Mahamudul Hasan Rubel
HomeAboutProjectsSkillsExperienceBlogPhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • About
  • Projects
  • Skills
  • Experience
  • Blog
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
AI/MLJune 20, 20264 min read

LLM Caching Strategies to Slash Latency and API Costs

Master LLM caching strategies to cut latency and API costs. Learn how to implement exact and semantic caches to optimize your production AI workflows.

LLMAICachingRedisVector DatabasePerformanceOptimizationRAGPrompt Engineering
A detailed view of assorted coins in a jar under warm light, depicting financial wealth.

Last month, I spent three days debugging a sudden spike in OpenAI API costs that turned out to be a repetitive internal dashboard query. We were hitting the model for the exact same summary 40 times an hour, costing us about $12 in tokens for data that hadn't changed since the previous morning.

If you’re building AI features, you’re likely familiar with the pain of waiting for a streaming response while your billing dashboard ticks upward. Implementing LLM caching is the most effective way to address this. It’s not just about speed; it’s about preventing your infrastructure from becoming a black hole for your budget.

The First Step: Exact Match Caching

The simplest approach is often the best. If your application sends identical prompts repeatedly—like "Summarize the daily sales report"—there’s no reason to ping the LLM.

We initially implemented a simple Redis cache using a hash of the user's prompt as the key. We used hash(prompt + model_version) to ensure we didn't serve stale responses if we upgraded from gpt-4o to a newer version.

PYTHON
import hashlib
import redis

cache = redis.Redis(host=CE9178">'localhost', port=6379, db=0)

def get_cached_response(prompt, model):
    key = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
    return cache.get(key)

This dropped our latency for repeat requests from ~800ms down to roughly 15ms. It’s a massive win, but it’s fragile. If a user adds a single space or changes the casing, the hash fails, and you're back to paying for a full inference.

Moving to Semantic Cache

To handle variations in user input, you need a semantic cache. Instead of looking for an identical string, you look for a "conceptually similar" one. This is where vector database caching comes into play.

When a request arrives, you embed the prompt using an embedding model (like text-embedding-3-small). You then search your vector store (we use Pinecone, but pgvector works fine too) for any previous prompts with a high cosine similarity—usually a threshold of 0.95 or higher.

If you find a match, you return the stored completion. If not, you proceed to the LLM and then save the new prompt-response pair to the vector database. This strategy is significantly more robust than exact matching, though it adds about 50-100ms of overhead for the similarity search.

Why Vector Database Caching Beats Simple KV Stores

Using a vector database for caching allows you to capture clusters of intent. I’ve found that even if users phrase things differently, the underlying answer remains the same.

However, don't ignore the trade-offs. While I've written before about controlling LLM cost and latency: A practical production guide, adding a vector lookup introduces a new dependency. If your vector store goes down, your entire AI pipeline stalls.

When we first tested this, we tried to cache everything. That was a mistake. We quickly hit storage limits and our cache hit rate stayed low because we were caching "long-tail" queries that never repeated. Now, we only cache prompts that meet a certain frequency threshold.

Implementing Prompt Caching

Some providers are now offering native prompt caching. This is different from the application-level caching I’ve described. By "pinning" your system prompt or long context windows to the provider's cache, you reduce the token cost for the input portion of your requests.

If you're already using RAG, ensure you're using hybrid search in RAG pipelines: Boosting retrieval accuracy to minimize the amount of irrelevant context you send to the LLM in the first place. Less context sent equals cheaper, faster inference.

Practical Considerations

Before you dive in, consider these failure modes:

  1. Stale Data: If you’re caching, you’re creating a "time-to-live" (TTL) problem. How long is a summary valid? If it’s a dynamic dashboard, 10 minutes might be too long.
  2. Privacy: Be careful about what you store in your cache. If your app handles PII, ensure your Redis or Vector DB instance is encrypted at rest and scoped correctly.
  3. Complexity: Don't build a semantic cache if your traffic is low. Start with Redis for exact matches and only move to vector-based semantic caching when you see a genuine pattern of near-duplicate requests.

Frequently Asked Questions

How do I decide between Redis and a Vector DB? Use Redis for exact matches—it’s faster and cheaper. Use a Vector DB only when you need to capture variations in intent that Redis can't catch.

Does caching interfere with LLM guardrails? Yes. If you implement LLM guardrails for production: Input validation and output filtering, make sure your cache stores the validated output. You don't want to serve a cached response that hasn't passed your current safety checks.

What is the "sweet spot" for cache hit rates? In our experience, a 20-30% hit rate is usually enough to justify the complexity. If you're hitting 50%+, you’re likely caching too aggressively or your users are very repetitive.

I’m still experimenting with how to best handle cache invalidation for multi-turn conversations. Managing context state while caching is tricky, and I haven't found a "one-size-fits-all" solution yet. Start small, monitor your hit rate, and don't over-engineer until the costs force your hand.

Back to Blog

Similar Posts

Aerial view of a modern highway interchange at dusk with flowing traffic.
AI/MLJune 20, 20265 min read

LLM Routing: A Strategy for Multi-Model Architectures

Master LLM routing to optimize costs and latency in production. Learn how to build a deterministic multi-model architecture for your AI application.

Read more
Close-up of an illuminated audio mixer panel in a recording studio, showcasing various controls and switches.
AI/ML
June 20, 2026
4 min read

Controlling LLM cost and latency: A Practical Production Guide

Controlling LLM cost and latency is the biggest hurdle in production. Learn how to optimize token usage and response times to keep your AI features fast.

Read more
View of large industrial pipelines running through a lush forest landscape.
AI/MLJune 20, 20264 min read

Hybrid Search in RAG Pipelines: Boosting Retrieval Accuracy

Hybrid search in RAG pipelines combines vector and keyword matching to solve retrieval failures. Learn how to implement it for better search relevance.

Read more