Master LLM caching strategies to cut latency and API costs. Learn how to implement exact and semantic caches to optimize your production AI workflows.

Last month, I spent three days debugging a sudden spike in OpenAI API costs that turned out to be a repetitive internal dashboard query. We were hitting the model for the exact same summary 40 times an hour, costing us about $12 in tokens for data that hadn't changed since the previous morning.
If you’re building AI features, you’re likely familiar with the pain of waiting for a streaming response while your billing dashboard ticks upward. Implementing LLM caching is the most effective way to address this. It’s not just about speed; it’s about preventing your infrastructure from becoming a black hole for your budget.
The simplest approach is often the best. If your application sends identical prompts repeatedly—like "Summarize the daily sales report"—there’s no reason to ping the LLM.
We initially implemented a simple Redis cache using a hash of the user's prompt as the key. We used hash(prompt + model_version) to ensure we didn't serve stale responses if we upgraded from gpt-4o to a newer version.
PYTHONimport hashlib import redis cache = redis.Redis(host=CE9178">'localhost', port=6379, db=0) def get_cached_response(prompt, model): key = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest() return cache.get(key)
This dropped our latency for repeat requests from ~800ms down to roughly 15ms. It’s a massive win, but it’s fragile. If a user adds a single space or changes the casing, the hash fails, and you're back to paying for a full inference.
To handle variations in user input, you need a semantic cache. Instead of looking for an identical string, you look for a "conceptually similar" one. This is where vector database caching comes into play.
When a request arrives, you embed the prompt using an embedding model (like text-embedding-3-small). You then search your vector store (we use Pinecone, but pgvector works fine too) for any previous prompts with a high cosine similarity—usually a threshold of 0.95 or higher.
If you find a match, you return the stored completion. If not, you proceed to the LLM and then save the new prompt-response pair to the vector database. This strategy is significantly more robust than exact matching, though it adds about 50-100ms of overhead for the similarity search.
Using a vector database for caching allows you to capture clusters of intent. I’ve found that even if users phrase things differently, the underlying answer remains the same.
However, don't ignore the trade-offs. While I've written before about controlling LLM cost and latency: A practical production guide, adding a vector lookup introduces a new dependency. If your vector store goes down, your entire AI pipeline stalls.
When we first tested this, we tried to cache everything. That was a mistake. We quickly hit storage limits and our cache hit rate stayed low because we were caching "long-tail" queries that never repeated. Now, we only cache prompts that meet a certain frequency threshold.
Some providers are now offering native prompt caching. This is different from the application-level caching I’ve described. By "pinning" your system prompt or long context windows to the provider's cache, you reduce the token cost for the input portion of your requests.
If you're already using RAG, ensure you're using hybrid search in RAG pipelines: Boosting retrieval accuracy to minimize the amount of irrelevant context you send to the LLM in the first place. Less context sent equals cheaper, faster inference.
Before you dive in, consider these failure modes:
How do I decide between Redis and a Vector DB? Use Redis for exact matches—it’s faster and cheaper. Use a Vector DB only when you need to capture variations in intent that Redis can't catch.
Does caching interfere with LLM guardrails? Yes. If you implement LLM guardrails for production: Input validation and output filtering, make sure your cache stores the validated output. You don't want to serve a cached response that hasn't passed your current safety checks.
What is the "sweet spot" for cache hit rates? In our experience, a 20-30% hit rate is usually enough to justify the complexity. If you're hitting 50%+, you’re likely caching too aggressively or your users are very repetitive.
I’m still experimenting with how to best handle cache invalidation for multi-turn conversations. Managing context state while caching is tricky, and I haven't found a "one-size-fits-all" solution yet. Start small, monitor your hit rate, and don't over-engineer until the costs force your hand.
Controlling LLM cost and latency is the biggest hurdle in production. Learn how to optimize token usage and response times to keep your AI features fast.