LLM cost optimization is achievable by implementing semantic caching with Redis and vector embeddings. Reduce latency and API bills with this practical guide.
When our production application started hitting the OpenAI API rate limits, our monthly bill was climbing faster than our user base. We were paying for identical or near-identical questions over and over, and the sub-second latency targets were slipping as the model struggled under the load.
If you’re struggling with similar constraints, implementing LLM cost optimization through semantic caching is your best path forward. Instead of relying on simple key-value lookups, we treat incoming queries as vectors and search for "meaning" rather than exact string matches.
Most developers start with simple exact-match caching. It’s easy to implement, but it’s brittle. If a user asks "What is the capital of France?" and then "Tell me the capital city of France," an exact cache misses entirely. Semantic caching solves this by converting the user query into a numerical vector embedding and searching your cache for high-similarity existing responses.
We’ve previously explored LLM Caching Strategies to Slash Latency and API Costs for broad architectural patterns, but today we’re getting into the implementation details using Redis.
We use Redis (specifically with the RedisSearch module) as our vector database. It’s fast, familiar, and handles similarity search natively.
Here is how the request flow looks in our current stack:
Flow diagram: User Query → Semantic Cache Check; B -- Hit → Return Cached Response; B -- Miss → Call LLM API; Call LLM API → Store in Redis; Store in Redis → Return Response
To get this running, you need a Redis instance with the RediSearch module enabled. I typically use the redis-py client along with sentence-transformers for embedding generation.
all-MiniLM-L6-v2.Here is a simplified snippet of how we handle the lookup:
PYTHONimport redis from sentence_transformers import SentenceTransformer # Initialize r = redis.Redis(host=CE9178">'localhost', port=6379) model = SentenceTransformer(CE9178">'all-MiniLM-L6-v2') def get_cached_response(query): query_vector = model.encode(query).astype(np.float32).tobytes() # Search for vectors with similarity > 0.90 results = r.ft("idx:queries").search( Query("*=>[KNN 1 @vector $vec AS score]") .return_fields("response", "score") .dialect(2), query_params={"vec": query_vector} ) if results.docs and float(results.docs[0].score) > 0.90: return results.docs[0].response return None
We first tried using a standard relational database for this, but the latency hit was roughly 400ms per query—far too slow. Switching to Redis dropped that to under 20ms.
However, don't ignore the dangers of stale data. If your LLM's knowledge base updates, your cache might return outdated information. You must implement a strategy for Semantic Cache Invalidation: Managing TTLs for RAG Pipelines to ensure your users aren't seeing hallucinations or old facts.
| Cache Type | Latency | Complexity | Use Case |
|---|---|---|---|
| Exact Match | <5ms | Low | Static FAQs |
| Semantic Cache | 15-30ms | Medium | Conversational AI |
| RAG Pipeline | 500ms+ | High | Document Retrieval |
We managed to reduce our OpenAI spend by about 30% in the first month by simply catching the most common, slightly-varied user queries. It isn't a silver bullet—you still need to handle complex, unique prompts—but it’s the lowest hanging fruit for LLM cost optimization.
Next time, I’m planning to look into LLM Caching with Semantic Bloom Filters for RAG Latency Reduction to see if we can skip the vector search entirely for obvious misses.
How do I choose the similarity threshold? Start at 0.90. If you find the cache is returning irrelevant answers, tighten it to 0.95. If it's missing too many obvious matches, loosen it to 0.85.
Does this increase my storage costs? Slightly, yes. You are storing strings and vectors in Redis. However, the cost of RAM in Redis is significantly lower than the cost of repeated LLM API tokens.
Can I use this for non-English queries?
Yes, provided your embedding model (like multilingual-e5) supports the languages you're targeting.
Implementing semantic chunking for RAG pipelines improves retrieval accuracy by grouping text by topic. Learn to move beyond fixed-length splits today.