AI/MLJune 28, 20263 min read

LLM Cost Optimization: Building a Semantic Cache with Redis

Q: How do I choose the similarity threshold?

Start at 0.90. If you find the cache is returning irrelevant answers, tighten it to 0.95. If it's missing too many obvious matches, loosen it to 0.85.

Q: Does this increase my storage costs?

Slightly, yes. You are storing strings and vectors in Redis. However, the cost of RAM in Redis is significantly lower than the cost of repeated LLM API tokens.

Q: Can I use this for non-English queries?

Yes, provided your embedding model (like multilingual-e5) supports the languages you're targeting.

LLM cost optimization is achievable by implementing semantic caching with Redis and vector embeddings. Reduce latency and API bills with this practical guide.

LLMRedisVector EmbeddingsCost OptimizationLatencyAI EngineeringAIRAG

When our production application started hitting the OpenAI API rate limits, our monthly bill was climbing faster than our user base. We were paying for identical or near-identical questions over and over, and the sub-second latency targets were slipping as the model struggled under the load.

If you’re struggling with similar constraints, implementing LLM cost optimization through semantic caching is your best path forward. Instead of relying on simple key-value lookups, we treat incoming queries as vectors and search for "meaning" rather than exact string matches.

Why Semantic Caching Matters

Most developers start with simple exact-match caching. It’s easy to implement, but it’s brittle. If a user asks "What is the capital of France?" and then "Tell me the capital city of France," an exact cache misses entirely. Semantic caching solves this by converting the user query into a numerical vector embedding and searching your cache for high-similarity existing responses.

We’ve previously explored LLM Caching Strategies to Slash Latency and API Costs for broad architectural patterns, but today we’re getting into the implementation details using Redis.

The Architecture

We use Redis (specifically with the RedisSearch module) as our vector database. It’s fast, familiar, and handles similarity search natively.

Here is how the request flow looks in our current stack:


Flow diagram: User Query → Semantic Cache Check; B -- Hit → Return Cached Response; B -- Miss → Call LLM API; Call LLM API → Store in Redis; Store in Redis → Return Response

Implementation Steps

To get this running, you need a Redis instance with the RediSearch module enabled. I typically use the redis-py client along with sentence-transformers for embedding generation.

Generate Embeddings: Convert every incoming prompt into a vector using a lightweight model like all-MiniLM-L6-v2.
Vector Search: Query Redis using a cosine similarity threshold (e.g., 0.90).
Handle Misses: If the score is below the threshold, proceed to the LLM, then cache the result.

Here is a simplified snippet of how we handle the lookup:


PYTHON
import redis
from sentence_transformers import SentenceTransformer

# Initialize
r = redis.Redis(host=CE9178">'localhost', port=6379)
model = SentenceTransformer(CE9178">'all-MiniLM-L6-v2')

def get_cached_response(query):
    query_vector = model.encode(query).astype(np.float32).tobytes()
    # Search for vectors with similarity > 0.90
    results = r.ft("idx:queries").search(
        Query("*=>[KNN 1 @vector $vec AS score]")
        .return_fields("response", "score")
        .dialect(2),
        query_params={"vec": query_vector}
    )
    if results.docs and float(results.docs[0].score) > 0.90:
        return results.docs[0].response
    return None

Dealing with Trade-offs

We first tried using a standard relational database for this, but the latency hit was roughly 400ms per query—far too slow. Switching to Redis dropped that to under 20ms.

However, don't ignore the dangers of stale data. If your LLM's knowledge base updates, your cache might return outdated information. You must implement a strategy for Semantic Cache Invalidation: Managing TTLs for RAG Pipelines to ensure your users aren't seeing hallucinations or old facts.

Performance Comparison

Cache Type	Latency	Complexity	Use Case
Exact Match	<5ms	Low	Static FAQs
Semantic Cache	15-30ms	Medium	Conversational AI
RAG Pipeline	500ms+	High	Document Retrieval

Final Thoughts

We managed to reduce our OpenAI spend by about 30% in the first month by simply catching the most common, slightly-varied user queries. It isn't a silver bullet—you still need to handle complex, unique prompts—but it’s the lowest hanging fruit for LLM cost optimization.

Next time, I’m planning to look into LLM Caching with Semantic Bloom Filters for RAG Latency Reduction to see if we can skip the vector search entirely for obvious misses.

FAQ

How do I choose the similarity threshold? Start at 0.90. If you find the cache is returning irrelevant answers, tighten it to 0.95. If it's missing too many obvious matches, loosen it to 0.85.

Does this increase my storage costs? Slightly, yes. You are storing strings and vectors in Redis. However, the cost of RAM in Redis is significantly lower than the cost of repeated LLM API tokens.

Can I use this for non-English queries? Yes, provided your embedding model (like multilingual-e5) supports the languages you're targeting.

Back to Blog

LLM Cost Optimization: Building a Semantic Cache with Redis

Why Semantic Caching Matters

The Architecture

Implementation Steps

Dealing with Trade-offs

Performance Comparison

Final Thoughts

FAQ

Similar Posts

Controlling LLM cost and latency: A Practical Production Guide

Implementing Semantic Chunking for RAG Pipelines: A Practical Guide

Multi-model consensus: Reducing LLM Hallucinations in Production