Mahamudul Hasan Rubel
HomeBlogCoursesAboutProjectsSkillsExperiencePhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • Blog
  • Courses
  • About
  • Projects
  • Skills
  • Experience
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

Subscribe to the newsletter

Get new articles and course lessons delivered to your inbox. No spam, unsubscribe anytime.

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
AI/MLJune 28, 20263 min read

LLM Cost Optimization: Building a Semantic Cache with Redis

LLM cost optimization is achievable by implementing semantic caching with Redis and vector embeddings. Reduce latency and API bills with this practical guide.

LLMRedisVector EmbeddingsCost OptimizationLatencyAI EngineeringAIRAG

When our production application started hitting the OpenAI API rate limits, our monthly bill was climbing faster than our user base. We were paying for identical or near-identical questions over and over, and the sub-second latency targets were slipping as the model struggled under the load.

If you’re struggling with similar constraints, implementing LLM cost optimization through semantic caching is your best path forward. Instead of relying on simple key-value lookups, we treat incoming queries as vectors and search for "meaning" rather than exact string matches.

Why Semantic Caching Matters

Most developers start with simple exact-match caching. It’s easy to implement, but it’s brittle. If a user asks "What is the capital of France?" and then "Tell me the capital city of France," an exact cache misses entirely. Semantic caching solves this by converting the user query into a numerical vector embedding and searching your cache for high-similarity existing responses.

We’ve previously explored LLM Caching Strategies to Slash Latency and API Costs for broad architectural patterns, but today we’re getting into the implementation details using Redis.

The Architecture

We use Redis (specifically with the RedisSearch module) as our vector database. It’s fast, familiar, and handles similarity search natively.

Here is how the request flow looks in our current stack:

Flow diagram: User Query → Semantic Cache Check; B -- Hit → Return Cached Response; B -- Miss → Call LLM API; Call LLM API → Store in Redis; Store in Redis → Return Response

Implementation Steps

To get this running, you need a Redis instance with the RediSearch module enabled. I typically use the redis-py client along with sentence-transformers for embedding generation.

  1. Generate Embeddings: Convert every incoming prompt into a vector using a lightweight model like all-MiniLM-L6-v2.
  2. Vector Search: Query Redis using a cosine similarity threshold (e.g., 0.90).
  3. Handle Misses: If the score is below the threshold, proceed to the LLM, then cache the result.

Here is a simplified snippet of how we handle the lookup:

PYTHON
import redis
from sentence_transformers import SentenceTransformer

# Initialize
r = redis.Redis(host=CE9178">'localhost', port=6379)
model = SentenceTransformer(CE9178">'all-MiniLM-L6-v2')

def get_cached_response(query):
    query_vector = model.encode(query).astype(np.float32).tobytes()
    # Search for vectors with similarity > 0.90
    results = r.ft("idx:queries").search(
        Query("*=>[KNN 1 @vector $vec AS score]")
        .return_fields("response", "score")
        .dialect(2),
        query_params={"vec": query_vector}
    )
    if results.docs and float(results.docs[0].score) > 0.90:
        return results.docs[0].response
    return None

Dealing with Trade-offs

We first tried using a standard relational database for this, but the latency hit was roughly 400ms per query—far too slow. Switching to Redis dropped that to under 20ms.

However, don't ignore the dangers of stale data. If your LLM's knowledge base updates, your cache might return outdated information. You must implement a strategy for Semantic Cache Invalidation: Managing TTLs for RAG Pipelines to ensure your users aren't seeing hallucinations or old facts.

Performance Comparison

Cache TypeLatencyComplexityUse Case
Exact Match<5msLowStatic FAQs
Semantic Cache15-30msMediumConversational AI
RAG Pipeline500ms+HighDocument Retrieval

Final Thoughts

We managed to reduce our OpenAI spend by about 30% in the first month by simply catching the most common, slightly-varied user queries. It isn't a silver bullet—you still need to handle complex, unique prompts—but it’s the lowest hanging fruit for LLM cost optimization.

Next time, I’m planning to look into LLM Caching with Semantic Bloom Filters for RAG Latency Reduction to see if we can skip the vector search entirely for obvious misses.

FAQ

How do I choose the similarity threshold? Start at 0.90. If you find the cache is returning irrelevant answers, tighten it to 0.95. If it's missing too many obvious matches, loosen it to 0.85.

Does this increase my storage costs? Slightly, yes. You are storing strings and vectors in Redis. However, the cost of RAM in Redis is significantly lower than the cost of repeated LLM API tokens.

Can I use this for non-English queries? Yes, provided your embedding model (like multilingual-e5) supports the languages you're targeting.

Back to Blog

Similar Posts

Close-up of an illuminated audio mixer panel in a recording studio, showcasing various controls and switches.
AI/MLJune 20, 20264 min read

Controlling LLM cost and latency: A Practical Production Guide

Controlling LLM cost and latency is the biggest hurdle in production. Learn how to optimize token usage and response times to keep your AI features fast.

Read more
AI/MLJune 24, 2026
4 min read

Implementing Semantic Chunking for RAG Pipelines: A Practical Guide

Implementing semantic chunking for RAG pipelines improves retrieval accuracy by grouping text by topic. Learn to move beyond fixed-length splits today.

Read more
AI/MLJune 23, 20264 min read

Multi-model consensus: Reducing LLM Hallucinations in Production

Multi-model consensus is a reliable way to reduce LLM hallucinations. Learn how to build verification loops that validate outputs for production-grade reliability.

Read more