AI/MLJune 24, 20264 min read

LLM Caching with Semantic Bloom Filters for RAG Latency Reduction

LLM caching with semantic Bloom filters helps you slash latency by pre-filtering queries. Learn to combine probabilistic structures with your RAG pipeline.

LLMRAGVector SearchCachingPythonPerformanceAIPrompt Engineering

Last month, our RAG pipeline's p99 latency spiked because we were hitting the vector database for every single user query, even when we knew the documents didn't contain the requested information. We needed a way to drop irrelevant requests before they ever touched our embedding model or the vector store.

Adding LLM caching is standard practice, but standard exact-match caches fail when user phrasing varies. While Semantic caching for RAG pipelines: Cut latency and costs helps with similar queries, we needed a way to handle the "I don't know" cases—queries that fall outside our knowledge base—without wasting compute. That’s where Bloom filters come in.

Why use Bloom filters for RAG optimization?

A Bloom filter is a space-efficient probabilistic data structure that tells you if an element is definitely not in a set or possibly in the set. In the context of latency reduction, it acts as a gatekeeper.

If you know your vector store only contains documents about "Internal HR Policies," you can hash incoming queries and check them against a Bloom filter that represents the "known topic space." If the query hashes to a "no," you skip the vector search entirely and return a canned response.

We initially tried storing every query-vector pair in a Redis hash. It worked for exact duplicates, but the memory overhead for semantic clustering grew too fast. By switching to a Bloom filter, we reduced our memory footprint by about 75% compared to a full hash map.

Implementing the pipeline

To implement this, you need a way to map semantic space to the filter. Since Bloom filters are binary, you can’t store vectors directly. Instead, we use a technique called "Locality Sensitive Hashing" (LSH) or simply hash the top-k keywords extracted from the user query via a lightweight NLP model like spaCy or a small BERT encoder.

Here is how we structured the check in our Python-based RAG service:


PYTHON
from pybloom_live import BloomFilter
import hashlib

# Initialize filter: capacity of 100k queries, 0.01 error rate
bloom = BloomFilter(capacity=100000, error_rate=0.01)

def is_query_relevant(query_text):
    # Extract domain-specific features
    features = extract_keywords(query_text) 
    query_hash = hashlib.sha256(features.encode()).hexdigest()
    
    if query_hash not in bloom:
        return False # Definitely not in our knowledge base
    return True

# Usage in the pipeline
if not is_query_relevant(user_query):
    return "I'm sorry, I only answer questions related to HR policies."
else:
    perform_vector_search(user_query)

Integrating with existing RAG strategies

The power of this approach isn't in replacing your vector search, but in acting as a fast-path filter. We often combine this with Implementing Metadata Filtering for Precise RAG Pipeline Retrieval to ensure that even if the Bloom filter suggests a match, the metadata constraints are still respected.

When you're designing your vector search layer, remember that Bloom filters are probabilistic. You will get false positives. This is fine! A false positive just means you perform a vector search that turns up nothing, which is the same behavior you have without the filter. You lose a few milliseconds on those, but you save seconds on the true negatives.

Lessons from the field

We faced a major hurdle during the initial rollout: cache drift. As we added new documents to our knowledge base, our "no-match" filter became stale. We had to implement a periodic rebuild of the filter.

If you're worried about keeping things fresh, I highly recommend looking into the strategies discussed in Semantic Cache Invalidation: Managing TTLs for RAG Pipelines. We now trigger a background job to re-populate the Bloom filter whenever our document ingestion pipeline finishes a batch update.

Some things I’m still experimenting with:

Multi-stage filtering: Using a small Bloom filter for "common irrelevant queries" and a larger one for "domain-specific knowledge."
Dynamic error rates: Adjusting the error_rate based on the current load on our vector database.

It’s not a silver bullet, but adding a probabilistic layer for LLM caching has consistently shaved off roughly 150ms from our average response time. It forces you to be disciplined about what your system actually knows, which usually leads to a better user experience anyway.

Frequently Asked Questions

1. Does a Bloom filter replace a vector database? No. It only helps you decide whether to query the vector database. It cannot return the actual document chunks or perform semantic similarity scoring.

2. What happens if the Bloom filter has a false positive? The system proceeds to the vector search as if the filter didn't exist. The vector database will return an empty result, and your RAG logic should handle that gracefully (e.g., "I don't have information on that").

3. How do I handle updates to the knowledge base? Since Bloom filters don't support deletion, you must periodically recreate the filter. We do this by keeping a master set of hashes in a database and rebuilding the filter daily.

Back to Blog

LLM Caching with Semantic Bloom Filters for RAG Latency Reduction

Why use Bloom filters for RAG optimization?

Implementing the pipeline

Integrating with existing RAG strategies

Lessons from the field

Frequently Asked Questions

Similar Posts

Semantic caching for RAG pipelines: Cut latency and costs

RAG Pipelines: Dynamic Retrieval Thresholds for Better Accuracy

LLM Cost Control: Mastering Dynamic Context Window Management