LLM caching with semantic Bloom filters helps you slash latency by pre-filtering queries. Learn to combine probabilistic structures with your RAG pipeline.
Last month, our RAG pipeline's p99 latency spiked because we were hitting the vector database for every single user query, even when we knew the documents didn't contain the requested information. We needed a way to drop irrelevant requests before they ever touched our embedding model or the vector store.
Adding LLM caching is standard practice, but standard exact-match caches fail when user phrasing varies. While Semantic caching for RAG pipelines: Cut latency and costs helps with similar queries, we needed a way to handle the "I don't know" cases—queries that fall outside our knowledge base—without wasting compute. That’s where Bloom filters come in.
A Bloom filter is a space-efficient probabilistic data structure that tells you if an element is definitely not in a set or possibly in the set. In the context of latency reduction, it acts as a gatekeeper.
If you know your vector store only contains documents about "Internal HR Policies," you can hash incoming queries and check them against a Bloom filter that represents the "known topic space." If the query hashes to a "no," you skip the vector search entirely and return a canned response.
We initially tried storing every query-vector pair in a Redis hash. It worked for exact duplicates, but the memory overhead for semantic clustering grew too fast. By switching to a Bloom filter, we reduced our memory footprint by about 75% compared to a full hash map.
To implement this, you need a way to map semantic space to the filter. Since Bloom filters are binary, you can’t store vectors directly. Instead, we use a technique called "Locality Sensitive Hashing" (LSH) or simply hash the top-k keywords extracted from the user query via a lightweight NLP model like spaCy or a small BERT encoder.
Here is how we structured the check in our Python-based RAG service:
PYTHONfrom pybloom_live import BloomFilter import hashlib # Initialize filter: capacity of 100k queries, 0.01 error rate bloom = BloomFilter(capacity=100000, error_rate=0.01) def is_query_relevant(query_text): # Extract domain-specific features features = extract_keywords(query_text) query_hash = hashlib.sha256(features.encode()).hexdigest() if query_hash not in bloom: return False # Definitely not in our knowledge base return True # Usage in the pipeline if not is_query_relevant(user_query): return "I'm sorry, I only answer questions related to HR policies." else: perform_vector_search(user_query)
The power of this approach isn't in replacing your vector search, but in acting as a fast-path filter. We often combine this with Implementing Metadata Filtering for Precise RAG Pipeline Retrieval to ensure that even if the Bloom filter suggests a match, the metadata constraints are still respected.
When you're designing your vector search layer, remember that Bloom filters are probabilistic. You will get false positives. This is fine! A false positive just means you perform a vector search that turns up nothing, which is the same behavior you have without the filter. You lose a few milliseconds on those, but you save seconds on the true negatives.
We faced a major hurdle during the initial rollout: cache drift. As we added new documents to our knowledge base, our "no-match" filter became stale. We had to implement a periodic rebuild of the filter.
If you're worried about keeping things fresh, I highly recommend looking into the strategies discussed in Semantic Cache Invalidation: Managing TTLs for RAG Pipelines. We now trigger a background job to re-populate the Bloom filter whenever our document ingestion pipeline finishes a batch update.
Some things I’m still experimenting with:
error_rate based on the current load on our vector database.It’s not a silver bullet, but adding a probabilistic layer for LLM caching has consistently shaved off roughly 150ms from our average response time. It forces you to be disciplined about what your system actually knows, which usually leads to a better user experience anyway.
1. Does a Bloom filter replace a vector database? No. It only helps you decide whether to query the vector database. It cannot return the actual document chunks or perform semantic similarity scoring.
2. What happens if the Bloom filter has a false positive? The system proceeds to the vector search as if the filter didn't exist. The vector database will return an empty result, and your RAG logic should handle that gracefully (e.g., "I don't have information on that").
3. How do I handle updates to the knowledge base? Since Bloom filters don't support deletion, you must periodically recreate the filter. We do this by keeping a master set of hashes in a database and rebuilding the filter daily.
RAG pipelines often suffer from noise. Learn how to implement dynamic retrieval thresholds to filter irrelevant context and improve LLM performance.