AI/MLJune 26, 20264 min read

Hybrid search for RAG: Combining Vector Embeddings and BM25

Hybrid search for RAG pipelines solves retrieval failures by combining vector embeddings with BM25. Learn the practical steps to boost your search accuracy.

AILLMRAG

When I first started building RAG pipelines, I assumed semantic search was a silver bullet. I spent weeks fine-tuning my embedding models, only to find that my system consistently failed whenever a user searched for a specific product ID or a rare technical acronym. It turns out, relying solely on vector embeddings is a recipe for missing the "long tail" of exact matches.

That’s where Hybrid Search in RAG Pipelines: Boosting Retrieval Accuracy becomes essential. By layering BM25 on top of your semantic search, you get the best of both worlds: the conceptual understanding of dense vectors and the surgical precision of keyword matching.

Why Vector Embeddings Aren't Enough

Vector embeddings are fantastic at capturing intent. If a user asks about "the cost of running a server," a vector search will happily return documents about "cloud infrastructure pricing" even if the word "cost" never appears.

However, vectors often struggle with:

Exact Matches: Product serial numbers, specific error codes, or unique proper nouns.
Out-of-vocabulary terms: Words that didn't appear frequently in your embedding model's training data.
Ambiguity: Sometimes the "semantic neighborhood" of a query is too broad, leading to irrelevant results.

BM25 (Best Matching 25) doesn't care about concepts. It cares about frequency and document length. If a user types "Error-404-X", BM25 will find that exact string with ease, whereas a vector model might struggle to represent that specific token sequence accurately.

Implementing Hybrid Search in RAG Pipelines

To build a production-grade hybrid system, you need to normalize the scores from both search methods before merging them. A vector search might return a cosine similarity score (typically 0 to 1), while BM25 returns an unbounded relevance score. You cannot simply add these together.

The Workflow

Query Decomposition: Receive the user's query.
Parallel Execution: Send the query to your vector database (e.g., Pinecone, Weaviate, or Qdrant) and your keyword index (e.g., Elasticsearch or OpenSearch).
Reciprocal Rank Fusion (RRF): This is the industry-standard way to merge the lists. Instead of raw scores, you look at the rank of each document in both lists and combine them.

The formula for RRF is simple: RRF score = sum(1 / (k + rank_i)) Where k is a constant, usually set to 60.

Practical Implementation Snippet

If you're using Python, you don't need to reinvent the wheel. Many modern vector databases now support hybrid search natively. Here is how it looks conceptually using a standard search client:


PYTHON
# Pseudo-code for a hybrid search request
results = client.search(
    query="How to fix Error-404-X",
    hybrid=True,
    alpha=0.3,  # Weight for vector search
    beta=0.7    # Weight for BM25
)

In this example, the alpha and beta parameters act as your "knobs." I usually start with a 50/50 split and adjust based on my evaluation set. If your use case is highly technical, bump the BM25 weight up. If you're building a conversational assistant, lean into the vector embeddings.

The Trade-offs of Hybrid Search

I initially tried to implement a custom re-ranking layer using a cross-encoder model. While it gave me the best results, it added roughly 400ms of latency per query. For a real-time chatbot, that’s a lifetime.

Method	Precision (Exact)	Recall (Semantic)	Latency
Vector Only	Low	High	Low
BM25 Only	High	Low	Very Low
Hybrid	High	High	Medium

If you're finding that your retriever is still bringing back noise, you might also want to look into Implementing Metadata Filtering for Precise RAG Pipeline Retrieval to prune your search space before the ranking happens.

Lessons Learned

The biggest mistake I made was ignoring the data quality of my BM25 index. BM25 relies on inverted indexes; if your documents aren't properly tokenized, the results will be garbage. Ensure your preprocessing pipeline for the keyword index is just as robust as your embedding pipeline.

Also, don't forget that if your chunks are too large, even the best hybrid search will struggle to provide precise answers. I've found that moving toward Implementing Semantic Chunking for RAG Pipelines: A Practical Guide helps keep the search results focused on specific topics, which naturally improves the BM25 hit rate.

Hybrid search isn't just a "nice to have"—it's a requirement for any RAG system that needs to operate in the real world. It's messy, it requires balancing two different retrieval paradigms, and it adds complexity to your infrastructure. But when a user types in a specific technical identifier and your bot actually finds the right document, the extra work pays for itself immediately.

FAQ

Q: Do I need two separate databases for hybrid search? A: Not necessarily. Many modern vector databases (like Weaviate or Qdrant) allow you to store both embeddings and text fields for BM25 in the same index.

Q: Is hybrid search always better than vector search? A: Not always. If your domain is highly conceptual and rarely uses specific terminology (e.g., a creative writing assistant), vector-only search is likely sufficient and cheaper to maintain.

Q: What is the biggest challenge with hybrid search? A: Normalization. Because vector scores and BM25 scores exist on different scales, you have to be very intentional about how you combine them, otherwise one method will dominate the results.

Back to Blog

Hybrid search for RAG: Combining Vector Embeddings and BM25

Why Vector Embeddings Aren't Enough

Implementing Hybrid Search in RAG Pipelines

The Workflow

Practical Implementation Snippet

The Trade-offs of Hybrid Search

Lessons Learned

FAQ

Similar Posts

LLM Cost Optimization: Building a Semantic Cache with Redis

LLM Agents Conflict Resolution: Merging Divergent Workflow Outputs

LLM Observability: Detecting Semantic Drift in Production Pipelines