Advanced Retrieval Strategies for RAG: Hybrid Search & Reranking

Master production-grade retrieval strategies for RAG. Learn to implement hybrid search, optimize with cross-encoder reranking, and automate query expansion.

RAGInformation RetrievalHybrid SearchRerankingMachine LearningLLMsaimachine-learningpython

Previously in this course, we established the foundation for Vector Databases and Similarity Search. While pure vector search is the industry standard for semantic matching, it often fails on exact-match queries like product SKUs or specific technical acronyms.

In this lesson, we move beyond simple vector similarity to build a robust retrieval pipeline. We will implement hybrid search, integrate cross-encoder rerankers for precision, and automate query expansion to handle ambiguous user intent.

The Retrieval Hierarchy

In production, a single retrieval method is rarely sufficient. We use a multi-stage process:

Candidate Retrieval: Fast, broad search (Hybrid: Vector + BM25).
Reranking: Slower, precise scoring (Cross-Encoders).
Query Transformation: Refining the user's intent before retrieval.

1. Implementing Hybrid Search

Hybrid search combines the semantic understanding of dense embeddings with the lexical precision of BM25. We use Reciprocal Rank Fusion (RRF) to merge the results from both sources into a single ranked list.


PYTHON
import numpy as np

def reciprocal_rank_fusion(results_list, k=60):
    CE9178">"""
    RRF combines multiple search rankings into one.
    results_list: list of lists of(doc_id, score)
    """
    fused_scores = {}
    for results in results_list:
        for rank, (doc_id, _) in enumerate(results):
            if doc_id not in fused_scores:
                fused_scores[doc_id] = 0.0
            fused_scores[doc_id] += 1.0 / (k + rank + 1)
    
    return sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)

As detailed in Hybrid search for RAG: Combining Vector Embeddings and BM25, the key is balancing your weights. In production, start with a 0.5/0.5 split and tune via A/B testing against your evaluation set.

2. Semantic Reranking with Cross-Encoders

Dense retrieval (bi-encoders) computes document scores independently, missing the nuances of how a query relates to a document. Cross-encoders process the query and document together in the same attention block, providing significantly higher accuracy at the cost of latency.

As discussed in Optimizing RAG Retrieval: A Practical Guide to Semantic Reranking, you should only rerank the "top-k" candidates (e.g., top 50) returned by your hybrid search to keep latency within acceptable bounds.


PYTHON
from sentence_transformers import CrossEncoder

# Load a pre-trained reranker
model = CrossEncoder(CE9178">'cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_results(query, documents):
    pairs = [(query, doc.text) for doc in documents]
    scores = model.predict(pairs)
    # Pair scores with document IDs and sort
    return sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)

3. Automating Query Expansion

Users often write short, vague queries. Query expansion uses an LLM to generate multiple variations of the query, broadening the search space and increasing recall.

Strategy: Generate 3-5 variations using a prompt template: "You are an expert search assistant. Generate 3 alternative versions of the user's search query to improve retrieval from a vector database. Original query: {query}"

Once you have the variations, perform a batch search across all of them and aggregate the results using the RRF function defined above.

Hands-on Exercise

Implement a RetrievalOrchestrator class that:

Takes a raw user query.
Expands the query into 3 variations using a small LLM (e.g., Llama-3-8B).
Executes a hybrid search (Vector + BM25) for each variation.
Performs RRF to get a unified top-50 list.
Reranks the top-50 using a Cross-Encoder.
Returns the top-5 documents.

Tip: Use the rank_bm25 library for your local BM25 implementation.

Common Pitfalls

Latency Bloat: Running a cross-encoder on 1,000 documents will kill your inference speed. Always limit your reranking set to 20-50 documents.
Normalization Mismatch: BM25 scores and vector similarity scores exist in different mathematical spaces. Never sum them directly; always use RRF or similar ranking-based fusion methods.
Over-expansion: Generating too many query variations (e.g., > 10) introduces noise, pulling in irrelevant documents and diluting your final context window.

Recap

We've moved from basic vector search to a robust retrieval architecture. By combining BM25 for keyword precision, vector search for semantic relevance, and cross-encoders for final verification, we achieve a system capable of handling complex, real-world queries.

Integrating these steps directly into your project's Retriever module will significantly reduce the "retrieval failure" rate—the most common cause of hallucinations in RAG systems.

Up next: Context Management and Windowing, where we optimize how those retrieved chunks are prepared for the LLM.

Back to Blog

Advanced Retrieval Strategies for RAG: Hybrid Search & Reranking

The Retrieval Hierarchy

1. Implementing Hybrid Search

2. Semantic Reranking with Cross-Encoders

3. Automating Query Expansion

Hands-on Exercise

Common Pitfalls

Recap

Similar Posts

Project Milestone: Domain-Specific Fine-Tuning for LLMs

Tokenization Strategies for LLMs: Mastering BPE and Byte-Level Encoding

Mixture-of-Experts (MoE) Layers: Scaling Efficiently with Sparsity