Master production-grade retrieval strategies for RAG. Learn to implement hybrid search, optimize with cross-encoder reranking, and automate query expansion.
Previously in this course, we established the foundation for Vector Databases and Similarity Search. While pure vector search is the industry standard for semantic matching, it often fails on exact-match queries like product SKUs or specific technical acronyms.
In this lesson, we move beyond simple vector similarity to build a robust retrieval pipeline. We will implement hybrid search, integrate cross-encoder rerankers for precision, and automate query expansion to handle ambiguous user intent.
In production, a single retrieval method is rarely sufficient. We use a multi-stage process:
Hybrid search combines the semantic understanding of dense embeddings with the lexical precision of BM25. We use Reciprocal Rank Fusion (RRF) to merge the results from both sources into a single ranked list.
PYTHONimport numpy as np def reciprocal_rank_fusion(results_list, k=60): CE9178">""" RRF combines multiple search rankings into one. results_list: list of lists of(doc_id, score) """ fused_scores = {} for results in results_list: for rank, (doc_id, _) in enumerate(results): if doc_id not in fused_scores: fused_scores[doc_id] = 0.0 fused_scores[doc_id] += 1.0 / (k + rank + 1) return sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
As detailed in Hybrid search for RAG: Combining Vector Embeddings and BM25, the key is balancing your weights. In production, start with a 0.5/0.5 split and tune via A/B testing against your evaluation set.
Dense retrieval (bi-encoders) computes document scores independently, missing the nuances of how a query relates to a document. Cross-encoders process the query and document together in the same attention block, providing significantly higher accuracy at the cost of latency.
As discussed in Optimizing RAG Retrieval: A Practical Guide to Semantic Reranking, you should only rerank the "top-k" candidates (e.g., top 50) returned by your hybrid search to keep latency within acceptable bounds.
PYTHONfrom sentence_transformers import CrossEncoder # Load a pre-trained reranker model = CrossEncoder(CE9178">'cross-encoder/ms-marco-MiniLM-L-6-v2') def rerank_results(query, documents): pairs = [(query, doc.text) for doc in documents] scores = model.predict(pairs) # Pair scores with document IDs and sort return sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
Users often write short, vague queries. Query expansion uses an LLM to generate multiple variations of the query, broadening the search space and increasing recall.
Strategy: Generate 3-5 variations using a prompt template: "You are an expert search assistant. Generate 3 alternative versions of the user's search query to improve retrieval from a vector database. Original query: {query}"
Once you have the variations, perform a batch search across all of them and aggregate the results using the RRF function defined above.
Implement a RetrievalOrchestrator class that:
Tip: Use the rank_bm25 library for your local BM25 implementation.
We've moved from basic vector search to a robust retrieval architecture. By combining BM25 for keyword precision, vector search for semantic relevance, and cross-encoders for final verification, we achieve a system capable of handling complex, real-world queries.
Integrating these steps directly into your project's Retriever module will significantly reduce the "retrieval failure" rate—the most common cause of hallucinations in RAG systems.
Up next: Context Management and Windowing, where we optimize how those retrieved chunks are prepared for the LLM.
Master domain-specific fine-tuning by preparing instruction data, executing QLoRA training, and validating model convergence on your custom project model.
Read moreLearn how to train custom Byte-Pair Encoding (BPE) tokenizers for LLMs. Master vocabulary trade-offs, byte-level processing, and efficient text encoding.
Retrieval Strategies for RAG