Vector Databases and Similarity Search: Mastering HNSW for RAG

Master vector databases by implementing HNSW for high-dimensional similarity search. Learn to scale your RAG pipeline with production-grade indexing strategies.

Vector DatabaseHNSWRAGEmbeddingsMLOpsaimachine-learningpython

Previously in this course, we completed Project Milestone: Domain-Specific Fine-Tuning for LLMs, where we adapted our base model to a specific task. Now that we have a fine-tuned model, we need a way to feed it relevant context at inference time. This lesson introduces the Vector Database as the backbone of your Retrieval-Augmented Generation (RAG) pipeline.

The Problem of High-Dimensional Search

When we convert text into embeddings, we represent semantic meaning as a dense vector in a high-dimensional space (typically 768 to 4096 dimensions). Finding "similar" documents is mathematically equivalent to finding the nearest neighbors in this space.

A brute-force search (calculating the cosine similarity between your query and every document in your collection) is $O(N \cdot D)$, where $N$ is the number of documents and $D$ is the embedding dimension. For a production RAG system, this is unusable. We need approximate nearest neighbor (ANN) search.

Understanding HNSW (Hierarchical Navigable Small World)

The industry standard for efficient retrieval is the HNSW algorithm. It builds a multi-layered graph where the top layers provide long-range "express" paths and bottom layers provide local, granular accuracy.

Feature	Brute Force Search	HNSW Indexing
Complexity	$O(N)$	$O(\log N)$
Accuracy	100%	High (Approximate)
Memory	Low	High (Graph storage)
Latency	Linear (Slow)	Sub-millisecond (Fast)

Setting Up a Vector Database

For our running project, we will use qdrant as our vector database engine. It provides a robust Python client and handles HNSW indexing natively.


PYTHON
from qdrant_client import QdrantClient, models

# 1. Initialize the client(local in-memory or persisted)
client = QdrantClient(":memory:") 

# 2. Define the collection with specific vector configuration
# Size must match your embedding model output(e.g., 768 for BGE-large)
client.recreate_collection(
    collection_name="project_docs",
    vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE),
)

High-Dimensional Similarity Search

To perform search, you must ensure your query passes through the exact same embedding pipeline used for your document chunks.


PYTHON
def search_documents(query_text, top_k=3):
    # Assume get_embedding() is your pre-trained encoder from previous lessons
    query_vector = get_embedding(query_text)
    
    results = client.search(
        collection_name="project_docs",
        query_vector=query_vector,
        limit=top_k
    )
    return results

# Example Usage
hits = search_documents("How to optimize transformer throughput?")
for hit in hits:
    print(f"Score: {hit.score:.4f} | ID: {hit.id}")

Managing Persistence and Index Updates

In production, you cannot rebuild your index from scratch every time your application restarts. You must persist the state to disk and handle incremental updates.

Persistence: Use a persistent storage path in QdrantClient(path="./qdrant_data").
Upserting: Use client.upsert() for both new documents and updates. If the ID exists, the database overwrites the vector.
Index Tuning: HNSW has two critical parameters: m (number of bi-directional links per node) and ef_construct (size of the dynamic list during index building). Higher values increase accuracy but slow down indexing.

Hands-on Exercise

Install qdrant-client and create a script that embeds 100 dummy documents using your fine-tuned model.
Index these vectors into a persistent Qdrant collection.
Perform a query and inspect the score. If the score is low for relevant queries, look into Hybrid search for RAG: Combining Vector Embeddings and BM25 to augment your results.

Common Pitfalls

Dimensionality Mismatch: If your embedding model outputs 768 dimensions but your collection is configured for 1536, the database will reject the vectors. Always validate the output shape of your model.
Normalization: If you use cosine similarity, ensure your vectors are normalized. While some databases handle this internally, it’s a frequent source of "why are my scores weird?" bugs.
Stale Indexes: After a massive batch update, some vector databases require an index optimization trigger to re-balance the HNSW graph. Check your database docs for optimize() or force_segment_merge commands.

Recap

We've moved from fine-tuning models to building the retrieval infrastructure. By using HNSW, we ensure our RAG system remains performant at scale. Remember that the vector database is only as good as the embeddings you provide; if your retrieval quality is lacking, revisit your Implementing Semantic Chunking for RAG Pipelines: A Practical Guide to ensure your data is being indexed in meaningful units.

Up next: We will explore advanced retrieval techniques, specifically how to combine our vector search with traditional keyword-based BM25 to build a robust hybrid search pipeline.

Back to Blog