Mahamudul Hasan Rubel
HomeBlogCoursesAboutProjectsSkillsExperiencePhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • Blog
  • Courses
  • About
  • Projects
  • Skills
  • Experience
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

Subscribe to the newsletter

Get new articles and course lessons delivered to your inbox. No spam, unsubscribe anytime.

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
Lesson 21 of the Advanced AI/ML: Deep Learning, LLMs & Production Systems course
AI/MLJune 27, 20264 min read

Vector Databases and Similarity Search: Mastering HNSW for RAG

Master vector databases by implementing HNSW for high-dimensional similarity search. Learn to scale your RAG pipeline with production-grade indexing strategies.

Vector DatabaseHNSWRAGEmbeddingsMLOpsaimachine-learningpython

Previously in this course, we completed Project Milestone: Domain-Specific Fine-Tuning for LLMs, where we adapted our base model to a specific task. Now that we have a fine-tuned model, we need a way to feed it relevant context at inference time. This lesson introduces the Vector Database as the backbone of your Retrieval-Augmented Generation (RAG) pipeline.

The Problem of High-Dimensional Search

When we convert text into embeddings, we represent semantic meaning as a dense vector in a high-dimensional space (typically 768 to 4096 dimensions). Finding "similar" documents is mathematically equivalent to finding the nearest neighbors in this space.

A brute-force search (calculating the cosine similarity between your query and every document in your collection) is $O(N \cdot D)$, where $N$ is the number of documents and $D$ is the embedding dimension. For a production RAG system, this is unusable. We need approximate nearest neighbor (ANN) search.

Understanding HNSW (Hierarchical Navigable Small World)

The industry standard for efficient retrieval is the HNSW algorithm. It builds a multi-layered graph where the top layers provide long-range "express" paths and bottom layers provide local, granular accuracy.

FeatureBrute Force SearchHNSW Indexing
Complexity$O(N)$$O(\log N)$
Accuracy100%High (Approximate)
MemoryLowHigh (Graph storage)
LatencyLinear (Slow)Sub-millisecond (Fast)

Setting Up a Vector Database

For our running project, we will use qdrant as our vector database engine. It provides a robust Python client and handles HNSW indexing natively.

PYTHON
from qdrant_client import QdrantClient, models

# 1. Initialize the client(local in-memory or persisted)
client = QdrantClient(":memory:") 

# 2. Define the collection with specific vector configuration
# Size must match your embedding model output(e.g., 768 for BGE-large)
client.recreate_collection(
    collection_name="project_docs",
    vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE),
)

High-Dimensional Similarity Search

To perform search, you must ensure your query passes through the exact same embedding pipeline used for your document chunks.

PYTHON
def search_documents(query_text, top_k=3):
    # Assume get_embedding() is your pre-trained encoder from previous lessons
    query_vector = get_embedding(query_text)
    
    results = client.search(
        collection_name="project_docs",
        query_vector=query_vector,
        limit=top_k
    )
    return results

# Example Usage
hits = search_documents("How to optimize transformer throughput?")
for hit in hits:
    print(f"Score: {hit.score:.4f} | ID: {hit.id}")

Managing Persistence and Index Updates

In production, you cannot rebuild your index from scratch every time your application restarts. You must persist the state to disk and handle incremental updates.

  1. Persistence: Use a persistent storage path in QdrantClient(path="./qdrant_data").
  2. Upserting: Use client.upsert() for both new documents and updates. If the ID exists, the database overwrites the vector.
  3. Index Tuning: HNSW has two critical parameters: m (number of bi-directional links per node) and ef_construct (size of the dynamic list during index building). Higher values increase accuracy but slow down indexing.

Hands-on Exercise

  1. Install qdrant-client and create a script that embeds 100 dummy documents using your fine-tuned model.
  2. Index these vectors into a persistent Qdrant collection.
  3. Perform a query and inspect the score. If the score is low for relevant queries, look into Hybrid search for RAG: Combining Vector Embeddings and BM25 to augment your results.

Common Pitfalls

  • Dimensionality Mismatch: If your embedding model outputs 768 dimensions but your collection is configured for 1536, the database will reject the vectors. Always validate the output shape of your model.
  • Normalization: If you use cosine similarity, ensure your vectors are normalized. While some databases handle this internally, it’s a frequent source of "why are my scores weird?" bugs.
  • Stale Indexes: After a massive batch update, some vector databases require an index optimization trigger to re-balance the HNSW graph. Check your database docs for optimize() or force_segment_merge commands.

Recap

We've moved from fine-tuning models to building the retrieval infrastructure. By using HNSW, we ensure our RAG system remains performant at scale. Remember that the vector database is only as good as the embeddings you provide; if your retrieval quality is lacking, revisit your Implementing Semantic Chunking for RAG Pipelines: A Practical Guide to ensure your data is being indexed in meaningful units.

Up next: We will explore advanced retrieval techniques, specifically how to combine our vector search with traditional keyword-based BM25 to build a robust hybrid search pipeline.

Previous lessonProject Milestone: Domain-Specific Fine-TuningNext lesson Retrieval Strategies for RAG
Back to Blog

Similar Posts

AI/MLJune 28, 20263 min read

Project Milestone: RAG and Agent Integration

Master the integration of RAG pipelines and agentic reasoning. Learn to orchestrate fine-tuned models with tools to solve complex, multi-step production queries.

Read more
AI/MLJune 28, 20264 min read

Mixture-of-Experts (MoE) Layers: Scaling Efficiently with Sparsity

Master Mixture-of-Experts (MoE) layers to build scalable, compute-efficient LLMs. Learn to design expert routers, implement sparse layers, and balance load.

Part of the course

Advanced AI/ML: Deep Learning, LLMs & Production Systems

advanced · Lesson 21 of 48

  1. 1

    Advanced Weight Initialization Strategies

    4 min
  2. 2

    Normalization Techniques at Scale

    3 min
  3. 3

    High-Dimensional Optimization Landscapes

    4 min
Read more
AI/MLJune 28, 20263 min read

Project Milestone: Production Deployment of ML Systems

Learn to execute a production deployment on Kubernetes, integrate telemetry, and build automated feedback loops to ensure your ML system remains performant.

Read more
  • 4

    Residual Connections and Gradient Stability

    4 min
  • 5

    Gating Units and Activation Functions

    4 min
  • 6

    Implementing Multi-Head Attention

    4 min
  • 7

    Positional Encoding Architectures

    4 min
  • 8

    Transformer Encoder-Decoder Design

    3 min
  • 9

    Project Milestone: Custom Architecture Setup

    3 min
  • 10

    Tokenization Strategies for LLMs

    3 min
  • 11

    Scaling Laws and Compute Budgets

    4 min
  • 12

    Data Parallelism Strategies

    3 min
  • 13

    Tensor and Pipeline Parallelism

    4 min
  • 14

    Efficient Dataset Loading and Prefetching

    4 min
  • 15

    Fine-tuning Methodologies Overview

    4 min
  • 16

    Parameter-Efficient Fine-Tuning (LoRA)

    4 min
  • 17

    Quantized LoRA (QLoRA)

    4 min
  • 18

    Alignment with RLHF

    4 min
  • 19

    Direct Preference Optimization (DPO)

    4 min
  • 20

    Project Milestone: Domain-Specific Fine-Tuning

    3 min
  • 21

    Vector Databases and Similarity Search

    4 min
  • 22

    Retrieval Strategies for RAG

    3 min
  • 23

    Context Management and Windowing

    4 min
  • 24

    Agentic Tool Use and Function Calling

    4 min
  • 25

    Chain-of-Thought and Multi-Step Reasoning

    4 min
  • 26

    Self-Correction and Iterative Refinement

    4 min
  • 27

    Project Milestone: RAG and Agent Integration

    3 min
  • 28

    Post-Training Quantization (PTQ)

    4 min
  • 29

    Model Pruning Techniques

    4 min
  • 30

    Knowledge Distillation

    4 min
  • 31

    Optimized Inference Runtimes (vLLM)

    4 min
  • 32

    TensorRT-LLM for High-Performance Serving

    3 min
  • 33

    ONNX Runtime for Cross-Platform Inference

    3 min
  • 34

    Project Milestone: Inference Optimization

    3 min
  • 35

    CI/CD for ML (MLOps)

    4 min
  • 36

    Continuous Training (CT) Pipelines

    4 min
  • 37

    Observability and Logging

    4 min
  • 38

    Drift Detection and Data Monitoring

    4 min
  • 39

    LLM-as-a-Judge for Evaluation

    4 min
  • 40

    Scaling Deployments with Kubernetes

    4 min
  • 41

    GPU Resource Allocation and Scheduling

    3 min
  • 42

    Project Milestone: Production Deployment

    3 min
  • 43

    Advanced Activation Checkpointing

    4 min
  • 44

    Mixed Precision Training (FP8/BF16)

    4 min
  • 45

    Distributed Optimizer States

    4 min
  • 46

    Gradient Accumulation and Batch Sizing

    4 min
  • 47

    Multi-Modal Model Architectures

    4 min
  • 48

    Mixture-of-Experts (MoE) Layers

    4 min
  • View full course