Mahamudul Hasan Rubel
HomeAboutProjectsSkillsExperienceBlogPhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • About
  • Projects
  • Skills
  • Experience
  • Blog
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
AI/MLJune 22, 20265 min read

LLM Documentation: Building Context-Aware Codebase Summarization Systems

LLM documentation tools can automate your codebase summaries. Learn how to build a robust RAG pipeline for code analysis that yields accurate, useful output.

LLMRAGSoftware EngineeringDocumentationPythonAI-EngineeringAIPrompt Engineering

Last month, I spent about three days trying to get GPT-4o to write documentation for a legacy microservice. The results were initially useless; the model hallucinated method signatures and completely misunderstood our internal dependency injection pattern. It wasn't until I moved away from "dump the whole file into the prompt" and toward a structured RAG pipeline that I started seeing actual value.

If you’re tired of manual documentation updates, you're likely looking for a way to automate the process. But generating docs from code isn't just about throwing text at an LLM. It’s about building a system that understands your project structure as well as you do.

Why Naive Prompting Fails for Code

We first tried a naive approach: we picked a file, concatenated it with a few surrounding utility functions, and asked the model to "explain this." It failed because it lacked global context. The model didn't know how the service interacted with our message queue or how the database models were defined in a different directory.

You’re essentially fighting the context window. Even with a large window, noise is your enemy. If you're building an LLM documentation tool, you need to be surgical. You need to treat code like data, not just text.

The Architecture of Automated Documentation

To build a reliable system, you need to move beyond simple string matching. We settled on a modular architecture that treats code as a graph of dependencies.

  1. Code Parsing: Use Tree-sitter to parse your source code into an Abstract Syntax Tree (AST). This allows you to extract classes, methods, and docstrings programmatically rather than guessing with regex.
  2. Vectorized Retrieval: Once you have your AST chunks, index them in a vector database like Pinecone or Qdrant. This allows for semantic retrieval when you need to know, for example, "Where is the authentication logic handled?"
  3. Context Assembly: Instead of sending the whole file, you fetch the relevant chunks identified by your search. If you’re just getting started, building a small RAG pipeline end to end in Python is the best way to grasp how these pieces fit together.

Implementing the RAG Pipeline for Code

When you implement a RAG pipeline for your codebase, you cannot rely on simple cosine similarity alone. Code is dense. A function named process_data in a payment service is semantically different from one in a logging service.

We found that adding a metadata layer—such as file path, class hierarchy, and imports—to our vector embeddings significantly improved retrieval. Here is a simplified look at how we structure our retrieval query:

PYTHON
# Conceptual snippet for retrieving code context
def get_relevant_context(query, vector_store, top_k=5):
    # Retrieve base chunks
    results = vector_store.similarity_search(query, k=top_k)
    
    # Enrich with AST-based dependency lookup
    enriched_context = []
    for res in results:
        dependencies = resolve_dependencies(res.metadata[CE9178">'file_path'])
        enriched_context.append(f"{res.page_content}\nDeps: {dependencies}")
        
    return "\n---\n".join(enriched_context)

By injecting these dependencies, the LLM stops hallucinating. It now sees that process_data calls db.save() and queue.publish(). If you find your retrieval is still noisy, you should definitely look into optimizing RAG retrieval: a practical guide to semantic reranking to filter out the irrelevant hits.

Refinement Through Prompt Engineering

Once you have the right context, prompt engineering becomes the final mile. Don't just ask the model to "write documentation." Use a structured format. We ask the model to output JSON that maps to our documentation schema:

  • Summary: A 2-sentence high-level overview.
  • Side Effects: What external systems does this touch?
  • Errors: What exceptions should the caller expect?

By enforcing a schema, we can then run automated tests. I highly recommend checking out LLM evaluation pipelines: building automated tests with LangSmith to ensure that your generated docs don't drift as your code changes.

Lessons Learned and Trade-offs

One thing I’m still unsure about is the balance between "freshness" and "cost." Every time we push a PR, we trigger a partial re-indexing of the affected modules. It’s roughly 1.8x more expensive than we initially budgeted. We've had to implement LLM cost control: mastering dynamic context window management to keep our token usage within reasonable limits.

If I were to start this over, I would spend more time on the parsing layer. Standard LLM tokenizers aren't optimized for code. They often struggle with deep indentation or unusual naming conventions. Using a custom tokenizer or at least pre-processing the code to normalize indentation can save you a lot of headache down the line.

Ultimately, automated documentation is an iterative game. You won't get it perfect on the first try. Start by documenting one service, measure the accuracy against existing human-written docs, and iterate on your retrieval strategy before scaling to the entire monorepo. It’s messy work, but having a codebase that explains itself is worth the effort.

FAQ

Q: Should I use a vector database or a graph database for code? A: Use both if you can. Vector databases are great for semantic search, but graph databases (like Neo4j) are superior for mapping complex dependency trees. We use a hybrid approach.

Q: How do you handle code updates? A: We use a Git-hook that calculates a hash of the file content. If the hash changes, we trigger an incremental update to the vector index for that specific file.

Q: Can I use a smaller model for this? A: Absolutely. Once you have the context injection right, you can often get away with using something like GPT-4o-mini or even a fine-tuned Llama 3 for the actual summarization task, which saves significant latency and cost.

Back to Blog

Similar Posts

AI/MLJune 21, 20264 min read

LLM Function Calling: A Guide to Dynamic Tool Selection

Master LLM function calling to build reliable agentic workflows. Learn to implement dynamic tool selection with strict schema validation for production apps.

Read more
AI/MLJune 21, 20265 min read

Mastering Query Decomposition for RAG Pipelines: A Practical Guide

Query decomposition is the secret to solving multi-hop reasoning in RAG pipelines. Learn how to break down complex queries to improve LLM accuracy today.

Read more
Upward angle view of classic historic buildings against a clear blue sky, showcasing urban architecture.
AI/MLJune 21, 20264 min read

LLM Cost Control: Mastering Dynamic Context Window Management

LLM cost control is vital for production RAG pipelines. Learn how to implement dynamic context window management to optimize token usage and reduce latency.

Read more