Building a small RAG pipeline is the fastest way to ground LLMs in your data. Learn the end-to-end process of indexing, retrieval, and generation.

Building a small RAG pipeline is the most effective way to stop your LLM from hallucinating on domain-specific data. Last month, I had to integrate a private knowledge base into an existing customer support tool, and I realized that most tutorials overcomplicate the stack. You don't need a massive vector database cluster to start; you need a clean, repeatable process for turning raw text into accurate answers.
At its core, a Retrieval-Augmented Generation (RAG) system is just a three-stage filter: ingest, retrieve, and generate. We take raw documents, break them into manageable chunks, store them as vectors, and then query them to provide context to an LLM.
Before we dive into the code, remember that the quality of your retrieval is only as good as your chunking strategy. I initially tried using fixed-character splits, but it broke context in half, leading to nonsensical answers. Switching to a recursive character splitter that respects paragraph boundaries improved my retrieval accuracy by roughly 30%.
We’ll use LangChain and the ChromaDB local vector store for this. First, install your dependencies:
Bashpip install langchain langchain-openai chromadb pypdf
You need to load your documents and split them into chunks. If you're using PDFs, PyPDFLoader is your friend.
PYTHONfrom langchain_community.document_loaders import PyPDFLoader from langchain_text_splitters import RecursiveCharacterTextSplitter loader = PyPDFLoader("manual.pdf") docs = loader.load() splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) chunks = splitter.split_documents(docs)
Once you have your chunks, you need to turn them into vectors. I prefer using OpenAI’s text-embedding-3-small model because it's cheap and performant for most internal knowledge bases.
PYTHONfrom langchain_openai import OpenAIEmbeddings from langchain_community.vectorstores import Chroma db = Chroma.from_documents( documents=chunks, embedding=OpenAIEmbeddings(), persist_directory="./chroma_db" )
By persisting this to a local directory, you avoid re-embedding your docs every time you restart the script. It’s a small win, but it saves about $0.50 in API costs during the dev phase alone.
Now, the actual building a small RAG pipeline magic happens. You don't just dump the context into the prompt; you need to format it so the model understands which parts are instructions and which parts are retrieved facts.
If you’re struggling with the quality of the LLM’s response, you might need to look at getting reliable structured output from an LLM in production to ensure your retrieval metadata stays clean.
PYTHONfrom langchain_openai import ChatOpenAI from langchain.chains import RetrievalQA llm = ChatOpenAI(model="gpt-4o-mini") retriever = db.as_retriever(search_kwargs={"k": 3}) qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=retriever ) response = qa_chain.invoke("How do I reset my account password?") print(response["result"])
I’ve learned that the "stuff" chain—where you simply dump all retrieved chunks into the prompt—is sufficient for 90% of use cases. Don't go down the rabbit hole of complex re-ranking or multi-stage agents until your basic "stuff" pipeline fails to deliver.
One major hurdle I hit was handling updates. If your documents change, your vector store will get stale. You’ll eventually need a strategy to delete and re-index specific document IDs. If you're managing this as part of a larger infrastructure, consider how this fits into your CI/CD flow, perhaps mirroring the discipline found when building a GitOps Pipeline with Argo CD and Crossplane.
How do I know if my RAG pipeline is working well? You need to measure retrieval accuracy. Check if the chunks returned by your retriever actually contain the answer to the user's query. If the retriever is failing, no amount of prompt engineering will fix the output.
Is ChromaDB production-ready? For a small to medium-sized app, yes. It's easy to manage and runs in-process. If you scale to millions of vectors, you'll likely want to move to a managed service like Pinecone or Weaviate, but don't optimize for that until you have the traffic.
Should I use RAG or Fine-tuning? Always start with RAG. Fine-tuning is for changing the behavior or tone of a model; RAG is for giving it knowledge. RAG is also much easier to debug because you can inspect the retrieved context.
I’m still experimenting with embedding models. While OpenAI is great, I’m currently testing local embeddings via Ollama to see if I can keep the entire pipeline offline for sensitive client data. It's a trade-off between latency and privacy that I haven't fully resolved yet. Start small, keep your chunks clean, and don't over-engineer the retrieval layer until your logs tell you it's broken.
Getting reliable structured output from an LLM is the difference between a prototype and a product. Learn how to enforce JSON schemas effectively.