AI/MLJune 20, 20264 min read

Building a small RAG pipeline end to end in Python

Building a small RAG pipeline is the fastest way to ground LLMs in your data. Learn the end-to-end process of indexing, retrieval, and generation.

RAGLLMPythonLangChainAIRetrievalPrompt Engineering

Building a small RAG pipeline is the most effective way to stop your LLM from hallucinating on domain-specific data. Last month, I had to integrate a private knowledge base into an existing customer support tool, and I realized that most tutorials overcomplicate the stack. You don't need a massive vector database cluster to start; you need a clean, repeatable process for turning raw text into accurate answers.

The Pipeline Architecture

At its core, a Retrieval-Augmented Generation (RAG) system is just a three-stage filter: ingest, retrieve, and generate. We take raw documents, break them into manageable chunks, store them as vectors, and then query them to provide context to an LLM.

Before we dive into the code, remember that the quality of your retrieval is only as good as your chunking strategy. I initially tried using fixed-character splits, but it broke context in half, leading to nonsensical answers. Switching to a recursive character splitter that respects paragraph boundaries improved my retrieval accuracy by roughly 30%.

Step 1: Ingestion and Chunking

We’ll use LangChain and the ChromaDB local vector store for this. First, install your dependencies:


Bash
pip install langchain langchain-openai chromadb pypdf

You need to load your documents and split them into chunks. If you're using PDFs, PyPDFLoader is your friend.


PYTHON
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = PyPDFLoader("manual.pdf")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

Step 2: Indexing with Embeddings

Once you have your chunks, you need to turn them into vectors. I prefer using OpenAI’s text-embedding-3-small model because it's cheap and performant for most internal knowledge bases.


PYTHON
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

db = Chroma.from_documents(
    documents=chunks, 
    embedding=OpenAIEmbeddings(),
    persist_directory="./chroma_db"
)

By persisting this to a local directory, you avoid re-embedding your docs every time you restart the script. It’s a small win, but it saves about $0.50 in API costs during the dev phase alone.

Step 3: Retrieval and Generation

Now, the actual building a small RAG pipeline magic happens. You don't just dump the context into the prompt; you need to format it so the model understands which parts are instructions and which parts are retrieved facts.

If you’re struggling with the quality of the LLM’s response, you might need to look at getting reliable structured output from an LLM in production to ensure your retrieval metadata stays clean.


PYTHON
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model="gpt-4o-mini")
retriever = db.as_retriever(search_kwargs={"k": 3})

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever
)

response = qa_chain.invoke("How do I reset my account password?")
print(response["result"])

Hard-Won Lessons from Production

I’ve learned that the "stuff" chain—where you simply dump all retrieved chunks into the prompt—is sufficient for 90% of use cases. Don't go down the rabbit hole of complex re-ranking or multi-stage agents until your basic "stuff" pipeline fails to deliver.

One major hurdle I hit was handling updates. If your documents change, your vector store will get stale. You’ll eventually need a strategy to delete and re-index specific document IDs. If you're managing this as part of a larger infrastructure, consider how this fits into your CI/CD flow, perhaps mirroring the discipline found when building a GitOps Pipeline with Argo CD and Crossplane.

FAQ

How do I know if my RAG pipeline is working well? You need to measure retrieval accuracy. Check if the chunks returned by your retriever actually contain the answer to the user's query. If the retriever is failing, no amount of prompt engineering will fix the output.

Is ChromaDB production-ready? For a small to medium-sized app, yes. It's easy to manage and runs in-process. If you scale to millions of vectors, you'll likely want to move to a managed service like Pinecone or Weaviate, but don't optimize for that until you have the traffic.

Should I use RAG or Fine-tuning? Always start with RAG. Fine-tuning is for changing the behavior or tone of a model; RAG is for giving it knowledge. RAG is also much easier to debug because you can inspect the retrieved context.

I’m still experimenting with embedding models. While OpenAI is great, I’m currently testing local embeddings via Ollama to see if I can keep the entire pipeline offline for sensitive client data. It's a trade-off between latency and privacy that I haven't fully resolved yet. Start small, keep your chunks clean, and don't over-engineer the retrieval layer until your logs tell you it's broken.

Back to Blog

Building a small RAG pipeline end to end in Python

The Pipeline Architecture

Step 1: Ingestion and Chunking

Step 2: Indexing with Embeddings

Step 3: Retrieval and Generation

Hard-Won Lessons from Production

FAQ

Similar Posts

Prompt patterns that survive contact with production

Getting reliable structured output from an LLM in production

WordPress Core Embraces AI: A New Era for Developers and Users