LLM cost control is vital for production RAG pipelines. Learn how to implement dynamic context window management to optimize token usage and reduce latency.

Last month, our RAG pipeline's OpenAI bill spiked by 40% after we increased our document retrieval limit from 3 chunks to 10. We were blindly stuffing the context window, assuming more data meant better answers, but we were mostly just paying for noise and increasing our LLM latency.
I realized we needed a more surgical approach to token management. Instead of a fixed-size context window, we moved to a dynamic strategy that adjusts based on the specific query and the relevance score of retrieved documents.
In the early days, we used a fixed k=5 for our vector search. It’s easy to implement, but it’s wasteful. When a user asks a simple question like "What is our vacation policy?", you don’t need five full documents. You need the one specific paragraph from the handbook.
By forcing the model to process irrelevant text, we were increasing our LLM cost control challenges significantly. Beyond the direct token cost, longer inputs force the model to perform more attention calculations, which directly translates to higher time-to-first-token (TTFT).

We started by refactoring our retrieval layer. Instead of just pulling the top k chunks, we now treat the context window as a budget.
Here is the strategy we settled on:
This approach ensures that we prioritize the most accurate information while keeping RAG pipelines performant. If the search returns ten documents, we might only inject two if they are highly relevant, saving us the cost of processing the other eight.
We use a simple helper function to calculate tokens before sending the request. Using tiktoken for OpenAI models, the logic looks roughly like this:
PYTHONimport tiktoken def build_context(chunks, budget=4000): encoder = tiktoken.encoding_for_model("gpt-4o") context = "" for chunk in chunks: chunk_text = chunk.text if len(encoder.encode(context + chunk_text)) < budget: context += chunk_text + "\n" else: break return context
This ensures we never overflow the context window, which saves us from those annoying "context window exceeded" errors that crash production flows.
Even with dynamic context, you’re still paying for the prompt tokens every single time. We found that semantic caching for RAG pipelines was the perfect companion to our dynamic window strategy.
If a user asks a question that’s semantically similar to a previous one, we serve the cached response. This completely bypasses the context window logic for those requests. When combined with LLM caching strategies, we saw our average cost per query drop by about 35%.
This isn't a silver bullet. The biggest trade-off is the complexity of your retrieval logic. By making your context window dynamic, you’re introducing more points of failure. What if your similarity threshold is too high and you filter out the only chunk that contains the answer?
We mitigate this by using a "fallback" mode. If the model returns "I don't know" or a low-confidence signal, we rerun the query with a wider retrieval radius and a lower threshold. It’s a bit slower, but it saves the user experience.
We also keep a close eye on our LLM evaluation pipelines to ensure that our aggressive pruning isn't degrading answer quality. If our precision drops, we dial back the threshold by 0.05.

I'm still not entirely happy with how we handle multi-hop questions. If the answer requires synthesizing information from three different documents, our simple relevance-based pruning might drop the third, less-relevant document that holds the final piece of the puzzle.
We’re experimenting with "context summarization," where we compress less relevant documents into short snippets rather than discarding them entirely. It adds a bit of latency, but it might be the key to better accuracy without blowing the budget.
Dynamic context management is never "done." It’s a constant tuning game between your token budget, the model's intelligence, and the user's need for speed. Start with simple thresholds, monitor your costs, and don't be afraid to experiment with the pruning logic.
LLM agents self-correction relies on recursive feedback loops to catch and fix errors before they reach your users. Learn to build resilient workflows.