AI/MLJune 21, 20264 min read

LLM Cost Control: Mastering Dynamic Context Window Management

LLM cost control is vital for production RAG pipelines. Learn how to implement dynamic context window management to optimize token usage and reduce latency.

LLMRAGAI EngineeringLLM OptimizationPythonPerformanceAIPrompt Engineering

Last month, our RAG pipeline's OpenAI bill spiked by 40% after we increased our document retrieval limit from 3 chunks to 10. We were blindly stuffing the context window, assuming more data meant better answers, but we were mostly just paying for noise and increasing our LLM latency.

I realized we needed a more surgical approach to token management. Instead of a fixed-size context window, we moved to a dynamic strategy that adjusts based on the specific query and the relevance score of retrieved documents.

Why static context windows fail in production

In the early days, we used a fixed k=5 for our vector search. It’s easy to implement, but it’s wasteful. When a user asks a simple question like "What is our vacation policy?", you don’t need five full documents. You need the one specific paragraph from the handbook.

By forcing the model to process irrelevant text, we were increasing our LLM cost control challenges significantly. Beyond the direct token cost, longer inputs force the model to perform more attention calculations, which directly translates to higher time-to-first-token (TTFT).

Implementing dynamic context window management

Upward angle view of classic historic buildings against a clear blue sky, showcasing urban architecture.

We started by refactoring our retrieval layer. Instead of just pulling the top k chunks, we now treat the context window as a budget.

Here is the strategy we settled on:

Relevance Thresholding: We discard any chunks with a cosine similarity score below 0.75.
Token Budgeting: We define a hard limit (e.g., 4000 tokens for GPT-4o).
Dynamic Pruning: We sort the remaining chunks by relevance and add them to the prompt until we hit 90% of our budget.

This approach ensures that we prioritize the most accurate information while keeping RAG pipelines performant. If the search returns ten documents, we might only inject two if they are highly relevant, saving us the cost of processing the other eight.

The implementation logic

We use a simple helper function to calculate tokens before sending the request. Using tiktoken for OpenAI models, the logic looks roughly like this:


PYTHON
import tiktoken

def build_context(chunks, budget=4000):
    encoder = tiktoken.encoding_for_model("gpt-4o")
    context = ""
    for chunk in chunks:
        chunk_text = chunk.text
        if len(encoder.encode(context + chunk_text)) < budget:
            context += chunk_text + "\n"
        else:
            break
    return context

This ensures we never overflow the context window, which saves us from those annoying "context window exceeded" errors that crash production flows.

Bridging the gap with caching

Even with dynamic context, you’re still paying for the prompt tokens every single time. We found that semantic caching for RAG pipelines was the perfect companion to our dynamic window strategy.

If a user asks a question that’s semantically similar to a previous one, we serve the cached response. This completely bypasses the context window logic for those requests. When combined with LLM caching strategies, we saw our average cost per query drop by about 35%.

Managing the trade-offs

This isn't a silver bullet. The biggest trade-off is the complexity of your retrieval logic. By making your context window dynamic, you’re introducing more points of failure. What if your similarity threshold is too high and you filter out the only chunk that contains the answer?

We mitigate this by using a "fallback" mode. If the model returns "I don't know" or a low-confidence signal, we rerun the query with a wider retrieval radius and a lower threshold. It’s a bit slower, but it saves the user experience.

We also keep a close eye on our LLM evaluation pipelines to ensure that our aggressive pruning isn't degrading answer quality. If our precision drops, we dial back the threshold by 0.05.

What I'm still questioning

Sneakers on pavement with a chalk question mark, symbolizing curiosity or decisions.

I'm still not entirely happy with how we handle multi-hop questions. If the answer requires synthesizing information from three different documents, our simple relevance-based pruning might drop the third, less-relevant document that holds the final piece of the puzzle.

We’re experimenting with "context summarization," where we compress less relevant documents into short snippets rather than discarding them entirely. It adds a bit of latency, but it might be the key to better accuracy without blowing the budget.

Dynamic context management is never "done." It’s a constant tuning game between your token budget, the model's intelligence, and the user's need for speed. Start with simple thresholds, monitor your costs, and don't be afraid to experiment with the pruning logic.

Back to Blog

LLM Cost Control: Mastering Dynamic Context Window Management

Why static context windows fail in production

Implementing dynamic context window management

The implementation logic

Bridging the gap with caching

Managing the trade-offs

What I'm still questioning

Similar Posts

LLM Routing: A Strategy for Multi-Model Architectures

LLM agents self-correction: Building Recursive Feedback Loops

Optimizing RAG Retrieval: A Practical Guide to Semantic Reranking