AI/MLJune 20, 20264 min read

Controlling LLM cost and latency: A Practical Production Guide

Controlling LLM cost and latency is the biggest hurdle in production. Learn how to optimize token usage and response times to keep your AI features fast.

LLMAI EngineeringPerformanceCost OptimizationLatencyProductionAIRAGPrompt Engineering

Last month, our team pushed an update that integrated a chat-based assistant into our main dashboard. The prototype looked great, but within 48 hours, our API bill spiked by 40% and our p99 latency ballooned to over 5 seconds. We had to rethink our approach to controlling LLM cost and latency before the feature became a liability.

It’s easy to get lost in the hype of new models, but production stability requires a more surgical approach. If you’re building AI features, you’re likely fighting the same battle between model capability and budget constraints.

Why Latency and Cost Go Hand-in-Hand

When you’re working with LLMs, latency is almost always a function of token count. The more tokens you send and receive, the longer the request takes and the more you pay. We initially tried to solve this by simply switching to a cheaper model, but that broke our logic—the model couldn't follow complex instructions reliably.

Instead of just swapping models, we started looking at the entire pipeline. We realized that our prompts were massive, including redundant context that we were sending on every single turn. By aggressively trimming our system prompts and caching common responses, we managed to cut our average token usage per request by about 35% without losing any reasoning capability.

Strategies for Controlling LLM Cost and Latency

Controlling LLM cost and latency in a real application requires a multi-layered defense. You can't rely on a single "silver bullet" setting or prompt tweak.

1. Implement Semantic Caching

Don't re-calculate what you've already answered. We implemented a Redis-based cache using embeddings to check if a user's question was semantically similar to a previous query. If the similarity score exceeds 0.95, we serve the cached response. This drops latency from ~2,200ms to under 50ms for repeat questions.

2. Enforce Strict Structured Output

One of the biggest hidden costs is the "chattiness" of models. If you ask for a JSON response, the model often adds conversational filler. We switched to enforcing getting reliable structured output from an LLM in production using constrained generation libraries like Instructor or guidance. By forcing the model to output only the schema we need, we save hundreds of tokens per call.

3. Smart Context Window Management

When building a small RAG pipeline end to end in Python, it's tempting to shove as much context as possible into the prompt. That's a mistake. We moved to a dynamic chunking strategy where we only include the most relevant fragments based on the user's current intent. If the user is asking about a specific invoice, why are we sending the entire user manual?

The Trade-off: When to Use Smaller Models

Two businessmen discuss stock market trends using a tablet with visible graphs.

We eventually settled on a tiered model strategy. For 80% of our interactions, we use a smaller, faster model like GPT-4o-mini or Haiku. We only route to a "heavier" model (like Claude 3.5 Sonnet or GPT-4o) if the user's intent classification indicates a complex query that requires higher reasoning.

This routing logic is simple:


PYTHON
def get_model(prompt):
    if is_complex_query(prompt):
        return "gpt-4o"
    return "gpt-4o-mini"

It sounds simple, but it saved us roughly $150 in the first week alone. The key is knowing which tasks actually require high-end reasoning and which ones are just simple data retrieval.

Lessons Learned

I’m still not 100% satisfied with our current setup. We’re currently looking into streaming responses to mask latency, but that brings its own set of UI challenges. We've also had to be careful with prompt patterns that survive contact with production because a prompt that works in the playground often fails when the user provides unexpected input.

If I were starting this project today, I would prioritize observability from day one. You can't optimize what you can't measure. We spent too long guessing why costs were high before we finally hooked up proper trace logging to see exactly how many tokens were being consumed per user session.

Controlling LLM cost and latency isn't a one-time task; it's an ongoing process of monitoring, pruning, and routing. Keep your prompts lean, your cache warm, and your model routing smart, and you'll keep your infrastructure costs under control.

Frequently Asked Questions

Close-up of a magnifying glass focusing on the phrase 'Frequently Asked Questions'.

How do I decide when to switch models? Use a test set of 50-100 real-world queries. If the smaller model achieves >90% accuracy compared to the large one, switch. If not, refine your few-shot examples.

Does streaming actually make the app faster? It doesn't reduce the time-to-completion, but it significantly improves the perceived latency for the user. It’s a vital UX pattern for any LLM-powered feature.

What is the best way to monitor token usage? Use a tool like LangSmith or native provider logging to track token usage per user ID. If you see a user spiraling, you can implement rate-limiting at the application level.

Back to Blog

Controlling LLM cost and latency: A Practical Production Guide

Why Latency and Cost Go Hand-in-Hand

Strategies for Controlling LLM Cost and Latency

1. Implement Semantic Caching

2. Enforce Strict Structured Output

3. Smart Context Window Management

The Trade-off: When to Use Smaller Models

Lessons Learned

Frequently Asked Questions

Similar Posts

Prompt patterns that survive contact with production

Building a small RAG pipeline end to end in Python

Getting reliable structured output from an LLM in production