Master LLM routing to optimize your AI infrastructure. Learn how to implement semantic classification for dynamic model selection and better cost control.
Last month, our production dashboard showed a recurring issue: we were burning through our token budget on simple intent-classification tasks while simultaneously hitting latency bottlenecks on complex reasoning requests. We were using GPT-4o for everything, treating a sledgehammer like a scalpel.
I realized we needed a better way to route traffic. If you’re building production apps, you’ve likely felt this friction. You don't need a frontier model to tell you if a user wants to "reset password" or "check billing." By implementing an LLM routing strategy, you can drastically cut inference costs without sacrificing the user experience for the hard stuff.
The core problem is one-size-fits-all prompting. When you send every user input to the same high-end model, you pay for reasoning capabilities that often go unused. My goal was to build a "triage" layer that sits in front of our main LLM pipeline.
I first tried a simple regex-based router. It was fast, but it broke the moment a user phrased a query slightly differently. Hard-coded rules are fragile. I needed something that understood the intent behind the input. That’s where semantic classification comes in.
By using a smaller, cheaper model (like GPT-4o-mini or a fine-tuned Llama 3 8B) as a classifier, I could map incoming requests to specific categories. Each category then determines the model selection:
To make this work, I used a lightweight class that performs a zero-shot classification before the main request hits our primary logic. Here is how I structured the dispatcher:
PYTHONimport instructor from pydantic import BaseModel, Field from enum import Enum class TaskCategory(str, Enum): SIMPLE = "simple" REASONING = "reasoning" TOOL_USE = "tool_use" class Route(BaseModel): category: TaskCategory = Field(..., description="The classification of the user input") def get_route(user_input: str): # Using a small, fast model for classification client = instructor.patch(openai.Client()) return client.chat.completions.create( model="gpt-4o-mini", response_model=Route, messages=[{"role": "user", "content": user_input}] )
The latency overhead here is around 180ms. That’s a small price to pay when it saves us from sending a 2,000-token prompt to a more expensive model. If you're already managing LLM Cost Control: Mastering Dynamic Context Window Management, this routing layer acts as a critical filter to keep your context usage lean.
Every time you introduce a routing step, you add network round-trips. If your classification model is too slow, the user notices.
I initially tried to do this with a local embedding-based classifier. It was lightning-fast, but it struggled to differentiate between "reset password" and "update email" because the vector space for those intents was too similar. Moving to a tiny LLM—specifically gpt-4o-mini—gave me the semantic nuance I needed at a cost that is effectively a rounding error compared to our main model usage.
It’s worth noting that this approach requires you to be rigorous about LLM Routing: A Strategy for Multi-Model Architectures. You aren't just routing to a model; you are routing to a capability.
Once you have the routing logic, the next step is prompt optimization. I found that the prompt used for the classifier needs to be extremely sparse. If the classifier prompt is too long, you’re just wasting the tokens you’re trying to save.
Keep the system prompt for your router under 50 tokens:
"Classify the user intent into one of these categories: [SIMPLE, REASONING, TOOL_USE]. Only return the JSON object."
If you find your router is misclassifying, don't make the prompt longer. Instead, provide 3-5 high-quality few-shot examples. This significantly improves accuracy without ballooning your input costs.
I’m still not 100% happy with how we handle "uncertainty." Sometimes the router isn't sure, and it picks the wrong bucket. I’ve started adding a confidence field to my Pydantic model. If the model's logprobs indicate low confidence in the classification, I force the request to the most capable model as a fallback. It’s safer, though it does cost more.
Ultimately, LLM routing is about continuous adjustment. You won't get the routing thresholds perfect on the first try. I’m currently looking into how we can automate the feedback loop—logging the "router choice" vs. "human correction" to retrain the classifier periodically.
We’re still iterating on our thresholds. The key is to treat your routing layer as a piece of infrastructure that evolves, not a static config file.
LLM streaming with partial JSON reconstruction keeps your AI interfaces fast. Learn to parse incomplete tokens and update UI components in real time.
Read moreMaster LLM streaming for structured output by parsing partial JSON in real-time. Learn to build responsive AI interfaces with robust validation techniques.