LLM fallback strategies are essential for production AI. Learn how to design a multi-model architecture that manages latency and API costs during outages.
Last month, our primary LLM provider experienced a massive latency spike that pushed response times from a snappy 600ms to over 12 seconds. Our user dashboard hung, the UI locked up, and we started bleeding requests. We didn't have an automated safety net, so we spent the next two hours manually swapping environment variables and redeploying.
Never again. After that incident, I spent the following week building a robust, automated switching layer. If you’re shipping AI features, you cannot rely on a single endpoint—you need LLM fallback strategies to keep your application afloat when the primary API inevitably stumbles.
When you hardcode an API call to a single model—like gpt-4o—you’re building a single point of failure. If that model slows down or hits a rate limit, your entire application dies with it.
We initially tried wrapping our calls in a standard retry loop. It was a disaster. If the provider is slow, retrying just compounds the latency issue, effectively DDOSing your own infrastructure. That’s when I realized we needed a proper LLM routing: a strategy for multi-model architectures that doesn't just retry, but actively pivots to a different resource.
The goal of a resilient system is to degrade gracefully. If your heavy, expensive model (like Claude 3.5 Sonnet or GPT-4o) fails to respond within a specific window, you shouldn't just error out. You should fall back to a "fast" model—something like GPT-4o-mini or Haiku—which usually responds in under 400ms.
Here is a simplified pattern for a Python-based fallback handler:
PYTHONimport time async def get_llm_response(prompt, model_chain): for model in model_chain: try: start = time.perf_counter() response = await call_llm(prompt, model) # Latency check: if it took too long, maybe log it but return if time.perf_counter() - start > 5.0: print(f"Warning: {model} was slow.") return response except Exception as e: print(f"Model {model} failed: {e}. Trying next...") continue raise Exception("All models failed.")
This simple iterative approach solves the immediate "it crashed" problem. However, it doesn't account for the fact that a failing model might recover. For that, you need to integrate API resilience with circuit breakers: stop cascading failures to stop sending requests to a provider that is clearly in a death spiral.
Fallback isn't just about uptime; it's about API cost optimization. If you run a high-traffic app, you can't afford to run your most expensive model for every request anyway.
My current setup uses a tiered approach:
This strategy ensures that I only pay the premium for "smart" responses when the system is healthy. When the system is struggling, I prioritize availability over perfection.
I’m still refining the "timeout" logic. Setting a hard timeout is tricky because different prompts take different amounts of time to generate. A simple summarization task should return in under a second, but a complex code-generation task might take five.
If you're just starting, don't over-engineer the switching logic. Start by logging the latency of every request. You'll quickly see that your "fast" model is actually slower than you think at certain times of the day.
Also, keep an eye on your token usage when you fall back. If your fallback model has a smaller context window, you might need to truncate your prompt dynamically before switching. It’s a messy detail, but it’s the difference between a system that works and one that crashes with a 400 Bad Request.
How do I decide when to trigger a fallback? Don't just use errors. Use a timeout. If a request takes longer than your P95 latency threshold, treat it as a "soft failure" and trigger the fallback.
Does switching models hurt the user experience? Sometimes. A smaller model might give a slightly less accurate answer. I recommend logging when a fallback occurs so you can review those responses later and tune your prompts to be more "model-agnostic."
Is this overkill for a small app? If you're in development, maybe. But if you have paying users, your reputation is tied to your uptime. Implementing a basic fallback is about two days of work, and it saves you from an embarrassing outage.
I’m still not entirely happy with how we handle "partial" failures—where the API returns a response but it's truncated or nonsensical. We're currently exploring better evals to catch that before it hits the user, but for now, the circuit-breaker-and-fallback pattern is doing the heavy lifting.
Master LLM function calling to build reliable agentic workflows. Learn to implement dynamic tool selection with strict schema validation for production apps.
Read moreQuery decomposition is the secret to solving multi-hop reasoning in RAG pipelines. Learn how to break down complex queries to improve LLM accuracy today.