AI/MLJune 21, 20264 min read

LLM Fallback Strategies: Designing Resilient AI Architectures

LLM fallback strategies are essential for production AI. Learn how to design a multi-model architecture that manages latency and API costs during outages.

LLMartificial intelligencesoftware engineeringsystem designAPIresilienceAIRAGPrompt Engineering

Last month, our primary LLM provider experienced a massive latency spike that pushed response times from a snappy 600ms to over 12 seconds. Our user dashboard hung, the UI locked up, and we started bleeding requests. We didn't have an automated safety net, so we spent the next two hours manually swapping environment variables and redeploying.

Never again. After that incident, I spent the following week building a robust, automated switching layer. If you’re shipping AI features, you cannot rely on a single endpoint—you need LLM fallback strategies to keep your application afloat when the primary API inevitably stumbles.

The Problem with Static Endpoints

When you hardcode an API call to a single model—like gpt-4o—you’re building a single point of failure. If that model slows down or hits a rate limit, your entire application dies with it.

We initially tried wrapping our calls in a standard retry loop. It was a disaster. If the provider is slow, retrying just compounds the latency issue, effectively DDOSing your own infrastructure. That’s when I realized we needed a proper LLM routing: a strategy for multi-model architectures that doesn't just retry, but actively pivots to a different resource.

Implementing Multi-Model Architecture

The goal of a resilient system is to degrade gracefully. If your heavy, expensive model (like Claude 3.5 Sonnet or GPT-4o) fails to respond within a specific window, you shouldn't just error out. You should fall back to a "fast" model—something like GPT-4o-mini or Haiku—which usually responds in under 400ms.

Here is a simplified pattern for a Python-based fallback handler:


PYTHON
import time

async def get_llm_response(prompt, model_chain):
    for model in model_chain:
        try:
            start = time.perf_counter()
            response = await call_llm(prompt, model)
            
            # Latency check: if it took too long, maybe log it but return
            if time.perf_counter() - start > 5.0:
                print(f"Warning: {model} was slow.")
            
            return response
        except Exception as e:
            print(f"Model {model} failed: {e}. Trying next...")
            continue
    raise Exception("All models failed.")

This simple iterative approach solves the immediate "it crashed" problem. However, it doesn't account for the fact that a failing model might recover. For that, you need to integrate API resilience with circuit breakers: stop cascading failures to stop sending requests to a provider that is clearly in a death spiral.

Managing Latency and Costs

Fallback isn't just about uptime; it's about API cost optimization. If you run a high-traffic app, you can't afford to run your most expensive model for every request anyway.

My current setup uses a tiered approach:

Semantic Cache: Before hitting any API, I check a Redis store. If the request is a near-match, I return the cached result. This is the biggest win for both cost and latency.
Primary Model: If it's a cache miss, I route to the primary high-intelligence model.
Fallback: If the primary exceeds a 2-second timeout, the circuit breaker trips, and I automatically route to a cheaper, faster model for the next 60 seconds.

This strategy ensures that I only pay the premium for "smart" responses when the system is healthy. When the system is struggling, I prioritize availability over perfection.

Pragmatic Lessons from Production

I’m still refining the "timeout" logic. Setting a hard timeout is tricky because different prompts take different amounts of time to generate. A simple summarization task should return in under a second, but a complex code-generation task might take five.

If you're just starting, don't over-engineer the switching logic. Start by logging the latency of every request. You'll quickly see that your "fast" model is actually slower than you think at certain times of the day.

Also, keep an eye on your token usage when you fall back. If your fallback model has a smaller context window, you might need to truncate your prompt dynamically before switching. It’s a messy detail, but it’s the difference between a system that works and one that crashes with a 400 Bad Request.

Frequently Asked Questions

How do I decide when to trigger a fallback? Don't just use errors. Use a timeout. If a request takes longer than your P95 latency threshold, treat it as a "soft failure" and trigger the fallback.

Does switching models hurt the user experience? Sometimes. A smaller model might give a slightly less accurate answer. I recommend logging when a fallback occurs so you can review those responses later and tune your prompts to be more "model-agnostic."

Is this overkill for a small app? If you're in development, maybe. But if you have paying users, your reputation is tied to your uptime. Implementing a basic fallback is about two days of work, and it saves you from an embarrassing outage.

I’m still not entirely happy with how we handle "partial" failures—where the API returns a response but it's truncated or nonsensical. We're currently exploring better evals to catch that before it hits the user, but for now, the circuit-breaker-and-fallback pattern is doing the heavy lifting.

Back to Blog

LLM Fallback Strategies: Designing Resilient AI Architectures

The Problem with Static Endpoints

Implementing Multi-Model Architecture

Managing Latency and Costs

Pragmatic Lessons from Production

Frequently Asked Questions

Similar Posts

LLM Function Calling: A Guide to Dynamic Tool Selection

Mastering Query Decomposition for RAG Pipelines: A Practical Guide

LLM Prompt Versioning: A Practical Guide to AI Feature Flagging