Failover Across OpenAI, Anthropic, and Google: A Production Pattern
A copy-paste production pattern for cross-provider LLM failover — preferring one provider, draining to another in microseconds, and never going down because of one vendor's outage.
Gateway-LLM team · · 7 min read
TL;DR
OpenAI has had four 30+ minute outages in the last twelve months. Anthropic has had two. Google's Gemini API rate-limits during regional traffic surges. If your product depends on any single provider being up, your reliability is bounded by theirs — and theirs is not 99.99%.
The fix is cross-provider failover: a single model alias backed by deployments across multiple vendors, where the gateway transparently retries the next one when the primary degrades. This post is a copy-paste production pattern for that, with the gotchas that bite teams in production.
The minimum viable failover config
model_list:
- model_alias: 'general'
deployments:
- provider: openai
model: gpt-4o
api_key_env: OPENAI_API_KEY
- provider: anthropic
model: claude-sonnet-4-20250514
api_key_env: ANTHROPIC_API_KEY
- provider: gemini
model: gemini-2.0-flash
api_key_env: GEMINI_API_KEY
router:
strategy: round_robin
retries: 2
retry_after_ms: 200
This single block buys you a lot:
- A request to
model: 'general'is sent to OpenAI first (top of the list). - If OpenAI returns 5xx or times out, the gateway retries Anthropic.
- If Anthropic also fails, it retries Gemini.
- If all three fail, the client gets a 502 and a
gatewayllm_router_exhaustion_totalcounter increments — you alert on it.
retries: 2 is the budget beyond the first attempt. Total possible attempts = 3 (one per deployment). retry_after_ms: 200 is a tiny back-off between attempts so you don't slam a recovering provider.
What "transparent" actually means
The application code doesn't change. The OpenAI Python SDK — pointed at the gateway — gets the same response shape regardless of which upstream actually served the call. The provider translation layer rewrites the response into OpenAI Chat Completions / Responses API format on the way out.
client = OpenAI(base_url="https://gateway/v1", api_key="gw_virt_…")
resp = client.chat.completions.create(
model="general",
messages=[{"role": "user", "content": "Hi"}],
)
# resp.choices[0].message.content works the same whether
# the upstream was OpenAI, Anthropic, or Gemini.
The gateway adds three response headers so you know which one served you:
X-Gateway-Decision: openai/gpt-4oX-Gateway-Retries: 0(or2if it failed over twice)X-Gateway-Cost-Usd: 0.001234
Three failover patterns
Pattern 1: prefer one provider, drain to another
Most teams want OpenAI primary, others as backup. Use weights:
- model_alias: 'general'
deployments:
- provider: openai
model: gpt-4o
weight: 95
- provider: anthropic
model: claude-sonnet-4-20250514
weight: 5
router.strategy: round_robin with these weights means 95% of traffic hits OpenAI, 5% hits Anthropic. The 5% serves two purposes:
- It's a smoke test — Anthropic is exercised continuously, so when you actually need to fail over, the path is warm.
- It validates response parity — you can compare outputs and notice if model behaviour drifts.
When OpenAI 5xxs, the retry budget kicks in and the failing 95% goes to Anthropic until OpenAI recovers. Your 95/5 split during steady state becomes 0/100 during an outage, automatically.
Pattern 2: latency-driven failover
If your priority is latency, not vendor preference:
router:
strategy: least_latency
retries: 1
The router picks the deployment with the lowest p50 over the last 60 seconds. When one provider degrades by 200 ms, traffic shifts away from it — usually before any 5xx is observed. Combined with retries, this is the most resilient configuration we ship.
Pattern 3: smart-route + failover
The combination that gives both cost savings and resilience:
- model_alias: 'auto'
deployments:
- provider: openai
model: gpt-4o-mini
tier: simple
- provider: anthropic
model: claude-haiku
tier: simple # fallback for simple bucket
- provider: openai
model: gpt-4o
tier: medium
- provider: anthropic
model: claude-sonnet-4-20250514
tier: medium # fallback for medium bucket
- provider: openai
model: gpt-5
tier: complex
router:
strategy: classifier
retries: 1
The classifier picks the bucket; within the bucket, the gateway tries the first matching deployment, then falls over to the second on failure. You get smart routing's cost savings and cross-provider resilience without picking between them.
The non-obvious gotchas
Style drift across providers
OpenAI's GPT-4o, Anthropic's Sonnet, and Gemini's Flash answer the same prompt slightly differently. They follow instructions equally well; their voice differs. If your product depends on a specific tone (a customer-facing chatbot with a defined persona), the failover hop can produce a noticeable response-style shift.
The fix is not to abandon failover. The fix is:
- For style-critical routes, pin them to a single-deployment alias and accept the reduced resilience for that one route.
- For everything else, failover is fine — users don't notice tone drift in form-filling or back-office workflows.
Tool-call format differences
OpenAI's tool-calling format is the most widely supported. Anthropic's is bidirectionally translated by the gateway, but edge cases exist (deeply nested schemas, multi-step tool chains). If your product relies on tool calls at scale, validate the failover path explicitly:
# Force Anthropic by temporarily disabling OpenAI deployment
curl -X POST http://localhost:8080/admin/deployments/disable \
-d '{ "alias": "general", "provider": "openai" }'
Run your tool-using flows; verify they work. Re-enable the OpenAI deployment.
Rate limits look like outages
When a provider rate-limits you (429), the gateway treats it like a 5xx: retry the next deployment. That's the right behaviour for a transient spike, but if you're systemically rate-limited on the primary, every request fails over.
Watch for it:
gatewayllm_router_failover_total{from="openai", to="anthropic"}should be near-zero in steady state.- If it climbs, your primary is hitting limits. Either raise your provider quota or weight more traffic to the secondary.
Cost surprises during failover
Your secondary provider may not be priced like your primary. If OpenAI primary costs $X/request and Anthropic secondary costs $1.5X/request, your bill spikes during an OpenAI outage. Two mitigations:
- Cap with virtual key budgets (
monthly_budget_usd) so a long outage can't run away. - Use the
gatewayllm_cost_usd_total{provider}metric to set a Slack alert on hourly spend exceeding a threshold.
Observability
Three Prometheus series do most of the work for failover monitoring:
gatewayllm_requests_total{provider, status}— the RED-style request rate, broken down by provider. A spike instatus=5xxfor one provider is your first signal.gatewayllm_router_failover_total{from, to}— counts how often the gateway moved a request fromfromtoto. Should be a flat line in steady state.gatewayllm_router_exhaustion_total— counts requests that ran out of retries. Anything above zero deserves an alert.
A reasonable Grafana panel: gatewayllm_router_failover_total rate over the last 30 minutes, paged when above 5% of total request rate.
Try it
If you're already running Gateway-LLM, the only thing standing between you and cross-provider failover is adding a second deployment to a model_list entry and reloading config (SIGHUP). The router does the rest.
Quickstart → /docs/quickstart Smart routing & failover → /docs/smart-routing Provider gotchas → /docs/providers
Related reading
- What is an LLM gateway? — the category primer.
- Smart routing for cost savings — pair failover with classifier routing.
- Virtual API keys for governance — pin a single team to a single provider when failover isn't acceptable.
FAQ
How does multi-provider LLM failover work?
A single alias is backed by deployments across multiple providers. The gateway tries them in order; on 5xx or timeout, it retries the next one transparently.
Won't failing over to a different provider give me a different response style?
Slightly. Use comparable tiers across providers; for style-critical routes, pin them to a single-deployment alias.
What about provider rate-limiting (429s)?
Treat them like 5xx — retry the next deployment. Watch the failover counter; chronic failover means your primary is rate-limited and you should rebalance.
Run Gateway-LLM in five minutes.
Open-source, OpenAI-compatible, and Apache 2.0. The Quickstart is a single docker compose up.