FailoverResilience

Failover Across OpenAI, Anthropic, and Google: A Production Pattern

Q: How does multi-provider LLM failover work?

You configure a single model alias backed by deployments across multiple providers. When the primary deployment 5xxs or times out, the gateway transparently retries the next deployment in the list. The client sees one response; the routing happened in single-digit milliseconds.

Q: Won''t failing over to a different provider give me a different response style?

Yes — slightly. The fix is to use providers with comparable model tiers (Claude Sonnet ≈ GPT-4o for general tasks; Haiku ≈ GPT-4o-mini for cheap classification). For prompts that depend on a specific model's style, pin them to a non-failover alias.

Q: What about provider rate-limiting (429s)?

Treat 429 like 5xx — retry the next deployment. The gateway does this automatically and emits a Prometheus counter (`gatewayllm_router_failover_total`) so you can alert on chronic primary-provider rate-limiting and rebalance.

A copy-paste production pattern for cross-provider LLM failover — preferring one provider, draining to another in microseconds, and never going down because of one vendor's outage.

Gateway-LLM team · April 12, 2026 · 7 min read

TL;DR

OpenAI has had four 30+ minute outages in the last twelve months. Anthropic has had two. Google's Gemini API rate-limits during regional traffic surges. If your product depends on any single provider being up, your reliability is bounded by theirs — and theirs is not 99.99%.

The fix is cross-provider failover: a single model alias backed by deployments across multiple vendors, where the gateway transparently retries the next one when the primary degrades. This post is a copy-paste production pattern for that, with the gotchas that bite teams in production.

The minimum viable failover config

model_list:
  - model_alias: 'general'
    deployments:
      - provider: openai
        model: gpt-4o
        api_key_env: OPENAI_API_KEY
      - provider: anthropic
        model: claude-sonnet-4-20250514
        api_key_env: ANTHROPIC_API_KEY
      - provider: gemini
        model: gemini-2.0-flash
        api_key_env: GEMINI_API_KEY

router:
  strategy: round_robin
  retries: 2
  retry_after_ms: 200

This single block buys you a lot:

A request to model: 'general' is sent to OpenAI first (top of the list).
If OpenAI returns 5xx or times out, the gateway retries Anthropic.
If Anthropic also fails, it retries Gemini.
If all three fail, the client gets a 502 and a gatewayllm_router_exhaustion_total counter increments — you alert on it.

retries: 2 is the budget beyond the first attempt. Total possible attempts = 3 (one per deployment). retry_after_ms: 200 is a tiny back-off between attempts so you don't slam a recovering provider.

What "transparent" actually means

The application code doesn't change. The OpenAI Python SDK — pointed at the gateway — gets the same response shape regardless of which upstream actually served the call. The provider translation layer rewrites the response into OpenAI Chat Completions / Responses API format on the way out.

client = OpenAI(base_url="https://gateway/v1", api_key="gw_virt_…")

resp = client.chat.completions.create(
    model="general",
    messages=[{"role": "user", "content": "Hi"}],
)

# resp.choices[0].message.content works the same whether
# the upstream was OpenAI, Anthropic, or Gemini.

The gateway adds three response headers so you know which one served you:

X-Gateway-Decision: openai/gpt-4o
X-Gateway-Retries: 0 (or 2 if it failed over twice)
X-Gateway-Cost-Usd: 0.001234

Three failover patterns

Pattern 1: prefer one provider, drain to another

Most teams want OpenAI primary, others as backup. Use weights:

- model_alias: 'general'
  deployments:
    - provider: openai
      model: gpt-4o
      weight: 95
    - provider: anthropic
      model: claude-sonnet-4-20250514
      weight: 5

router.strategy: round_robin with these weights means 95% of traffic hits OpenAI, 5% hits Anthropic. The 5% serves two purposes:

It's a smoke test — Anthropic is exercised continuously, so when you actually need to fail over, the path is warm.
It validates response parity — you can compare outputs and notice if model behaviour drifts.

When OpenAI 5xxs, the retry budget kicks in and the failing 95% goes to Anthropic until OpenAI recovers. Your 95/5 split during steady state becomes 0/100 during an outage, automatically.

Pattern 2: latency-driven failover

If your priority is latency, not vendor preference:

router:
  strategy: least_latency
  retries: 1

The router picks the deployment with the lowest p50 over the last 60 seconds. When one provider degrades by 200 ms, traffic shifts away from it — usually before any 5xx is observed. Combined with retries, this is the most resilient configuration we ship.

Pattern 3: smart-route + failover

The combination that gives both cost savings and resilience:

- model_alias: 'auto'
  deployments:
    - provider: openai
      model: gpt-4o-mini
      tier: simple
    - provider: anthropic
      model: claude-haiku
      tier: simple        # fallback for simple bucket
    - provider: openai
      model: gpt-4o
      tier: medium
    - provider: anthropic
      model: claude-sonnet-4-20250514
      tier: medium        # fallback for medium bucket
    - provider: openai
      model: gpt-5
      tier: complex

router:
  strategy: classifier
  retries: 1

The classifier picks the bucket; within the bucket, the gateway tries the first matching deployment, then falls over to the second on failure. You get smart routing's cost savings and cross-provider resilience without picking between them.

The non-obvious gotchas

Style drift across providers

OpenAI's GPT-4o, Anthropic's Sonnet, and Gemini's Flash answer the same prompt slightly differently. They follow instructions equally well; their voice differs. If your product depends on a specific tone (a customer-facing chatbot with a defined persona), the failover hop can produce a noticeable response-style shift.

The fix is not to abandon failover. The fix is:

For style-critical routes, pin them to a single-deployment alias and accept the reduced resilience for that one route.
For everything else, failover is fine — users don't notice tone drift in form-filling or back-office workflows.

Tool-call format differences

OpenAI's tool-calling format is the most widely supported. Anthropic's is bidirectionally translated by the gateway, but edge cases exist (deeply nested schemas, multi-step tool chains). If your product relies on tool calls at scale, validate the failover path explicitly:

# Force Anthropic by temporarily disabling OpenAI deployment
curl -X POST http://localhost:8080/admin/deployments/disable \
  -d '{ "alias": "general", "provider": "openai" }'

Run your tool-using flows; verify they work. Re-enable the OpenAI deployment.

Rate limits look like outages

When a provider rate-limits you (429), the gateway treats it like a 5xx: retry the next deployment. That's the right behaviour for a transient spike, but if you're systemically rate-limited on the primary, every request fails over.

Watch for it:

gatewayllm_router_failover_total{from="openai", to="anthropic"} should be near-zero in steady state.
If it climbs, your primary is hitting limits. Either raise your provider quota or weight more traffic to the secondary.

Cost surprises during failover

Your secondary provider may not be priced like your primary. If OpenAI primary costs $X/request and Anthropic secondary costs $1.5X/request, your bill spikes during an OpenAI outage. Two mitigations:

Cap with virtual key budgets (monthly_budget_usd) so a long outage can't run away.
Use the gatewayllm_cost_usd_total{provider} metric to set a Slack alert on hourly spend exceeding a threshold.

Observability

Three Prometheus series do most of the work for failover monitoring:

gatewayllm_requests_total{provider, status} — the RED-style request rate, broken down by provider. A spike in status=5xx for one provider is your first signal.
gatewayllm_router_failover_total{from, to} — counts how often the gateway moved a request from from to to. Should be a flat line in steady state.
gatewayllm_router_exhaustion_total — counts requests that ran out of retries. Anything above zero deserves an alert.

A reasonable Grafana panel: gatewayllm_router_failover_total rate over the last 30 minutes, paged when above 5% of total request rate.

Try it

If you're already running Gateway-LLM, the only thing standing between you and cross-provider failover is adding a second deployment to a model_list entry and reloading config (SIGHUP). The router does the rest.

Quickstart → /docs/quickstart Smart routing & failover → /docs/smart-routing Provider gotchas → /docs/providers

FAQ

How does multi-provider LLM failover work?
A single alias is backed by deployments across multiple providers. The gateway tries them in order; on 5xx or timeout, it retries the next one transparently.

Won't failing over to a different provider give me a different response style?
Slightly. Use comparable tiers across providers; for style-critical routes, pin them to a single-deployment alias.

What about provider rate-limiting (429s)?
Treat them like 5xx — retry the next deployment. Watch the failover counter; chronic failover means your primary is rate-limited and you should rebalance.

Run Gateway-LLM in five minutes.

Open-source, OpenAI-compatible, and Apache 2.0. The Quickstart is a single docker compose up.

Read the Quickstart Talk to the founders