Documentation
Smart routing
Three router strategies (round_robin, least_latency, classifier), how they pick deployments, and how they handle failover.
4 min read · updated 2026-04-29
When a request arrives at an alias backed by more than one deployment, the router decides which deployment actually handles it. Pick a strategy that matches what you're optimising for: cost, latency, or quality.
router:
strategy: classifier # round_robin | least_latency | classifier
retries: 2
retry_after_ms: 200
Strategies
round_robin
The simplest. Cycles through deployments in order, skewed by weight if you set it. Good fit when:
- All deployments are roughly equivalent (same model, multiple keys).
- You want predictable, even spread for cost reporting.
- model_alias: 'gpt-4o'
deployments:
- provider: openai
model: gpt-4o
api_key_env: OPENAI_KEY_TEAM_A
weight: 50
- provider: openai
model: gpt-4o
api_key_env: OPENAI_KEY_TEAM_B
weight: 50
least_latency
Tracks rolling p50 latency per deployment over the last 60 seconds and picks the lowest. Excellent fit for an alias backed by several providers when one is briefly degraded — the router routes around it without you noticing.
It's not a magic bullet: if every deployment is healthy, this behaves like round-robin most of the time, then quietly leans away from a slow one when the data says so.
classifier
The marquee strategy. Each prompt is scored for complexity in microseconds and dropped into one of three buckets:
| Bucket | Score range | Routed to (typical config) |
|---|---|---|
| simple | 0.00 – 0.33 | Mini-tier model (gpt-4o-mini, gemini-flash, …) |
| medium | 0.33 – 0.66 | Mid-tier model (gpt-4o, claude-sonnet, …) |
| complex | 0.66 – 1.00 | Flagship (gpt-5, claude-opus, …) |
Configure the bucket-to-deployment mapping in your alias:
- model_alias: 'auto'
deployments:
- provider: openai
model: gpt-4o-mini
tier: simple
- provider: openai
model: gpt-4o
tier: medium
- provider: openai
model: gpt-5
tier: complex
The signals the classifier looks at — long inputs, code blocks, math, explicit reasoning verbs (reason, derive, compare) — are the same set the public live demo uses. They're deterministic and fast (sub-microsecond on warm classifiers), and you can override the threshold with router.classifier.thresholds.
Failover and retries
Every strategy is wrapped in a retry budget:
router:
retries: 2 # how many additional deployments to try
retry_after_ms: 200 # backoff between attempts
A request that fails (5xx, timeout, or a provider rate-limit 429) is automatically re-tried against the next deployment in the alias's list. The client sees one response, with a header x-gateway-retries: N if any retries happened.
If every deployment fails, the gateway returns 502 Bad Gateway with the upstream error in the body, and increments gatewayllm_router_exhaustion_total so you can alert on it.
Per-team overrides (with virtual keys)
A virtual key can pin or restrict the router's choices:
curl -X POST http://localhost:8080/admin/keys \
-H "Authorization: Bearer $MASTER_KEY" \
-d '{
"name": "compliance-team",
"models": ["gpt-4o", "claude-sonnet"],
"router": { "force_strategy": "least_latency" }
}'
That key can only call those two aliases, and the router runs least_latency for it specifically — independently of the global default.
Observability
Every routing decision lands in two places:
- A response header —
x-gateway-decision: openai/gpt-4o-mini,x-gateway-route-bucket: simple. - A Prometheus counter —
gatewayllm_smart_route_decisions_total{bucket="simple",override="false"}.
Hit /metrics to see the live distribution; sum it over a day to see how much traffic is hitting your cheap tier vs your flagship. That ratio is your savings number.
Tuning advice
- Start in the live demo, not in production. Drop in some real prompts, see which bucket they land in. If the classifier downgrades something it shouldn't, raise that bucket's threshold.
- Use shadow mode during rollout (
router.classifier.shadow: true). The classifier still decides; the request still goes to your old default. You compare deltas in the dashboard before swapping. - Pin sensitive routes with a virtual key that locks
models:to a single alias. Customer-facing legal copy never has to share a router decision with internal classification jobs.
What's next
- Cut LLM costs with smart routing — the narrative version with worked token math.
- Multi-provider failover — failover patterns at scale.
- Virtual API keys — per-key router and budget caps.
Stuck or want a feature? Email the founders directly at mitshawtechnologies@gmail.com. We answer fast.