Documentation

Smart routing

Three router strategies (round_robin, least_latency, classifier), how they pick deployments, and how they handle failover.

4 min read · updated 2026-04-29

When a request arrives at an alias backed by more than one deployment, the router decides which deployment actually handles it. Pick a strategy that matches what you're optimising for: cost, latency, or quality.

router:
  strategy: classifier  # round_robin | least_latency | classifier
  retries: 2
  retry_after_ms: 200

Strategies

`round_robin`

The simplest. Cycles through deployments in order, skewed by weight if you set it. Good fit when:

All deployments are roughly equivalent (same model, multiple keys).
You want predictable, even spread for cost reporting.

- model_alias: 'gpt-4o'
  deployments:
    - provider: openai
      model: gpt-4o
      api_key_env: OPENAI_KEY_TEAM_A
      weight: 50
    - provider: openai
      model: gpt-4o
      api_key_env: OPENAI_KEY_TEAM_B
      weight: 50

`least_latency`

Tracks rolling p50 latency per deployment over the last 60 seconds and picks the lowest. Excellent fit for an alias backed by several providers when one is briefly degraded — the router routes around it without you noticing.

It's not a magic bullet: if every deployment is healthy, this behaves like round-robin most of the time, then quietly leans away from a slow one when the data says so.

`classifier`

The marquee strategy. Each prompt is scored for complexity in microseconds and dropped into one of three buckets:

| Bucket | Score range | Routed to (typical config) | |---|---|---| | simple | 0.00 – 0.33 | Mini-tier model (gpt-4o-mini, gemini-flash, …) | | medium | 0.33 – 0.66 | Mid-tier model (gpt-4o, claude-sonnet, …) | | complex | 0.66 – 1.00 | Flagship (gpt-5, claude-opus, …) |

Configure the bucket-to-deployment mapping in your alias:

- model_alias: 'auto'
  deployments:
    - provider: openai
      model: gpt-4o-mini
      tier: simple
    - provider: openai
      model: gpt-4o
      tier: medium
    - provider: openai
      model: gpt-5
      tier: complex

The signals the classifier looks at — long inputs, code blocks, math, explicit reasoning verbs (reason, derive, compare) — are the same set the public live demo uses. They're deterministic and fast (sub-microsecond on warm classifiers), and you can override the threshold with router.classifier.thresholds.

Failover and retries

Every strategy is wrapped in a retry budget:

router:
  retries: 2          # how many additional deployments to try
  retry_after_ms: 200 # backoff between attempts

A request that fails (5xx, timeout, or a provider rate-limit 429) is automatically re-tried against the next deployment in the alias's list. The client sees one response, with a header x-gateway-retries: N if any retries happened.

If every deployment fails, the gateway returns 502 Bad Gateway with the upstream error in the body, and increments gatewayllm_router_exhaustion_total so you can alert on it.

Per-team overrides (with virtual keys)

A virtual key can pin or restrict the router's choices:

curl -X POST http://localhost:8080/admin/keys \
  -H "Authorization: Bearer $MASTER_KEY" \
  -d '{
    "name": "compliance-team",
    "models": ["gpt-4o", "claude-sonnet"],
    "router": { "force_strategy": "least_latency" }
  }'

That key can only call those two aliases, and the router runs least_latency for it specifically — independently of the global default.

Observability

Every routing decision lands in two places:

A response header — x-gateway-decision: openai/gpt-4o-mini, x-gateway-route-bucket: simple.
A Prometheus counter — gatewayllm_smart_route_decisions_total{bucket="simple",override="false"}.

Hit /metrics to see the live distribution; sum it over a day to see how much traffic is hitting your cheap tier vs your flagship. That ratio is your savings number.

Tuning advice

Start in the live demo, not in production. Drop in some real prompts, see which bucket they land in. If the classifier downgrades something it shouldn't, raise that bucket's threshold.
Use shadow mode during rollout (router.classifier.shadow: true). The classifier still decides; the request still goes to your old default. You compare deltas in the dashboard before swapping.
Pin sensitive routes with a virtual key that locks models: to a single alias. Customer-facing legal copy never has to share a router decision with internal classification jobs.

What's next

Cut LLM costs with smart routing — the narrative version with worked token math.
Multi-provider failover — failover patterns at scale.
Virtual API keys — per-key router and budget caps.

Stuck or want a feature? Email the founders directly at mitshawtechnologies@gmail.com. We answer fast.