Documentation

Smart routing

Three router strategies (round_robin, least_latency, classifier), how they pick deployments, and how they handle failover.

4 min read · updated 2026-04-29

When a request arrives at an alias backed by more than one deployment, the router decides which deployment actually handles it. Pick a strategy that matches what you're optimising for: cost, latency, or quality.

router:
  strategy: classifier  # round_robin | least_latency | classifier
  retries: 2
  retry_after_ms: 200

Strategies

round_robin

The simplest. Cycles through deployments in order, skewed by weight if you set it. Good fit when:

  • All deployments are roughly equivalent (same model, multiple keys).
  • You want predictable, even spread for cost reporting.
- model_alias: 'gpt-4o'
  deployments:
    - provider: openai
      model: gpt-4o
      api_key_env: OPENAI_KEY_TEAM_A
      weight: 50
    - provider: openai
      model: gpt-4o
      api_key_env: OPENAI_KEY_TEAM_B
      weight: 50

least_latency

Tracks rolling p50 latency per deployment over the last 60 seconds and picks the lowest. Excellent fit for an alias backed by several providers when one is briefly degraded — the router routes around it without you noticing.

It's not a magic bullet: if every deployment is healthy, this behaves like round-robin most of the time, then quietly leans away from a slow one when the data says so.

classifier

The marquee strategy. Each prompt is scored for complexity in microseconds and dropped into one of three buckets:

| Bucket | Score range | Routed to (typical config) | |---|---|---| | simple | 0.00 – 0.33 | Mini-tier model (gpt-4o-mini, gemini-flash, …) | | medium | 0.33 – 0.66 | Mid-tier model (gpt-4o, claude-sonnet, …) | | complex | 0.66 – 1.00 | Flagship (gpt-5, claude-opus, …) |

Configure the bucket-to-deployment mapping in your alias:

- model_alias: 'auto'
  deployments:
    - provider: openai
      model: gpt-4o-mini
      tier: simple
    - provider: openai
      model: gpt-4o
      tier: medium
    - provider: openai
      model: gpt-5
      tier: complex

The signals the classifier looks at — long inputs, code blocks, math, explicit reasoning verbs (reason, derive, compare) — are the same set the public live demo uses. They're deterministic and fast (sub-microsecond on warm classifiers), and you can override the threshold with router.classifier.thresholds.

Failover and retries

Every strategy is wrapped in a retry budget:

router:
  retries: 2          # how many additional deployments to try
  retry_after_ms: 200 # backoff between attempts

A request that fails (5xx, timeout, or a provider rate-limit 429) is automatically re-tried against the next deployment in the alias's list. The client sees one response, with a header x-gateway-retries: N if any retries happened.

If every deployment fails, the gateway returns 502 Bad Gateway with the upstream error in the body, and increments gatewayllm_router_exhaustion_total so you can alert on it.

Per-team overrides (with virtual keys)

A virtual key can pin or restrict the router's choices:

curl -X POST http://localhost:8080/admin/keys \
  -H "Authorization: Bearer $MASTER_KEY" \
  -d '{
    "name": "compliance-team",
    "models": ["gpt-4o", "claude-sonnet"],
    "router": { "force_strategy": "least_latency" }
  }'

That key can only call those two aliases, and the router runs least_latency for it specifically — independently of the global default.

Observability

Every routing decision lands in two places:

  • A response header — x-gateway-decision: openai/gpt-4o-mini, x-gateway-route-bucket: simple.
  • A Prometheus counter — gatewayllm_smart_route_decisions_total{bucket="simple",override="false"}.

Hit /metrics to see the live distribution; sum it over a day to see how much traffic is hitting your cheap tier vs your flagship. That ratio is your savings number.

Tuning advice

  • Start in the live demo, not in production. Drop in some real prompts, see which bucket they land in. If the classifier downgrades something it shouldn't, raise that bucket's threshold.
  • Use shadow mode during rollout (router.classifier.shadow: true). The classifier still decides; the request still goes to your old default. You compare deltas in the dashboard before swapping.
  • Pin sensitive routes with a virtual key that locks models: to a single alias. Customer-facing legal copy never has to share a router decision with internal classification jobs.

What's next


Stuck or want a feature? Email the founders directly at mitshawtechnologies@gmail.com. We answer fast.