How Smart Routing Cuts LLM Costs by 40–70% Without Quality Loss
A walkthrough of prompt-complexity classification, tier-based routing, and the worked token math that turns a $40k OpenAI bill into a $14k one — with no measurable change in output quality.
Gateway-LLM team · · 9 min read
TL;DR
Most LLM bills are 40–70% wasted because flagship models handle prompts that mini-tier models could solve indistinguishably. Smart routing scores each prompt in under a microsecond, drops it into a complexity bucket (simple / medium / complex), and dispatches to the cheapest tier that clears your quality bar. This post walks through:
- The prompt-classifier features that actually predict complexity.
- The token math that explains where the savings come from.
- A full worked example: a $40k/month bill becoming $14k.
- How to ship it safely with shadow mode and a rollback plan.
The math problem you didn't realise you had
Pretend a single team sends 10 million chat completions to GPT-4o in a month. Average prompt: 800 tokens in, 200 tokens out. At 2026 OpenAI pricing ($2.50 / $10 per million tokens, in / out), that's:
prompt_cost = 10M × 800 × $2.50 / 1M = $20,000
output_cost = 10M × 200 × $10 / 1M = $20,000
total = $40,000 / month
Now look at what those 10M requests actually contained. Roughly:
- 70% are short, predictable work — extract a date, classify a sentiment, summarise a paragraph. A mini-tier model (
gpt-4o-miniat $0.15 / $0.60) handles them indistinguishably. - 20% are mid-difficulty — drafting a paragraph of marketing copy, paraphrasing, light reasoning. A mid-tier model (
gpt-4o) is the right call. - 10% are genuinely hard — multi-step reasoning, code review, math, long-context summarisation. Flagship-only.
Pin every request to GPT-4o and that 70% of trivial work pays $0.0028/request — when it could pay $0.000165. That's a 17x markup on prompts that don't need flagship inference, repeated across 7 million requests.
Re-route per-tier and the bill becomes:
| Bucket | Volume | Per-request cost | Subtotal |
|---|---|---|---|
| simple → gpt-4o-mini | 7,000,000 | $0.000165 | $1,155 |
| medium → gpt-4o | 2,000,000 | $0.0028 | $5,600 |
| complex → gpt-4o (kept) | 1,000,000 | $0.0028 | $2,800 |
| Total | 10,000,000 | | $9,555 |
$9.5k vs $40k. 76% reduction. No SDK changes, no code rewrite.
Real production traffic is messier than this — averages don't represent every request, and some teams have a much more skewed bucket mix. The savings range we see across replayed customer traces lands at 40–70%, which is the headline number we're comfortable putting on the homepage.
The classifier: what actually predicts complexity
A useful prompt classifier doesn't need to be a model. The features that empirically predict whether a prompt needs a flagship are deterministic and cheap to compute:
| Signal | Score weight | Why it matters |
|---|---|---|
| Prompt length (runes) | up to 0.55 | Long inputs require holding more context; flagships handle that better. |
| Code blocks (```) | +0.15 | Code reasoning benefits from the larger model. |
| Math notation ($$, LaTeX, "derivative", "integral") | +0.20 | Math needs strong symbolic reasoning. |
| Reasoning verbs ("derive", "compare", "step by step", "prove") | +0.20 | Explicit chain-of-thought asks for flagship-level inference. |
Sum the weights, clamp to [0, 1], drop into a bucket:
score < 0.33→simple0.33 ≤ score < 0.66→mediumscore ≥ 0.66→complex
Run that on a real request and you get a decision in under a microsecond. The whole thing is ~80 lines of Go and lives in backend/internal/smartroute/smartroute.go in our open-source repo. There's no model call. There's no embedding lookup. It's a feature scorer.
You can absolutely ship more sophisticated classifiers — small distilled BERTs running on CPU, in-house tuned scorers. The 40–70% savings number above comes from this simple deterministic scorer; the marginal gain from more complex classifiers, in our measurements, is small.
"Won't this hurt my output quality?"
This is the right question. The answer has three parts.
Part 1: most prompts genuinely don't need flagship. Output quality on extract this date, classify this support ticket as billing/technical/other, summarise this email in two sentences is functionally identical between gpt-4o-mini and gpt-4o. If you don't believe me, run the comparison yourself for a hundred prompts. (Most teams have, which is how the 40–70% number became consensus.)
Part 2: the classifier is conservative. Ambiguous cases are routed up, not down. The threshold for simple → medium is set at 0.33 by default and is config-tunable. Errors of the form "should have been complex but was routed to mini" are rare; errors of the form "could have been mini but was routed to medium" are tolerated.
Part 3: shadow mode is mandatory before you ship. Run the classifier in shadow: true for a week. The classifier still decides; the request still goes to your old default. Compare deltas in your dashboard:
- LLM-judge similarity score: do mini-tier and flagship produce equivalent outputs on
simpleprompts? - Human-eval sample: does the QA team see any difference?
- Downstream metrics: did task-completion rate, user satisfaction, support ticket volume change?
If any of those deltas are concerning on a bucket, raise that bucket's threshold. If they're not, flip shadow: false and ship.
A rollout plan that won't get you fired
# config.yaml — week 1, shadow mode
router:
strategy: classifier
shadow: true
Week 1: classifier runs but doesn't dispatch. You collect routing decisions in your spend log and compare them to actual model assignments offline.
# week 2, simple bucket only
router:
strategy: classifier
shadow: false
buckets:
simple: live
medium: shadow
complex: shadow
Week 2: only simple prompts route. They're 70% of traffic and the lowest risk. You see real savings, and medium/complex keep going to your old default while you monitor.
# week 3, full live
router:
strategy: classifier
shadow: false
Week 3: full live. Continue monitoring the dashboard. If a regression shows up, flip shadow on for the offending bucket and investigate.
The whole rollout — design, shadow, partial, full — is two weeks. The savings on ten million prompts/month at the numbers above is roughly $30,000/month. The cost is two weeks of one engineer's attention.
The headers that prove it's working
Every routed request carries three response headers:
X-Gateway-Decision: openai/gpt-4o-mini— which deployment served the call.X-Gateway-Route-Bucket: simple— what the classifier picked.X-Gateway-Cost-Usd: 0.000099— the metered cost of this call.
Plus the SSE meta event includes caller, routed, overridden, classifier score and bucket. You can trace any single request end-to-end and explain why it cost what it cost.
In Prometheus, the live aggregates are:
gatewayllm_smart_route_decisions_total{bucket, override}— how often each bucket fired.gatewayllm_cost_usd_total{model, provider}— running spend per deployment.
Sum the second one over a day, divide by the same number you'd have spent at flagship-only pricing, and that's your savings ratio. You can put it in a Grafana stat panel and stop arguing about it in finance reviews.
Multi-provider smart routing
The classifier is provider-agnostic. It picks a tier; the gateway maps the tier onto your deployment list. Some patterns that work in production:
- model_alias: 'auto'
deployments:
- provider: groq
model: llama-3.3-70b
api_key_env: GROQ_API_KEY
tier: simple # free / very fast
- provider: anthropic
model: claude-haiku
api_key_env: ANTHROPIC_API_KEY
tier: simple # cheap fallback
- provider: openai
model: gpt-4o
api_key_env: OPENAI_API_KEY
tier: medium
- provider: openai
model: gpt-5
api_key_env: OPENAI_API_KEY
tier: complex
Now simple prompts go to free Llama on Groq with Haiku as fallback; medium goes to GPT-4o; complex goes to flagship GPT-5. That's not a 40–70% cost reduction anymore — that's an 80–95% reduction on the simple bucket.
When not to smart-route
- Strictly latency-bound apps. If you're chasing a 200 ms total response time, sometimes the cheapest model is also the slowest (Anthropic Haiku is great in cost, less great in TTFT compared to OpenAI). In that case, run
least_latencystrategy and pin the alias to a single fast deployment. - Tool-using agents that depend on a specific model's behaviour. If your tool-calling chain is tuned to GPT-4o's tool format, don't have it suddenly route to Gemini. Pin those calls to a non-routed alias and only smart-route the prompts that aren't part of an agent loop.
- Hand-tuned prompts that lean on flagship reasoning. Your golden retrieval-augmented system prompt that uses 8k tokens of in-context examples is, by classifier definition,
complex. Trust the classifier — it'll keep that one on the flagship.
Try it
Gateway-LLM ships with the classifier above pre-wired and configurable. The Quickstart is a five-minute Docker setup; the live demo on the homepage runs the same classifier against real prompts and streams real OpenAI responses.
Run it locally → /docs/quickstart
Tune the router → /docs/smart-routing
See the savings live → the Try the live router section on the homepage
Related reading
- What is an LLM gateway? — the architectural primer.
- Multi-provider LLM failover — what happens when the cheap deployment goes down.
- LiteLLM vs Gateway-LLM — comparing the two most common router implementations.
FAQ
How does smart routing reduce LLM costs?
By routing easy prompts (which are most prompts) to cheap mini-tier models and saving flagship models for genuinely hard ones. Replayed customer traces show 40–70% savings.
Will smart routing hurt my output quality?
Not on prompts that genuinely don't need a flagship — and most don't. The classifier is conservative; ambiguous cases route up. Shadow mode lets you measure the delta before flipping live.
How fast is the routing decision?
Sub-microsecond. The classifier is a deterministic feature scorer, not a model call.
Does smart routing work across providers?
Yes. It picks a tier, you map the tier to whichever deployments you've registered.
Run Gateway-LLM in five minutes.
Open-source, OpenAI-compatible, and Apache 2.0. The Quickstart is a single docker compose up.