Virtual API Keys: How to Govern LLM Spend Across Teams
A practical guide to issuing per-team virtual API keys with rate limits, model allowlists, and hard budget caps — without ever rotating your provider credentials.
Gateway-LLM team · · 6 min read
TL;DR
A virtual API key is a short, revocable token your application uses to call the gateway. It sits in front of your provider credentials, so a leaked key is a five-second revocation, not a five-hour key rotation. Each one carries a rate limit (RPM and TPM), a model allowlist, and an optional monthly USD budget. This post covers how to design them, how to scope them per team, and the spend-governance patterns that come naturally once you have them.
The problem they solve
Without a gateway, your three engineering teams share OPENAI_API_KEY. When a developer pushes it to a public GitHub repo at 11 PM:
- You revoke
OPENAI_API_KEY. - Every running service that holds that key starts failing.
- You generate a new one in OpenAI's dashboard.
- You distribute it to every service via your secret manager.
- You wait for re-deploys to pick it up.
- You answer Slack messages from the team that was in the middle of a launch.
This takes hours and you've lost product time. The leak is a people problem, but the blast radius is an architectural problem.
Virtual API keys turn that 11 PM hour into a five-second DELETE /admin/keys/vk_…. Provider credentials don't change. Other services keep running. The leaked key returns 401 immediately for anything still trying to use it.
Anatomy of a virtual key
{
"id": "vk_01HZX...",
"key": "gw_virt_a8f2...e7c1",
"name": "checkout-service",
"team": "payments",
"models": ["gpt-4o-mini", "gpt-4o"],
"rpm": 120,
"tpm": 200000,
"monthly_budget_usd": 50,
"router": { "force_strategy": "least_latency" },
"created_at": "2026-04-29T12:00:00Z"
}
Five things to understand:
key— the bearer token your app sends. Hashed at rest; shown once at creation.models— the allowlist. The key gets403 Forbiddenif it tries to call an alias not in this list.rpm/tpm— requests-per-minute / tokens-per-minute. The first ceiling hit returns429.monthly_budget_usd— hard cap. Once exceeded, every call returns429withX-Gateway-Limit-Kind: budgetuntil the next month.router— optional per-key router overrides. Lets a compliance team getleast_latencyeven if the global default isclassifier.
Designing keys per team
The pattern that holds up across team sizes:
One key per service, not per developer
Services have stable identities; developers churn. If checkout-service and notifications-service share a key, you can't tell which one is responsible for a spike. Keep them separate from day one.
Use the team tag for cost reporting
Spend reports are most useful aggregated by team — it answers "who needs to optimise" without naming individuals. Tag every service-key with the owning team. Multiple keys can share a team tag.
curl -X POST http://localhost:8080/admin/keys \
-H "Authorization: Bearer $MASTER_KEY" \
-d '{
"name": "ml-platform-batch",
"team": "ml-platform",
"models": ["embed", "summarize"],
"rpm": 60,
"monthly_budget_usd": 200
}'
Lock models to the cheapest viable alias
Async / batch jobs (embeddings, summaries, classifications) almost always belong on a cheap, specifically-tuned alias. Customer-facing services are the only ones that need flagship access. Restricting the allowlist on batch keys prevents accidental six-figure mistakes.
Set a budget even when you trust the team
The cap is a fire alarm, not a financial penalty. Setting monthly_budget_usd: 200 on a job that normally spends $40 means a runaway loop pages you at $200, not at $20,000. The team that owns the key gets a cost-overrun alert and decides whether to raise the cap or fix the bug.
Concrete team archetypes
Three patterns we see most often:
Customer-facing service
{
"name": "support-chatbot",
"team": "customer-success",
"models": ["auto"],
"rpm": 600,
"tpm": 1000000,
"monthly_budget_usd": 5000,
"router": { "force_strategy": "classifier" }
}
High RPM (we expect spiky human traffic), generous budget (this is revenue-generating), classifier strategy (we want smart routing on customer prompts because they're a wide difficulty spread).
Internal automation
{
"name": "support-ticket-classifier",
"team": "data",
"models": ["gpt-4o-mini"],
"rpm": 30,
"tpm": 80000,
"monthly_budget_usd": 100
}
Low RPM (this runs on a cron), tight budget (it's batch work, not revenue), pinned to gpt-4o-mini because we already know mini-tier is sufficient for the task.
Compliance / regulated workload
{
"name": "legal-redaction",
"team": "legal",
"models": ["claude-sonnet"],
"rpm": 20,
"monthly_budget_usd": 800,
"router": {
"force_strategy": "least_latency",
"allow_providers": ["anthropic"]
}
}
Locked to a single alias and a single provider (compliance contracts often forbid certain vendors), conservative router (don't classifier-route around the contractually-mandated provider), modest RPM.
Day 30 governance review
After thirty days of running on virtual keys, do a sweep:
- Which keys hit their budget cap? Either the cap was wrong (raise it) or the spend pattern is wrong (investigate). Both deserve attention.
- Which keys never came close to their RPM? Lower the limit. Tight RPM caps are early-warning systems for runaway loops.
- Which keys are still on
models: ["*"]? Pin them. The audit trail gets exponentially more useful when every key has a tight allowlist. - Which teams have multiple keys with the same
teamtag? That's fine. It's how you should be set up.
Auditing
Every request through a virtual key writes a structured audit row:
{
"request_id": "req_01HZ...",
"key_id": "vk_01HZX...",
"team": "payments",
"model_alias": "gpt-4o-mini",
"deployment": "openai/gpt-4o-mini",
"prompt_tokens": 312,
"completion_tokens": 81,
"cost_usd": 0.000099,
"duration_ms": 412,
"ts": "2026-04-29T12:34:56Z"
}
Stream those rows to Snowflake / BigQuery / your warehouse via the observability callbacks, and you can answer any spend question with SQL. "What did the payments team spend last month, broken down by alias?" is a GROUP BY team, model_alias, date_trunc('month', ts).
Try it
If you're already running Gateway-LLM, virtual keys are in the admin UI under Keys → Create. If you're not, the Quickstart gets you running in five minutes; the second curl example after that creates your first virtual key.
Provider keys stay in your secret manager. Virtual keys live in Postgres, scoped, observable, and revocable. That's the whole shape.
Related reading
- What is an LLM gateway? — the category primer.
- Smart routing for cost savings — pair virtual keys with smart routing for tighter cost control.
- Multi-provider LLM failover — how virtual keys interact with cross-provider routing.
FAQ
What are virtual API keys for LLMs?
Short, scoped, revocable tokens you issue to your apps. They sit in front of your provider credentials so a leaked virtual key never forces a provider rotation.
Why not just use OpenAI's native API keys per team?
Provider keys can't be cross-provider, model-allowlisted, or budget-capped. Virtual keys give you all of those.
How do virtual key budgets actually work?
The gateway sums per-request costs against your monthly cap and 429s once you exceed it. Default reset is calendar-monthly; rolling windows are configurable.
Run Gateway-LLM in five minutes.
Open-source, OpenAI-compatible, and Apache 2.0. The Quickstart is a single docker compose up.