Semantic CachingPerformance

Semantic Caching for LLMs: When It Works, When It Hurts

A pragmatic guide to semantic caching for LLM responses — embedding similarity thresholds, cache invalidation, and the failure modes that produce confidently-wrong answers.

Gateway-LLM team · · 7 min read


TL;DR

Semantic caching keys LLM responses by an embedding of the prompt, so re-phrased duplicates hit the cache instead of the upstream model. Done right, it cuts cost 20–40% on chat-heavy workloads and shaves 200+ ms off every cache hit. Done wrong, it serves the answer to the wrong question — confidently, in an unbroken voice — and you don't notice for a week.

This post covers the cases where semantic caching genuinely helps, the failure modes that make people swear off it, and a tuning playbook that sits between "off" and "panic".

What semantic caching actually is

Two layers of caching for LLMs:

  • Exact-match cache. Hash the prompt. If we've seen this exact string before, return the stored response. Free, deterministic, narrow.
  • Semantic cache. Embed the prompt with a small embedding model. Compare the embedding against stored ones; if the cosine similarity exceeds a threshold, return the stored response.

The semantic version is where the cost savings live. In a customer-support chat workload, prompts with the same intent are phrased dozens of different ways:

  • "How do I cancel my subscription?"
  • "I want to cancel"
  • "cancel my account please"
  • "How can I stop my subscription?"

All four mean the same thing. The exact-match cache treats them as four prompts; the semantic cache treats them as one. That's a 75% reduction in upstream calls for that intent.

When it works

Semantic caching pays off most on workloads with these properties:

  • Repetitive intent. The same handful of questions, phrased differently, dominate the volume. Customer support, FAQ chatbots, internal documentation Q&A.
  • Stable answers. "What's the capital of France?" doesn't change. Caching a stable answer for a year is fine.
  • Latency-sensitive UX. A 200 ms cache hit feels like instant; a 2,000 ms model call feels like a wait.
  • Bounded prompt diversity. When the universe of prompts is finite, the cache hit rate climbs over time. Open-ended prompts (write me a story about X) cap out at 5–10% hit rates.

In practice, the workloads that hit 30–50% cache hit rates are FAQ bots, intent classification, embedding-based recommendation, and sales-tooling lookups.

When it hurts

The failure mode that makes engineering leads disable caching forever:

  • The threshold is set too low (> 0.85).
  • Two prompts with overlapping vocabulary but different intent end up above the threshold.
  • The cache returns the wrong answer, fluently and confidently.
  • Nobody notices because the response looks right.
  • A week later a customer complains, you investigate, and you find that you've been serving the answer to "summarise this contract" for prompts asking "translate this contract".

Three honest examples we've seen in customer audits:

| Cached prompt | New prompt | Cosine similarity | Outcome | |---|---|---|---| | "Summarise this contract" | "Translate this contract" | 0.91 | Wrong answer served | | "What's the refund policy?" | "What's the cancellation policy?" | 0.94 | Sometimes right, sometimes wrong | | "Diagnose this stack trace" | "Explain this stack trace" | 0.93 | Mostly right; missed the diagnostic asks |

The numbers above tell you something useful: there's a band roughly 0.90–0.94 where false positives become real and the cache stops being safe. The default threshold in Gateway-LLM is 0.93. We pick conservatively for a reason.

A tuning playbook

1. Turn it on in shadow mode

cache:
  enabled: true
  semantic:
    enabled: true
    similarity_threshold: 0.93
    ttl_seconds: 3600
    shadow: true   # log hits, don't serve them

In shadow mode, the cache records hits in your spend log and metrics, but every request still hits the upstream model. You compare the cached response to the actual response and look for differences.

2. Sample the false-positive rate

Do this for at least a week of real traffic:

SELECT request_id, prompt, cached_response, actual_response,
       similarity_score
FROM gateway_audit
WHERE cache_shadow_hit = true
ORDER BY ts DESC
LIMIT 200;

Eyeball the rows. If any of them are wrong, raise the threshold. If they're all clean, lower the threshold by 0.01 and try again.

3. Disable caching on high-stakes aliases

Some routes shouldn't be cached at any threshold:

- model_alias: 'legal'
  cache: { enabled: false }
- model_alias: 'support'
  cache: { enabled: false }
- model_alias: 'medical'
  cache: { enabled: false }

The rule of thumb: if a wrong answer is expensive (legal liability, customer escalation, medical risk), the cache is off, full stop. Repetition savings are not worth wrong-answer risk on those routes.

4. Use TTL aggressively

Stable answers can cache for hours. Anything that mentions current state ("today", "this week", "your account balance") shouldn't cache more than a few minutes:

cache:
  semantic:
    ttl_seconds: 300       # 5 min default
    overrides:
      - alias: 'faq'
        ttl_seconds: 86400 # 24 hours
      - alias: 'pricing'
        ttl_seconds: 600   # 10 min

5. Monitor cache effectiveness

The two metrics that matter:

  • gatewayllm_cache_hit_total{type="semantic"} — semantic hits served.
  • gatewayllm_cache_miss_total{type="semantic"} — misses (i.e. requests that fell through to the upstream).

Hit rate = hits / (hits + misses). Track it as a Grafana panel. If it suddenly drops, your prompt distribution has shifted (a new feature shipped, traffic doubled in a region) and your cache hit assumptions need re-validating.

6. Don't cache streaming responses naïvely

Streaming responses present specifically: the user sees tokens trickle in. A cache hit returns the whole response instantly, which feels weird. Two options:

  • Replay the cached response with synthetic chunking so it still streams to the client (Gateway-LLM does this by default).
  • Bypass cache for streaming when UX consistency matters more than savings.

A pragmatic default config

For most teams, this is the place to land:

cache:
  enabled: true
  exact: true
  semantic:
    enabled: true
    similarity_threshold: 0.93
    ttl_seconds: 600
    shadow: false
    embedding_model: text-embedding-3-small
  overrides:
    - alias: 'legal'
      enabled: false
    - alias: 'medical'
      enabled: false

Hit rate: typically 25–45% on customer-facing chat workloads, less on agent/tool-use workloads. Cost savings: roughly proportional to the hit rate, because cached responses cost effectively zero. False positives: with a 0.93 threshold and high-stakes aliases disabled, near-zero in production for the workloads we've measured.

When to skip caching entirely

  • You're at under 100k requests/month. The savings are small enough that it's not worth the operational complexity.
  • Every prompt is unique (creative generation, code completion in your specific repo). Cache hit rate caps at 2–5%; the embedding overhead isn't worth the saving.
  • You can't tolerate any false-positive risk. Some compliance contexts (financial advice, medical) make this the right call. Trust the upstream model on every request; eat the cost.

Try it

Gateway-LLM ships with both exact and semantic caching enabled by default at the 0.93 threshold. The Configuration page covers every cache knob; shadow mode is a single line.

If you're new here, Quickstart is the place to start; turn caching off in config.yaml if you'd rather measure savings before flipping it on.

Related reading

FAQ

What is semantic caching for LLMs?
Storing past LLM responses keyed by an embedding of the prompt, so near-duplicate prompts can be served from cache instead of re-calling the model.

When does semantic caching go wrong?
When the similarity threshold is too loose, related-but-different prompts hit the same cache entry. Fix: raise the threshold or disable caching for high-stakes aliases.

How do I tune the similarity threshold?
Start at 0.93. Run a week in shadow mode, eyeball false positives, lower or raise the threshold from there.


Run Gateway-LLM in five minutes.

Open-source, OpenAI-compatible, and Apache 2.0. The Quickstart is a single docker compose up.