TrueFoundry AI Gateway caches LLM responses so that repeated or similar queries are served instantly without calling the model provider. The gateway supports two caching strategies:Documentation Index
Fetch the complete documentation index at: https://www.truefoundry.com/llms.txt
Use this file to discover all available pages before exploring further.
- Exact-Match — returns cached responses only when the request is identical
- Semantic — returns cached responses when requests are semantically similar, even if worded differently
Prompt Caching vs Gateway Caching
LLM providers and the TrueFoundry AI Gateway offer caching at different layers. Understanding the difference helps you pick the right approach — or combine both for maximum savings.| Provider Prompt Caching | Gateway Caching (Exact-Match & Semantic) | |
|---|---|---|
| Where it runs | Inside the model provider (e.g. Anthropic) | In the TrueFoundry AI Gateway, before the request reaches any provider |
| What is cached | The prompt prefix — the provider skips re-processing tokens it has already seen | The complete LLM response — a cache hit returns the response instantly with zero model invocation |
| How it matches | Exact token-prefix match on the prompt | Exact request hash (exact-match) or embedding cosine similarity (semantic) |
| Latency savings | Reduces time-to-first-token; the model still generates a new completion | Eliminates model call entirely — response is returned from cache in milliseconds |
| Cost savings | Cached input tokens are billed at a reduced rate (varies by provider) | Cache hit = zero model cost for that request |
| Provider support | Provider-specific (see Prompt Caching) | Works with every provider routed through the gateway |
Cache Types
Exact-Match Cache
Exact-match caching stores responses keyed by a hash of the complete request — messages, model, and all parameters. A cached response is returned only when every part of the request matches exactly. Best for:- API calls with identical parameters
- Deterministic queries that always need the same response
- Development and testing environments
- Applications with predictable, repetitive queries
Semantic Cache
Semantic caching uses embeddings and cosine similarity to match requests that express the same intent, even when worded differently. For example, “How do I reset my password?” and “What’s the password reset process?” would match. The gateway extracts the last message, generates an embedding, and compares it against cached embeddings. All other request parameters (model, prior messages, temperature, etc.) are hashed and must still match exactly — only the last message is compared semantically. Best for:- Customer support chatbots
- FAQ systems where users phrase questions differently
- Conversational AI applications
- Any scenario where query variations express the same intent
Semantic cache is a superset of exact-match cache. Setting the cache type to
semantic will also return results for exact-match hits, so you don’t need to configure both.How to control the similarity required for a cache hit?
How to control the similarity required for a cache hit?
The
similarity_threshold parameter (0 – 1.0) controls how close two queries must be for the cached response to be returned. A higher value demands closer matches; a lower value allows broader matching.| Range | Behaviour | Recommended for |
|---|---|---|
| 0.95 – 1.0 | Very strict — only nearly identical queries match | High-precision use cases where incorrect cache hits are costly |
| 0.85 – 0.95 | Balanced — works well for most conversational applications | General-purpose chatbots and FAQ systems |
| < 0.85 | Broad — may return results for loosely related queries | Exploratory or low-risk scenarios |
How is similarity computed?
How is similarity computed?
The gateway converts the last message of each request into a vector embedding using an embedding model. When a new request arrives, its embedding is compared against cached embeddings using cosine similarity — a score between 0 and 1.0 that measures how close two vectors are in meaning. If the score meets or exceeds the configured
similarity_threshold, the cached response is returned.All other request parameters (model, prior messages, temperature, etc.) are hashed separately and must match exactly — only the last message is compared semantically.Namespacing and Cache Isolation
The gateway isolates cache entries at two levels to prevent data leaking across boundaries.Level 1 — User / Virtual Account (automatic)
Every cache entry is scoped to the user or virtual account that created it. This happens automatically — you don’t need to configure anything. A cache entry created by User A is never visible to User B, even if they send the exact same request.Level 2 — Custom Namespace (optional)
Within a user’s or virtual account’s cache, you can further partition entries by providing anamespace string. Entries in one namespace are invisible to requests with a different namespace (or no namespace).
This is useful when a single virtual account serves multiple downstream end-users or application contexts and you want per-context cache isolation. For example, an application that serves many tenants through a single virtual account can set namespace to each tenant’s ID so that cached responses never cross tenant boundaries.
| Scenario | Namespace value | Effect |
|---|---|---|
| Single application, shared cache | omit or set to "default" | All requests under the same user/virtual account share a single cache pool |
| Multi-tenant application | Set to the tenant or end-user ID (e.g. "tenant-123") | Each tenant gets its own isolated cache within the same virtual account |
| Multiple environments | Set to the environment name (e.g. "staging", "production") | Prevents staging queries from returning production-cached responses |
Configuration
Enable caching by adding thex-tfy-cache-config header to your requests.
Examples
Replace
{GATEWAY_BASE_URL} with the base URL of the TrueFoundry AI Gateway and your-truefoundry-api-key with your API key.Response Headers
The gateway returns headers indicating cache status on every response:| Header | Description | Example Values |
|---|---|---|
x-tfy-cache-status | Whether the cache was hit | hit, miss, error |
x-tfy-cached-trace-id | Trace ID of the original request that populated the cache (only on hits) | trace_abc123 |
x-tfy-cache-similarity-score | Cosine similarity score (semantic cache hits only) | 0.95 |
How It Works
- Exact-Match
- Semantic
Hash the request
The gateway generates a hash of the complete request (messages, model, parameters).
Infrastructure Setup
For SaaS customers, caching infrastructure is fully managed — no additional setup is required. For On-Premise deployments, caching requires two infrastructure components:- Redis — cache store for both exact-match and semantic caching
- Embedding Model — required for semantic caching to generate vector embeddings (see Embedding Model)
Redis Setup
You can either enable the bundled Redis instance in thetfy-llm-gateway Helm chart, or connect your own Redis (or Redis-compatible store such as Valkey).
- Bundled Redis
- Bring Your Own Redis
Enable Redis directly in the
tfy-llm-gateway chart values:Embedding Model
For embedding model, here is the configuration required based on your deployment mode:- TrueFoundry SaaS: No setup required, we use OpenAI’s
text-embedding-3-smallmodel by default. This model is not configurable. - Self Hosted (Control Plane + Gateway Plane): You can configure the model by going to “Settings” → “Semantic Cache” and select the embedding model. (should already be added as in integration)
- Hybrid: (TrueFoundry Control Plane + Self Hosted Gateway Plane): For this, you need add an environment variable in the values of tfy-llm-gateway helm chart.
You can add your embedding model (already registered on TrueFoundry) using the following change: