Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.truefoundry.com/llms.txt

Use this file to discover all available pages before exploring further.

TrueFoundry AI Gateway caches LLM responses so that repeated or similar queries are served instantly without calling the model provider. The gateway supports two caching strategies:
  • Exact-Match — returns cached responses only when the request is identical
  • Semantic — returns cached responses when requests are semantically similar, even if worded differently

Prompt Caching vs Gateway Caching

LLM providers and the TrueFoundry AI Gateway offer caching at different layers. Understanding the difference helps you pick the right approach — or combine both for maximum savings.
Provider Prompt CachingGateway Caching (Exact-Match & Semantic)
Where it runsInside the model provider (e.g. Anthropic)In the TrueFoundry AI Gateway, before the request reaches any provider
What is cachedThe prompt prefix — the provider skips re-processing tokens it has already seenThe complete LLM response — a cache hit returns the response instantly with zero model invocation
How it matchesExact token-prefix match on the promptExact request hash (exact-match) or embedding cosine similarity (semantic)
Latency savingsReduces time-to-first-token; the model still generates a new completionEliminates model call entirely — response is returned from cache in milliseconds
Cost savingsCached input tokens are billed at a reduced rate (varies by provider)Cache hit = zero model cost for that request
Provider supportProvider-specific (see Prompt Caching)Works with every provider routed through the gateway

Cache Types

Exact-Match Cache

Exact-match caching stores responses keyed by a hash of the complete request — messages, model, and all parameters. A cached response is returned only when every part of the request matches exactly. Best for:
  • API calls with identical parameters
  • Deterministic queries that always need the same response
  • Development and testing environments
  • Applications with predictable, repetitive queries

Semantic Cache

Semantic caching uses embeddings and cosine similarity to match requests that express the same intent, even when worded differently. For example, “How do I reset my password?” and “What’s the password reset process?” would match. The gateway extracts the last message, generates an embedding, and compares it against cached embeddings. All other request parameters (model, prior messages, temperature, etc.) are hashed and must still match exactly — only the last message is compared semantically. Best for:
  • Customer support chatbots
  • FAQ systems where users phrase questions differently
  • Conversational AI applications
  • Any scenario where query variations express the same intent
Semantic cache is a superset of exact-match cache. Setting the cache type to semantic will also return results for exact-match hits, so you don’t need to configure both.
The similarity_threshold parameter (0 – 1.0) controls how close two queries must be for the cached response to be returned. A higher value demands closer matches; a lower value allows broader matching.
RangeBehaviourRecommended for
0.95 – 1.0Very strict — only nearly identical queries matchHigh-precision use cases where incorrect cache hits are costly
0.85 – 0.95Balanced — works well for most conversational applicationsGeneral-purpose chatbots and FAQ systems
< 0.85Broad — may return results for loosely related queriesExploratory or low-risk scenarios
Start with a threshold of 0.9 and monitor cache hit rates. Adjust up for precision or down for coverage.
The gateway converts the last message of each request into a vector embedding using an embedding model. When a new request arrives, its embedding is compared against cached embeddings using cosine similarity — a score between 0 and 1.0 that measures how close two vectors are in meaning. If the score meets or exceeds the configured similarity_threshold, the cached response is returned.All other request parameters (model, prior messages, temperature, etc.) are hashed separately and must match exactly — only the last message is compared semantically.

Namespacing and Cache Isolation

The gateway isolates cache entries at two levels to prevent data leaking across boundaries.

Level 1 — User / Virtual Account (automatic)

Every cache entry is scoped to the user or virtual account that created it. This happens automatically — you don’t need to configure anything. A cache entry created by User A is never visible to User B, even if they send the exact same request.

Level 2 — Custom Namespace (optional)

Within a user’s or virtual account’s cache, you can further partition entries by providing a namespace string. Entries in one namespace are invisible to requests with a different namespace (or no namespace). This is useful when a single virtual account serves multiple downstream end-users or application contexts and you want per-context cache isolation. For example, an application that serves many tenants through a single virtual account can set namespace to each tenant’s ID so that cached responses never cross tenant boundaries.
ScenarioNamespace valueEffect
Single application, shared cacheomit or set to "default"All requests under the same user/virtual account share a single cache pool
Multi-tenant applicationSet to the tenant or end-user ID (e.g. "tenant-123")Each tenant gets its own isolated cache within the same virtual account
Multiple environmentsSet to the environment name (e.g. "staging", "production")Prevents staging queries from returning production-cached responses
{
  "type": "semantic",
  "similarity_threshold": 0.9,
  "ttl": 600,
  "namespace": "tenant-123"
}

Configuration

Enable caching by adding the x-tfy-cache-config header to your requests.

Examples

Replace {GATEWAY_BASE_URL} with the base URL of the TrueFoundry AI Gateway and your-truefoundry-api-key with your API key.
curl {GATEWAY_BASE_URL}/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-truefoundry-api-key" \
  -H "x-tfy-cache-config: {\"type\": \"exact-match\", \"ttl\": 600}" \
  -d '{
    "model": "openai-main/gpt-4o",
    "messages": [
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ]
  }'

Response Headers

The gateway returns headers indicating cache status on every response:
HeaderDescriptionExample Values
x-tfy-cache-statusWhether the cache was hithit, miss, error
x-tfy-cached-trace-idTrace ID of the original request that populated the cache (only on hits)trace_abc123
x-tfy-cache-similarity-scoreCosine similarity score (semantic cache hits only)0.95

How It Works

1

Hash the request

The gateway generates a hash of the complete request (messages, model, parameters).
2

Look up the cache

If a cached response exists for this hash, it is returned immediately.
3

Forward on miss

On a cache miss, the request is forwarded to the model provider. The response is cached before being returned.

Infrastructure Setup

For SaaS customers, caching infrastructure is fully managed — no additional setup is required. For On-Premise deployments, caching requires two infrastructure components:
  1. Redis — cache store for both exact-match and semantic caching
  2. Embedding Model — required for semantic caching to generate vector embeddings (see Embedding Model)

Redis Setup

You can either enable the bundled Redis instance in the tfy-llm-gateway Helm chart, or connect your own Redis (or Redis-compatible store such as Valkey).
Enable Redis directly in the tfy-llm-gateway chart values:
tfy-llm-gateway:
  redis:
    enabled: true

Embedding Model

For embedding model, here is the configuration required based on your deployment mode:
  1. TrueFoundry SaaS: No setup required, we use OpenAI’s text-embedding-3-small model by default. This model is not configurable.
  2. Self Hosted (Control Plane + Gateway Plane): You can configure the model by going to “Settings” → “Semantic Cache” and select the embedding model. (should already be added as in integration)
  3. Hybrid: (TrueFoundry Control Plane + Self Hosted Gateway Plane): For this, you need add an environment variable in the values of tfy-llm-gateway helm chart.
    You can add your embedding model (already registered on TrueFoundry) using the following change:
    tfy-llm-gateway:
      env:
        SEMANTIC_CACHE_MODEL_IDENTIFIER: "openai-account/embedding-model-name" #model identifier on TrueFoundry for the embedding model