Skip to main content
Caching is available for TrueFoundry SaaS customers. Contact support@truefoundry.com to enable this feature for your account.
TrueFoundry AI Gateway caches LLM responses to deliver faster results and reduce costs. By storing and reusing responses from previous requests, you can serve repeated or similar queries instantly. The gateway supports two caching strategies:
  • Exact-Match: Returns cached responses only for identical requests
  • Semantic: Returns cached responses for semantically similar requests using embeddings and similarity scoring
Caching can reduce response times by up to 20x and significantly lower costs by avoiding redundant model calls.

Why Use Caching?

Cached responses are returned instantly without calling the model provider. This is particularly valuable for chat applications, customer support systems, and any scenario with repeated queries.
Every cache hit eliminates a model API call. For high-traffic applications or expensive models, this translates to significant cost savings over time.
Semantic caching recognizes when users ask the same question in different ways, serving cached responses for variations like “How do I reset my password?” and “What’s the password reset process?”
Development workflows often involve repeated requests. Caching speeds up iteration cycles and reduces API costs during testing.

Cache Types

Exact-Match Cache

Exact-match caching stores responses for identical requests. The gateway creates a hash of the complete request (messages, model, parameters) and returns the cached response if an exact match is found. When to Use:
  • API calls with identical parameters
  • Deterministic queries that always need the same response
  • Development and testing environments
  • Applications with predictable, repetitive queries
Configuration:
{
  "type": "exact-match",
  "ttl": 600
}

Semantic Cache

Semantic caching uses embeddings and cosine similarity to identify semantically similar requests. The gateway returns cached responses when the similarity score exceeds the configured threshold. When to Use:
  • Customer support chatbots
  • FAQ systems where users phrase questions differently
  • Conversational AI applications
  • Any scenario where query variations express the same intent
Configuration:
{
  "type": "semantic",
  "similarity_threshold": 0.9,
  "ttl": 600
}
The similarity_threshold ranges from 0 to 1.0:
  • 0.95 - 1.0: Very strict, only nearly identical queries match
  • 0.85 - 0.95: Balanced, works well for most conversational applications
  • < 0.85: Broader matching, may return results for loosely related queries
Start with a threshold of 0.9 for semantic caching. Monitor your cache hit rates and adjust based on your accuracy requirements.
Semantic cache is a “superset” of both caches. Setting cache mode to “semantic” will work for when there are exact-match cache hits as well.

Configuration

Enable caching by adding the x-tfy-cache-config header to your requests.
Replace {controlPlaneURL} with your TrueFoundry control plane URL and your-truefoundry-api-key with your API key.
curl https://{controlPlaneURL}/api/llm/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-truefoundry-api-key" \
  -H "x-tfy-cache-config: {\"type\": \"exact-match\", \"ttl\": 600}" \
  -d '{
    "model": "openai-main/gpt-4o",
    "messages": [
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ]
  }'

Configuration Parameters

ParameterRequiredTypeDescription
typeYes"exact-match" | "semantic"Cache matching strategy
ttlYesnumberTime to live in seconds. Maximum: 3600 (1 hour)
similarity_thresholdSemantic onlynumberSimilarity score between 0 and 1.0. Higher values require closer matches

Response Headers

The gateway returns headers indicating cache status:
HeaderDescriptionExample Values
x-tfy-cache-statusCache hit statushit, miss, error
x-tfy-cached-trace-idOriginal request trace ID for cache hitstrace_abc123
x-tfy-cache-similarity-scoreSimilarity score (semantic cache only)0.95
curl -i https://{controlPlaneURL}/api/llm/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-truefoundry-api-key" \
  -H "x-tfy-cache-config: {\"type\": \"semantic\", \"similarity_threshold\": 0.9, \"ttl\": 600}" \
  -d '{
    "model": "openai-main/gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

# Check response headers for:
# x-tfy-cache-status: hit/miss/error
# x-tfy-cached-trace-id: trace_abc123
# x-tfy-cache-similarity-score: 0.95

How It Works

Exact-Match Cache

  1. Gateway generates a hash of the complete request
  2. Checks if a cached response exists for this hash
  3. Returns cached response if found, otherwise forwards to model and caches the response

Semantic Cache

  1. Gateway extracts the last message from the request and generates an embedding vector
  2. All other request parameters (model, previous messages, temperature, etc.) are hashed and must match exactly
  3. Compares the embedding with cached embeddings using cosine similarity
  4. If similarity score exceeds the threshold and parameter hash matches, returns the cached response
  5. Otherwise, forwards to model and caches both response and embedding
To optimize for accurate cache hit rates, semantic cache only works with requests containing less than 8,191 input tokens (the limit of the OpenAI text-embedding model used for similarity matching).