Request Caching

Caching is available for TrueFoundry SaaS customers. Contact support@truefoundry.com to enable this feature for your account.

TrueFoundry AI Gateway caches LLM responses to deliver faster results and reduce costs. By storing and reusing responses from previous requests, you can serve repeated or similar queries instantly. The gateway supports two caching strategies:

Exact-Match: Returns cached responses only for identical requests
Semantic: Returns cached responses for semantically similar requests using embeddings and similarity scoring

Why Use Caching?

Reduce Response Latency

Cached responses are returned instantly without calling the model provider. This is particularly valuable for chat applications, customer support systems, and any scenario with repeated queries.

Lower LLM Costs

Every cache hit eliminates a model API call. For high-traffic applications or expensive models, this translates to significant cost savings over time.

Handle Query Variations

Semantic caching recognizes when users ask the same question in different ways, serving cached responses for variations like “How do I reset my password?” and “What’s the password reset process?”

Faster Development & Testing

Development workflows often involve repeated requests. Caching speeds up iteration cycles and reduces API costs during testing.

Cache Types

Exact-Match Cache

Exact-match caching stores responses for identical requests. The gateway creates a hash of the complete request (messages, model, parameters) and returns the cached response if an exact match is found. When to Use:

API calls with identical parameters
Deterministic queries that always need the same response
Development and testing environments
Applications with predictable, repetitive queries

Configuration:

{
  "type": "exact-match",
  "ttl": 600,
  "namespace": "production-app"  // optional
}

Semantic Cache

Semantic caching uses embeddings and cosine similarity to identify semantically similar requests. The gateway returns cached responses when the similarity score exceeds the configured threshold. When to Use:

Customer support chatbots
FAQ systems where users phrase questions differently
Conversational AI applications
Any scenario where query variations express the same intent

Configuration:

{
  "type": "semantic",
  "similarity_threshold": 0.9,
  "ttl": 600,
  "namespace": "production-app"  // optional
}

The similarity_threshold ranges from 0 to 1.0:

0.95 - 1.0: Very strict, only nearly identical queries match
0.85 - 0.95: Balanced, works well for most conversational applications
< 0.85: Broader matching, may return results for loosely related queries

Start with a threshold of 0.9 for semantic caching. Monitor your cache hit rates and adjust based on your accuracy requirements.

Semantic cache is a “superset” of both caches. Setting cache mode to “semantic” will work for when there are exact-match cache hits as well.

Embedding Model

Semantic caching generates vector embeddings to measure similarity between requests. The embedding model can be configured based on your deployment type: On-Premise: Configure the embedding model through Controls > Settings > Semantic Cache in the gateway dashboard. The selected model applies to all semantic cache operations gateway-wide.

SaaS: Uses OpenAI text-embedding-3-small by default. The embedding model is not configurable in the SaaS version.

Configuration

Enable caching by adding the x-tfy-cache-config header to your requests.

Replace {GATEWAY_BASE_URL} with the base URL of the TrueFoundry AI Gateway and your-truefoundry-api-key with your API key.

curl {GATEWAY_BASE_URL}/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-truefoundry-api-key" \
  -H "x-tfy-cache-config: {\"type\": \"exact-match\", \"ttl\": 600, \"namespace\": \"production-app\"}" \
  -d '{
    "model": "openai-main/gpt-4o",
    "messages": [
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ]
  }'

Configuration Parameters

Parameter	Required	Type	Description
`type`	Yes	`"exact-match"` \| `"semantic"`	Cache matching strategy
`ttl`	Yes	`number`	Time to live in seconds. Maximum: 259200 (3 days)
`similarity_threshold`	Semantic only	`number`	Similarity score between 0 and 1.0. Higher values require closer matches
`namespace`	No	`string`	Additional cache isolation per user or context. Defaults to ‘default’

Response Headers

The gateway returns headers indicating cache status:

Header	Description	Example Values
`x-tfy-cache-status`	Cache hit status	`hit`, `miss`, `error`
`x-tfy-cached-trace-id`	Original request trace ID for cache hits	`trace_abc123`
`x-tfy-cache-similarity-score`	Similarity score (semantic cache only)	`0.95`

curl -i {GATEWAY_BASE_URL}/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-truefoundry-api-key" \
  -H "x-tfy-cache-config: {\"type\": \"semantic\", \"similarity_threshold\": 0.9, \"ttl\": 600, \"namespace\": \"production-app\"}" \
  -d '{
    "model": "openai-main/gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

# Check response headers for:
# x-tfy-cache-status: hit/miss/error
# x-tfy-cached-trace-id: trace_abc123
# x-tfy-cache-similarity-score: 0.95

How It Works

Exact-Match Cache

Gateway generates a hash of the complete request
Checks if a cached response exists for this hash
Returns cached response if found, otherwise forwards to model and caches the response

Semantic Cache

Gateway extracts the last message from the request and generates an embedding vector
All other request parameters (model, previous messages, temperature, etc.) are hashed and must match exactly
Compares the embedding with cached embeddings using cosine similarity
If similarity score exceeds the threshold and parameter hash matches, returns the cached response
Otherwise, forwards to model and caches both response and embedding

Get Started

LLM Gateway

MCP Registry and Gateway

Agent Hub

Guardrails and Security

Prompt Management

Observability

Deployment

Admin Guide

API Reference

Chat

Agent

Embeddings

Rerank

Responses

Image

Audio

Batch

Files

Moderations

Models

Why Use Caching?

Cache Types

Exact-Match Cache

Semantic Cache

Embedding Model

Configuration

Configuration Parameters

Response Headers

How It Works

Exact-Match Cache

Semantic Cache

Get Started

LLM Gateway

MCP Registry and Gateway

Agent Hub

Guardrails and Security

Prompt Management

Observability

Deployment

Admin Guide

API Reference

Chat

Agent

Embeddings

Rerank

Responses

Image

Audio

Batch

Files

Moderations

Models

​Why Use Caching?

​Cache Types

​Exact-Match Cache

​Semantic Cache

​Embedding Model

​Configuration

​Configuration Parameters

​Response Headers

​How It Works

​Exact-Match Cache

​Semantic Cache

Why Use Caching?

Cache Types

Exact-Match Cache

Semantic Cache

Embedding Model

Configuration

Configuration Parameters

Response Headers

How It Works

Exact-Match Cache

Semantic Cache