Caching is available for TrueFoundry SaaS customers. Contact support@truefoundry.com to enable this feature for your account.
- Exact-Match: Returns cached responses only for identical requests
- Semantic: Returns cached responses for semantically similar requests using embeddings and similarity scoring
Why Use Caching?
Reduce Response Latency
Reduce Response Latency
Cached responses are returned instantly without calling the model provider. This is particularly valuable for chat applications, customer support systems, and any scenario with repeated queries.
Lower LLM Costs
Lower LLM Costs
Every cache hit eliminates a model API call. For high-traffic applications or expensive models, this translates to significant cost savings over time.
Handle Query Variations
Handle Query Variations
Semantic caching recognizes when users ask the same question in different ways, serving cached responses for variations like “How do I reset my password?” and “What’s the password reset process?”
Faster Development & Testing
Faster Development & Testing
Development workflows often involve repeated requests. Caching speeds up iteration cycles and reduces API costs during testing.
Cache Types
Exact-Match Cache
Exact-match caching stores responses for identical requests. The gateway creates a hash of the complete request (messages, model, parameters) and returns the cached response if an exact match is found. When to Use:- API calls with identical parameters
- Deterministic queries that always need the same response
- Development and testing environments
- Applications with predictable, repetitive queries
Semantic Cache
Semantic caching uses embeddings and cosine similarity to identify semantically similar requests. The gateway returns cached responses when the similarity score exceeds the configured threshold. When to Use:- Customer support chatbots
- FAQ systems where users phrase questions differently
- Conversational AI applications
- Any scenario where query variations express the same intent
similarity_threshold ranges from 0 to 1.0:
- 0.95 - 1.0: Very strict, only nearly identical queries match
- 0.85 - 0.95: Balanced, works well for most conversational applications
- < 0.85: Broader matching, may return results for loosely related queries
Semantic cache is a “superset” of both caches. Setting cache mode to “semantic” will work for when there are exact-match cache hits as well.
Configuration
Enable caching by adding thex-tfy-cache-config header to your requests.
Replace
{controlPlaneURL} with your TrueFoundry control plane URL and your-truefoundry-api-key with your API key.Configuration Parameters
| Parameter | Required | Type | Description |
|---|---|---|---|
type | Yes | "exact-match" | "semantic" | Cache matching strategy |
ttl | Yes | number | Time to live in seconds. Maximum: 3600 (1 hour) |
similarity_threshold | Semantic only | number | Similarity score between 0 and 1.0. Higher values require closer matches |
Response Headers
The gateway returns headers indicating cache status:| Header | Description | Example Values |
|---|---|---|
x-tfy-cache-status | Cache hit status | hit, miss, error |
x-tfy-cached-trace-id | Original request trace ID for cache hits | trace_abc123 |
x-tfy-cache-similarity-score | Similarity score (semantic cache only) | 0.95 |
How It Works
Exact-Match Cache
- Gateway generates a hash of the complete request
- Checks if a cached response exists for this hash
- Returns cached response if found, otherwise forwards to model and caches the response
Semantic Cache
- Gateway extracts the last message from the request and generates an embedding vector
- All other request parameters (model, previous messages, temperature, etc.) are hashed and must match exactly
- Compares the embedding with cached embeddings using cosine similarity
- If similarity score exceeds the threshold and parameter hash matches, returns the cached response
- Otherwise, forwards to model and caches both response and embedding
To optimize for accurate cache hit rates, semantic cache only works with requests containing less than 8,191 input tokens (the limit of the OpenAI text-embedding model used for similarity matching).