Skip to main content

Why Guardrails?

Once AI applications go to production, they handle real user data and — in the case of agents — call external tools on their own. Things can go wrong fast:
  • A customer support chatbot leaks a user’s credit card number because PII wasn’t stripped from the context.
  • A coding agent runs rm -rf / through an MCP tool after hallucinating a shell command — and nothing stopped it.
  • A healthcare assistant makes up drug dosage numbers. The response reaches the patient unchecked.
  • An internal Q&A bot gets jailbroken through prompt injection, leaking confidential company data.
Guardrails prevent these scenarios. They sit between your application and the LLM (or MCP tool), inspecting and — when needed — blocking or rewriting data before it causes damage. You can attach them to LLM requests (check the prompt going in, check the response coming out) and to MCP tool calls (check the arguments before the tool runs, check the results after it returns).

How a TrueFoundry Guardrail Works

Each guardrail has two settings you configure: what it does with the data, and how strictly it enforces its decisions.

Operation Mode

ModeBehaviorExecution
ValidateLooks at the data, blocks the request if something is wrong. Doesn’t touch the data itself.
E.g., a content moderation guardrail sees hate speech in the prompt and blocks the request outright.
Runs in parallel (faster)
MutateLooks at the data and rewrites it. Can also block.
E.g., a PII guardrail rewrites ”My SSN is 123-45-6789”to”My SSN is REDACTED” and lets the request through.
Runs sequentially by priority (lower = first)

Enforcement Strategy

This decides what happens when a guardrail catches a violation — and also what happens if the guardrail itself has a problem (like a timeout or a provider outage).
StrategyOn ViolationOn Guardrail Error
EnforceBlockBlock
Enforce But Ignore On ErrorBlockLet through (graceful degradation)
AuditLet through (log only)Let through
How to roll out safely:
  • Start with Audit so you can see what guardrails would catch without affecting users.
  • Once things look right, switch to Enforce But Ignore On Error — you get protection, but a guardrail provider outage won’t take your app down.
  • Move to Enforce when you need strict compliance.

How Truefoundry AI Gateway Runs Guardrails

Where guardrails run depends on whether you’re making an LLM call or invoking an MCP tool.
LLM requests have two hooks — Input (before the model sees the prompt) and Output (after the model responds):

LLM Input

Runs before the prompt reaches the LLM:
  • PII masking and redaction
  • Prompt injection detection
  • Content moderation

LLM Output

Runs after the LLM responds:
  • Hallucination detection
  • Secrets detection
  • Content filtering
Diagram showing the flow of LLM requests through input and output guardrails
Here’s the order of operations when a request hits the gateway:
  1. Input Mutation guardrails run first and block until they finish (e.g., redacting PII from the prompt).
  2. Input Validation kicks off in the background — it checks for things like prompt injection while the model request is already in flight.
  3. The model request starts with the mutated prompt.
  4. If Input Validation fails while the model is still running, the gateway cancels the model request right away so you don’t pay for it.
  5. Once the model responds, Output Mutation guardrails process the response (e.g., stripping secrets).
  6. Output Validation checks the final result. If it fails, the response is blocked — though model costs have already been incurred at this point.
  7. The clean response goes back to the client.
HookExecutionWhat Happens on Failure
Input ValidationAsync (parallel with model request)Model request cancelled
Input MutationSync (before model request)Request blocked
Output MutationSync (after model response)Response blocked
Output ValidationSync (after output mutation)Response rejected

Latency Impact of Guardrails

Guardrails add processing time — but the gateway is designed to keep that impact small.
  • Input Validation runs in parallel with the model request, so in the happy path, it adds no extra wait time before you see the first token.
  • Input Mutation runs before the model request, so its processing time is added directly.
  • When Input Validation fails, the model request gets cancelled immediately — you don’t pay for a response you were going to throw away.
Execution Flow Examples
Input validation runs in parallel with the model request. Output guardrails process the response before it’s returned.
Input validation fails while the model is running — the model request gets cancelled immediately to save costs.
The model finishes, but output validation fails — the response is rejected. Model costs are already incurred at this point.
You can track the latency impact of each guardrail in AI Gateway → Monitor → Request Traces. Each guardrail span shows its execution time, result, scope, and which entity it was applied on.
Request Traces view showing guardrail execution latency, result type, scope, and applied entity for each span

How to Apply Guardrails

You can attach guardrails in two ways:
Pass the X-TFY-GUARDRAILS header to apply guardrails on a single request. Handy for testing or when different requests need different guardrails.
{
  "llm_input_guardrails": ["my-group/pii-redaction"],
  "llm_output_guardrails": ["my-group/secrets-detection"],
  "mcp_tool_pre_invoke_guardrails": ["my-group/sql-sanitizer"],
  "mcp_tool_post_invoke_guardrails": ["my-group/code-safety"]
}
Set up guardrail rules in AI Gateway → Controls → Guardrails to apply guardrails automatically based on who’s making the request, which model they’re calling, or which MCP tool is being used. This is the way to go for org-wide enforcement.For a step-by-step walkthrough, see the Getting Started guide. For the full policy reference, see Guardrails Configuration.

Supported Guardrails

The AI Gateway ships with built-in guardrails and integrates with a range of external providers — all managed through a single interface.

TrueFoundry Built-in Guardrails

Ready to use out of the box — no external credentials needed.

Secrets Detection

Catches and redacts credentials like AWS keys, API keys, JWT tokens, and private keys.

Code Safety Linter

Flags unsafe code patterns — eval, exec, os.system, subprocess calls, dangerous shell commands.

SQL Sanitizer

Catches risky SQL: DROP, TRUNCATE, DELETE/UPDATE without WHERE, string interpolation.

Regex Pattern Matching

Matches and redacts sensitive patterns (PII, payment cards, credentials) using built-in or custom regex.

Prompt Injection

Detects prompt injection attacks and jailbreak attempts using model-based analysis.

PII Detection

Finds and redacts personally identifiable information with configurable entity categories.

Content Moderation

Blocks harmful content across hate, self-harm, sexual, and violence categories with adjustable thresholds.

Cedar Guardrails

Fine-grained access control for MCP tools using Cedar policy language with default-deny security.

OPA Guardrails

Fine-grained access control with full policy lifecycle management using Open Policy Agent.

External Providers

We also integrate with third-party guardrail providers. Don’t see yours? Reach out — we’re happy to add it.

OpenAI Moderations

OpenAI’s moderation API for detecting violence, hate speech, harassment, and other policy violations.

AWS Bedrock Guardrail

AWS Bedrock’s guardrail capabilities for AI models.

Azure PII

Azure’s PII detection service for identifying and redacting personal data.

Azure Content Safety

Azure Content Safety for detecting harmful or inappropriate content.

Azure Prompt Shield

Azure Prompt Shield for blocking prompt injection and jailbreak attempts.

Enkrypt AI

Advanced moderation and compliance — toxicity, bias, and sensitive data detection.

Palo Alto Prisma AIRS

Palo Alto AI Risk for content safety and threat detection.

PromptFoo

Promptfoo integration for content moderation guardrails.

Fiddler

Fiddler-Safety and Fiddler-Response-Faithfulness guardrails.

CrowdStrike

API-based security — content moderation, prompt injection detection, toxicity analysis. (Formerly Pangea, acquired by CrowdStrike in 2025.)

Patronus AI

Hallucination detection, prompt injection, PII leakage, toxicity, and bias evaluators.

Google Model Armor

Google Cloud Model Armor for prompt injection, harmful content, PII, and malicious URI detection.

GraySwan Cygnal

Policy violation detection and content safety monitoring powered by GraySwan Cygnal.

Akto

LLM security, prompt injection detection, and policy violation monitoring with native streaming support.

Bring Your Own Guardrail / Plugin

If the built-in and provider integrations don’t cover your use case, you can write your own. Build a custom guardrail with Guardrails.AI, a plain Python function, or any framework you prefer, and plug it into the gateway.

Custom Guardrails

Build and integrate your own guardrail using Guardrails.AI, a Python function, or any custom logic.

FAQ

By default, guardrails look at all messages in the conversation. If you only care about the latest message, set the X-TFY-GUARDRAILS-SCOPE header:
  • all (default): Checks the full conversation history
  • last: Checks only the most recent message
Using last is faster when you don’t need to scan the whole conversation.
curl -X POST "{GATEWAY_BASE_URL}/chat/completions" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H 'X-TFY-GUARDRAILS: {"llm_input_guardrails":["my-group/pii-redaction"]}' \
  -H "X-TFY-GUARDRAILS-SCOPE: last" \
  -d '{
    "model": "openai/gpt-4o",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello, how can you help me today?"}
    ]
  }'
Depends on the enforcement strategy you picked:
  • Enforce: Request gets blocked.
  • Enforce But Ignore On Error: Request goes through anyway. This is the safest default — you stay protected when guardrails work, but a provider outage won’t break your app.
  • Audit: Request always goes through.
Go to AI Gateway → Monitor → Request Traces. You’ll see:
  • Which guardrails ran on each hook
  • Whether they passed or failed, and how long they took
  • What they found (secrets, SQL issues, unsafe patterns, etc.)
  • What mutations were applied
Traces are logged for both successful and blocked requests.
Streaming and output guardrails are fundamentally at odds — guardrails need the complete response to evaluate it, but streaming sends tokens to the client as they’re generated. The gateway handles this by buffering the entire response, running output guardrails on the full text, and only then streaming it back to the client if the guardrails pass. This means you lose the time-to-first-token benefit of streaming when output guardrails are active.There are alternative strategies like running guardrails on batches of tokens as they arrive, but these only work for guardrails that operate on local patterns — for example, regex matching, PII detection, or secrets scanning can catch issues in a chunk without seeing the full response. Guardrails that need the complete output as context — such as hallucination detection, factual consistency checks, or content moderation that evaluates the overall message — cannot work incrementally and still require the full response to be buffered.In practice:
ApproachTime to First TokenWorks For
Full buffering (default)Delayed until guardrails completeAll guardrails — hallucination, moderation, factual checks, etc.
Chunked evaluationNear-normal streamingPattern-based guardrails only — regex, PII, secrets scanning
If low latency matters more than full output coverage, consider using only input guardrails with streaming, and running output guardrails asynchronously for audit purposes.