Content Moderation can be applied to all four guardrail hooks: LLM Input, LLM Output, MCP Pre Tool, and MCP Post Tool, providing comprehensive content safety across your entire AI workflow.
What is Content Moderation?
Content Moderation is a built-in TrueFoundry guardrail that analyzes text content for harmful material across four safety categories - hate speech, self-harm, sexual content, and violence. It uses a model-based approach with configurable severity thresholds so you can tune sensitivity to match your use case. The guardrail is fully managed by TrueFoundry no external credentials or setup required.Key Features
-
Four Safety Categories: Detects harmful content across:
- Hate — Hate speech, discrimination, and derogatory content
- SelfHarm — Self-injury, suicide-related content
- Sexual — Sexually explicit or suggestive content
- Violence — Violent content, threats, and graphic descriptions
- Configurable Severity Threshold: Set the sensitivity level (0–6) to control what gets flagged — from safe content only to high-risk content, allowing you to balance safety with usability.
- Selective Category Detection: Choose which categories to monitor. Enable all four or only the ones relevant to your application.
Adding Content Moderation Guardrail
Create or Select a Guardrails Group
Create a new guardrails group or select an existing one where you want to add the Content Moderation guardrail.
Add Content Moderation Integration
Click on Add Guardrail and select Content Moderation from the TrueFoundry Guardrails section.

Configure the Guardrail
Fill in the configuration form:
- Name: Enter a unique name for this guardrail configuration (e.g.,
content-moderation) - Severity Threshold: Set the minimum severity level to flag (default:
2) - Categories: Select which content categories to check
- Enforcing Strategy: Choose how violations are handled

Configuration Options
| Parameter | Description | Default |
|---|---|---|
| Name | Unique identifier for this guardrail | Required |
| Operation | validate only (detects and blocks, no mutation) | validate |
| Enforcing Strategy | enforce, enforce_but_ignore_on_error, or audit | enforce |
| Severity Threshold | Minimum severity level (0–6) to flag content | 2 |
| Categories | Array of content categories to check | Required |
Content Moderation only supports validate mode — it detects and blocks harmful content but does not modify it. See Guardrails Overview for details on Enforcing Strategy.
Categories and Severity Levels
Content Categories
| Category | Description |
|---|---|
| Hate | Content expressing hatred, discrimination, or derogation based on identity characteristics |
| SelfHarm | Content related to self-injury, suicide, or self-destructive behavior |
| Sexual | Sexually explicit or suggestive content |
| Violence | Content depicting or promoting physical violence, threats, or graphic injury |
Severity Levels
The severity threshold controls how sensitive the detection is. Content is flagged when any category’s severity meets or exceeds the threshold.| Severity | Level | Description |
|---|---|---|
| 0 | Safe | No harmful content detected |
| 2 | Low | Mildly concerning content (default threshold) |
| 4 | Medium | Moderately harmful content |
| 6 | High | Severely harmful content |
How It Works
The guardrail analyzes content and returns severity scores (0–6) for each enabled category. If any category’s severity meets or exceeds the configured threshold, the content is flagged. Example:Use Cases
Recommended Hook Usage
| Hook | Use Case |
|---|---|
| LLM Input | Block harmful user inputs before they reach the LLM |
| LLM Output | Ensure LLM responses don’t contain harmful content |
| MCP Pre Tool | Validate tool parameters for harmful content |
| MCP Post Tool | Check tool outputs for harmful content |