Truefoundry Docs

Overview

When we interact with the prompt in the AI gateway playground, the playground UI renders the tool calls, their arguments, results and the LLM responses as they are streamed back from the gateway. If you want to do the same in your own application, or a different UI apart from the TrueFoundry playground, you can use the Agent API described below.

Quickstart

Get started with the Agent API in 3 simple steps:

Set your API token and base URL

export TFY_API_TOKEN=your-token-here
export TFY_CONTROL_PLANE_BASE_URL=https://your-truefoundry-instance.com

See Authentication for details on getting your token.

Make your first request

HTTPie
curl

http POST https://${TFY_CONTROL_PLANE_BASE_URL}/api/llm/agent/chat/completions \
  Authorization:"Bearer ${TFY_API_TOKEN}" \
  Content-Type:application/json \
  model=openai/gpt-4o stream:=true \
  messages:='[{"role":"user","content":"Help me find a model that can generate images"}]' \
  mcp_servers:='[{"integration_fqn":"common-tools","enable_all_tools":false,"tools":[{"name":"web_search"}]}]'

curl --location 'https://${TFY_CONTROL_PLANE_BASE_URL}/api/llm/agent/chat/completions' \
  --header 'Content-Type: application/json' \
  --header "Authorization: Bearer ${TFY_API_TOKEN}" \
  --data-raw '{
    "model": "openai/gpt-4o",
    "messages": [{"role": "user", "content": "Help me find a model that can generate images"}],
    "stream": true,
    "mcp_servers": [{
      "integration_fqn": "common-tools",
      "enable_all_tools": false,
      "tools": [{ "name": "web_search" }]
    }]
  }'

Understand the response

You’ll receive a streaming response with:

Assistant content: The LLM’s text response
Tool calls: When the assistant decides to use tools (like web search)
Tool results: Output from executed tools
Follow-up: The assistant processes tool results and continues

The agent will automatically use the web_search tool to find image generation models and provide recommendations.

Request examples

Call with registered MCP servers

When you have MCP servers already registered in your TrueFoundry AI Gateway, you can reference them using their integration_fqn:

HTTPie
curl

http POST https://${TFY_CONTROL_PLANE_BASE_URL}/api/llm/agent/chat/completions \
  Authorization:"Bearer ${TFY_API_TOKEN}" \
  Content-Type:application/json \
  model=openai/gpt-4o stream:=true \
  messages:='[{"role":"user","content":"Search for Python tutorials and run a simple code example"}]' \
  mcp_servers:='[{"integration_fqn":"common-tools","enable_all_tools":false,"tools":[{"name":"web_search"},{"name":"code_executor"}]}]'

curl --location 'https://${TFY_CONTROL_PLANE_BASE_URL}/api/llm/agent/chat/completions' \
  --header 'Content-Type: application/json' \
  --header "Authorization: Bearer ${TFY_API_TOKEN}" \
  --data-raw '{
    "model": "openai/gpt-4o",
    "messages": [{"role": "user", "content": "Search for Python tutorials and run a simple code example"}],
    "stream": true,
    "mcp_servers": [{
      "integration_fqn": "common-tools",
      "enable_all_tools": false,
      "tools": [{ "name": "web_search" }, { "name": "code_executor" }]
    }]
  }'

Use external MCP servers

You can connect to any MCP server accessible without pre-registering it in the gateway:

HTTPie
curl

http POST https://${TFY_CONTROL_PLANE_BASE_URL}/api/llm/agent/chat/completions \
  Authorization:"Bearer ${TFY_API_TOKEN}" \
  Content-Type:application/json \
  model=openai/gpt-4o stream:=true \
  messages:='[{"role":"user","content":"Search for machine learning models and packages"}]' \
  mcp_servers:='[{"url":"https://huggingface.co/mcp","enable_all_tools":false,"headers":{"Authorization":"Bearer <huggingface-token>"},"tools":[{"name":"model_search"}]}]'

curl --location 'https://${TFY_CONTROL_PLANE_BASE_URL}/api/llm/agent/chat/completions' \
  --header 'Content-Type: application/json' \
  --header "Authorization: Bearer ${TFY_API_TOKEN}" \
  --data-raw '{
    "model": "openai/gpt-4o",
    "messages": [{"role": "user", "content": "Search for machine learning models and packages"}],
    "stream": true,
    "mcp_servers": [{
      "url": "https://huggingface.co/mcp",
      "enable_all_tools": false,
      "headers": {"Authorization": "Bearer <huggingface-token>"},
      "tools": [{ "name": "model_search" }]
    }]
  }'

Override auth headers

You can override authentication per MCP server entry using either:

The headers field in mcp_servers array (recommended): Pass headers directly in the request body for each MCP server entry
The x-tfy-mcp-headers header: Pass headers in the HTTP request header using FQN-based format

Method 1: Using `headers` field (Recommended)

HTTPie
curl

http POST https://${TFY_CONTROL_PLANE_BASE_URL}/api/llm/agent/chat/completions \
  Authorization:"Bearer ${TFY_API_TOKEN}" \
  Content-Type:application/json \
  model=openai/gpt-4o stream:=true \
  messages:='[{"role":"user","content":"Think step by step and search for information about Python"}]' \
  mcp_servers:='[
    {
      "integration_fqn":"common-tools",
      "enable_all_tools": false,
      "tools": [{"name":"sequential_thinking"},{"name":"web_search"}],
      "headers": {"X-API-Key":"<custom-api-key>", "X-Custom-Header":"custom-value"}
    }
  ]'

curl --location 'https://${TFY_CONTROL_PLANE_BASE_URL}/api/llm/agent/chat/completions' \
  --header 'Content-Type: application/json' \
  --header "Authorization: Bearer ${TFY_API_TOKEN}" \
  --data-raw '{
    "model": "openai/gpt-4o",
    "messages": [{"role": "user", "content": "Think step by step and search for information about Python"}],
    "stream": true,
    "mcp_servers": [{
      "integration_fqn": "common-tools",
      "enable_all_tools": false,
      "tools": [{"name": "sequential_thinking"}, {"name": "web_search"}],
      "headers": {"X-API-Key": "<custom-api-key>", "X-Custom-Header": "custom-value"}
    }]
  }'

Method 2: Using `x-tfy-mcp-headers` header (Registered servers only)

For Agent API requests with registered MCP servers, you can also pass custom headers using the x-tfy-mcp-headers HTTP header. This method uses an FQN-based format where you specify headers for each MCP server using its fully qualified name (FQN).

The x-tfy-mcp-headers header only works with registered MCP servers (using integration_fqn). For external servers (using url), you must use the headers field within the mcp_servers array as shown in Method 1.

Format: The header value should be a JSON string with the structure:

{
  "server-fqn": {
    "Header-Name": "header-value"
  }
}

Key points:

Use the server’s full FQN as the key (e.g., truefoundry:mcp-server-group:remote-mcp-servers:mcp-server:hf-mcp-server)
You can copy the FQN from the MCP server details page using the Copy FQN button on the top right
For virtual servers, use the underlying remote server’s FQN
Headers specified here override any default authentication configured for the MCP server

Python
curl

import json
from openai import OpenAI

client = OpenAI(
    base_url="https://<your-control-plane-url>/api/llm/agent",
    api_key="your-tfy-token"
)

# Define custom headers for registered MCP servers using FQN-based format
# Get the full FQN from the MCP server details in the UI
mcp_headers = json.dumps({
    "truefoundry:mcp-server-group:remote-mcp-servers:mcp-server:hf-mcp-server": {
        "Authorization": "Bearer custom-token",
        "X-Custom-Header": "custom-value"
    }
})

response = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Search for Python tutorials"}],
    stream=True,
    extra_headers={"x-tfy-mcp-headers": mcp_headers},
    extra_body={
        "mcp_servers": [{
            "integration_fqn": "hf-mcp-server",  # Registered server
            "enable_all_tools": False,
            "tools": [{"name": "model_search"}]
        }]
    }
)

curl --location 'https://${TFY_CONTROL_PLANE_BASE_URL}/api/llm/agent/chat/completions' \
  --header 'Content-Type: application/json' \
  --header "Authorization: Bearer ${TFY_API_TOKEN}" \
  --header 'x-tfy-mcp-headers: {"truefoundry:mcp-server-group:remote-mcp-servers:mcp-server:hf-mcp-server":{"Authorization":"Bearer custom-token"}}' \
  --data-raw '{
    "model": "openai/gpt-4o",
    "messages": [{"role": "user", "content": "Search for Python tutorials"}],
    "stream": true,
    "mcp_servers": [{
      "integration_fqn": "hf-mcp-server",
      "enable_all_tools": false,
      "tools": [{"name": "model_search"}]
    }]
  }'

Important differences:

The x-tfy-mcp-headers format for Agent API is different from the MCP Gateway format
Agent API always uses full FQN as keys (e.g., truefoundry:mcp-server-group:remote-mcp-servers:mcp-server:hf-mcp-server)
This method only works with registered servers (integration_fqn). For external servers, use the headers field in the mcp_servers array

API Reference

Request parameters

Request Parameters

Parameter	Type	Required	Default	Description
`model`	string	✓	-	The LLM model to use (e.g., “openai/gpt-4o”)
`messages`	array	✗	-	Array of message objects with `role` and `content`
`mcp_servers`	array	✗	-	Array of MCP Server configurations (see below)
`max_tokens`	number	✗	-	Maximum number of tokens to generate
`temperature`	number	✗	-	Controls randomness in the response (0.0 to 2.0)
`top_p`	number	✗	-	Nucleus sampling parameter (0.0 to 1.0)
`top_k`	number	✗	-	Top-k sampling parameter
`stream`	boolean	✗	-	Whether to stream responses (only `true` is supported)
`iteration_limit`	number	✗	5	Maximum tool call iterations (1-20)

About tool call iterations: An iteration represents a full loop of user → model → tool call → tool result → model. The iteration_limit sets the maximum number of such loops per request to prevent runaway chains.

MCP server configuration

Each entry in the mcp_servers array should include:

MCP Server Parameters

Parameter	Type	Required	Default	Description
`integration_fqn`	string	✗*	-	Fully qualified name of the MCP Server integration
`url`	string	✗*	-	URL of the MCP server (must be valid URL)
`headers`	object	✗	-	HTTP headers to send to the MCP server
`enable_all_tools`	boolean	✗	`true`	Whether to enable all tools for this server
`tools`	array	✗	-	Array of specific tools to enable

*Note: Either integration_fqn or url must be provided, but not both.

Tool configuration

Each entry in the tools array should include:

Tool Parameters

Parameter	Type	Required	Description
`name`	string	✓	The name of the tool as it appears in the MCP server

Streaming Response

The Chat API uses Server-Sent Events (SSE) to stream responses in real-time. This includes assistant text, tool calls (function names and their arguments), and tool results.

Both assistant content and tool call arguments are streamed incrementally across multiple chunks. You must accumulate these fragments to build complete responses.

Compatibility: The streaming format follows OpenAI Chat Completions streaming semantics. See the official guide: OpenAI streaming responses. In addition, the Gateway emits tool result chunks as extra delta events (with role: "tool", tool_call_id, and content) to carry tool outputs.

Quick Reference

Event Quick Reference

Event	Relevant Fields	Description
Content	`delta.role` (first or every chunk), `delta.content`	Assistant text streamed over multiple chunks
Tool Call (start)	`delta.tool_calls[].function.name`, `delta.tool_calls[].id`	Announces a function call and its id
Tool Call (args)	`delta.tool_calls[].function.arguments`	Arguments streamed in multiple chunks; concatenate
Tool Result	`delta.role == "tool"`, `delta.tool_call_id`, `delta.content`	Tool output tied to a tool call id
Done	`choices[].finish_reason == "stop"`	Signals end of a message

SSE Envelope

Each SSE line delivers a JSON payload:

data: {"id": "event_id", "object": "chat.completion.chunk", "choices": [...]}

Event Types

Content events: assistant text

First chunk (role appears, empty content):

{
  "choices": [{ "delta": { "role": "assistant", "content": "" } }]
}

Subsequent chunk(s) with actual text:

{
  "choices": [{ "delta": { "content": "User Name" } }]
}

Alternative: some models include the role on every chunk:

{
  "choices": [{ "delta": { "role": "assistant", "content": "User Name" } }]
}

Role emission differs by provider. Do not assume the role is only present on the first chunk. Clients should set the role on the first chunk and carry it forward for subsequent chunks, and safely ignore repeated role fields on later chunks.

Concatenate delta.content across chunks to build the full assistant message.

Tool call events: function name then arguments

Start of a tool call (function name announced):

{
  "choices": [{
    "delta": {
      "tool_calls": [{
        "index": 0,
        "id": "call_xxx",
        "type": "function",
        "function": { "name": "a_getSlackUsers", "arguments": "" },
        "mcp_server_integration_id": "...",
        "tool_name": "getSlackUsers"
      }]
    }
  }]
}

Arguments streamed in later chunk(s):

{
  "choices": [{ "delta": { "tool_calls": [{ "index": 0, "function": { "arguments": "{}" } }] } }]
}

Append function.arguments fragments per index to reconstruct full arguments.
Completion of this phase is indicated by finish_reason: "tool_calls".

Anthropic-specific behavior: Anthropic may stream an empty string for tool-call arguments ("arguments": ""). When invoking the tool, their API expects a valid JSON object. Normalize empty arguments to {} before issuing the call.

args = tool_call.function.arguments or ""
if args.strip() == "":
    args = "{}"  # Anthropic requires an empty JSON object

Tool result events: tool output

{
  "choices": [{
    "delta": {
      "role": "tool",
      "tool_call_id": "call_xxx",
      "mcp_server_integration_id": "...",
      "tool_name": "getSlackUsers",
      "content": "{\"content\":[{\"type\":\"text\",\"text\":\"...\"}]}"
    }
  }]]
}

delta.role == "tool" indicates a tool result chunk.
The content is a JSON string; parse it to extract text or structured data if needed.

Error events

{
  "type": "error",
  "error": { "type": "rate_limit_error", "message": "This request would exceed the rate limit..." }
}

Processing Streaming events

How to Get the Code Snippet

You can generate a ready-to-use code snippet directly from the AI Gateway web UI:

Go to the Playground or your MCP Server group in the AI Gateway.
Click the API Code Snippet button.
Copy the generated code and use it in your application.

The generated code snippet from the playground will only show the last assistant message, and will not show tool calls and results from that conversation.

TrueFoundry AI Gateway interface showing the API Code Snippet button

Agent API Code Snippet - Button

Generated code snippet for using the Agent API with MCP servers

Agent API Code Snippet - Example

Process streaming in code

OpenAI Client example

You can use the OpenAI client library with a custom base URL to handle the streaming response:

Configure client

from openai import OpenAI

client = OpenAI(
    base_url="https://<your-control-plane-url>/api/llm/agent",
    api_key="****",
)

Base URL: Point this to your Gateway URL with the /api/llm/agent path which directly targets the Agent API.

Define common Agent configuration

AGENT_CONFIG = {
    "model": "openai/gpt-4o",
    "stream": True,
    "extra_body": {
        "mcp_servers": [{
            "integration_fqn": "common-tools",
                "enable_all_tools": False,
            "tools": [{"name": "web_search"}, {"name": "code_executor"}]
        }],
        "iteration_limit": 10
    }
}

model: Provider/model routed via Gateway.
mcp_servers: Select specific tools from an MCP server.
iteration_limit: Max agent tool-call iterations.

Collect streamed chunks into full messages

The get_messages function processes the streaming response to reconstruct complete messages. Let’s break it down:

1. Initialize and detect new messages

def get_messages(chat_stream):
    messages = []
    previous_chunk_id = None

    for chunk in chat_stream:
        if not chunk.choices:
            continue

        delta = chunk.choices[0].delta

        # Detect new message when chunk ID changes
        if chunk.id != previous_chunk_id:
            previous_chunk_id = chunk.id
            messages.append({"role": delta.role, "content": ""})

        current_message = messages[-1]

What’s happening: Each streaming chunk has an ID. When the ID changes, it signals a new message starting (assistant, tool result, etc.). We create a new message object with the role and empty content.

2. Handle tool result messages

        # Set tool_call_id for tool result messages
        if delta.role == 'tool':
            current_message["tool_call_id"] = delta.tool_call_id

What’s happening: Tool result messages have role: "tool" and include a tool_call_id that links the result back to the specific tool call that generated it.

3. Accumulate message content

        # Accumulate content for all message types (assistant text, tool results)
        current_message["content"] += delta.content if delta.content else ""

What’s happening: Both assistant responses and tool results stream their content incrementally. We concatenate each chunk’s content to build the complete message.

4. Handle tool calls (function name and arguments)

        # Process tool calls - function names and arguments are streamed
        for tool_call in delta.tool_calls or []:
            # Initialize tool_calls array if needed
            if "tool_calls" not in current_message:
                current_message["tool_calls"] = []

            # Add new tool call if index exceeds current array length
            if tool_call.index >= len(current_message["tool_calls"]):
                current_message["tool_calls"].append({
                    "id": tool_call.id,
                    "type": "function",
                    "function": {
                        "name": tool_call.function.name,
                        "arguments": tool_call.function.arguments or ""
                    }
                })
                # If the first tool call event has arguments, continue to next event
                if tool_call.function.arguments:
                    continue

            # Accumulate function arguments (streamed incrementally)
            current_message["tool_calls"][tool_call.index]["function"]["arguments"] += tool_call.function.arguments or ""

What’s happening:

Tool calls are streamed with function names first, then arguments in chunks
Each tool call has an index to handle multiple simultaneous tool calls
We accumulate the arguments string as it streams in (like {"query": "Python tutorials"})

5. Apply Anthropic fix for empty arguments

    # Anthropic fix: normalize empty argument strings to valid JSON
    for msg in messages:
        if msg["role"] == "assistant" and len(msg.get("tool_calls", [])) > 0:
            for tool_call in msg["tool_calls"]:
                if not tool_call["function"]["arguments"].strip():
                    tool_call["function"]["arguments"] = "{}"

    return messages

What’s happening: Anthropic models sometimes send empty strings "" for tool arguments, but the OpenAI format expects "{}" for empty JSON objects. We normalize this.

Helper to send/merge a turn

def chat_with_agent(messages):
    stream = client.chat.completions.create(messages=messages, **AGENT_CONFIG)
    messages += get_messages(stream)
    return messages

Run a conversation and print outputs

conversation = [
    {"role": "user", "content": "Search for Python tutorials and run a simple hello world example"}
]

conversation = chat_with_agent(conversation)

conversation.append(
    {"role": "user", "content": "Now scrape a Python documentation page and extract key concepts"}
)

conversation = chat_with_agent(conversation)

for message in conversation:
    if message["role"] in ["user", "assistant"]:
        print(f'{message["role"].title()}: {message["content"]}')
    elif message["role"] == "tool":
        print(f'Tool Result ({message["tool_call_id"]}):\n{message["content"]}')
    if message["role"] == "assistant" and len(message.get("tool_calls", [])) > 0:
        for tool_call in message["tool_calls"]:
            print(f'Tool call id: {tool_call["id"]}')
            print(f'Tool call function: {tool_call["function"]["name"]}')
            print(f'Tool call arguments: {tool_call["function"]["arguments"]}\n\n')
    print()

Complete working code

from openai import OpenAI

client = OpenAI(
    base_url="https://<your-control-plane-url>/api/llm/agent",
    api_key="****",
)

# Common configuration
AGENT_CONFIG = {
    "model": "openai/gpt-4o",
    "stream": True,
    "extra_body": {
        "mcp_servers": [{
            "integration_fqn": "common-tools",
                "enable_all_tools": False,
            "tools": [{"name": "web_search"}, {"name": "code_executor"}]
        }],
        "iteration_limit": 10
    }
}

def get_messages(chat_stream):
    messages = []
    previous_chunk_id = None

    for chunk in chat_stream:
        if not chunk.choices:
            continue

        delta = chunk.choices[0].delta

        # Detect new message when chunk ID changes
        if chunk.id != previous_chunk_id:
            previous_chunk_id = chunk.id
            messages.append({"role": delta.role, "content": ""})

        current_message = messages[-1]

        # Set tool_call_id for tool result messages
        if delta.role == 'tool':
            current_message["tool_call_id"] = delta.tool_call_id

        # Accumulate content for all message types (assistant text, tool results)
        current_message["content"] += delta.content if delta.content else ""

        # Process tool calls - function names and arguments are streamed
        for tool_call in delta.tool_calls or []:
            # Initialize tool_calls array if needed
            if "tool_calls" not in current_message:
                current_message["tool_calls"] = []

            # Add new tool call if index exceeds current array length
            if tool_call.index >= len(current_message["tool_calls"]):
                current_message["tool_calls"].append({
                    "id": tool_call.id,
                    "type": "function",
                    "function": {
                        "name": tool_call.function.name,
                        "arguments": tool_call.function.arguments or ""
                    }
                })
                # If the first tool call event has arguments, continue to next event
                if tool_call.function.arguments:
                    continue

            # Accumulate function arguments (streamed incrementally)
            current_message["tool_calls"][tool_call.index]["function"]["arguments"] += tool_call.function.arguments or ""

    # Anthropic fix: normalize empty argument strings to valid JSON
    for msg in messages:
        if msg["role"] == "assistant" and len(msg.get("tool_calls", [])) > 0:
            for tool_call in msg["tool_calls"]:
                if not tool_call["function"]["arguments"].strip():
                    tool_call["function"]["arguments"] = "{}"

    return messages

def chat_with_agent(messages):
    stream = client.chat.completions.create(messages=messages, **AGENT_CONFIG)
    messages += get_messages(stream)
    return messages

# Example usage
conversation = [
    {"role": "user", "content": "Search for Python tutorials and run a simple hello world example"}
]

conversation = chat_with_agent(conversation)

conversation.append(
    {"role": "user", "content": "Now scrape a Python documentation page and extract key concepts"}
)

conversation = chat_with_agent(conversation)

# Print the conversation
for message in conversation:
    if message["role"] in ["user", "assistant"]:
        print(f'{message["role"].title()}: {message["content"]}')
    elif message["role"] == "tool":
        print(f'Tool Result ({message["tool_call_id"]}):\n{message["content"]}')
    if message["role"] == "assistant" and len(message.get("tool_calls", [])) > 0:
        for tool_call in message["tool_calls"]:
            print(f'Tool call id: {tool_call["id"]}')
            print(f'Tool call function: {tool_call["function"]["name"]}')
            print(f'Tool call arguments: {tool_call["function"]["arguments"]}\n\n')
    print()

Tool call flow

The streaming API follows this flow when tools are involved:

Assistant Response Start: Initial content from the LLM (streamed)
Tool Call Event: Function name, then arguments streamed incrementally
Tool Execution: The gateway executes the complete tool call
Tool Result Event: Results are streamed back
Assistant Follow-up: The assistant processes results and continues

Stream termination

The stream ends with either:

A [DONE] message indicating completion
An error event if something goes wrong
Client disconnection

Get Started

Developer Guide

MCP Registry and Gateway

Prompt Management

Observability

Integrations

Deployment

API Reference

Chat

Agent

Embeddings

Rerank

Responses

Image

Audio

Batch

Files

Moderations

Models

​Overview

​Quickstart

​Set your API token and base URL

​Make your first request

​Understand the response

​Request examples

​Call with registered MCP servers

​Use external MCP servers

​Override auth headers

​Method 1: Using headers field (Recommended)

​Method 2: Using x-tfy-mcp-headers header (Registered servers only)

​API Reference

​Request parameters

Request Parameters

​MCP server configuration

MCP Server Parameters

​Tool configuration

Tool Parameters

​Streaming Response

​Quick Reference

Event Quick Reference

​SSE Envelope

​Event Types

​Processing Streaming events

​How to Get the Code Snippet

​Process streaming in code

​OpenAI Client example

​Configure client

​Define common Agent configuration

​Collect streamed chunks into full messages

1. Initialize and detect new messages

2. Handle tool result messages

3. Accumulate message content

4. Handle tool calls (function name and arguments)

5. Apply Anthropic fix for empty arguments

​Helper to send/merge a turn

​Run a conversation and print outputs

​Tool call flow

​Stream termination

Overview

Quickstart

Set your API token and base URL

Make your first request

Understand the response

Request examples

Call with registered MCP servers

Use external MCP servers

Override auth headers

Method 1: Using `headers` field (Recommended)

Method 2: Using `x-tfy-mcp-headers` header (Registered servers only)

API Reference

Request parameters

MCP server configuration

Tool configuration

Streaming Response

Quick Reference

SSE Envelope

Event Types

Processing Streaming events

How to Get the Code Snippet

Process streaming in code

OpenAI Client example

Configure client

Define common Agent configuration

Collect streamed chunks into full messages

Helper to send/merge a turn

Run a conversation and print outputs

Tool call flow

Stream termination