Compaction reduces context size while preserving state for the next turn, so you can balance quality, cost, and latency as conversations grow.
The gateway supports standalone POST /responses/compact and server-side compaction via context_management in POST /responses.
Compaction is only supported by OpenAI. Use the gateway’s OpenAI inference base URL.
Standalone: POST /responses/compact
Send a full context window; the API returns a compacted window (including an opaque, encrypted compaction item) to pass as input to your next /responses call. Body: model, input, and optionally instructions, previous_response_id.
Do not prune the compact response. Pass the full output into your next /responses call as-is.
Code
from openai import OpenAI
client = OpenAI(
api_key="your-tfy-api-key",
base_url="https://{controlPlaneUrl}/api/llm",
)
compacted = client.responses.compact(
model="openai-main/gpt-4o",
input=long_input_items_array,
)
next_input = [
*compacted.output,
{"type": "message", "role": "user", "content": user_input_message()},
]
next_response = client.responses.create(
model="openai-main/gpt-4o",
input=next_input,
store=False,
)
Response shape
{
"id": "resp_compact_123",
"object": "response.compaction",
"created_at": 1234567890,
"output": [
{ "type": "compaction", "encrypted_content": "..." }
],
"usage": {
"input_tokens": 15000,
"output_tokens": 1200,
"total_tokens": 16200
}
}
Server-side: POST /responses with context_management
Set context_management: [{"type": "compaction", "compact_threshold": 200000}] on create. When the rendered token count crosses the threshold, the server compacts and emits a compaction item in the stream. No separate /responses/compact call needed.
- Stateless: Append response output (including compaction items) to your input each turn.
- Stateful: Use
previous_response_id and send only the new user message; do not manually prune.
With stateless chaining, you can drop items that came before the most recent compaction item to keep requests smaller. With previous_response_id, do not prune.
Code
conversation = [
{"type": "message", "role": "user", "content": "Let's begin a long coding task."},
]
while keep_going:
response = client.responses.create(
model="openai-main/gpt-4o",
input=conversation,
store=False,
context_management=[{"type": "compaction", "compact_threshold": 200000}],
)
conversation.extend(response.output)
conversation.append(
{"type": "message", "role": "user", "content": get_next_user_input()},
)
References