Alerting

Truefoundry AI Gateway exposes a lot of metrics on which you might want to alert on. As of today, its not possible to set alerts on Gateway metrics if you are using the managed version of the gateway. This is something we are working on and will come soon. For self-hosted gateway, the TF gateway exposes metrics on /metrics endpoint which can be scraped by Prometheus or any other monitoring solution you use. You can read about these metrics more here.

This section below is applicable only if you are self-hosting the Gateway plane.

If you are self-hosting the gateway, we recommend setting up the following three alerts. You can set more according to your needs, but the below 3 should be there.

Gateway pods not healthy

sum by (namespace, pod) (
  kube_pod_status_phase {
    job="kube-state-metrics",
    namespace=~"truefoundry",
    phase=~"Pending|Unknown",
    pod=~"truefoundry-tfy-llm-gateway-.*"
  }
) > 0

Gateway pod restarts in last 5 min

increase(
  kube_pod_container_status_restarts_total {
    job="kube-state-metrics",
    namespace=~"truefoundry",
    pod=~"truefoundry-tfy-llm-gateway-.*"
  }[5m]
) > 0

Gateway request failing for 5xx error code in last 5 min

(
  sum(
    rate(
      http_request_duration_seconds_count{
        status_code=~"^5..$",
        container="tfy-llm-gateway",
        namespace="truefoundry"
      }[5m]
    )
  )
  /
  sum(
    rate(
      http_request_duration_seconds_count{
        container="tfy-llm-gateway",
        namespace="truefoundry"
      }[5m]
    )
  )
) * 100 > 0

Prometheus & Grafana Fetch Request Traces via API

⌘I

Get Started

LLM Gateway

MCP Registry and Gateway

Agent Hub

Guardrails and Security

Prompt Management

Observability

Deployment

Admin Guide

API Reference

Chat

Agent

Embeddings

Rerank

Responses

Image

Audio

Batch

Files

Moderations

Models