Truefoundry AI Gateway exposes a lot of metrics on which you might want to alert on. As of today, its not possible to set alerts on Gateway metrics if you are using the managed version of the gateway. This is something we are working on and will come soon.
For self-hosted gateway, the TF gateway exposes metrics on /metrics endpoint which can be scraped by Prometheus or any other monitoring solution you use. You can read about these metrics more here.
This section below is applicable only if you are self-hosting the Gateway plane.
If you are self-hosting the gateway, we recommend setting up the following three alerts. You can set more according to your needs, but the below 3 should be there.
- Gateway pods not healthy
sum by (namespace, pod) (
kube_pod_status_phase {
job="kube-state-metrics",
namespace=~"truefoundry",
phase=~"Pending|Unknown",
pod=~"truefoundry-tfy-llm-gateway-.*"
}
) > 0
- Gateway pod restarts in last 5 min
increase(
kube_pod_container_status_restarts_total {
job="kube-state-metrics",
namespace=~"truefoundry",
pod=~"truefoundry-tfy-llm-gateway-.*"
}[5m]
) > 0
- Gateway request failing for 5xx error code in last 5 min
(
sum(
rate(
http_request_duration_seconds_count{
status_code=~"^5..$",
container="tfy-llm-gateway",
namespace="truefoundry"
}[5m]
)
)
/
sum(
rate(
http_request_duration_seconds_count{
container="tfy-llm-gateway",
namespace="truefoundry"
}[5m]
)
)
) * 100 > 0