Skip to main content
This section is applicable only if you are self-hosting the Gateway plane. In case Truefoundry is hosting the Gateway, everything is managed by Truefoundry and you don’t need to setup anything here.
TrueFoundry exposes metrics that can be used to set different alerts. Here are few recommended Alerts:
  1. Gateway pod not healthy
sum by (namespace, pod) (kube_pod_status_phase{job="kube-state-metrics", phase=~"Pending|Unknown", namespace=~"truefoundry", pod=~"truefoundry-tfy-llm-gateway-.*"}) > 0
  1. Gateway pod restarts in last 5 min
increase(kube_pod_container_status_restarts_total{job="kube-state-metrics", namespace=~"truefoundry", pod=~"truefoundry-tfy-llm-gateway-.*"}[5m]) > 0
  1. Gateway request failing for 5xx error code in last 5 min
sum(rate(http_request_duration_seconds_count{status_code=~"^5..$", container="tfy-llm-gateway", namespace="truefoundry"}[5m])) / sum (rate(http_request_duration_seconds_count{container="tfy-llm-gateway", namespace="truefoundry"}[5m])) * 100 > 0