Benchmarking the TrueFoundry LLM Gateway: it's blazing fast ⚡

November 12, 2024
Share this post
https://www.truefoundry.com/blog/truefoundry-llm-gateway-is-blazing-fast
URL
Benchmarking the TrueFoundry LLM Gateway: it's blazing fast ⚡

Highlights

  • TrueFoundry LLM Gateway provides a unified OpenAI compatible interface to various LLM providers like Anthropic, OpenAI, Bedrock, Gemini and many others
  • TrueFoundry LLM Gateway scales seamlessly to 350 RPS on a single replica of 1 unit CPU while using 270 MB of memory. We compared with another gateway product, LiteLLM, on a similar setup and LiteLLM failed to scaled beyond 50 RPS
  • TrueFoundry LLM Gateway only adds an extra latency of 3-5 ms, while LiteLLM adds between 15-30 ms per request.

Why does your org need an LLM Gateway?

An LLM Gateway provides a unified interface to manage your organisation's LLM usage:

  • Unified API: Access multiple LLM providers through a single OpenAI compatible interface, no code changes needed
  • API Key Security: Secure, centralised credential management
  • Governance & Control: Set limits, access controls, and content filtering
  • Rate Limiting: Prevent abuse and ensure fair usage
  • Observability: Track usage, costs, latency and performance
  • Load Balancing: Route requests across providers automatically
  • Cost Management: Monitor spending and set budget alerts
  • Audit Trails: Log all LLM interactions for compliance

How fast is TrueFoundry LLM Gateway?

Load Test Setup

For our load testing experiment, we setup a deployed this fake OpenAI endpoint service using TrueFoundry. The service would simulate OpenAI request and response format without actually producing tokens.

We also deployed the TrueFoundry LLM Gateway and LiteLLM Proxy Server, both running of a single replica with 1 unit CPU and 1 GB memory.  

We added our fake OpenAI provider into both TrueFoundry and LiteLLM gateways. While load testing, we made requests to the fake OpenAI server in 3 different ways:

  • Setup 1: Directly without using any proxy or gateway
  • Setup 2: Through the TrueFoundry LLM Gateway deployed on 1 unit CPU and 1 GB memory
  • Setup 3: Through the LiteLLM Proxy Server  deployed on 1 unit CPU and 1 GB memory
RPS 10 RPS 50 RPS 200 RPS 300 RPS
OpenAI direct (Setup 1) 73 ms 73 ms 73 ms 73 ms
TrueFoundry LLM Gateway (Setup 2) 76 ms (+3 ms) 76 ms (+3 ms) 76 ms (+3 ms) 77 ms (+4 ms)
LiteLLM Proxy (Setup 3) 88 ms (+15 ms) 99 ms (+26 ms) Could not scale to 200 RPS Could not scale to 300 RPS

Observations

  1. TrueFoundry Gateway adds only extra 3 ms in latency upto 250 RPS and 4 ms at RPS > 300
  2. TrueFoundry LLM Gateway was able to scale without any degradation in performance until about 350 RPS (1 vCPU, 1 GB machine) before the CPU utilisation reached 100% and latencies started getting affected. With more CPU or more replicas, the LLM Gateway can scale to tens of thousands of requests per second.
  3. LiteLLM on the same machine was not able to scale beyond 40-50 RPS before reaching CPU limit

More metrics

Setup 1: Direct OpenAI endpoint calling

Stats @ 200 RPS
Stats @ 300 RPS
Response Time v/s RPS

Setup 2: TrueFoundry LLM Gateway

Stats @ 200 RPS
Stats @ 300 RPS
Response Time v/s RPS

Setup 3: LiteLLM

Stats @ ~58 RPS
Response times v/s RPS

Speed features of LLM Gateway

  • Near-Zero Overhead: Just 3-5 ms added latency
  • Optimised Backend: Built with performant Node.js framework
  • Config Caching: Config is stored in memory for quick look up
  • Smart Routing: Minimal processing overhead
  • Edge Ready: Deploy close to your apps
  • High Capacity: A t2.2xlarge AWS instance (43$ per month on spot) machine can scale upto ~3000 RPS with no issues.
Edge Deployment of TrueFoundry LLM Gateway

Supported Providers

Below is a comprehensive list of popular LLM providers that is supported by TrueFoundry LLM Gateway:

Provider Streaming Supported
GCP
AWS
Azure OpenAI
Self Hosted Models on TrueFoundry
OpenAI
Cohere
AI21
Anthropic
Anyscale
Together AI
DeepInfra
Ollama
Palm
Perplexity AI
Mistral AI
Groq
Nomic

Discover More

April 16, 2024

Cognita: Building an Open Source, Modular, RAG applications for Production

LLMs & GenAI
April 11, 2024

How To Choose The Best Vector Database

LLMs & GenAI
March 28, 2024

Leveraging Fractional GPUs on Kubernetes

GPU
LLMs & GenAI
March 14, 2024

Helping Enterprises accelerate the time to value for GenAI

LLMs & GenAI

Related Blogs

No items found.

Blazingly fast way to build, track and deploy your models!

pipeline