A 2-Person team serving model to 1.5M people with TrueFoundry

In the last few months, we have had the opportunity to work with a lean butted team. They have developed a state-of-the-art deep learning model and created partnerships to extremely talen ship it to more than 10 million users.

The last missing piece in their impact story was handling the engineering to accomplish this. The model was compute-heavy, and at the scale that they wanted to serve this model to its end users, they needed a reliable and performant infrastructure stack that the 2 of them could manage (1 DevOps Engineer and 1 ML Engineer).

Need for Async Deployment

The model was built to process audio inputs of varying sizes. Since the model had a high processing time (averaging ~5 seconds), it needed an async inference for each request to process and respond to these requests.

The team had developed a stack on AWS Sagemaker

The team has built its initial stack for serving the model on Sagemaker. However, when they conducted their first pilot using this design, they realized that serving the model reliably at its desired scale would be difficult with this stack.

Users faced lags of 8-10 Minutes

Even after using the Async setup, since the instances took time to scale up (8-10 mins per machine), the end-user experience was compromised when they had to bear this lag.

However, during the PoC, they faced huge lags in response times. Since they were new to many of the Sagemaker-related controls, they lost critical time finding the reason for the lags. Some of the challenges they faced were:

Difficult to learn: They found it difficult as DS/MLEs to understand the new concepts required to use Sagemaker.
Limited Visibility: Doing a root cause analysis of the issues, especially in production, was difficult due to unintuitive dashboards and interfaces.
Difficult to scale: Scaling Sagamaker was slow, causing delays in user responses and a poor customer experience.
Separate Quota: AWS requires you to make a separate case for getting capacity for Sagemaker-reserved GPU instances. The team found this process slow and restrictive.
Expensive: Using GPUs with Sagemaker was expensive for the team because Sagemaker up marks such instances by 25-40% over raw EKS.

Post the PoC, the team lost confidence in Sagemaker and decided that they needed a solution that the two of them (One ML Engineer and 1 DevOps Engineer) could serve to their target audience of 10Mn+ users.

Deploying system on TrueFoundry in <2 Days

When we started engaging with the team, their pilot was ~7 Days away. We assured the team that we could assist them in migrating the entire stack and rebuilding it using TrueFoundry's modules in <2 Days so that they get ample time to test before their pilot had to go to production.

Much faster scaling

The team conducted benchmarks by sending a burst of 88 requests to the model to benchmark performance vs Sagemaker. TrueFoundry scaled up 78% faster than Sagemaker, giving the user much faster responses. The end-to-end time taken to respond to the query was 40% faster with TrueFoundry.

‍

AUTOSCALING TEST RESULTS (G5.XLARGE, 2 WORKERS, 88 REQUESTS)
	AWS Sagemaker	TrueFoundry
Total Time to process 88 Requests	660s	395.9s

‍

AUTOSCALING TEST RESULTS (G5.XLARGE, 2 WORKERS, 88 REQUESTS)
	AWS Sagemaker	TrueFoundry
Total Time to process 88 Requests	660s	395.9s

‍

Reliable scaling to 150+ Nodes

The team was simply able to scale the application to 150+ GPU nodes because:

Easy to configure: They just had to change an argument on the UI and could easily configure autoscaling rules based on the backlog of incoming requests. This would otherwise have taken multiple back-and-forths with the engineering team.

Increased GPU Quota: With TrueFoundry, they could use both Spot and Raw ECS. Due to the GPU shortage with cloud providers, TrueFoundry also gave the team an option to scale across different GPU providers and regions.

Spot Usage and Autoscaling: The team had to put in no additional effort to configure the use of spot instances for their services. Instances were also scaled down when traffic was low. Using TrueFoundry's reliability mechanism for spot usage and autoscaling settings, the team saved $100K+ dollars during the pilot period.
Dev and Demo Environment: The team has also deployed a Dev and Demo service of the model to collect feedback while scaling down the machines when not in use.

1.5 Mn users already served and increasing by the Day!

Using TrueFoundry, the 2 member team can manage their entire workload, which often scales to more than 150 GPU nodes!! by themselves. While working with us, the most remarkable thing that stood out to the team was our customer support and low response times. TrueFoundry is invested in the success of its clients and hope that all our clients can scale and create impact at scales similar to this project!

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now

The fastest way to build, govern and scale your AI

Book a Demo

A 2-Person team serving model to 1.5 Million people with TrueFoundry

Need for Async Deployment

The team had developed a stack on AWS Sagemaker

Users faced lags of 8-10 Minutes

Deploying system on TrueFoundry in <2 Days

Much faster scaling

Reliable scaling to 150+ Nodes

1.5 Mn users already served and increasing by the Day!

Built for Speed: ~10ms Latency, Even Under Load

Data Residency in the Age of Agentic AI: How AI Gateways Enable Sovereign Scale and Compliance

Mapping the On-Prem AI Market: From Chips to Control Planes

AI Gateways: From Outage Panic to Enterprise Backbone

How TrueFoundry’s AI Gateway Makes MCP Enterprise‑Ready

Turning AI Chaos into Control: A Conversation on Agentic AI with Tesseract Talks

Breaking Down AI Gateway Usage: Customer and User-Level Analytics

Rate Limiting in AI Gateway : The Ultimate Guide

AI Gateway: The Central Control Pane of Today’s Generative AI Infrastructure

The Complete Guide to AI Gateways and MCP Servers

A 2-Person team serving model to 1.5 Million people with TrueFoundry

Need for Async Deployment

The team had developed a stack on AWS Sagemaker

Users faced lags of 8-10 Minutes

Deploying system on TrueFoundry in <2 Days

Much faster scaling

Reliable scaling to 150+ Nodes

1.5 Mn users already served and increasing by the Day!

Built for Speed: ~10ms Latency, Even Under Load

Discover More

Data Residency in the Age of Agentic AI: How AI Gateways Enable Sovereign Scale and Compliance

Mapping the On-Prem AI Market: From Chips to Control Planes

AI Gateways: From Outage Panic to Enterprise Backbone

How TrueFoundry’s AI Gateway Makes MCP Enterprise‑Ready

Turning AI Chaos into Control: A Conversation on Agentic AI with Tesseract Talks

Breaking Down AI Gateway Usage: Customer and User-Level Analytics

Rate Limiting in AI Gateway : The Ultimate Guide

AI Gateway: The Central Control Pane of Today’s Generative AI Infrastructure

The Complete Guide to AI Gateways and MCP Servers

Subscribe to our newsletter

How TrueFoundry’s AI Gateway Makes MCP Enterprise‑Ready