Autopilot: Automating Infrastructure Management for GenAI

March 18, 2025
Share this post
https://www.truefoundry.com/blog/automating-infrastructure-management-for-generative-ai-with-autopilot
URL
Autopilot: Automating Infrastructure Management for GenAI

What is Autopilot

Machine learning operations (MLOps) often involve complex, manual processes that consume time and resources. Truefoundry's Autopilot aims to eliminate these operational burdens, enabling developers to focus solely on writing code and data scientists to refine their models. Autopilot automatically handles resource optimization and reliability fixes and ensures a frictionless workflow with minimal human intervention.

Why do we need this

The operational concerns of any software development lifecycle can be divided in three different stages -

  1. Day 0 – Design & Planning: Define architecture, provisioning strategies, security policies, and scaling frameworks before deployment.
  2. Day 1 – Deployment & Implementation: Set up infrastructure, deploy applications, configure observability, and establish CI/CD pipelines.
  3. Day 2 – Operations & Maintenance: Continuously monitor, auto-scale resources, apply security patches, and handle incident management.

These processes have been implemented in three different phases, leading from fragmented responsibilities to automation-driven efficiency.

Phase 1: Separate Dev and Ops

This is the stage a team usually starts at. The three stages of operations usually involve the following in this phase.

  • Day 0: Developers focus on application design, while operations teams handle infrastructure and security.
  • Day 1: Developers package and prepare the applications for deployment, while the operations team focuses on provisioning resources and configuring them.
  • Day 2: Operations manage scaling, monitoring, and security patches while developers still troubleshoot issues.

This separation of responsibility creates unnecessary friction between the Development and Operations teams. The problem is made more serious by making context sharing more difficult because of the limited shared vocabulary in some cases.

For a single application this can translate into an initial release timeline of several weeks with each subsequent day 2 operations taking a few days with the inevitable alignment issues between the operations and development teams.

 

Phase 2: Internal platform creation

In phase 2, an org adopts an internal platform that allows the development team to configure and control most operational levers as they see fit. The operations team moves into more of an enforcement and standardizing role using the platform as the layer to orchestrate it.

This phase has a few drawbacks -

  • For developers, this phase means making many choices in the initial stages with limited context or relevant expertise. This leads to an increase in cognitive load and suboptimal resource planning.
  • This approach manifests as an explosion of the operations space for the operations team. A typical team might encounter a multifold increase in the number of services and infrastructure costs with the series of suboptimal decisions made.

Even though we initially gain velocity, this is offset by the explosion in complexity of a development team’s work, which is inevitably followed by a clash of concerns between the two teams.

Phase 3: Automating away the platform

In the third phase, the platform itself starts automating all the operational concerns. This eliminates the need for many decisions to be made across the three stages of operation.

This means the 3 operational stages can be achieved on day 0 itself with almost no operational choices from either the development or the operations teams. This is what Autopilot attempts to do.

Why now

Although the need for more automation has been apparent for quite some time, a system like autopilot becomes even more essential in the current scenario with the following new paradigms coming into play

Microservices

With the wide adoption of microservices architecture, the number of services in an org has undergone a cambrian explosion. This convenience of getting changes or new services to production has a flip side of more difficult oversight. Autopilot is a system that can reliably optimise these service.

Agentic Systems

Agentic systems are systems that autonomously execute tasks. They need a robust, self-sufficient deployment strategy with a backing infrastructure flexible enough to scale up and down as needed dynamically. Current state-of-the-art AI agents rely on adaptable, efficient infrastructure to function optimally. These are dynamic systems needing different levels of human involvement. The wide rollout of such systems can only be possible with a system where all the operational aspects are automated, which is where Autopilot comes into

Case Study

For one of the Truefoundry platform users, the cost for dev clusters was a major issue. With around 200 services deployed, this was a typical case of multiple services with small inefficiencies accumulating to create a massive overall cost increase. Any attempt at cost optimization would have to be done at the individual service level. This extreme work requirement led to the cost overrun worsening and never becoming a priority.

After enabling autopilot, for this one customer, they could realize 1500$ cost savings in just 2 clusters. In addition, over 50 reliability-related fixes were applied where the workloads were either starved for CPU or were undergoing memory crunch-related issues

What can Autopilot do currently

CPU, Memory, and Storage optimisation

Autopilot automates away the cpu and memory configuration for an application. Autopilot does this by looking at two sources of inputs -

  • Historical Usage: Autopilot looks at the historical usage for an application to come back with an optimal configuration going ahead
  • Real-Time Adjustments: Autopilot also reacts to alerts and other source of events to perform real time mitigation to nip a problem in the bud. This leads to improvement in MTTR and proactively prevents a lot of problems that would otherwise turn much larger.

Cluster Health

Autopilot also takes care of the health and cost of the individual components installed in a cluster, such as Istio, ArgoCD, Carpenter, etc. Failure of any of these components can lead to disastrous consequences for the workloads running on that cluster. Autopilot ensures that these components are running in a cost-optimal manner while continuing to function by proactively looking for resource spikes and accounting for them.

Node capacity

Going beyond services, autopilot also optimises wrt the infrastructure underlying the running services. This means recommending the node capacity ideal for an application. This is done by taking the application metrics, environment and other factors into account.

Autoscaling

Many development teams choose to scale their applications for the maximum load that an application is expected to face. This leads to a lot of extra cost when these extra replicas are not in use. An obvious solution is implementing autoscaling, but even that is not applicable when the traffic patterns are unpredictable. Autopilot goes through the historical metrics for every service and generates a recommendation for enabling or disabling autoscaling based on the application's historical nature.

What’s next

Although we already see a lot of gains with cost and reliability using autopilot on production, a lot more needs to be done to realize the complete automation vision laid down earlier. Some of the aspects of operational concern that are worth automating away next are -

  • Periodic autoscaling - Predicting and implementing periodic autoscaling, taking the historical traffic into account, can allow us to enable autoscaling even for spiky workloads.
  • Auto-shutdown recommendation - Filtering out and shutting down services or workloads that are not in use can lead to massive cost savings
  • Auto-benchmarking—While estimating a service's resources is already possible, a better estimation can be made by running the benchmarks with live or simulated traffic and observing the affected business metrics. Autopilot seeks to automate this process, which can be very time-consuming for most development teams.
  • Cluster infra optimization - Cluster CPU utilization across industry is 10% on the average link . While misconfiguration of applications for CPU is a significant chunk of it, a large part is also the inefficiencies in load distribution on the underlying infrastructure. This can be in the form of many underutilized nodes, too many small nodes wasting space on daemonsets etc. Fixing the infra provisioning configurations and leveraging tools like karpenter at the cloud level can help a lot in improving this aspect.

Conclusion

Truefoundry's Autopilot is a transformative tool in the evolution of MLOps, addressing critical operational challenges across the software development lifecycle. Autopilot enables teams to focus on innovation rather than operational overhead by automating resource optimization, cluster health management, and autoscaling. As the adoption of microservices and agentic systems continues to grow, the need for such automation becomes increasingly urgent. With its current capabilities and ambitious roadmap, Autopilot is poised to revolutionize how organizations approach operational efficiency, reliability, and cost optimization.

Management free AI infrastructure
Book a demo now

Discover More

AutoDeploy: LLM Agent to for GenAI Deployments

Engineering and Product
LLMs & GenAI
March 6, 2025

Scaling to Zero in Kubernetes: A Deep Dive into Elasti

Engineering and Product
February 6, 2025

Announcing $19M Series A, Scaling AI Deployment with Autonomous Agents on Autopilot

Engineering and Product
Culture
January 27, 2025

Cost Comparison with Sagemaker

Engineering and Product

Related Blogs

No items found.

Blazingly fast way to build, track and deploy your models!

pipeline