We firmly believe that every company will be a Machine Learning (ML) company in the coming few years. As organizations embrace ML, one of the significant challenges they face is managing the associated cloud costs. Running AI/ML workloads in the cloud can quickly become expensive, but with careful planning and optimization, it is possible to reduce these costs significantly.
In this blog post, we will explore several strategies to help you optimize your AI infrastructure, ultimately reducing your cloud expenses without compromising performance or scalability. Below are the broad categories to consider:
After having worked with quite a few organizations and from our own previous experience, a large part of the cost is just from human mistakes of forgetting to turn off VMs, services or incorrect architecture design incurring more costs. Having full visibility of who owns what and what cost is incurred on a team /project basis helps point out cost drainage faster and lets everyone be accountable for their own projects.
You can't improve what you don't measure
The very first step in optimising ML workload is to start measuring and accounting for attribution. Below are some of the initiatives that you can undertake:
ML Workloads incur huge compute costs, mainly because they require high compute resources - either high CPU requirements or GPUs, both of which turn out to be very expensive. Below are some of the steps you can take to reduce the compute cost:
👉
Spot Instance: A Spot Instance allows you to access and utilize unused EC2 capacity at a steeply discounted rate.When you launch a Spot Instance, you specify a maximum price that you are willing to pay per hour. If the Spot price for the instance type and Availability Zone that you request is less than your maximum price, your instance will be launched. However, if the Spot price for that instance type and Availability Zone increases above your maximum price, your instance may be terminated with two minutes' notice.
Reserved Instance: In this, you commit to using a certain amount of EC2 capacity for a certain period of time. In return, you receive a significant discount on the On-Demand price of that capacity.
We did a comparative study US East (N. Virginia) and found that:
Different usecases in ML require different architectures, and choosing the wrong design here can lead to massive differences in costs. Some of the most common usecases and mistakes we have seen are:
Often times, modelling one of the usecases in a different architecture can lead to loss of reliability or additional latency or large cloud bills.
People assume autoscaling to be useful only when there is a high volume of traffic and machines need to be scaled up or down based on the incoming traffic. However, we also want to extend the concept of autoscaling to dev development environments to save cost. Some areas where autoscaling can help save cost drastically are:
g4dn.xlarge : $383 (On-demand) vs $115(Spot)
NC4as T4 v3: $383 (On-demand) vs $49 (Spot)
Its important to colocate the data and compute so that we don't incur a lot of ingress/egress costs. Usually training processes involves downloading the data to the machines where the model is being trained. A few things to take care of here to avoid unexpected costs are:
Oftentimes, Data Scientists start a VM, set up Jupyter Notebook there, or use it via SSH in VSCode. While this approach works, it often leads to developers forgetting to shut down the VMs when they are done working. This leads to a lot of drainage in costs. It is worth investing in auto-shutdown hosted notebooks once the DS team grows to more than 5 members.
GPUs are heavily used in ML, however in very few cases, are GPUs used efficiently. This article sheds excellent light on how GPUs are mostly used today and the inefficiencies. Sharing GPU between workloads, and efficiently batching techniques are essential to utilize GPU efficiently.
TrueFoundry has helped save a min of around 40% on infrastructure costs for all its customers.
Kubernetes helps reduce cost by efficiently binpacking workloads among the nodes and making sure the cluster is being used effectively. This is a great article that sheds more light on how Kubernetes helps save costs
TrueFoundry makes it really easy to shut down your dev instances using the time-based autoscaling feature. Developers mostly work for around 40 hours a week, whereas the machines run almost 128 hours a week. If we were to shut down the machines effectively, we could save around 60% of the cost.
Truefoundry allows data scientists to configure an inactivity timeout period on every notebook after which the notebook will automatically be switched off.
This helps save a lot of the costs specially if notebooks are running on GPUs.
Truefoundry makes it really easy to use spot / on-demand instances for developers. Developers and data-scientists know their applications the best - so we leave it to them to decide the best choice for their applications.
It also shows you the cost tradeoffs between spot and on-demand instance for you to make the right choice according to your usecase.
Truefoundry allows you to set CPU, memory and GPU quotas for different teams and developers which allows leaders to have a sense of the cost allocation across teams and also guard against mistakes by developers by not allowing them to go beyond the allotted limits.
Truefoundry automatically shows you the recommended cpu and memory resources for your service by analyzing the consumption of the service over the last few days. It currently recommends you the suggested cpu, and memory requests and limits - however, we also plan to automatically recommend you the autoscaling strategy, the correct architecture in the future.
Do you want to assess how can you optimise AI workload cost, we have created an easy-to-take 5-minute assessment.
We promise to share the personalised report with you.
Join AI/ML leaders for the latest on product, community, and GenAI developments