Large language models like ChatGPT and diffusion models like Stable Diffusion have taken the world by storm in just under a year. More and more organisations are now beginning to leverage Generative AI for their existing and exciting new use cases. While most companies can directly start using APIs provided by companies like OpenAI, Anthropic, Cohere, etc, these APIs also come with a hefty cost. In the long term, many companies would like to finetune small to mid-size versions of equivalent open-source LLMs like Llama, Flan-T5, Flan-UL2, GTP-Neo, OPT, Bloom, etc. as Alpaca and GPT4All projects did.
Finetuning smaller models with outputs from larger models can be useful in multiple ways:
To enable all this, GPUs have become an essential workhorse in any company working with these foundational models. With model sizes growing and reaching trillions of parameters distributed training over multiple GPUs is slowly becoming the new norm. Nvidia is leading the hardware space with their newer Ampere and Hopper series cards. High-speed NVLink and Infiniband interconnect allow connecting up to 256 Nvidia A100 or Nvidia H100 (and ~4k in super pod clusters) to train and infer with ever-larger models in record times.
We'll now walk over the components needed to use GPUs with Kubernetes - mainly on AWS EKS and GCP GKE (Standard or Autopilot) but the mentioned components are essential on any K8s cluster.
Your cloud provider has GPU VMs, how do we bring them to the K8s cluster? One way is to manually configure GPU Nodepools of fixed size or with cluster autoscaler that can bring in GPU nodes as and when needed and let go when not. However, this still needs manual configuration of several different node pools. An even better solution is to configure Auto Provisioning systems like AWS Karpenter or GCP Node Auto Provisioners. We talk about these in our previous article: Cluster Autoscaling for Big 3 Clouds ☁️
apiVersion: karpenter.sh/v1alpha5kind: Provisionermetadata: name: gpu-provisioner namespace: karpenterspec: weight: 10 kubeletConfiguration: maxPods: 110 limits: resources: cpu: "500" requirements: - key: karpenter.sh/capacity-type operator: In values: - spot - on-demand - key: topology.kubernetes.io/zone operator: In values: - ap-south-1 - key: karpenter.k8s.aws/instance-family operator: In values: - p3 - p4 - p5 - g4dn - g5 taints: - key: "nvidia.com/gpu" effect: "NoSchedule" providerRef: name: default ttlSecondsAfterEmpty: 30
Sample GCP Node Auto Provisioner config
resourceLimits: - resourceType: 'cpu' minimum: 0 maximum: 1000 - resourceType: 'memory' minimum: 0 maximum: 10000 - resourceType: 'nvidia-tesla-v100' minimum: 0 maximum: 4 - resourceType: 'nvidia-tesla-t4' minimum: 0 maximum: 4 - resourceType: 'nvidia-tesla-a100' minimum: 0 maximum: 4autoprovisioningLocations: - us-central1-cmanagement: autoRepair: true autoUpgrade: trueshieldedInstanceConfig: enableSecureBoot: true enableIntegrityMonitoring: truediskSizeGb: 100
Note here we can also configure our provisioners to use spot type instances to get anywhere from 30-90% cost savings for stateless applications.
spot
For any virtual machine to use GPUs, its drivers need to be installed on the host. Fortunately, on both AWS EKS and GCP GKE nodes are preconfigured with certain versions of Nvidia Drivers.
Because each newer CUDA version requires a higher minimum driver version, you might even want to control the driver version for all nodes. This can be done by provisioning nodes with custom images that do not have the driver and letting Nvidia gpu-operator install a specified version. However, this may not be allowed on all cloud providers, so be aware of the driver versions on your nodes to avoid compatibility headaches.
gpu-operator
We talk about the gpu-operator later below.
In Kubernetes since everything runs inside pods (a set of containers) just installing drivers on the host is not enough. Nvidia provides a standalone component called nvidia-container-toolkit that installs hooks to containerd runC to make the host's GPU drivers and devices available to the containers running on the node. See this article for a more detailed explanation.
nvidia-container-toolkit
containerd
runC
nvidia-container-toolkit can be installed to be run as Daemonset on the GPU nodes.
Having GPUs on the node is not enough, the Kubernetes scheduler needs to know which node has how many GPUs available. This can be done using a Device Plugin. A device plugin allows advertising custom hardware resources to the control plane e.g. nvidia.com/gpu . Nvidia has published a device plugin that advertises allocatable GPUs on a node. This plugin again can be run as a Daemonset.
nvidia.com/gpu
Once the above components are configured we need to add a few things to the pod spec to schedule it on the GPU node - mainly resources , affinity and tolerations
resources
affinity
tolerations
E.g. on GCP GKE we can do:
spec: # We define how many gpus we want for the pod resources: limits: nvidia.com/gpu: 2 # affinities help us place the pod on the GPU nodes affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: # Specify which instance family we want - operator: In key: cloud.google.com/machine-family values: - a2 # Specify which gpu type we want - operator: In key: cloud.google.com/gke-accelerator values: - nvidia-tesla-a100 # Specify we want a spot VM - operator: In key: cloud.google.com/gke-spot values: - "true"tolerations: # Spot VMs have a taint, so we mention a toleration for it - key: cloud.google.com/gke-spot operator: Equal value: "true" effect: NoSchedule # We taint the GPU nodes, so we mention a toleration for it - key: nvidia.com/gpu operator: Exists effect: NoSchedule
resources.limits
Note that these configurations will differ based on what provisioning methods and cloud providers you use (e.g. Karpenter on AWS vs NAP on GKE)
Monitoring GPU metrics like Utilisation, Memory Usage, Power Draw, Temperature, etc is important to ensure things are working smoothly as well as to do further optimisations.
Fortunately, Nvidia has a component called dcgm-exporter that can run as Daemonset on GPU nodes and publish metrics at an endpoint. These metrics can then be scraped with Prometheus and consumed. Here is an example scrape config:
dcgm-exporter
- job_name: gpu-metrics scrape_interval: 15s scrape_timeout: 10s metrics_path: /metrics scheme: http kubernetes_sd_configs: - role: endpoints namespaces: names: - <dcgm-exporter-namespace-here> relabel_configs: - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: kubernetes_node
However, note that dcgm-exporter needs to run with hostIPC: true and privileged securityContext. This is fine for EKS and GKE Standard. However, GKE Autopilot does not allow such access, instead, GKE publishes metrics on the preconfigured nvidia-device-plugin Daemonsets which can be scraped or viewed in GCP Cloud Monitoring.
hostIPC: true
securityContext
nvidia-device-plugin
AWS EKSGCP GKE StandardGCP GKE AutopilotProvisioningKarpenter / ManualGCP Node Auto Provisioner / ManualAuto ProvisioningDriversPreinstalled/Install via gpu-operatorPreinstalledPreinstalledContainer Toolkitnvidia-container-toolkitvia gpu-operatorPreconfiguredPreconfiguredDevice Pluginnvidia-device-pluginvia gpu-operatorPreconfigured DaemonsetPreconfigured DaemonsetMetricsnvidia-dcgm-exportervia gpu-operatorStandalone nvidia-dcgm-exporter / Custom scrapingCustom Scraping
nvidia-dcgm-exporter
The gpu-operator mentioned above for most parts on AWS EKS is a bunch of standalone Nvidia components like drivers, container-toolkit, device-plugin, and metrics exporter among others, all combined and configured to be used together via a single helm chart. The gpu-operator runs a master pod on the control plane which can detect GPU nodes in the cluster. On detection of a GPU node, it deploys a worker Daemonset that further schedules pods to optionally install drivers, container toolkit, device-plugin, CUDA toolkit, metrics exporter and validators. You can read more about it here.
Generally, it is possible to have the CUDA toolkit installed on the host machine and have it made available to the pod via volume mounting, however, we find this can be quite brittle as it requires fiddling with PATH and LD_LIBRARY_PATH variables. Moreover, all pods on the same node have to use the same CUDA toolkit version which can be quite restrictive. Hence it is better to put the CUDA toolkit (or just parts of it) inside the container image.
PATH
LD_LIBRARY_PATH
You can start from already-built images provided by Nvidia or your favourite deep-learning framework or add it using one line on the Truefoundry platform
To enable organisations to fine-tune and ship their Generative AI models faster on their existing infrastructure, the Truefoundry Platform allows developers to add one or more Nvidia GPUs to their applications with minimal effort. Developers only need to specify how many instances of some of the best GPUs for Machine Learning like V100, P100, A100 40GB, A100 80GB (optimal for training) or T4, A10 (optimal for inference) they need and we do the rest. Read more on our docs.
GPUs are a fantastic technology and this is just the beginning for us. We are actively working on the following problems:
If any of this sounds exciting, please reach out to work with us to build the best MLOps platform.
TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.
Discuss About your ML Pipeline Challenges with us here
Join AI/ML leaders for the latest on product, community, and GenAI developments