The GenAI revolution has led to a surge in GPU demand across the industry. Companies want to train, fine-tune and deploy LLMs in massive quantities. This has meant lower availability and consequent increase in prices for the latest GPUs. Companies running workloads on public cloud have suffered from high prices and increasing uncertainty wrt GPU availability.
These new realities make being able to utilize available GPUs to the maximum extent absolutely critical. Partitioning or sharing a single GPUs between multiple processes helps with this. Implementing it on top of kubernetes gives a winning combination where we get autoscaling and a sophisticated scheduler to help with optimizing GPU utilization.
In order to share a single GPU with multiple workloads in kubernetes, these are the options we have -
Multi-Instance GPU (MIG) allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be securely partitioned into separate GPU Instances for CUDA applications. Each partition is completely memory and compute isolated and can provide predictable througput and latency
A single NVIDIA A100 GPU can be partitioned in upto 7 isolated GPU instances. Each partition appears as a separate GPU to the software running on a partitioned node. Other MIG supported GPUs and number of supported partitions are listed here.More info here
Pros
nvidia-device-plugin
Cons
Time slicing enables multiple workloads to be scheduled on the same GPU. Compute time is shared between the multiple processes and the processes are interleaved in time. A cluster administrator can configure a cluster or node to advertise a certain number of replicas/GPU which reconfigures the nodes accordingly.
There are other options available to us for GPU sharing like MPS and vGPUs but they don’t have native support in `nvidia-device-plugin` and we won’t be discussing them here.
Lets go through a short walkthrough on how we can utilize time sharing on Azure Kubernetes Service. We start with an already existing kubernetes cluster.
1. Add a GPU enabled node pool in the cluster
$ az aks nodepool add \ --name <nodepool-name> \ --resource-group <resource-group-name> \ --cluster-name <cluster-name> \ --node-vm-size Standard_NC4as_T4_v3 \ --node-count 1
This will add a new node pool with a single node to the existing AKS cluster with a single NVIDIA T4 GPU. This can be verified by running the following
$ kubectl get nodes <gpu-node-name> -o 'jsonpath={.status.allocatable.nvidia\.com\/gpu}'
2. Install the gpu operator
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update $ helm install gpu-operator nvidia/gpu-operator \ -n gpu-operator --create-namespace \ --set driver.enabled=false \ --set toolkit.enabled=false \ --set operator.runtimeClass=nvidia-container-runtime
3. Once the operator is installed, we create a time slicing configuration and configure the whole cluster to slice the GPU resources where available
$ kubectl apply -f - <<EOF apiVersion: v1 kind: ConfigMap metadata: name: time-slicing-config data: any: |- version: v1 flags: migStrategy: none sharing: timeSlicing: renameByDefault: false failRequestsGreaterThanOne: false resources: - name: nvidia.com/gpu replicas: 10 EOF # Reconfigure gpu operator to pick up the config map $ kubectl patch clusterpolicy/cluster-policy \ -n gpu-operator --type merge \ -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'
4. Verify that the existing node has been successfully reconfigure
$ kubectl get nodes <gpu-node-name> -o 'jsonpath={.status.allocatable.nvidia\.com\/gpu}' 10
5. We can verify the configuration by creating a deployment with 4 replicas with each asking for 2 nvidia.com/gpu resource
$ kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: name: time-slicing-verification labels: app: time-slicing-verification spec: replicas: 4 selector: matchLabels: app: time-slicing-verification template: metadata: labels: app: time-slicing-verification spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule hostPID: true containers: - name: cuda-sample-vector-add image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04" command: ["/bin/bash", "-c", "--"] args: - while true; do /cuda-samples/vectorAdd; done resources: limits: nvidia.com/gpu: 1 EOF
Verify that all the pods of this deployment have come up on the same already created node and it was able to accommodate them.
The GenAI revolution has changed the landscape of GPU requirements and made being responsible with resource utilization more critical than ever. There are shortcomings to both the approaches outlined here but there is no way around being responsible with GPU costs in the current scenario.
Join AI/ML leaders for the latest on product, community, and GenAI developments