Efficient ML Storage with Kubernetes Volumes

Share training data: It's possible multiple data scientists are training on the same data, or we are running multiple experiments in parallel on the same dataset. The naive way will be to duplicate the data for multiple data scientists—however, this will end up costing us much more. A more efficient way here will be to store the training data in a volume and mount the volume to the notebooks of different data scientists.

Model Storage: If we are hosting models as real-time APIs, there will be multiple replicas of the API server to handle the traffic. Here, every replica must download the model from the model registry (say S3) to the local disk. If every replica does this repeatedly, it will take more time to start up and also incur more S3 access costs. Using volumes, you can store your trained models externally and mount them onto the inference server. There is no need to download the model; the API server can just find the model on disk at the mounted path.

Artifact Sharing: We might have a usecase wherein the output of one pipeline stage needs to be consumed by the next stage. For example, after finetuning a model, we might need to host it as an API just for experimentation. While we can write the model to S3 and then download it back from S3, it will take a lot of time just for the model upload/download process. Instead, for faster experimentation, the finetuning job can just write the model to a volume, and the inference service can then mount the volume with the model.

Checkpointing: During the training of machine learning models, it's common to save checkpoints periodically to resume training in case of failure or to fine-tune models. Volumes can be used to store these checkpoint files, ensuring that training progress is not lost when a job restarts from failure. This also enables you to run training on-spot instances, saving a lot of costs.

TrueFoundry Storage Name	Cloud Provider Storage Name	Storage Class	Description
efs-sc	Elastic File System (EFS)	efs.csi.aws.com	A fully managed, scalable, and highly durable elastic file system that offers high availability, automatic scaling, and cost-effective general file sharing. It's suitable for workloads with varying capacity needs.

TrueFoundry Storage Name	Cloud Provider Storage Name	Storage Class	Description
standard-rwx	Google Basic HDD Filestore	filestore.csi.storage.gke.io	A cost-effective and scalable file storage solution ideal for general-purpose file storage and cost-sensitive workloads. It offers lower cost but also lower performance due to its HDD-based nature.
premium-rwx	Google Premium Filestore	filestore.csi.storage.gke.io	Provides higher performance and throughput compared to Basic HDD, making it suitable for I/O-intensive file operations and demanding workloads. It's SSD-based, offering higher performance at a higher cost.
enterprise-rwx	Google Enterprise Filestore	filestore.csi.storage.gke.io	Delivers the highest performance, throughput, advanced features, multi-zone support, and high availability, making it ideal for mission-critical workloads and applications with strict availability requirements. It comes with the highest cost.

TrueFoundry Storage Name	Cloud Provider Storage Name	Storage Class	Description
azurefile	Azure File Storage (Standard)	file.csi.azure.com	Uses Azure Standard storage to create file shares for general file sharing across VMs or containers, including Windows apps. It offers cost-effective performance.
azurefile-premium	Azure File Storage (Premium)	file.csi.azure.com	Uses Azure Premium storage for higher performance, making it suitable for I/O-intensive file operations.
azurefile-csi	Azure File Storage (StandardCSI)	file.csi.azure.com	Leverages Azure Standard storage with CSI for dynamic provisioning, potentially offering better performance and CSI features.
azurefile-csi-premium	Azure File Storage (PremiumCSI)	file.csi.azure.com	Combines Azure Premium storage with CSI for dynamic provisioning and high-performance file operations.
azureblob-nfs-premium	Azure Blob Storage (NFS Premium)	blob.csi.azure.com	Uses Azure Premium storage with NFS v3 protocol for accessing large amounts of unstructured data and object storage, catering to demanding workloads with NFS access.
azureblob-fuse-premium	Azure Blob Storage (Fuse Premium)	blob.csi.azure.com	Uses Azure Premium storage with BlobFuse for accessing large amounts of unstructured data and object storage, suitable for workloads that require BlobFuse access.

Volumes on Kubernetes

When to use Kubernetes volumes?

When to use Volume vs Blob storage like S3 / GCS / Azure Container?

Performance

Reliability

Cost

Access Constraints

Volume Provisioning Modes

Dynamic

Static

Dynamically Provisioned Volumes

Storage Classes

AWS Storage Classes

GCP Storage Classes

Azure Storage Classes

Statically Provisioned Volumes

Mount a GCS bucket as volume

Create a GCS bucket

Create Service account and Grant relevant permissions

Create Service-Account in Workspace from Truefoundry UI

Create a PersistentVolume object

Mount an S3 bucket as a Volume

Setting up IAM Policies and Relevant Roles

Creating a Persistent Volume on the Kubernetes Cluster

Mount an Existing EFS as a Volume

Install EFS CSI driver on your cluster

Create an Access Point for your EFS

Create a PersistentVolume on the cluster

Create a Volume on TrueFoundry

Using Volumes on Truefoundry

Subscribe to our newsletter

A Guide to Cloud Node Auto-Provisioning

Fractional GPUs in Kubernetes

Authenticated gRPC service on Kubernetes

Enabling the Large Language Models Revolution: GPUs on Kubernetes

Blazingly fast way to build, track and deploy your models!

Company

Product

Resources

Goodreads

Volumes on Kubernetes

When to use Kubernetes volumes?

When to use Volume vs Blob storage like S3 / GCS / Azure Container?

Performance

Reliability

Cost

Access Constraints

Volume Provisioning Modes

Dynamic

Static

Dynamically Provisioned Volumes

Storage Classes

AWS Storage Classes

GCP Storage Classes

Azure Storage Classes

Statically Provisioned Volumes

Mount a GCS bucket as volume

Create a GCS bucket

Create Service account and Grant relevant permissions

Create Service-Account in Workspace from Truefoundry UI

Create a PersistentVolume object

Mount an S3 bucket as a Volume

Setting up IAM Policies and Relevant Roles

Creating a Persistent Volume on the Kubernetes Cluster

Mount an Existing EFS as a Volume

Install EFS CSI driver on your cluster

Create an Access Point for your EFS

Create a PersistentVolume on the cluster

Create a Volume on TrueFoundry

Using Volumes on Truefoundry

Subscribe to our Newsletter

Subscribe to our newsletter

Discover More

A Guide to Cloud Node Auto-Provisioning

Fractional GPUs in Kubernetes

Authenticated gRPC service on Kubernetes

Enabling the Large Language Models Revolution: GPUs on Kubernetes

Related Blogs

Blazingly fast way to build, track and deploy your models!

Company

Product

Resources

Goodreads

Subscribe to our newsletter