With the range of use cases for machine learning expanding a lot in recent years, the need to scale the operations around the training, deployment and monitoring of these models has also become quite important. Many of these concerns are similar to the ones that have been “solved” for general software use cases. Kubernetes is one such piece of open source software that has consolidated the cloud native eco-system around itself by serving as the underlying platform.
Hence, it becomes imperative to explore whether it is useful for kubernetes to be leveraged for a machine learning use case. Lets first start with kubernetes itself and whats so interesting about it.
Credit: Kubernetes
💡 Kubernetes is an open source container orchestration engine for automating deployment, scaling, and management of containerized applications.
In simpler terms, Kubernetes provides a simple and standardized way to run and operate workloads that need to be dynamically scaled across multiple machines.
Lets go through some of the most popular features -
These are just some of the features that are available by default. A large number of use cases are actually solved by the tooling built using kubernetes as the underlying layer. We will get into specific tools in a subsequent issue.
With the understanding of what kubernetes is and what are the major features it provides in a software development scenario, let’s delve into what are the specific problems that it can solve in a data scientist workflow.
The figure above gives a broad outline of how a typical data science pipeline runs. A lot of companies have chosen to use a vast variety of bespoke solutions with overlapping features to glue it all together.
We’ll go through each of these steps to try and make sense of where kubernetes fits the bill -
Before any raw data can be made useful, it must first be transformed into sanitised inputs for the model training pipeline. This is where feature stores come into the picture by performing transformation, storage and serving of the feature data.
Kubernetes supports deployment of stateful workloads and integrates very well with cloud providers to seamlessly provide persistence.
Most model development starts with an ML Engineer writing code in a jupyter notebook and for many that is almost all that is needed. It provides a REPL interface for running python code. This starts with being hosted on personal laptops, but it is better to run a centralised pool of hosted jupyter notebooks which can be used by multiple individuals.
The declarative model of kubernetes along with support for persistent storage systems make it trivial to host a pool of notebooks and allow access control over individual notebooks to enforce effective collaboration.
Any algorithm written on a notebook needs to be fed training data to get a model artefact as an output. This can be done in the notebook itself in smaller use cases but requires a much more powerful pipeline for larger datasets. Usually the validation against test data set is also performed here before the artefact is used for performing inferences in production.
There are multiple solutions for orchestrating a DAG pipeline on kubernetes. Airflow has native support for kubernetes while kubeflow has been built completely on top of kubernetes. All the major monitoring solutions provide first class integration with kubernetes which is essential for running production grade pipelines.
This stage takes care of storing and versioning the dataset and the model. This ensures that any model artefact remains reproducible for as long as needed. A parallel can be drawn to how code management is performed using git.
Although the underlying data store for such management systems can be hosted on kubernetes itself, in many cases it is better to use a managed solution from a cloud provider. In such cases, most of the cloud providers seamlessly integrate their own IAM systems with that of kubernetes making it safe to access data from outside the cluster without having to store the access credentials.
Finally, the model artefact is prepared so that a production system can make inferences on top of it. This usually involves wrapping a model in an API framework and allowing other services to call the model to make inferences. Concerns similar to software engineering such as authn/authz, scalability, reliability etc come into the picture here.
This is where kubernetes shines. Most of the features that we talked about in the earlier section become critical at this stage.
Like any production system, continuously monitoring the currently deployed model is essential to make sure that your system is behaving the way it is expected to. The metrics to watch out for can include everything from the actual accuracy of the predictions to the latency and throughput the system is able to support.
A lot of monitoring solutions integrate closely with kubernetes. Discovering a target to actually scrape metrics, performing computation on top of it and storing it for later use can all be performed without any external dependency.
The whole landscape around kubernetes has been exploding and a lot of tooling is already out there. There are however some pitfalls that any organisation should keep in mind before adopting it wholesale. We’ll get into those and how they can be mitigated in the next issue.
Join AI/ML leaders for the latest on product, community, and GenAI developments