We are back with another episode of True ML Talks. In this, we'll delve deep into the ML architecture at Cookpad, one of the world's largest recipe service platforms. We will also cover the challenges of building a successful machine learning platform, how they use Nvidia triton inference server to run models on.We are talking with Jose Navarro
Jose is the Lead ML Platform Engineer at Cookpad, trying to help the machine learning engineers and all the ML practitioners to deliver ML systems quickly and reliably.
📌
Our conversations with Jose will cover below aspects:- Structure of the ML teams @ Cookpad- GPU-based ML Infrastructure- Automated Model Deployment at Cookpad- Feature Store Integration and Configuration for Online Inference- Data Sources and Feature Management during Model Experiments- Using Argo Workflows for retraining ML Models- Nvidia's Triton Inference Server and its Benefits- Integrating MLflow Model Registry with Triton Inference Server- Leveraging LLMs and Gen AI at Cookpad- Tailoring the MLOps Architecture to Your Needs
Automated Model Deployment with MLflow and Backend Support:When a machine learning engineer is experimenting at Cookpad, they have access to a managed Jupyter hub where they can run their experiments and utilize various kernels. Once they have developed a model, they publish it to MLflow, which serves as the model registry. From there, the process is fully automated.Cookpad's automation pipeline supports a list of backends that are compatible with both MLflow and the Triton Inference Server. The supported backends include PyTorch, TensorFlow, Onnx, and others. The automation pipeline handles the deployment of the registered models, leveraging the appropriate backend based on compatibility. The default choices for deployment are usually PyTorch or TensorFlow, with Onnx used for optimizing certain neural networks.
At Cookpad, data scientists utilize the feature store for both experimentation and online inference. During experimentation, they have offline access to query existing features for model training. When exploring new features, data is pulled from the Data Warehouse, transformed, and used to create new features. Once satisfied with the model's performance, data scientists simply create a new feature group in the feature store through the repository schema.
Data flows from the Data Warehouse to the feature store via Kafka, allowing streaming of newly created features. To enable online inference, data scientists extend the streaming service to consume relevant events and perform transformations. During model registration, data scientists specify the model's construction and features used, and the transform code is configured to retrieve specific features by name.
This integration between the feature store, Data Warehouse, and streaming service ensures seamless incorporation of features into the online inference process, offering flexibility for adjustments and updates when needed.
During model experiments, Cookpad's data scientists have two options for data sourcing. If the required features are already available in the feature store, they can directly query it offline. However, when exploring new features, they access the Data Warehouse, retrieve the data, and create the necessary transformations for training the models.
Incorporating new features into the feature store is straightforward. Data scientists submit a pull request (PR) with the feature's schema details, and automation handles the creation of the feature group using AWS calls. The data used in the Data Warehouse is streamed through Kafka to enable online inference with the newly created features. The existing streaming service is extended to consume events and apply transformations, ensuring the streaming flow of features into the feature store.
For maintaining feature configuration, data scientists include relevant information when registering the model in MLflow. The transformation code accesses the feature store directly, and through configuration, data scientists specify the required feature names. This flexibility allows easy modification of feature configurations as needed.
The process of incorporating retraining pipelines into the architecture is currently a work in progress at Cookpad. While the focus has been on iterating quickly through new ML features, the implementation of mature retraining pipelines that replace or retrain models on a daily or weekly basis is still being developed. The recommendation systems at Cookpad are still in an early stage, and the iterative approach allows for rapid experimentation and model replacement through AB testing.
Although the potential exists to build reproducible pipelines using Argo Workflows, Cookpad acknowledges that they are in the early stages of this implementation. It is not yet an ideal solution, and the reproducibility of the pipelines is a challenge they are actively addressing.
Starting with smaller, simpler experiments and automating critical pipeline components allows for a well-thought-out architecture. Cookpad prioritized automating inference, recognizing its criticality, and plans to focus on retraining pipelines in the future. This organized and incremental approach to building the platform is a valuable learning experience for the audience, highlighting the effectiveness of the methodology.
Attempting to build an end-to-end system from the beginning often results in unnecessary or ill-fitting components. He suggests exploring alternatives to Argo Workflows for reproducible pipelines, such as using a Python wrapper or a different tool that aligns better with the machine learning engineers' familiarity with Kubernetes manifests and fits well with their CI/CD practices.
By leveraging Nvidia's Triton Inference Server, Cookpad achieves cost optimization, enhances model inference performance, and simplifies deployment for ML engineers.
Cookpad has integrated MLflow Model Registry and Triton Inference Server to streamline the deployment of models at scale. Here's how they accomplished it:
By leveraging this integration, Cookpad enables efficient model deployment from MLflow to Triton Inference Server, allowing for scalability and easy updates without disrupting ongoing inference operations.
Cookpad is actively exploring the potential use cases and applications of Language Models (LLMs) and Generative AI (Gen AI) technology within its platform. While specific implementations are still in the exploration phase, here are some areas where Cookpad envisions leveraging these advancements:
During the development and implementation of these use cases, Cookpad prioritizes user data privacy and compliance.
We have to follow that process and make sure that they are compliant with security.
When it comes to building a real-time inference stack with MLOps, there is no one-size-fits-all approach. Jose Navarro emphasizes the importance of tailoring the architecture based on the specific requirements and the maturity level of the machine learning (ML) practice within a company. Here are some key insights regarding the essential components of an MLOps architecture:
Keep watching the TrueML youtube series and reading the TrueML blog series.
TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.
Join AI/ML leaders for the latest on product, community, and GenAI developments