We are back with another episode of True ML Talks. In this, we dive deep into DoorDash's ML Platform, and we are speaking with Hien Luu.
Hien Luu is the Senior Engineering Manager at DoorDash, building the build out of DoorDash's ML platform. DoorDash, as everyone knows, is one of the biggest companies in food delivery in the US, more than $25 billion company.
📌
Our conversations with Hien Luu will cover below aspects:- ML Usecases in DoorDash- Designing a Scalable Model Serving Layer- Shadowing Models: Accelerating Testing and Deployment- Standardization via gRPC- Streamlining Feature Engineering and Data Formats- The Importance of Model Validation and Automated Retraining- Challenges and Opportunities for ML Ops in Supporting Generative AI and LLMs
DoorDash's MLOps team's scalable model serving layer is a crucial component of their machine learning infrastructure that supports billions of predictions every day. The following are some insights into the architecture and key decisions that enabled the growth of their model serving layer.
Implementing a shadowing layer within DoorDash's model serving infrastructure has revolutionized the speed at which models are tested and deployed. This section delves into the unique aspects of the shadowing layer, its distinction from Canary testing, and its profound impact on facilitating efficient model testing for data scientists.
DoorDash's shadowing layer simplifies the process, ensuring that data scientists can effortlessly conduct model tests. The implementation is both straightforward and powerful. Data scientists utilize configurations and an intuitive tool to specify a primary model and shadow models. With just a few clicks, they can allocate a desired percentage of incoming traffic (e.g., 1% or 2%) to be routed to the shadow models. The platform handles the rest, including loading the designated model into the appropriate pods, seamlessly routing the specified traffic, and logging predictions for the shadow models.
The simplicity and user-friendliness of DoorDash's shadowing layer have dramatically expedited the pace of testing and deployment for data scientists. By eliminating unnecessary complexities and minimizing reliance on engineering support, data scientists enjoy full autonomy over the shadowing process. This newfound agility empowers them to iterate on their models more frequently, resulting in an accelerated development cycle and fostering rapid innovation.
However, as the number of models and traffic volume increases, it is essential to address considerations such as the scalability of the logging system and cost management. Striking a balance between efficient operations and the expanding scope of model testing remains crucial for sustaining the benefits of the shadowing layer.
Standardized on gRPCDoorDash adopted gRPC as the standard protocol across the company. This choice was driven by the need for stability and efficiency at scale. The binary protocol of gRPC, along with its battle-tested nature, appealed to DoorDash's focus on optimizing every aspect of their ML infrastructure. The decision to use gRPC for service-to-service communication ensured reliable and efficient interactions between components of the model serving layer.
We all believe that when you do things at scale, every little thing matters and I think the binary protocol, it's good for that when you start offering a scale and gRPC has been battle tested at many, many companies.
In order to facilitate feature engineering and model training, DoorDash focused on optimizing its infrastructure and data formats. Initially, the company utilized Snowflake as a data warehouse, which provided efficient data storage and management. However, as they scaled their model training operations, retrieving data from Snowflake proved to be inefficient. Recognizing the need for a data lake, Hien Luu advocated for its implementation, drawing from his experience at LinkedIn where a data lake had proven to be a valuable asset for numerous use cases. Building a data lake took time and effort, but once in place, DoorDash could leverage it to construct their feature engineering framework.
The feature engineering framework served as an abstraction layer, allowing data scientists to express how they wanted features to be computed. DoorDash's infrastructure then handled the computation, scheduling of pipelines, and resource management on behalf of the data scientists. Collaborating with the data lake team, optimal formats were determined for storing the computed features.
In addition to the offline feature store, DoorDash also employed an online feature store. The majority of use cases involved online predictions integrated into production systems, necessitating the presence of an online feature store. Both offline and online feature stores were maintained, addressing the training and serving discrepancy commonly encountered in the industry. To synchronize the feature sets between the two stores, generated features were stored in the offline feature store and subsequently uploaded to the online feature store. By using the same logic for both offline and online scenarios, the feature engineering framework simplified the process. Data scientists could specify their desired features for both stores and rely on the infrastructure to handle the underlying mechanisms, such as scheduling the uploads.
Ensuring the accuracy and reliability of machine learning models is a critical aspect of the MLOps process. Model validation involves testing the performance of a model using real-world data to verify its effectiveness. By automating this validation process using tools like MLflow, data scientists can track experiments, compare results, and evaluate different models based on their performance metrics. Model validation provides confidence in the model's ability to make accurate predictions and informs decision-making in the deployment process.
Automated retraining takes model validation a step further by enabling models to be automatically retrained based on predefined criteria or thresholds. This proactive approach ensures that models stay up-to-date and continue to perform optimally over time. By minimizing manual intervention, MLOps teams can reduce the risk of human error and streamline the retraining process.
Implementing automated retraining requires careful consideration of each model's specific needs and potential consequences. MLOps teams must design and implement safeguards and flexible processes to ensure that models are retrained appropriately. This involves planning and testing to determine the optimal retraining frequency, criteria for retraining, and strategies for promoting the retrained models to production.
The benefits of automated retraining are substantial. By continuously updating models, organizations can maintain their accuracy and reliability, adapt to evolving data patterns, and address potential performance degradation. Automated retraining also reduces the risk of errors and downtime in production environments, as models are proactively improved and updated.
Incorporating model validation and automated retraining into the MLOps infrastructure is crucial for building robust and reliable machine learning systems. By leveraging automation tools and implementing well-designed processes, organizations can ensure that their models deliver accurate predictions consistently and adapt to changing conditions effectively.
Generative AI and language models (LMs) have the potential to revolutionize many industries, including food delivery. However, effectively leveraging these technologies requires ML Ops teams to tackle several challenges and opportunities.
Here is another interesting blog written by the team at DoorDash around Generative AI:
Keep watching the TrueML youtube series and reading the TrueML blog series.
TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.
Join AI/ML leaders for the latest on product, community, and GenAI developments