We are back with another episode of True ML Talks. In this, we dive deep into Salesforce's ML Platform, and we are speaking with Arpeet Kale.
Arpeet was part of the engineering team at Salesforce that built the entire ML platform. He is one of the founders of Builders Fund, where he and his colleagues invest and advise ML/AI companies across the world. And at the same time, he is the head of infrastructure at Skiff.
📌
Our conversations with Arpeet will cover below aspects:- ML Usecases in Salesforce- Salesforce ML Team Structure- Overview of Salesforce ML Infrastructure- Prototyping ML models at Salesforce- Managing Costs for Large-Scale ML Projects in the Cloud- Automated Flow for Moving Models- Building a Multi-Tenant Real-Time Prediction Service- Optimization of models for enterprise AI- Security and Reliability Measures in Salesforce AI Platform- ML Infrastructure Platform vs Software Deployment Platform
At Salesforce, the ML team was divided into three teams:
We found this interesting blog on how ML is used by Salesforce:
Salesforce ML infrastructure was built on top of a tech stack that was chosen to provide a scalable and reliable platform. Here are some of the most relevant and unique pointers about the infrastructure:
The main reasons why it is important to separate clusters in machine learning infrastructure are:1. Security: Separating clusters reduces the risk of data breaches and unauthorized access to sensitive data. Each team can work in their own environment with the necessary security measures.2. Data Compliance: Different teams may have different data compliance requirements, which can be met by separating clusters. This ensures that each team is working with data that meets the necessary regulatory requirements.3. Resource Management: Separating clusters allows teams to have the resources they need to complete their tasks without interfering with the resources of other teams. This ensures efficient use of resources and prevents resource contention.
At Salesforce, the prototyping framework was built around Jupyter Notebooks, allowing data scientists to run short-term experiments interactively and in real-time. The experiments were then transitioned to a long-running job on a large-scale cluster, producing real-time metrics as the job ran.
The training and experimentation SDK was built to abstract the complexity of scheduling jobs, pulling and pushing data, and system dependencies. Data scientists could call a Python API or function to take care of these tasks, and track experiment progress, metrics, logs, and more in the workbench dashboard.
The framework was opinionated, providing an abstracted solution, but still allowing for some flexibility in how data scientists chose to use the platform. However, it was not a completely freeform-style experiment, and there were internal guidelines and standards to follow.
Challenges of Hosting Jupyter Notebooks at Scale with Sensitive Data:When hosting Jupyter Notebooks at a large scale with sensitive data, the major challenges involve approval workflows for authentication. Data scientists must obtain approval from a certain person or manager to access the data. The notebook environment is ephemeral and destroyed after experiments are completed, but all artifacts generated are persisted. The authentication is API-driven and integrated with internal systems.
Large-scale machine learning projects can quickly become costly, especially when utilizing GPU resources in the cloud. In order to manage costs during the prototyping phase, there are a few strategies that can be employed.
While there are other strategies for reducing costs, such as utilizing spot instances, these often require a lot of engineering effort and may not be practical for long-running jobs. Additionally, spot instances may not always be available in regions with GPU resources.
By utilizing reserved capacity and auto-scaling, you can effectively manage costs while still having the resources you need for your machine learning projects. These strategies continue to be relevant today and can be applied to any public cloud provider.
Salesforce's promotion flow for moving models from one environment to another relied on the notion of golden datasets for every domain. The data scientists could evaluate the model's performance on these datasets and also on randomized datasets to assess the model's capability to perform well on different types of data. This helped them decide whether to promote a model into higher environments or not.
The promotion process was done through the workbench, but it was intentionally kept slightly manual to ensure that the model performed beyond a certain threshold on n+1 types of datasets. This was challenging because Salesforce is a multi-tenant system, and every customer has a different dataset, sometimes numbering in the hundreds of thousands. Salesforce built hundreds of thousands of models, each specific to a customer and dataset, and automated the process as much as possible.
Overall, the promotion flow at Salesforce was designed to ensure that models were thoroughly evaluated and performed well on diverse datasets before being promoted to higher environments.
Building a multi-tenant real-time prediction service is a complex task that involves serving a large number of models with different sizes and architectures in real-time while meeting specific SLA requirements. To address this challenge, the engineering team at Salesforce developed a serving layer that underwent several iterations.
Initially, the team relied on a structured database for metadata and a file store for model artifacts. However, this approach was not scalable for larger and more complex models. To solve this, they sharded their clusters based on the complexity of the model and the type of compute required. For instance, smaller models ran on CPUs, while larger models needed GPUs. Clusters were dedicated to specific types of models, such as NLP models, LSTM models, transformer models, image classification models, object detection models, and OCR models.
The team also developed a layer that orchestrated deploying services on different clusters and node groups. They implemented caching to ensure frequently requested models had lower latencies. Initially, data and research scientists were allowed to use their preferred framework, which made it challenging to uniformly serve the models. The team narrowed down the frameworks to one or two and optimized the models for these frameworks.
Finally, the team converted the models into a uniform format regardless of the original training framework, allowing them to optimize the serving code for each type of model. Overall, the team's efforts resulted in a scalable, efficient, and reliable real-time prediction service.
The real-time inference was my favorite thing to work on. And I think, by the end of it, we also were able to file a patent on it. So, it was a great engineering feature that we added to the platform. It was the most used feature, actually. We were doing double-digit, millions of predictions per day and so it was very, very satisfying to see that getting used by so many customers. - Arpeet
We found this interesting blog on ML Lakes and the Salesforce's Data Platform`s architecture:
They heavily benchmarked models and aimed to stay within the bounds of widely supported operators and other operations within a framework to ensure easy conversion. Custom operators were a high-friction conversion and required a high touch approach, but the team found that 95% of use cases were easily solved by off-the-shelf models that did not require novel techniques. This allowed them to optimize for the majority of use cases and spend time on the remaining 5% of models that were not as widely used.Arpeet also noted that frameworks such as Onyx, Triton, and NVIDIA's Inference Server have made significant strides in standardizing model formats and benchmarking, making them valuable tools for large real-time inference use cases.
Approval
Machine learning infrastructure platforms and software deployment platforms have a lot in common, according to the discussion between Anuraag and Arpeet. Here are the key takeaways:
Overall, there is no significant difference between machine learning infrastructure platforms and software deployment platforms, except for the nature of the workload and the tooling required for orchestration.
I think I would say that focusing on a niche at this point, in terms of, either operationalizing a large-scale AI workflow is probably going to be the next set of one of difficult challenges. - Arpeet
Keep watching the TrueML youtube series and reading all the TrueML blog series.
TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.
Join AI/ML leaders for the latest on product, community, and GenAI developments