We are back with another episode of True ML Talks. In this, we dive deep into Facebook's FBLearner Flow, Facebook's AI backbone and we are speaking with Aditya Kalro.
Aditya is currently a Senior Engineering Manager at Google in the Identity team, and prior to this, Aditya was at Facebook, where he led the build-out of the entire ML workflow management platform at Facebook called FBLearner Flow. And we'll talk about it in more detail in today's call.
📌
Our conversations with Aditya will cover below aspects:- Overview of FBLearner Flow.- A/B Testing and Shadow Testing in large-scale systems.- Bridging the gap between Research and Production.- Optimizing cost and latency in AI Inference.- Architecture of FBLearner Flow.- Bridging the Gap Between Software and ML Deployment Platforms.- Importance of Monitoring and Distributed training.- Core principles for building an ML system for scale.
FBLearner Flow is a machine learning workflow management platform that was built by Facebook to manage its ML infrastructure. Aditya led the development of the platform and oversaw its growth to support thousands of training per day across 700-800 teams.
Below are three unique and relevant aspects of FBLearner Flow's evolution:
Machine learning models need to be evaluated to determine their ability to generalize well on unseen data. This is where the evaluation system comes in. There are two parts to the evaluation system: batch evaluation and online evaluation.
Deploying machine learning models in real-world scenarios can be challenging, especially when dealing with large-scale systems like Facebook, where even a small error can have significant consequences. In this regard, building an effective A/B testing framework is crucial to ensure the optimal performance of machine learning models.
Two essential components of the FB Learner platform that helped achieve this goal were A/B testing and shadow testing. A/B testing allowed for a comparison of two versions of the system or model to determine which one performs better, while shadow testing allowed for deploying the new model in parallel with the existing one to evaluate its performance without affecting user experience. Doing so, helped mitigate the risk of deploying a faulty model in production.
Another unique feature of the FBLearner platform was its ability to facilitate the exchange of models between ML practitioners and developers. It enabled developers to easily deploy the models to production and test them using the existing quick experiment infrastructure. This allowed them to compare the performance of their existing system with the newly deployed ML model quickly, ensuring optimal performance of the system.
Facebook's AI research team faced a major challenge in bridging the gap between the needs of researchers and the production team. While researchers needed a system that was fast and allowed them to deploy new models quickly, the production team required stability, reliability, and predictability.
👉
To address this challenge, Facebook built a Slurm-like interface on top of its machine learning platform.
Slurm is a command-line interface used extensively in academia for experiment management. By creating a similar command-line interface for the platform, Facebook made it easy for researchers to use the platform.
Despite the fundamental differences in the requirements of both teams, having a common interface made it easier for researchers to migrate their models to FBLearner Flow for production. The system gave them access to a large feed of machines, unlike Slurm, which was designed to run on a small set of machines.
The Slurm-like interface on the platform allowed researchers to experiment with different models quickly and migrate them to the production environment when they were satisfied with the results.
In the field of AI, achieving cost optimization without compromising on latency is a perpetual challenge. However, with the advent of new technologies and architectural designs, solutions have been developed to address this issue.
Containerization and microservices have proven effective in optimizing costs and reducing latency in AI inference.
Containerization is a method of packaging software code along with its dependencies into a single unit, known as a container. This container can be moved easily from one computing environment to another, making it highly scalable and flexible. By using containerization, organizations can pack multiple AI models into a single container, enabling them to be deployed and scaled quickly and efficiently.
Moreover, containerization also enables bin packing, which optimizes resource allocation by placing multiple containers on a single physical machine. This allows organizations to make the most of their available resources and reduce costs. In addition, with auto-scaling, organizations can quickly scale up or down based on demand, further optimizing costs.
Furthermore, microservices, which are small, independent components of an application, can be used to create an efficient and agile inference platform. By breaking down complex applications into smaller, modular services, each service can be independently scaled, managed, and updated. This not only makes the platform more resilient but also helps reduce latency.
Building a robust and scalable AI system requires a solid infrastructure. In this regard, Facebook's FBLearner Flow has been at the forefront of AI innovation, providing a unique solution for training and deploying AI models at scale.
The architecture of FBLearner Flow was largely built in-house, leveraging Facebook's existing infrastructure. They started with Kronos, an internal scheduler, but had to move towards containerization to address stampeding herd and noisy neighbor problems. The system was then built around the idea of operators and workflows, with Hive tables initially being used as channels for structured data, but file clusters and Blob storage eventually being used for non-structured data like images.
The system's execution mechanism was self-contained and versioned, allowing for comparisons between different versions of the model. Experiment management was made easier by the system's ability to change features, model versions, or training paradigms while keeping the output metrics and evaluation sets the same.
If the FBLearner Flow were to be rebuilt today, Kubernetes and Kubeflow would be the preferred solutions. Kubeflow provides a more self-contained paradigm, making it easier to deploy, and it can use other connectors to connect to different pieces of infrastructure.
The inference platform was built on top of Tupperware, Facebook's services infrastructure, with each model being its own container. The auto-scaling feature was borrowed from Kubernetes to ensure the platform was able to scale up and down as required.
Overall, the architecture of FBLearner Flow provides a unique solution for building AI systems at scale. It's a testament to the importance of infrastructure in building robust and scalable AI systems.
As the field of machine learning continues to grow and evolve, there is a growing need to develop platforms and processes for deploying machine learning models in production environments. However, many companies view machine learning platforms as separate from software engineering platforms, which can lead to confusion and inefficiencies.
While there are some differences in the tools used, the processes for developing and deploying ML models can be very similar to those used in software engineering.
One of the key aspect is the importance of being opinionated about how ML models are developed and deployed. By introducing concepts like testing and required steps, such as the train test split, companies can streamline the ML development process and ensure that models are deployed in a consistent and effective manner.
Another similarity is the need for monitoring and telemetry in both software engineering and ML deployment platforms. Just as developers monitor the performance of their applications and microservices, ML developers and MLOps engineers need to monitor the performance of their models and infrastructure.
By recognizing the commonalities between the two, and developing processes and tools that are consistent across both, companies can streamline their development and deployment processes, reduce errors and inefficiencies, and ensure that their ML models are deployed in a consistent and effective manner.
Monitoring is a critical aspect of any ML deployment process. It helps to ensure that the models are working as expected and delivering the desired results. Monitoring the infrastructure and the output of the predictions is crucial to ensure that the system is working correctly. The metrics generation part of the FBLearner system, which auto-generated metrics for monitoring the performance of the models.
There is a need for customizable metrics that allow developers to monitor their models' performance effectively. With monitoring in place, developers can quickly identify and fix any issues that may arise during the deployment process, leading to better performance and more accurate predictions.
In conclusion, monitoring is an essential part of ML deployment, and it is crucial to have a system in place that allows developers to monitor their models effectively. Customizable metrics and close monitoring of the infrastructure and output of predictions can help ensure that ML models are working as expected, leading to better performance and more accurate predictions.
Distributed training and distributive inference were major paradigm shifts for the FBLearners Flow system. Initially, the system was designed to work on a single machine, and training was meant to occur on the same machine. However, as more data was added to the system, it became necessary to re-architect the system to support distributed training.
The biggest challenge in this regard was around structure beta. The team had to employ both model parallel and data parallel training for structured and unstructured data, respectively. They also had to add special rules for parameter servers, which were treated differently compared to the training itself. The parameter servers would gather all of the output from each of the training things and put them back together. The team experimented with several different paradigms, eventually settling on a parameter server paradigm that allowed for checkpointing.
Reliability became a major concern with distributed training, as even if one machine failed, the entire workflow would fail. The team had to build APIs to enable check-pointing, as it wasn't completely automated. They also had to write a restarting mechanism to ensure that the workflow could be restarted from the appropriate checkpoint in case of a failure.
While distributed training made it possible to train more complex models, it also made things harder for the team, as training now took 5-6 times longer than in the past. Nevertheless, the team was able to adapt to these changes and ensure that the system continued to work reliably.
Below are some interesting reads on Facebook's Machine Learning:
Keep watching the TrueML youtube series and reading all the TrueML blog series.
TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.
Topics: FBLearner Flow, ML Platform at Facebook, ML for large data sets
Join AI/ML leaders for the latest on product, community, and GenAI developments