We are back with another episode of True ML Talks. In this, we dive deep into Edge ML Platform, and we are speaking with Rahul Kulhari.
Introducing Rahul Kulhari, the co-founder and head of data science at Edge. With a strong background in AI and machine learning, Rahul is responsible for executing the company's vision and building its AI strategy. He leads a team of experts who develop cutting-edge AI systems that power Edge's talent acquisition, talent mobility, and internal talent marketplace products. His expertise and experience make him a valuable asset to the industry and an excellent resource for anyone interested in the latest developments in data science and AI.
📌
Our conversations with Liming will cover below aspects:- ML use cases in Edge- Machine Learning Team at Edge- Innovation in Machine Learning Stack- Quantization VS Distillation- Challenges in Operationalizing Machine Learning- Choosing MLOps Tools
The team structure at Edge is divided into five subcategories. Each vertical is responsible for a particular aspect of the AI product development lifecycle. These five verticals are as follows:
The role of AI product manager:The AI product manager bridges the business gap between data science and ML engineering teams by connecting with product and customer success teams to understand business objectives. They organize discussions involving data scientists, research scientists, and the ML engineering team to identify each team member's necessary contributions. The AI product manager communicates the needs and guidelines for each team's contribution to ensure everyone is aligned. They remain involved throughout the project, ensuring that the business objectives are met and that everyone is working towards the same goal.
The ML team at Edge recognizes the significant challenge of the lack of data in the machine learning workflow. To address this, they have introduced various tools, processes, and algorithms for data augmentation. They have developed capabilities such as student-teacher algorithms, which enable their models to be trained on noisy data created using these tools and algorithms and then fine-tuned on a large amount of labeled data.
One critical tool that they use for data augmentation is Evidently AI, which helps them identify data and target drift to ensure that the noisy data created aligns with the labeled or goal data. This tool allows them to ensure that their categorical and continuous features are in line and helpful in creating accurate models.
The team has also innovated in the machine learning pipeline. While it has become mature over time, when they were building it, they found that no single tool or product could solve all the end-to-end tasks, and integrating them with each other was a challenge. They have utilized different tools such as Neptune, Comet, and MLflow for model registry and management.
From the deployment perspective, they have focused on scalability, latency, and cost. They use tools such as TF serving and Onyx for quantization for deployment on Kubernetes deployment pods. They have multiple tools throughout their machine learning pipeline, which they consider an innovation. They have been able to manage their finances while building state-of-the-art work, so they have not found a need to move to newer tools that may be more expensive. However, they encourage their team to keep an eye on new technologies and tools that may be useful in the future.
Optimizing model latency is a crucial challenge in the field of machine learning, and techniques such as quantization, model pruning, and distillation have been explored to solve it. According to a recent report by a team at Edge, quantization works better than distillation for reducing model latency.
The team experimented with different models such as DistilBERT, RoBERTa, and ALBERT, and ultimately chose ALBERT due to its better performance in job and resume interpretation. They also conducted distillation on both ALBERT and RoBERTa.
From their experiments, the team found that quantization provided remarkable results, reducing model latency from approximately 1.2 seconds to around 200 milliseconds on CPUs. The team utilized Onyx and hugging face quantization for their models, which they trained only on GPUs.
When selecting the right model, the team considered various factors such as latency, model size, concurrency, CPU utilization, and memory utilization. They collaborated with data scientists who provided the framework for the quantization process while the machine learning engineering team conducted the experiments and selected the best option based on the results.
Although quantization had a 1% impact on precision, it did not affect recall. The team emphasizes that everyone should try quantization as it is a simple yet effective technique for reducing model latency.
To get the data, the model before quantization was taking approximately 1200 milliseconds. But when you did that quantization, it reduced to approximately 200 milliseconds.
When it comes to MLOps, infrastructure is a critical component. A reliable infrastructure is necessary to support the processing power required for machine learning training and deployment. Using a GPU provider like E2E Networks can provide affordable GPUs in India.
For model training and building, using tools such as Neptune, Comet ML, or TrueFoundry integrated with Git can ensure reproducibility and regulatory compliance. Hugging Face, TensorFlow, and PyTorch are also recommended for building models. CatBoost is a good option for regression problems or decision trees.
When it comes to deployment, ONNX is a recommended tool, or a serverless approach can be taken using Max.io, Banana.dev, or Infrrd. In development, data quality can be ensured through custom or third-party tools such as Great Expectations, Streamlit for visualization, and Alibi Detect or Evidently AI for data drift and analysis. However, during production, additional tools may be required for data quality, lineage, and other types of analysis.
Keep watching the TrueML youtube series and reading all the TrueML blog series.
TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.
Join AI/ML leaders for the latest on product, community, and GenAI developments