We are back with another episode of True ML Talks. In this, we dive deep into Simpl's ML Platform, and we are speaking with Sheekha.
Sheekha is the Director of Data Science at Simpl. Simpl is building India's leading first tap checkout network where they provide merchants with an entire set of products starting from BNPL to helping them pay in installments to a lot of other value-adding services. They work with more than 26,000 merchants across India, including JIO platforms, which is the largest telecom network; Zomato, which is one of the biggest food delivery services in the country, and a lot more.
📌
Our conversations with Sheekha will cover below aspects:- ML use cases in Simpl- Overview of Simpl ML Infrastructure- Managing Costs for ML Training- Managing Training and Inference Pipelines Separately- Automation in Retraining ML Models- Simpl's Foray into building in-house- Considerations for Real-time Systems and Data Science Models- Making ML Deployment as Simple as Software- Ingraining Engineering Principles in Data Science
We found this interesting news cover on how Simpl is leveraging ML for fraud detection:
The data science team at Simpl consists of 28 data scientists and 16 data engineers. The team is a core part of Simpl along with other engineering teams, and they have a separate DevOps team. The team is working on ML, neural network systems, rules, graph databases, and graph machine learning models to look at communities of fraud users.
From a current tech stack perspective, the company has everything on the cloud, with no on-prem systems in place.
The data science team at Simpl uses a remote machine with Python notebook and libraries built by the data engineering team to connect to databases and perform exploratory data analysis (EDA). Once the data analysis is done, the team sets up a pipeline with the help of the data engineering team to deploy the model. For batch models, the team uses Airflow for scheduling.
Model monitoring is done using Simpl's dashboards to track output changes. In terms of MLOps, Simpl is currently investing in the area. For anti-fraud systems, the company has a model that uses batch systems for analyzing similar email ids and phone numbers. The team also has some tools that run in real-time for monitoring transactions based on the velocity of the transaction and the amount being transacted.
Simpl also deployed a neural network model for transaction monitoring. The model combines current payload with historical data from the last one year and pushes it into the neural net model for a decision on whether to allow or decline the transaction. The data engineering team built a Flink pipeline to manage the peak traffic and ensure a low SLA of 70-80 milliseconds.
Feature Store:A feature store is a centralized repository for storing and managing features, which are individual measurable properties or characteristics of data that are used to train machine learning models.Simpl currently uses DynamoDB as a feature store for real-time availability. However, this is expensive, and there are efforts to build an internal feature store to bring down costs in the long term.
We found this interesting blog on how Data Science evolved at Simpl:
Managing the costs associated with implementing and scaling machine learning (ML) models is a critical challenge. It is especially important for models that require significant amounts of data and use expensive resources such as Flink pipelines and virtual machines.The ML team deals with terabytes of data, which necessitates the use of virtual machines for training jobs. Balancing the costs against the benefits of the models is crucial.To mitigate the costs, the team collaborates with DevOps and data engineering teams to explore cost-effective options. They have also been working on building an internal feature store to reduce the costs of using DynamoDB. Another cost-saving measure they employ is the use of on-spot instances for non-critical tasks.However, managing costs is an ongoing process that requires continuous evaluation of the model's cost-effectiveness. Factors such as precision-recall balance and the cost of good users also come into play when deciding the best cost-saving measure
Interaction between the ML and the DevOps Team:Collaboration between DevOps and data science teams is necessary to provision virtual machines for machine learning projects, and there is typically a minimum of three days of lead time. The DevOps team receives multiple requests, including those from the data science team, which require consideration of cost and collaboration with the data engineering team to fulfill. In case of an urgent request, the DevOps team can expedite the provisioning process without considering the cost implications. The data science team accounts for the three-day time lag in the project deployment plan.
Managing the training and inference pipelines separately can lead to a range of problems that can affect the overall efficiency of the system. The primary reason for this is that it can make it difficult to track the models' origins, retain the codes, and replicate the results. It can also lead to human error and mushrooming of problems, especially in startups.
On the other hand, managing these pipelines separately can provide greater flexibility and control over the system, enabling you to optimize each process independently. It can also allow you to scale the system more easily by adding new resources to the training or inference pipelines as needed.
However, ideally, you'd want to merge these pipelines and incorporate retraining in the same process. By doing so, you can avoid the issues associated with managing these pipelines separately. You will still be able to maintian the flexibility and control that comes with managing them independently. Overall, the decision to manage these pipelines separately or together depends on the specific needs of your organization and the resources available to you.
Retraining ML models is a crucial part of maintaining their accuracy and relevance. However, manual retraining can be time-consuming and prone to errors. That's why automation plays a vital role in ensuring that the process is efficient, reliable, and scalable.
Automating retraining can help organizations set specific intervals for triggering retraining, ensuring that the models are updated regularly. This can also help save time and resources, as automation eliminates the need for manual intervention.
However, there can be challenges in automating retraining for complex models that require specialized hardware or software. In such cases, manual retraining may be necessary until an automated solution can be implemented.
The use of SageMaker has been a game-changer for data science teams when it comes to handling large datasets for machine learning projects. However, the platform still presents some challenges that can impact the productivity of the team.
Although SageMaker has been a useful platform for the team, there are still other options like Kubernetes that they have not tried yet. However, the decision to use SageMaker was mainly driven by the need for a faster system that could handle large amounts of data.
The company plans to create an improved version of SageMaker, their own machine learning platform. Initially an R&D experiment, the project now benefits from a larger team capable of in-house development. Although their virtual system possessed some SageMaker features, it lacked distributed computing. Adding distributed computing to their current virtual machine through Py console integration will provide the required solution.
For user access control management and data accessibility, the company has built various IAM roles and allocated a child account to their data team for cost management. However, they still require further work, particularly given the sensitive data they handle as a FinTech firm, and regular audits by RBI.
While they could use an external platform, the company has chosen to develop their version of SageMaker in-house. Their decision is strategic and not based on constraints related to data accessibility or cost. By having greater control over the platform, they can scale and grow more efficiently. The company has already used distributed computing in some systems via DAS.
As we're scaling and the team is getting bigger, if you can do it in-house, why not?- Sheekha
Developing ML models has become easier with libraries like Scikit-learn, but the time to start a project and go live is still high, particularly for smaller companies without pipelines and MLOps systems. Setting up pipelines, cleaning data, validating tests, and deploying models can take two to three months. Moreover, finding bugs in a model is challenging as there is no standardization for the process. Therefore, companies need systems that make model development as seamless as software development to improve developer productivity. The system should allow for flexibility, easy integration, and building on top of the existing system. It should also have standardization for bug finding, monitoring data going in and out, and feedback loops.
In the realm of data science, there has been a growing emphasis on the need for data scientists to possess engineering skills to ensure the successful and efficient deployment of ML models.
You would want our data scientists to deploy everything and even filters.- Sheekha
Sheekha expressed her interest in large language models (LLMs) and the new developments surrounding them, but at present, they are not using them in their work. However, she acknowledged that they are exploring interesting use cases for LLMs, particularly in their chatbot integration.
I definitely foresee a lot of interesting use cases for LLMs- Sheekha
Keep watching the TrueML youtube series and reading all the TrueML blog series.
TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.
Join AI/ML leaders for the latest on product, community, and GenAI developments