Digits is a financial management software company that uses AI to automate accounting tasks for operators. By automating tasks such as transaction classification, outlier detection, and fraud detection, Digits helps operators to double their customer base and improve their response time to customers.
Digits needed to move to deep learning and NLP models to address the challenges of accounting subjectivity. Also Digits had a strong foundation in data engineering and Kubernetes, which would be essential for building and scaling a successful ML platform.
The team began by introducing TFX for ML pipeline orchestration and TF Serving for model serving. This allowed Digits to build and deploy ML models in a scalable and reliable way.
Next, The team focused on developing similarity-based pipelines. These pipelines are able to accurately classify transactions and identify outliers, even when the data is ambiguous or incomplete. This is because similarity-based pipelines find similar patterns in transactions and mimic those patterns. This approach is more effective than using global machine learning models, which can give inconsistent results depending on the accountant's interpretation of the data.
Digits' ML pipelines are now used to power a variety of features, including transaction classification, outlier detection, and fraud detection. As a result, Digits is able to provide its customers with valuable insights and help them to automate tasks, improve accuracy, and save money.
Digits' approach to ML training is well-organized and efficient. The use of Kubernetes for orchestration allows Digits to scale its training operations up or down as needed. The use of TensorFlow Transform for preprocessing and the training platform in Google Cloud projects provides Digits with the tools and resources it needs to train complex models quickly and efficiently. The use of a validation set and a model registry ensures that Digits is shipping high-quality models to production.
Digits orchestrates ML training on Kubernetes using the following steps:
In the realm of GPU resource allocation for ML training, Digits employs a comprehensive approach involving both manual and automated procedures. This strategy encompasses:
Manual Processes: Digits sets clear GPU usage boundaries for teams and projects to uphold equitable allocation while averting overutilization. Additionally, it champions open communication among ML engineers, fostering resource awareness and mitigating conflicts.
Automated Processes: Digits maintains vigilance through continuous GPU usage monitoring, issuing timely alerts should usage exceed predefined thresholds to facilitate early issue identification and resolution. A queuing system ensures fair GPU allocation, adhering to a first-come-first-served basis.
Best Practices: Digits encourages ML engineers to plan GPU utilization proactively, ensuring resource availability and conflict minimization. Leveraging cloud resources provides flexibility, ensuring adequate GPU access even during periods of high demand. Promoting transparency in GPU utilization nurtures trust and cooperation among team members, ultimately enhancing resource management.
At Digits, TensorFlow Profiler takes center stage in the analysis of training runs, providing valuable insights for optimizing ML models:
Digits diligently logs every training run through TensorFlow Profiler, allowing for the tracking of performance trends over time.
Vital metrics including training duration, memory consumption, and accuracy are meticulously tracked, facilitating meaningful performance comparisons across diverse models and configurations.
TensorFlow Profiler equips Digits with the capability to systematically compare the performance of various training runs, thereby assisting in the judicious selection of the most suitable model and configuration to address specific problem domains.
Benefits:
When crafting validation sets for similarity-based ML pipelines, consider these key factors:
Similarity-based ML pipelines have a number of unique challenges and optimizations, compared to traditional ML pipelines.
Challenges:
Optimizations:
Digits uses TensorFlow Extended (TFX) and Vertex AI Pipelines for similarity-based ML pipelines. TFX is a Google-developed, open-source end-to-end platform for building, deploying, and managing ML pipelines. Vertex AI Pipelines is a fully managed cloud service for managing ML pipelines.
TFX provides a number of components that are useful for building similarity-based ML pipelines, including:
Vertex AI Pipelines makes it easy to run and manage TFX pipelines at scale. Vertex AI Pipelines provides a number of features that are useful for similarity-based ML pipelines, including:
Digits uses Vertex Endpoints for model registry and TF Serving for productionization.
Vertex Endpoints is a fully managed cloud service for deploying and managing machine learning models. It provides a number of features that make it a good choice for model registry, including:
TF Serving is a high-performance, production-ready TensorFlow serving system. It provides a number of features that make it a good choice for productionization, including:
Digits uses CI/CD to automate the deployment of models to Vertex Endpoints. When a model is registered in the model registry, the CI/CD system is triggered. The CI/CD system then builds a TF Serving model and deploys it to a Vertex Endpoint.
There are a number of benefits to using Vertex Endpoints and CI/CD for productionization:
Digits uses a combination of techniques to automatically detect when models need to be retrained:
Once Digits detects that a model needs to be retrained, it uses CI/CD to automate the retraining and deployment process. The CI/CD system builds a new TF Serving model using the latest training data and deploys it to a Vertex Endpoint.
Example:
The following is an example of how Digits' automatic model retraining process works:
Machine learning (ML) engineers and designers often work in silos, which can lead to problems when trying to bring ML models to production. ML engineers may develop models that are accurate but not user-friendly, while designers may create interfaces that are visually appealing but do not collect feedback on model predictions.
To address these challenges, it is important for ML engineers and designers to collaborate closely. This can be done by:
Generative AI has the potential to revolutionize many industries. Here are some of the use cases for generative AI at Digits:
There are privacy and security concerns associated with generative AI, and it is important to address these concerns in a responsible way. We as a community can find ways to develop and use generative AI in a way that is safe and beneficial to everyone.
Casting and model registry hosting landscape will change significantly in the coming years to accommodate the needs of large language models. - Hannes
Keep watching the TrueML youtube series and reading the TrueML blog series.
TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.
Join AI/ML leaders for the latest on product, community, and GenAI developments