We received a very encouraging response to our first episode of True ML Talks. In this series, we dive deep into the ML pipeline of a few leading ML companies, and in today's episode, we are speaking with Stefan Krawczyk.
Stefan is building DAGWorks, a collaborative open-source platform for data science teams to build and maintain model pipelines, plugging into existing MLOps and data infrastructure (read their YC Launch). He has over 15 years of experience at companies such as Nextdoor, LinkedIn, and Stitch Fix in the field of data and machine learning. He previously led the Model Lifecycle team at Stitch Fix, where he gained extensive experience building self-service tooling for an internal MLOps machine learning platform. He's also a regular conference speaker and author of the popular open-source framework, Hamilton.
📌
Our conversations with Stefan will revolve around four key themes:1. Machine Learning usecases for the business. 2. How Stitch Fix's team is structured to optimize the business outcomes. 3. Challenges faced in the build-out of ML stack with specific challenges that come pertaining to the industry.4. An overview of cutting-edge innovations applied during the process of building and scaling ML infrastructure.
Stitch Fix has two teams working on their machine learning (ML) systems - the data science and platform teams.
👉
The Data Science team is responsible for owning the models and owning the results. The Platform team ensures the availability of deployment infrastructure and its components. There was no "handoff" of the model between the Data Science team and the Platform team. This helps in getting many more iterations with the model.
The infrastructure allocation is handled by the Platform team, who generally owned the quotas of what was available. Data scientists could request nodes or other Spark clusters on a UI. Some accounting is done to ensure that costs do not spiral out without justification. The platform team tried to enable people to get what they wanted easily without getting into too much labor, and there were teams that owned the quotas and ensured that costs were kept under control.
"One reason why I stayed so long at Stitch Fix was precisely because of those challenges and figuring out how to solve them." - Stefan
For Docker component swapping, the platform team tried to create a golden API where data scientists could describe what they wanted to happen without having to worry about platform dependencies. This was done through configuration-driven model pipelines, where data scientists could provide text containing subsets of changes. The platform team could then change things without requiring the data scientist team to update or upgrade their Docker containers. The platform team could also upgrade logs or meta information without users having to redeploy or rewrite their pipelines. This removed the need for teams to manage and update things and allowed the platform team to more efficiently manage and update things without requiring migration from the data scientist team.
It is important to pick up tools based on business impact and SLAs. Things that can be done with a single node should be optimized with an orchestration system. For data versioning and model registries, saving things to S3 in a structured path structure and storing metadata with them may work. When it comes to model registry, open-source tools like MLflow should help, but there are also hosted management solutions like TrueFoundry.
It is important to have A/B testing system in the stack to understand the value that their model brings to their business. This will help in making decisions about where to invest in MLOps practices based on the impact that their model has.
In machine learning, Extract, Transform, Load (ETL) is a critical process for transforming raw data into valuable insights. However, ETL in machine learning systems poses several challenges that need to be addressed.
ETLs in machine learning systems can fall silent due to upstream changes, making it hard for the data practitioners to trace inputs to the model, leading to difficulties in maintaining ETLs. The complexity of ETLs also grows over time, making it difficult for teams to keep up.
To solve for the challenges mentioned in the previous section, DAGWorks is building an open-source ETL platform for data science teams that reduces their reliance on engineering, the open source is around Hamilton stated above. Hamilton, enables data practitioners without software engineering backgrounds to write code that goes into production and manage machine learning ETL artifacts on their existing infrastructure. Hamilton also provides a central feature definition store and lineage, which helps to make debugging easier and can be used for compliance cases.
Hamilton is designed to be an abstraction layer that can be used with various orchestration tools such as Airflow or Argo. Stefan believes that data scientists should not have to care about the topology of where things run and instead, focus on building models and iterations. DAGWorks is trying to figure out ways and abstractions to make it easier to change data quality providers without having to rewrite things.
While Hamilton complements tools like Metaflow, it is not trying to replace them. Instead, it enables people to be more productive on top of those systems by allowing them to model the micro within a task. Overall, DAGWorks is trying to make it easier for data science teams to manage and maintain machine learning ETL artifacts.
Below are some interesting reads from Stefan & his team:
What I Learned Building Platforms at Stitch Fix
Deployment for Free - A Machine Learning Platform for Stitch Fix's Data Scientists
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Keep watching the TrueML youtube series and reading all the TrueML blog series.
TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.
Join AI/ML leaders for the latest on product, community, and GenAI developments