In my last job, we used to build product recommender systems for eCommerce companies- that means our APIs were live on every page of their website. We got a new customer which was our first 7 figure deal and we were so cautious about them that we initially onboarded them in March with rule-based recommendations. We didn’t want to risk bad user experience through our nascent Machine learning models.
Later in April, we built out Machine learning models and performed extensive offline testing and a lot of manual QA. Finally, we felt confident that our model will perform well and we then launched and two things happened-
Overall, this resulted in a lot of fire-fighting, a lot of credibility loss and an almost lost customer. Through our internal retrospective later, we realized that while #1 was just a manual miss, it was almost impossible to detect issues like #2 offline. Since then, we moved to the bright side of doing dark launches!
Dark launch is a deployment strategy that allows you to replay your actual production traffic to your newly deployed service and discard the response before returning it to the user. It behaves as if the service is actually live but does not affect the users at all. This allows you to verify that your new service does not have any errors, has comparable or better performance compared to your old service, and can handle the production load. Once all of this is verified, it’s almost trivial to actually switch to your new service incrementally. So in a way,
Dark launch is a light way to launch your services.
with very minimal downside and huge potential upside.
Dark launching your services is one of the realistic ways of testing your services and models on a production-like system. But executing dark launch can require a lot of set-up and maturity within the organization from a development, monitoring and infrastructure standpoint.
Offline testing allows you to check the behaviour of your system, usually in isolation. Rarely would it allow you to test the end-to-end system along with the state of the surrounding system with realistic traffic and network settings like production? You can achieve 70% of all of this through meticulous logging and very complicated offline testing but dark launch turns out to be a much simpler system. This is because you anyways end up doing most of the above steps to launch and monitor a service normally. After you have done a successful dark launch, your actual release of the new service is almost trivial so the effort-reward ratio becomes well worth it.
There are a couple of cases where this could be hard to justify practically- for example, if your service is stateful, or it actually changes the database then doing a dark launch is a lot more complicated. In my personal experience, ensuring the correctness of the system becomes so hard that it's almost better to just settle for offline testing over dark launch!
If you are more curious about dark launches or want to share some of your experience, please get in touch with me at nikunj@truefoundry.com!
TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.
Join AI/ML leaders for the latest on product, community, and GenAI developments