We are back with another episode of True ML Talks. In this, we dive deep into Voiceflow's ML Platform as well as LLM's and we are speaking with Denys Linkov
Denys leads the machine learning team here at Voiceflow. He joined as the founding ML engineer. Prior to that, He worked as a senior cloud architect for a global bank working on data systems, MLOps and core infrastructure.
📌
Our conversations with Adhitihya will cover below aspects:- Machine Learning at Voiceflow- Voiceflow's MLOps Journey- Automating model deployment and observability to reduce context switching and improve efficiency- Real-time inferencing pipeline: Benefits and challenges- Voiceflow's approach to generative AI
Voiceflow is a no-code platform that allows businesses to build and deploy conversational AI applications. It can be used to create chatbots, virtual assistants, and other conversational interfaces for a wide range of industries, including:
Voiceflow's NLU model is able to cover a wide range of industries because it is trained on a massive dataset of text and code from a variety of sources. This allows Voiceflow to understand and respond to a wide range of natural language queries, regardless of the industry.
For example: A Voiceflow chatbot could be used by an e-commerce company to help customers find products, answer questions about products, and place orders. A Voiceflow chatbot could also be used by a real estate company to help potential buyers find homes, schedule appointments with agents, and learn about the home buying process.
One of the challenges of building an NLU model that can cover all of these industries is that each industry has its own unique language and jargon. However, Voiceflow's NLU model is able to learn these differences over time as it is exposed to more data from different industries.
One of the first challenges Voiceflow faced was deciding whether to build its own models or use external models. Voiceflow decided to explore both options and built a couple of proof of concepts. The first feature Voiceflow built was utterance generation, which uses machine learning to generate examples that a user needs to add to enrich their own data model.
To deploy the utterance generation model into production, Voiceflow built out its MLOps platform. The goal of the platform was to be able to deploy several experiments into production very quickly, as well as manage the environments.
The utterance generation model was the first to be killed by the release of ChatGPT, which is a more advanced generative model. This taught Voiceflow the importance of being flexible and willing to kill off its own developments if necessary, in order to focus on what's best for the customer experience.
Voiceflow also discusses the massive shift that has happened in the conversational AI space since the launch of instruction-tuned GPT-based models. Voiceflow admits that it was a strategic mistake not to think about using GPT-3 at the time, but it also learned that it's important to be adaptable and willing to change its approach as the field evolves.
Here's a blog you can read regarding Creating the Voiceflow NLU:
In the traditional machine learning development process, data scientists train models in Jupyter notebooks and then hand them off to machine learning engineers or backend engineers to deploy them in production. This can lead to context switching and delays, as the engineers need to understand the model and the data in order to deploy it successfully.
One way to address this challenge is to automate model deployment and observability. This can be done by creating a set of tools and processes that allow data scientists to deploy and monitor their models in production without having to involve other engineers.
One example of this is to use a cloud-based platform that provides managed services for model deployment and observability. These platforms can provide a variety of features, such as:
Another approach to automating model deployment and observability is to develop your own custom tools and processes. This can give you more flexibility and control, but it also requires more investment.
Here is a specific example of how one company automated model deployment and observability using this approach:
This automation allowed the company's data scientists to deploy and monitor their models in production without having to involve any other engineers.
There are also some challenges that need to be considered when developing your own custom tools and processes for model deployment and observability:
There are a few things that can be done to mitigate the challenges of developing your own custom tools and processes for model deployment and observability:
Real-time inferencing pipelines offer a number of benefits, including:
However, real-time inferencing pipelines also present some challenges, such as:
One of the challenges of building and deploying a real-time machine learning pipeline is how to auto scale the system to handle changes in traffic. There are a number of factors to consider, such as the predictability of the traffic patterns, the latency requirements of the models, and the complexity of the auto scaling algorithm.
One approach to auto scaling a real-time machine learning pipeline is to use a queuing system. This allows you to decouple the producers (which generate the inference requests) from the consumers (which process the inference requests). This gives you more flexibility in how you scale the system.
To auto scale a queuing-based system, you can use a variety of metrics, such as the number of messages in the queue, the average latency of the requests, or the CPU utilization of the workers. You can also use a combination of these metrics.
It is important to carefully tune the auto scaling algorithm to avoid over-scaling or under-scaling the system. Over-scaling can lead to wasted resources, while under-scaling can lead to performance problems.
Here are some additional thoughts on auto scaling a queuing-based system for real-time inference:
Choosing a model server for latency-sensitive applications can be challenging for a number of reasons. First, there are many different model servers available, each with its own strengths and weaknesses. Second, the requirements for latency-sensitive applications can vary widely depending on the specific application and the types of models being used. Finally, it is often difficult to predict how a model server will perform in a production environment.
Factors to consider
When choosing a model server for a latency-sensitive application, it is important to consider the following factors:
💡
Other insights around the ML platform at Voiceflow:Voiceflow use a combination of AWS and GCP, as different enterprise customers have different requirements. They have not explored using Karpenter or Autopilot yet, as they were already building out their infrastructure when these features were released. They also need to use T4 GPUs for many of their workloads, which are not optimal for Autopilot. Overall, they are prioritizing engineering time for now and will eventually migrate to more advanced infrastructure solutions as they scale up.
Voiceflow is taking a cautious approach to open source generative AI. They are aware of the potential benefits of these models, but they are also aware of the challenges involved. They are committed to providing their users with the best possible experience, and they will switch to open source models when it is the right time for their business.
There are a few challenges associated with open source generative AI:
Despite the challenges, open source generative AI models also offer a number of benefits:
Latency is a critical factor to consider when choosing a model for a retrieval augmented generation system. The best approach is to give users a choice of models to use and to provide education on what to use for different tasks.
For example, if latency is the most important factor, then using an NLU based approach with intense utterances and static responses is recommended. NLU models are typically much faster than generative models, and static responses can be delivered with very low latency.
If the user needs higher precision or better formatting, then using a generative model like GPT-4 is recommended. Generative models are more powerful than NLU models and can generate text that is more natural and engaging. However, it is important to note that generative models are also much slower than NLU models.
Another way to reduce latency is to use a distributed architecture. In a distributed architecture, the retrieval and generation tasks are performed on separate servers. This allows the system to scale to meet the needs of even the most demanding applications.
Retrieval augmented generation (RAG) systems are a powerful new approach to text generation that combine the strengths of retrieval and generative models. RAG systems work by first retrieving relevant passages from a knowledge base and then using a generative model to generate text based on the retrieved passages.
RAG systems can be used for a variety of tasks, including question answering, summarization, and creative writing. However, building a high-performance RAG system can be challenging.
In this blog post, we discuss some of the key factors to consider when building a RAG system, including:
In addition to the factors mentioned above, it is also important to keep in mind that RAG systems are complex and can be difficult to generalize. Every user's domain and use case will be different, so it is important to give users the power to test their own prompts, processing, and chunking strategies. This will allow users to customize the system to meet their specific needs.
Here you can read more about how to deploy a RAG architecture on TrueFoundry:
Companies that have built NLP-based solutions using traditional methods are now facing the challenge of transitioning to generative AI. Generative AI models, such as GPT-4 and LaMDA, offer a number of advantages over traditional methods, including the ability to generate text, translate languages, and answer questions in a comprehensive and informative way. However, there are also a number of challenges associated with transitioning to generative AI.
One challenge is that generative AI models are still under development and can be expensive to use. Additionally, the concept of prompting is still fairly ambiguous and challenging. Companies need to be able to develop effective prompting techniques in order to get the most out of generative AI models.
Another challenge is integrating generative AI models into existing infrastructure. Companies need to make sure that their systems can handle the increased load and complexity of generative AI models.
Despite the challenges, there are also a number of opportunities associated with transitioning to generative AI. Generative AI models can help companies to improve the quality of their products and services, automate tasks, and create new products and services.
Here are some tips for companies that are transitioning to generative AI:
Transitioning to generative AI can be a challenge, but it is also an opportunity for companies to improve their products and services and create new products and services. By following the tips above, companies can make the transition to generative AI as smooth and successful as possible.
Keep watching the TrueML youtube series and reading all the TrueML blog series.
TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.
Join AI/ML leaders for the latest on product, community, and GenAI developments