If ChatGPT is the iPhone, then the AppStore is yet to be built- and that will be a suite of vertical applications built on top of it. Every industry, every business, and every individual is going to build these applications in small and big ways. Starting from building a healthcare or a manufacturing-specific ChatGPT, to writing domain-specific emails for marketing, to answering enterprise-specific questions from the internal knowledge base, to even building personal search engines to answer questions like- what is John’s address or when is my mother-in-law’s birthday? We wrote a bunch of examples here in our previous blog post.
To understand this, its important to understand clear vs deep web:
Models like ChatGPT are trained on a massive dataset but all that is available on the clear web. So you can’t ask it a question whose answer depends on anything on the deep web like your email or private docs. However, in the process of learning from the massive clear web dataset, models like ChatGPT build so much intelligence about the language & semantics that it’s much easier for it to learn new information from small quantities of data for a specific task.
Fine-tuning is a powerful technique that allows us to leverage the knowledge and learning of a pre-trained model like ChatGPT to improve the performance of the model on a new task by training it on a smaller, task-specific dataset. For example, let’s say you want to build a Question answering system for your internal company docs stored in Confluence. You can pass all the text content from your confluence and fine-tune a GPT model with them.
Sounds simple. Why is everyone not doing it?
While OpenAI has reduced a lot of friction through its fine-tuning APIs, it still requires a fair bit of consideration, planning, and effort along different axes-
The data that you need to fine-tune the models has to be in a specific prompt-completion pair format.
{“prompt”: <prompt-text>, “completion”: <ideal generated text>}
A few pointers to consider:
OpenAI has multiple models that can be fine-tuned and each of them has tradeoffs that need to be considered-
Note: You can also incrementally fine-tune a previously fine-tuned model. There is some limitation on the timeframe due to changing APIs that the models must have been fine-tuned after 21, Apr 2022.
We tried using fine-tuning APIs of OpenAI with default parameters and they worked very well on some of the tasks, but we could get improvements as much as 40% by tweaking the hyper-parameters.
The fine-tuning costs vary a lot depending on what parameters you decide to choose so its very important to have a solid understanding.
📌
For 1M tokens: - Ada: $0.4- Babbage: $0.6- Curie: $3 - Davinci: $30.
Using fine-tuned models are significantly more expensive (~5x) than the pretrained counterparts. For 1M tokens, - Ada: $1.6 vs $0.4- Babbage: $2.4 vs $0.5- Curie $12 vs $2- DaVinci $120 vs $20
When you are using the model, parameters like bestof and n also affect your cost because you end up generating multiple completions for a single prompt. Consider using max_response_length to save cost or reduce the usage of best_of and n parameters.
Building a fine-tuned model for our own confluence dataset was not trivial. It involved the following 4-step process.
Data Fetching
It took me a bit of an effort to figure out reading all the data from Confluence through its APIs because Confluence has rich data with tables and headings and subheadings and I had to convert that into simple text for ease of usage. Also, managing permissions and the right level of access was not trivial. Finally, we were able to create a very simple form where you submit the url, username and API Key and we can fetch the data.
Data Preprocessing
This was the hardest part. Here we tried multiple approaches-
We eventually selected most pairs from #3 but also threw in some random sample from #1 and #2 above. For our dataset we generated over 50000 pairs of prompts and completions with an average length of 17 words for prompt and 133 words for response making it a total of 150 words per pair. Ideally, this step would need much more experimentation.
Model fine-tuning
We experimented with different models and hyper-parameters and realized Curie tends to perform better than Ada or Babbage but we felt like Ada was fine given the cost tradeoff. We didn’t try Davinci. We had to tweak the learning rate and settled at 0.05 whereas we ran the model for 6 epochs. Cost of training 1 run of Curie was about $30 and Ada was about $4. Overall through a bunch of experimentation it might have costed us $400 in OpenAI credits for fine tuning - we also had a few bad runs.
Testing
We noticed that the fine tuned model performed strictly better than the original model on our internal dataset related questions. We had our team try out about 100 odd prompts and rated them manually. This is not scientific but worked for this simple use case. Interestingly, we noticed that in some cases, the fine tuned model performed worse than the original model for general queries. We still need to debug what’s happening here.
This was a fun exercise and we will be working on experimenting with fine tuning on other datasets.
If you want any specific datasets, or want me to open up this app to be used for your own Confluence, reach out to me at nikunj@truefoundry.com
TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.
Join AI/ML leaders for the latest on product, community, and GenAI developments