In this blog, we will show the summary of various open-source LLMs that we have benchmarked. We benchmarked these models from a latency, cost, and requests per second perspective. This will help you evaluate if it can be a good choice based on the business requirements. Please note that we don't cover the qualitative performance in this article - there are different methods to compare LLMs which can be found here.
Use cases Benchmarked
The key use cases across which we benchmarked are:
- 1500 Input tokens, 100 output tokens (Similar to Retrieval Augmented Generation use cases)
- 50 Input tokens, 500 output tokens (Generation Heavy use cases)
Benchmarking Setup
For benchmarking, we have used locust, an open-source load-testing tool. Locust works by creating users/workers to send requests in parallel. At the beginning of each test, we can set the Number of Users
and Spawn Rate
. Here the Number of Users
signify the Maximum number of users that can spawn/run concurrently, whereas the Spawn Rate
signifies how many users will be spawned per second.
In each benchmarking test for a deployment config, we started from 1
user and kept increasing the Number of Users
gradually till we saw a steady increase in the RPS. During the test, we also plotted the response times (in ms)
and total requests per second
.
In each of the 2 deployment configurations, we have used the huggingface text-generation-inference model server having version=0.9.4
. The following are the parameters passed to the text-generation-inference
image for different model configurations:
LLMs Benchmarked
The 5 open source LLMs benchmarked are as follows:
The following table shows a summary of benchmarking LLMs:
Details LLM Benchmarking Blogs on each LLMs
For each of the models mentioned above, refer to the detailed LLM benchmarking blogs as shown below: