We benchmark the performance of Mistral-7B in this article from latency, cost, and requests per second perspective. This will help us evaluate if it can be a good choice based on the business requirements. Please note that we don't cover the qualitative performance in this article - there are different methods to compare LLMs which can be found here.
In this blog, we have benchmarked the Mistral-7B-Instruct-v0.1 model from mistralai. The Mistral-7B-Instruct-v0.1 LLM is an instruct fine-tuned version of the Mistral-7B-v0.1 generative text model (having 7 billion parameters) using a variety of publicly available conversation datasets.
The key factors across which we benchmarked are:
GPU Type:
Prompt Length:
For benchmarking, we have used locust, an open-source load-testing tool. Locust works by creating users/workers to send requests in parallel. At the beginning of each test, we can set the Number of Users and Spawn Rate. Here the Number of Users signify the Maximum number of users that can spawn/run concurrently, whereas the Spawn Rate signifies how many users will be spawned per second.
Number of Users
Spawn Rate
In each benchmarking test for a deployment config, we started from 1 user and kept increasing the Number of Users gradually till we saw a steady increase in the RPS. During the test, we also plotted the response times (in ms) and total requests per second.
1
response times (in ms)
total requests per second
In each of the 2 deployment configurations, we have used the vLLM model server having version=0.2.0-d849de0.
version=0.2.0-d849de0
Latency, RPS, and Cost
We calculate the best latency based on sending only one request at a time. To increase throughput, we send requests parallelly to the LLM. The max throughput is the case when the model is able to process the input requests without significant deterioration in latency.
Tokens Per Second
LLMs process input tokens and generation differently - hence we have calculated the input tokens and output tokens processing rate differently.
A10 24GB GPU (1500 input + 100 output tokens)
We can observe in the above graphs that the Best Response Time (at 1 user) is 4.6 seconds. We can increase the number of users to throw more traffic at the model - we can see the throughput increasing till 0.8 RPS without a significant drop in latency. Beyond 0.8 RPS, the latency increases drastically which means requests are being queued up.
4.6 seconds
0.8
A10 24GB GPU (50 input + 500 output tokens)
We can observe in the above graphs that the Best Response Time (at 1 user) is 18 seconds. We can increase the number of users to throw more traffic at the model - we can see the throughput increasing till 0.4RPS without a significant drop in latency. Beyond 0.4 RPS, the latency increases drastically which means requests are being queued up.
18 seconds
0.4
A100 40GB GPU (1500 input + 100 output tokens)
We can observe in the above graphs that the Best Response Time (at 1 user) is 2.3 seconds. We can increase the number of users to throw more traffic at the model - we can see the throughput increasing till 2.8 RPS without a significant drop in latency. Beyond 2.8 RPS, the latency increases drastically which means requests are being queued up.
2.3 seconds
2.8
A100 40GB GPU (50 input + 500 output tokens)
We can observe in the above graphs that the Best Response Time (at 1 user) is 9.7 seconds. We can increase the number of users to throw more traffic at the model - we can see the throughput increasing till 1.5 RPS without a significant drop in latency. Beyond 1.5 RPS, the latency increases drastically which means requests are being queued up.
9.7 seconds
1.5
Hopefully, this will be useful for you to decide if Mistral-7B-Instruct will suit your use case and the costs you can expect to incur while hosting Mistral-7B-Instruct.
Join AI/ML leaders for the latest on product, community, and GenAI developments