Benchmarks for ChatGPT and Co
June 2024
The highlights of the month:
The elephant in the room - Claude 3.5 Sonnet and artifacts feature
Confidential computing - how it can make AI more secure and cost-effective for companies
The trend towards small and powerful LLMs that can be operated locally
LLM Benchmarks | June 2024
The Trustbit benchmarks evaluate the models in terms of their suitability for digital product development. The higher the score, the better.
☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama license
⚠️ - Non-standard license! We recommend consulting a legal advisor before use to determine whether it can be used in your company in a legally compliant manner
model | code | crm | docs | integrate | marketing | reason | final 🏆 | Cost | Speed |
---|---|---|---|---|---|---|---|---|---|
GPT-4o ☁️ | 85 | 95 | 100 | 90 | 82 | 75 | 88 | 1.24 € | 1.49 rps |
GPT-4 Turbo v5/2024-04-09 ☁️ | 80 | 99 | 98 | 93 | 88 | 45 | 84 | 2.51 € | 0.83 rps |
Claude 3.5 Sonnet ☁️ | 67 | 83 | 89 | 78 | 80 | 59 | 76 | 0.97 € | 0.09 rps |
GPT-4 v1/0314 ☁️ | 80 | 88 | 98 | 52 | 88 | 50 | 76 | 7.19 € | 1.26 rps |
GPT-4 Turbo v4/0125-preview ☁️ | 60 | 97 | 100 | 71 | 75 | 45 | 75 | 2.51 € | 0.82 rps |
GPT-4 v2/0613 ☁️ | 80 | 83 | 95 | 52 | 88 | 50 | 74 | 7.19 € | 2.07 rps |
Claude 3 Opus ☁️ | 64 | 88 | 100 | 53 | 76 | 59 | 73 | 4.83 € | 0.41 rps |
GPT-4 Turbo v3/1106-preview ☁️ | 60 | 75 | 98 | 52 | 88 | 62 | 72 | 2.52 € | 0.68 rps |
Gemini Pro 1.5 0514 ☁️ | 67 | 96 | 75 | 100 | 25 | 62 | 71 | 2.06 € | 0.91 rps |
Gemini Pro 1.5 0409 ☁️ | 62 | 97 | 96 | 63 | 75 | 28 | 70 | 1.89 € | 0.58 rps |
GPT-3.5 v2/0613 ☁️ | 62 | 81 | 73 | 75 | 81 | 48 | 70 | 0.35 € | 1.39 rps |
GPT-3.5 v3/1106 ☁️ | 62 | 70 | 71 | 63 | 78 | 59 | 67 | 0.24 € | 2.29 rps |
GPT-3.5 v4/0125 ☁️ | 58 | 87 | 71 | 60 | 78 | 47 | 67 | 0.13 € | 1.41 rps |
Gemini 1.5 Flash 0514 ☁️ | 32 | 97 | 100 | 56 | 72 | 41 | 66 | 0.10 € | 1.76 rps |
Gemini Pro 1.0 ☁️ | 55 | 86 | 83 | 60 | 88 | 26 | 66 | 0.10 € | 1.35 rps |
Cohere Command R+ ☁️ | 58 | 80 | 76 | 49 | 70 | 59 | 65 | 0.85 € | 1.88 rps |
Qwen1.5 32B Chat f16 ⚠️ | 64 | 90 | 82 | 56 | 78 | 15 | 64 | 1.02 € | 1.61 rps |
GPT-3.5-instruct 0914 ☁️ | 44 | 92 | 69 | 60 | 88 | 32 | 64 | 0.36 € | 2.12 rps |
Gemma 7B OpenChat-3.5 v3 0106 f16 ✅ | 62 | 67 | 84 | 33 | 81 | 48 | 63 | 0.22 € | 4.91 rps |
Meta Llama 3 8B Instruct f16🦙 | 74 | 62 | 68 | 49 | 80 | 42 | 63 | 0.35 € | 3.16 rps |
GPT-3.5 v1/0301 ☁️ | 49 | 82 | 69 | 67 | 82 | 24 | 62 | 0.36 € | 3.93 rps |
Mistral 7B OpenChat-3.5 v3 0106 f16 ✅ | 56 | 87 | 67 | 52 | 88 | 23 | 62 | 0.33 € | 3.28 rps |
Mistral 7B OpenChat-3.5 v2 1210 f16 ✅ | 58 | 73 | 72 | 45 | 88 | 28 | 61 | 0.33 € | 3.27 rps |
Llama 3 8B OpenChat-3.6 20240522 f16 ✅ | 64 | 51 | 76 | 45 | 88 | 39 | 60 | 0.30 € | 3.62 rps |
Starling 7B-alpha f16 ⚠️ | 51 | 66 | 67 | 52 | 88 | 36 | 60 | 0.61 € | 1.80 rps |
Mistral 7B OpenChat-3.5 v1 f16 ✅ | 46 | 72 | 72 | 49 | 88 | 31 | 60 | 0.51 € | 2.14 rps |
Yi 1.5 34B Chat f16 ⚠️ | 44 | 78 | 70 | 52 | 86 | 28 | 60 | 1.28 € | 1.28 rps |
Claude 3 Haiku ☁️ | 59 | 69 | 64 | 55 | 75 | 33 | 59 | 0.08 € | 0.53 rps |
Mixtral 8x22B API (Instruct) ☁️ | 47 | 62 | 62 | 94 | 75 | 7 | 58 | 0.18 € | 3.01 rps |
Claude 3 Sonnet ☁️ | 67 | 41 | 74 | 52 | 78 | 30 | 57 | 0.97 € | 0.85 rps |
Qwen2 7B Instruct f32 ⚠️ | 44 | 81 | 81 | 39 | 66 | 29 | 57 | 0.47 € | 2.30 rps |
Mistral Large v1/2402 ☁️ | 33 | 49 | 70 | 75 | 84 | 25 | 56 | 2.19 € | 2.04 rps |
Anthropic Claude Instant v1.2 ☁️ | 51 | 75 | 65 | 59 | 65 | 14 | 55 | 2.15 € | 1.47 rps |
Anthropic Claude v2.0 ☁️ | 57 | 52 | 55 | 45 | 84 | 35 | 55 | 2.24 € | 0.40 rps |
Cohere Command R ☁️ | 39 | 66 | 57 | 55 | 84 | 26 | 54 | 0.13 € | 2.47 rps |
Qwen1.5 7B Chat f16 ⚠️ | 51 | 81 | 60 | 34 | 60 | 36 | 54 | 0.30 € | 3.62 rps |
Anthropic Claude v2.1 ☁️ | 36 | 58 | 59 | 60 | 75 | 33 | 53 | 2.31 € | 0.35 rps |
Qwen1.5 14B Chat f16 ⚠️ | 44 | 58 | 51 | 49 | 84 | 17 | 51 | 0.38 € | 2.90 rps |
Meta Llama 3 70B Instruct b8🦙 | 46 | 72 | 53 | 29 | 82 | 18 | 50 | 7.32 € | 0.22 rps |
Mistral 7B OpenOrca f16 ☁️ | 42 | 57 | 76 | 21 | 78 | 26 | 50 | 0.43 € | 2.55 rps |
Mistral 7B Instruct v0.1 f16 ☁️ | 31 | 71 | 69 | 44 | 62 | 21 | 50 | 0.79 € | 1.39 rps |
Llama2 13B Vicuna-1.5 f16🦙 | 36 | 37 | 53 | 39 | 82 | 38 | 48 | 1.02 € | 1.07 rps |
Codestral v1 ⚠️ | 33 | 47 | 43 | 71 | 66 | 13 | 45 | 0.31 € | 3.98 rps |
Google Recurrent Gemma 9B IT f16 ⚠️ | 46 | 27 | 71 | 45 | 56 | 25 | 45 | 0.93 € | 1.18 rps |
Mistral Small v1/2312 (Mixtral) ☁️ | 10 | 67 | 65 | 51 | 56 | 8 | 43 | 0.19 € | 2.17 rps |
Llama2 13B Hermes f16🦙 | 38 | 24 | 30 | 61 | 60 | 43 | 43 | 1.03 € | 1.06 rps |
Mistral Small v2/2402 ☁️ | 27 | 42 | 36 | 82 | 56 | 8 | 42 | 0.19 € | 3.14 rps |
Llama2 13B Hermes b8🦙 | 32 | 25 | 29 | 61 | 60 | 43 | 42 | 4.94 € | 0.22 rps |
Mistral Medium v1/2312 ☁️ | 36 | 43 | 27 | 59 | 62 | 12 | 40 | 0.83 € | 0.35 rps |
IBM Granite 34B Code Instruct f16 ☁️ | 52 | 49 | 30 | 44 | 57 | 5 | 40 | 1.12 € | 1.46 rps |
Llama2 13B Puffin f16🦙 | 37 | 15 | 38 | 48 | 56 | 41 | 39 | 4.89 € | 0.22 rps |
Llama2 13B Puffin b8🦙 | 37 | 14 | 37 | 46 | 56 | 39 | 38 | 8.65 € | 0.13 rps |
Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️ | 13 | 47 | 57 | 40 | 59 | 8 | 37 | 0.05 € | 2.30 rps |
Llama2 13B chat f16🦙 | 15 | 38 | 17 | 45 | 75 | 8 | 33 | 0.76 € | 1.43 rps |
Llama2 13B chat b8🦙 | 15 | 38 | 15 | 45 | 75 | 6 | 32 | 3.35 € | 0.33 rps |
Mistral 7B Notus-v1 f16 ⚠️ | 16 | 54 | 25 | 41 | 48 | 4 | 31 | 0.80 € | 1.37 rps |
Mistral 7B Zephyr-β f16 ✅ | 28 | 34 | 46 | 44 | 29 | 4 | 31 | 0.51 € | 2.14 rps |
Llama2 7B chat f16🦙 | 20 | 33 | 20 | 42 | 50 | 20 | 31 | 0.59 € | 1.86 rps |
Orca 2 13B f16 ⚠️ | 15 | 22 | 32 | 22 | 67 | 19 | 29 | 0.99 € | 1.11 rps |
Mistral 7B Instruct v0.2 f16 ☁️ | 7 | 30 | 50 | 13 | 58 | 8 | 28 | 1.00 € | 1.10 rps |
Microsoft Phi 3 Mini 4K Instruct f16 ⚠️ | 36 | 35 | 31 | 1 | 50 | 6 | 27 | 0.87 € | 1.26 rps |
Mistral 7B v0.1 f16 ☁️ | 0 | 9 | 42 | 42 | 52 | 12 | 26 | 0.93 € | 1.17 rps |
Microsoft Phi 3 Medium 4K Instruct f16 ⚠️ | 12 | 34 | 30 | 13 | 47 | 8 | 24 | 0.85 € | 1.28 rps |
Google Gemma 2B IT f16 ⚠️ | 20 | 28 | 14 | 39 | 15 | 20 | 23 | 0.32 € | 3.44 rps |
Orca 2 7B f16 ⚠️ | 13 | 0 | 24 | 18 | 52 | 4 | 19 | 0.81 € | 1.34 rps |
Google Gemma 7B IT f16 ⚠️ | 0 | 0 | 0 | 9 | 62 | 0 | 12 | 1.03 € | 1.06 rps |
Llama2 7B f16🦙 | 0 | 5 | 18 | 3 | 28 | 2 | 9 | 1.01 € | 1.08 rps |
Yi 1.5 9B Chat f16 ⚠️ | 0 | 4 | 29 | 8 | 0 | 8 | 8 | 1.46 € | 0.75 rps |
The benchmark categories in detail
Here's exactly what we're looking at with the different categories of LLM Leaderboards
-
How well can the model work with large documents and knowledge bases?
-
How well does the model support work with product catalogs and marketplaces?
-
Can the model easily interact with external APIs, services and plugins?
-
How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?
-
How well can the model reason and draw conclusions in a given context?
-
Can the model generate code and help with programming?
-
The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.
-
The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.
Claude 3.5 Sonnet - Anthropic did it again
Remember how Anthropic made a big quality improvement in their models in March?
They have just done it again by releasing Claude 3.5 Sonnet. This mid-range model is not only more powerful than the top-of-the-range Opus model, but also about five times cheaper.
Improved performance with Claude 3.5 Sonnet
Claude 3.5 Sonnet better follows instructions and has same reasoning capabilities as their top model - Haiku, so this is a huge improvement.
New: Artifacts for a better user experience
There is one more big improvement in the product line of Anthropic, though. It is called Artifacts, and it isn’t even about LLM capability, but rather about user experience and LLM integration.
Artifacts: Working efficiently with documents and code
The idea of Artifacts is: when you are working on some document or a piece of code, Claude web chat, will pull this document into a convenient separate window. This document will now become an entity of its own, not just a snippet that is repeated in the web chat. Artifacts are versioned, and you can properly iterate on them.
This may seem like a small feature, but together with Claude 3.5 Sonnet, it becomes a huge productivity boost that makes it worthwhile to use Claude Chat instead of ChatGPT when working with documents and code snippets.
Small, efficient models are getting better and better
Last month we tested several local LLMs. There were some pleasant surprises:
First of all, it was about Google Gemma 7B Instruct. This Google model is often criticized for being too restricted and limited.
However, the OpenChat 3.5 fine-tuning of this model reveals its true capabilities and places this 7B model above the first version of GPT-3.5.
It is rumored that GPT-3.5 had about 20-175B parameters, and this small 7B model (which can run on a laptop) manages to outperform it! The rate of progress is impressive.
In fact, the only local LLM that performs better than this model (in our benchmarks) is AliBaba's Qwen1.5-32B model. However, this model has a non-standard license and requires more than four times as many resources to run.
As you can see from the picture, there are already many 7B models with performance comparable to early versions of GPT-3.5. Based on the trends, the progress will not just end there.
Poorer performing models
Not all local models performed so well in our benchmark. Here are some that performed poorly (mostly because they couldn't follow even basic instructions accurately):
- Yi 1.5 34B Chat
- Google Recurrent Gemma 9B IT
- Microsoft Phi 3 Mini/Medium
- Google Gemma 2B/7B
Apple Privacy Model and Confidential Computing
In its latest announcement, Apple has started to introduce more AI features to its ecosystem. One of the most interesting aspects was the concept of Private Cloud Compute.
Essentially, the iPhone will use a small and efficient LLM model to process all incoming requests. This LLM is not very powerful and comparable to modern 7B models. However, it is fast and will process all requests in a secure way - locally.
It becomes particularly interesting when the LLM-controlled system recognizes that it needs more computing power to process the request.
In this case, it has two options:
It can ask the user for permission to send the specific request to OpenAI GPT.
It can securely forward the request to a private cloud compute managed by Apple.
What is private cloud compute?
It is a protected Apple datacenter that uses their own chips to host powerful Large Language Models. The setup gives strong guarantees that your personal requests will be handled securely and nobody, not even Apple, will even see questions and answers.
This is done through a combination of special hardware, encryption, secured VM images and mutual attestation between the software and hardware. Ultimately, they do their best to make it very hard and expensive to break this setup even by Apple or governments.
Apple is all about consumer electronics, is there anything comparable for companies?
Yes, it does exist. It's called confidential computing. The concept has been around for some time (see the Confidential Computing Consortium), but has only recently been properly applied to GPUs by Nvidia. Nvidia introduced it in the Hopper architecture (H100 GPUs) and almost completely eliminated the performance penalty in the Blackwell architecture.
The concept is the same as Apple's PCC:
data is encrypted in transit and at rest
data is decrypted during the computation time
hardware and software are designed to make it impossible (really hard and expensive) to take a look at the data while it is decrypted.
Major cloud providers are already testing VMs with confidential GPU calculation (e.g. Microsoft Azure with H100 since 2023, Google Cloud with H100 since 2024).
This approach is interesting because it offers a third option to companies that need to build a secure LLM-driven system:
Options | Guarantees | Investments in advance | Costs for operation |
---|---|---|---|
OpenAI from Microsoft | Medium. Not everyone likes sending data to third parties. But many already use MS Office | None | High - we pay per request |
Our own data center with GPUs | Very high - data remains within our security perimeter. | Huge - GPUs are expensive, lead times are also long. | Low |
Renting confidential GPU calculation | High - there are many guarantees that our data is protected from everyone else. | Low - we can pay as we go | High - we pay per rental period |
Just like with hybrid clouds (they were a big thing in the past, but are a norm these days), we can mix-and-match these options for a cost-effective and secure solution, just like Apple does with PCC. For example:
Have a small local deployment that runs cost-effective 7B models on our own hardware. It will handle all requests locally.
If a user request needs more powerful AI/LLM and doesn’t involve critical information - route requests to Azure OpenAI
If a user request is both sensitive and requires a lot of GPU compute, then - route it to a confidential compute in the cloud.
Ultimately, if the powerful-and-confidential workload is steady enough, it might make sense to add a few local and powerful GPUs to handle it. During the peaks we can still rent confidential compute in the cloud.
With an H100 setup, you can expect high performance even with a single GPU if you use the right software and optimization profile. For example, you can achieve +20-50% throughput with Llama 3 8B at fp16 by changing the backend from vLLM to TensorRT backend with Nvidia NIM-setup.
Since the H100 hardware also supports fp8 quantization, we can even achieve +10-30% performance by switching from fp16 to fp8.
NB: Performance gains will depend on the overall context size, batch size and nature of the workload.
Confidential computing: new ways of working together without disclosing data and code
If you push the concept even further, confidential compute enables a new mode of collaboration between the companies: you can do a multi-party data analysis without data and code disclosure. For example, medical companies can pool their data to develop more efficient treatment procedures but without disclosing raw private data to each other.
Summary
Apple made a great job in explaining the concepts of confidential computing to the audience. This raises awareness about one more cost-effective possibility of building a secure AI-driven enterprise solution.
All the ingredients for building such a solution are already available:
Resource-efficient LLMs that can be operated locally within the safety perimeter - fine-tuning of Llama 3 8B, Gemma and Mistral 7B.
Powerful cloud models from renowned providers: GPT from OpenAI and Gemini from Google.
New hardware that gives strong data protections guarantees even and that could be rented.
Time will tell whether this approach will become more popular.
Trustbit LLM Benchmarks Archive
Interested in the benchmarks of the past months? You can find all the links on our LLM Benchmarks overview page!