Benchmarks for ChatGPT and Co

June 2024

The highlights of the month:

  • The elephant in the room - Claude 3.5 Sonnet and artifacts feature

  • Confidential computing - how it can make AI more secure and cost-effective for companies

  • The trend towards small and powerful LLMs that can be operated locally

LLM Benchmarks | June 2024

The Trustbit benchmarks evaluate the models in terms of their suitability for digital product development. The higher the score, the better.

☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama license
⚠️ - Non-standard license! We recommend consulting a legal advisor before use to determine whether it can be used in your company in a legally compliant manner

model code crm docs integrate marketing reason final 🏆 Cost Speed
GPT-4o ☁️ 85 95 100 90 82 75 88 1.24 € 1.49 rps
GPT-4 Turbo v5/2024-04-09 ☁️ 80 99 98 93 88 45 84 2.51 € 0.83 rps
Claude 3.5 Sonnet ☁️ 67 83 89 78 80 59 76 0.97 € 0.09 rps
GPT-4 v1/0314 ☁️ 80 88 98 52 88 50 76 7.19 € 1.26 rps
GPT-4 Turbo v4/0125-preview ☁️ 60 97 100 71 75 45 75 2.51 € 0.82 rps
GPT-4 v2/0613 ☁️ 80 83 95 52 88 50 74 7.19 € 2.07 rps
Claude 3 Opus ☁️ 64 88 100 53 76 59 73 4.83 € 0.41 rps
GPT-4 Turbo v3/1106-preview ☁️ 60 75 98 52 88 62 72 2.52 € 0.68 rps
Gemini Pro 1.5 0514 ☁️ 67 96 75 100 25 62 71 2.06 € 0.91 rps
Gemini Pro 1.5 0409 ☁️ 62 97 96 63 75 28 70 1.89 € 0.58 rps
GPT-3.5 v2/0613 ☁️ 62 81 73 75 81 48 70 0.35 € 1.39 rps
GPT-3.5 v3/1106 ☁️ 62 70 71 63 78 59 67 0.24 € 2.29 rps
GPT-3.5 v4/0125 ☁️ 58 87 71 60 78 47 67 0.13 € 1.41 rps
Gemini 1.5 Flash 0514 ☁️ 32 97 100 56 72 41 66 0.10 € 1.76 rps
Gemini Pro 1.0 ☁️ 55 86 83 60 88 26 66 0.10 € 1.35 rps
Cohere Command R+ ☁️ 58 80 76 49 70 59 65 0.85 € 1.88 rps
Qwen1.5 32B Chat f16 ⚠️ 64 90 82 56 78 15 64 1.02 € 1.61 rps
GPT-3.5-instruct 0914 ☁️ 44 92 69 60 88 32 64 0.36 € 2.12 rps
Gemma 7B OpenChat-3.5 v3 0106 f16 ✅ 62 67 84 33 81 48 63 0.22 € 4.91 rps
Meta Llama 3 8B Instruct f16🦙 74 62 68 49 80 42 63 0.35 € 3.16 rps
GPT-3.5 v1/0301 ☁️ 49 82 69 67 82 24 62 0.36 € 3.93 rps
Mistral 7B OpenChat-3.5 v3 0106 f16 ✅ 56 87 67 52 88 23 62 0.33 € 3.28 rps
Mistral 7B OpenChat-3.5 v2 1210 f16 ✅ 58 73 72 45 88 28 61 0.33 € 3.27 rps
Llama 3 8B OpenChat-3.6 20240522 f16 ✅ 64 51 76 45 88 39 60 0.30 € 3.62 rps
Starling 7B-alpha f16 ⚠️ 51 66 67 52 88 36 60 0.61 € 1.80 rps
Mistral 7B OpenChat-3.5 v1 f16 ✅ 46 72 72 49 88 31 60 0.51 € 2.14 rps
Yi 1.5 34B Chat f16 ⚠️ 44 78 70 52 86 28 60 1.28 € 1.28 rps
Claude 3 Haiku ☁️ 59 69 64 55 75 33 59 0.08 € 0.53 rps
Mixtral 8x22B API (Instruct) ☁️ 47 62 62 94 75 7 58 0.18 € 3.01 rps
Claude 3 Sonnet ☁️ 67 41 74 52 78 30 57 0.97 € 0.85 rps
Qwen2 7B Instruct f32 ⚠️ 44 81 81 39 66 29 57 0.47 € 2.30 rps
Mistral Large v1/2402 ☁️ 33 49 70 75 84 25 56 2.19 € 2.04 rps
Anthropic Claude Instant v1.2 ☁️ 51 75 65 59 65 14 55 2.15 € 1.47 rps
Anthropic Claude v2.0 ☁️ 57 52 55 45 84 35 55 2.24 € 0.40 rps
Cohere Command R ☁️ 39 66 57 55 84 26 54 0.13 € 2.47 rps
Qwen1.5 7B Chat f16 ⚠️ 51 81 60 34 60 36 54 0.30 € 3.62 rps
Anthropic Claude v2.1 ☁️ 36 58 59 60 75 33 53 2.31 € 0.35 rps
Qwen1.5 14B Chat f16 ⚠️ 44 58 51 49 84 17 51 0.38 € 2.90 rps
Meta Llama 3 70B Instruct b8🦙 46 72 53 29 82 18 50 7.32 € 0.22 rps
Mistral 7B OpenOrca f16 ☁️ 42 57 76 21 78 26 50 0.43 € 2.55 rps
Mistral 7B Instruct v0.1 f16 ☁️ 31 71 69 44 62 21 50 0.79 € 1.39 rps
Llama2 13B Vicuna-1.5 f16🦙 36 37 53 39 82 38 48 1.02 € 1.07 rps
Codestral v1 ⚠️ 33 47 43 71 66 13 45 0.31 € 3.98 rps
Google Recurrent Gemma 9B IT f16 ⚠️ 46 27 71 45 56 25 45 0.93 € 1.18 rps
Mistral Small v1/2312 (Mixtral) ☁️ 10 67 65 51 56 8 43 0.19 € 2.17 rps
Llama2 13B Hermes f16🦙 38 24 30 61 60 43 43 1.03 € 1.06 rps
Mistral Small v2/2402 ☁️ 27 42 36 82 56 8 42 0.19 € 3.14 rps
Llama2 13B Hermes b8🦙 32 25 29 61 60 43 42 4.94 € 0.22 rps
Mistral Medium v1/2312 ☁️ 36 43 27 59 62 12 40 0.83 € 0.35 rps
IBM Granite 34B Code Instruct f16 ☁️ 52 49 30 44 57 5 40 1.12 € 1.46 rps
Llama2 13B Puffin f16🦙 37 15 38 48 56 41 39 4.89 € 0.22 rps
Llama2 13B Puffin b8🦙 37 14 37 46 56 39 38 8.65 € 0.13 rps
Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️ 13 47 57 40 59 8 37 0.05 € 2.30 rps
Llama2 13B chat f16🦙 15 38 17 45 75 8 33 0.76 € 1.43 rps
Llama2 13B chat b8🦙 15 38 15 45 75 6 32 3.35 € 0.33 rps
Mistral 7B Notus-v1 f16 ⚠️ 16 54 25 41 48 4 31 0.80 € 1.37 rps
Mistral 7B Zephyr-β f16 ✅ 28 34 46 44 29 4 31 0.51 € 2.14 rps
Llama2 7B chat f16🦙 20 33 20 42 50 20 31 0.59 € 1.86 rps
Orca 2 13B f16 ⚠️ 15 22 32 22 67 19 29 0.99 € 1.11 rps
Mistral 7B Instruct v0.2 f16 ☁️ 7 30 50 13 58 8 28 1.00 € 1.10 rps
Microsoft Phi 3 Mini 4K Instruct f16 ⚠️ 36 35 31 1 50 6 27 0.87 € 1.26 rps
Mistral 7B v0.1 f16 ☁️ 0 9 42 42 52 12 26 0.93 € 1.17 rps
Microsoft Phi 3 Medium 4K Instruct f16 ⚠️ 12 34 30 13 47 8 24 0.85 € 1.28 rps
Google Gemma 2B IT f16 ⚠️ 20 28 14 39 15 20 23 0.32 € 3.44 rps
Orca 2 7B f16 ⚠️ 13 0 24 18 52 4 19 0.81 € 1.34 rps
Google Gemma 7B IT f16 ⚠️ 0 0 0 9 62 0 12 1.03 € 1.06 rps
Llama2 7B f16🦙 0 5 18 3 28 2 9 1.01 € 1.08 rps
Yi 1.5 9B Chat f16 ⚠️ 0 4 29 8 0 8 8 1.46 € 0.75 rps

The benchmark categories in detail

Here's exactly what we're looking at with the different categories of LLM Leaderboards

  • How well can the model work with large documents and knowledge bases?

  • How well does the model support work with product catalogs and marketplaces?

  • Can the model easily interact with external APIs, services and plugins?

  • How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?

  • How well can the model reason and draw conclusions in a given context?

  • Can the model generate code and help with programming?

  • The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.

  • The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.

Claude 3.5 Sonnet - Anthropic did it again

They have just done it again by releasing Claude 3.5 Sonnet. This mid-range model is not only more powerful than the top-of-the-range Opus model, but also about five times cheaper.

Improved performance with Claude 3.5 Sonnet

Claude 3.5 Sonnet better follows instructions and has same reasoning capabilities as their top model - Haiku, so this is a huge improvement.

New: Artifacts for a better user experience

There is one more big improvement in the product line of Anthropic, though. It is called Artifacts, and it isn’t even about LLM capability, but rather about user experience and LLM integration.

Artifacts: Working efficiently with documents and code

The idea of Artifacts is: when you are working on some document or a piece of code, Claude web chat, will pull this document into a convenient separate window. This document will now become an entity of its own, not just a snippet that is repeated in the web chat. Artifacts are versioned, and you can properly iterate on them.

This may seem like a small feature, but together with Claude 3.5 Sonnet, it becomes a huge productivity boost that makes it worthwhile to use Claude Chat instead of ChatGPT when working with documents and code snippets.

Small, efficient models are getting better and better

Last month we tested several local LLMs. There were some pleasant surprises:

First of all, it was about Google Gemma 7B Instruct. This Google model is often criticized for being too restricted and limited.

However, the OpenChat 3.5 fine-tuning of this model reveals its true capabilities and places this 7B model above the first version of GPT-3.5.

It is rumored that GPT-3.5 had about 20-175B parameters, and this small 7B model (which can run on a laptop) manages to outperform it! The rate of progress is impressive.

In fact, the only local LLM that performs better than this model (in our benchmarks) is AliBaba's Qwen1.5-32B model. However, this model has a non-standard license and requires more than four times as many resources to run.

As you can see from the picture, there are already many 7B models with performance comparable to early versions of GPT-3.5. Based on the trends, the progress will not just end there.

Poorer performing models

Not all local models performed so well in our benchmark. Here are some that performed poorly (mostly because they couldn't follow even basic instructions accurately):

- Yi 1.5 34B Chat

- Google Recurrent Gemma 9B IT

- Microsoft Phi 3 Mini/Medium

- Google Gemma 2B/7B

Apple Privacy Model and Confidential Computing

In its latest announcement, Apple has started to introduce more AI features to its ecosystem. One of the most interesting aspects was the concept of Private Cloud Compute.

Essentially, the iPhone will use a small and efficient LLM model to process all incoming requests. This LLM is not very powerful and comparable to modern 7B models. However, it is fast and will process all requests in a secure way - locally.

It becomes particularly interesting when the LLM-controlled system recognizes that it needs more computing power to process the request.

In this case, it has two options:

  • It can ask the user for permission to send the specific request to OpenAI GPT.

  • It can securely forward the request to a private cloud compute managed by Apple.

What is private cloud compute?

It is a protected Apple datacenter that uses their own chips to host powerful Large Language Models. The setup gives strong guarantees that your personal requests will be handled securely and nobody, not even Apple, will even see questions and answers.

This is done through a combination of special hardware, encryption, secured VM images and mutual attestation between the software and hardware. Ultimately, they do their best to make it very hard and expensive to break this setup even by Apple or governments.

Apple is all about consumer electronics, is there anything comparable for companies?

Yes, it does exist. It's called confidential computing. The concept has been around for some time (see the Confidential Computing Consortium), but has only recently been properly applied to GPUs by Nvidia. Nvidia introduced it in the Hopper architecture (H100 GPUs) and almost completely eliminated the performance penalty in the Blackwell architecture.

The concept is the same as Apple's PCC:

  • data is encrypted in transit and at rest

  • data is decrypted during the computation time

  • hardware and software are designed to make it impossible (really hard and expensive) to take a look at the data while it is decrypted.

Major cloud providers are already testing VMs with confidential GPU calculation (e.g. Microsoft Azure with H100 since 2023, Google Cloud with H100 since 2024).

This approach is interesting because it offers a third option to companies that need to build a secure LLM-driven system:

Options Guarantees Investments in advance Costs for operation
OpenAI from Microsoft Medium. Not everyone likes sending data to third parties. But many already use MS Office None High - we pay per request
Our own data center with GPUs Very high - data remains within our security perimeter. Huge - GPUs are expensive, lead times are also long. Low
Renting confidential GPU calculation High - there are many guarantees that our data is protected from everyone else. Low - we can pay as we go High - we pay per rental period

Just like with hybrid clouds (they were a big thing in the past, but are a norm these days), we can mix-and-match these options for a cost-effective and secure solution, just like Apple does with PCC. For example:

  • Have a small local deployment that runs cost-effective 7B models on our own hardware. It will handle all requests locally.

  • If a user request needs more powerful AI/LLM and doesn’t involve critical information - route requests to Azure OpenAI

  • If a user request is both sensitive and requires a lot of GPU compute, then - route it to a confidential compute in the cloud.

Ultimately, if the powerful-and-confidential workload is steady enough, it might make sense to add a few local and powerful GPUs to handle it. During the peaks we can still rent confidential compute in the cloud.

With an H100 setup, you can expect high performance even with a single GPU if you use the right software and optimization profile. For example, you can achieve +20-50% throughput with Llama 3 8B at fp16 by changing the backend from vLLM to TensorRT backend with Nvidia NIM-setup.

Since the H100 hardware also supports fp8 quantization, we can even achieve +10-30% performance by switching from fp16 to fp8.

NB: Performance gains will depend on the overall context size, batch size and nature of the workload.

Confidential computing: new ways of working together without disclosing data and code

If you push the concept even further, confidential compute enables a new mode of collaboration between the companies: you can do a multi-party data analysis without data and code disclosure. For example, medical companies can pool their data to develop more efficient treatment procedures but without disclosing raw private data to each other.

Summary

Apple made a great job in explaining the concepts of confidential computing to the audience. This raises awareness about one more cost-effective possibility of building a secure AI-driven enterprise solution.

All the ingredients for building such a solution are already available:

  • Resource-efficient LLMs that can be operated locally within the safety perimeter - fine-tuning of Llama 3 8B, Gemma and Mistral 7B.

  • Powerful cloud models from renowned providers: GPT from OpenAI and Gemini from Google.

  • New hardware that gives strong data protections guarantees even and that could be rented.

Time will tell whether this approach will become more popular.

Trustbit LLM Benchmarks Archive

Interested in the benchmarks of the past months? You can find all the links on our LLM Benchmarks overview page!