Benchmarks for ChatGPT and Co

July 2024

Highlights of the month

July 2024 was a very fruitful month in the world of generative AI. We even saw a few boundaries pushed forward. We have a lot of ground to cover. Let’s get started!

  • Codestral-Mamba 7B - new efficient LLM architecture that achieves surprisingly good results

  • GPT-4o Mini - affordable, lightweight model. The best in its class!

  • Mistral Nemo 12B - decent downloadable model in its class, designed for quantization (compression)

  • Mistral Large 123B v2 - local model that reaches the level of GPT-4 Turbo v3 and Gemini Pro 1.5. It would be the best local model if it weren't for Meta Llama 3.1:

  • Meta Llama 3.1 - a series of models with a permissive license that set new records in our benchmark.

    +++ Update +++

  • Gemini Pro 1.5 v0801 - Google suddenly manages to catch up with OpenAI and makes it into the top 3!

LLM Benchmarks | July 2024

The Trustbit benchmarks evaluate the models in terms of their suitability for digital product development. The higher the score, the better.

☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama license

model code crm docs integrate marketing reason final 🏆 Cost Speed
GPT-4o ☁️ 90 95 100 90 82 75 89 1.21 € 1.50 rps
GPT-4 Turbo v5/2024-04-09 ☁️ 86 99 98 93 88 45 85 2.45 € 0.84 rps
Google Gemini Pro 1.5 0801 ☁️ 84 92 90 100 70 72 85 1.48 € 0.83 rps
GPT-4 v1/0314 ☁️ 90 88 98 52 88 50 78 7.04 € 1.31 rps
Claude 3.5 Sonnet ☁️ 72 83 89 78 80 59 77 0.94 € 0.09 rps
GPT-4 v2/0613 ☁️ 90 83 95 52 88 50 76 7.04 € 2.16 rps
GPT-4 Turbo v4/0125-preview ☁️ 66 97 100 71 75 45 76 2.45 € 0.84 rps
GPT-4o Mini ☁️ 63 87 80 52 100 67 75 0.04 € 1.46 rps
Claude 3 Opus ☁️ 69 88 100 53 76 59 74 4.69 € 0.41 rps
Meta Llama3.1 405B Instruct🦙 81 93 92 55 75 46 74 2.39 € 1.16 rps
GPT-4 Turbo v3/1106-preview ☁️ 66 75 98 52 88 62 73 2.46 € 0.68 rps
Mistral Large 123B v2/2407 ☁️ 68 79 68 75 75 71 73 0.86 € 1.02 rps
Gemini Pro 1.5 0514 ☁️ 73 96 75 100 25 62 72 2.01 € 0.92 rps
Meta Llama 3.1 70B Instruct f16🦙 74 89 90 55 75 46 72 1.79 € 0.90 rps
Gemini Pro 1.5 0409 ☁️ 68 97 96 63 75 28 71 1.84 € 0.59 rps
GPT-3.5 v2/0613 ☁️ 68 81 73 75 81 48 71 0.34 € 1.46 rps
GPT-3.5 v3/1106 ☁️ 68 70 71 63 78 59 68 0.24 € 2.33 rps
Gemini Pro 1.0 ☁️ 66 86 83 60 88 26 68 0.09 € 1.36 rps
GPT-3.5 v4/0125 ☁️ 63 87 71 60 78 47 68 0.12 € 1.43 rps
Gemini 1.5 Flash 0514 ☁️ 32 97 100 56 72 41 66 0.09 € 1.77 rps
Cohere Command R+ ☁️ 63 80 76 49 70 59 66 0.83 € 1.90 rps
Qwen1.5 32B Chat f16 ⚠️ 70 90 82 56 78 15 65 0.97 € 1.66 rps
GPT-3.5-instruct 0914 ☁️ 47 92 69 60 88 32 65 0.35 € 2.15 rps
Mistral Nemo 12B v1/2407 ☁️ 54 58 51 97 75 50 64 0.07 € 1.22 rps
Mistral 7B OpenChat-3.5 v3 0106 f16 ✅ 68 87 67 52 88 23 64 0.32 € 3.39 rps
Meta Llama 3 8B Instruct f16🦙 79 62 68 49 80 42 64 0.32 € 3.33 rps
GPT-3.5 v1/0301 ☁️ 55 82 69 67 82 24 63 0.35 € 4.12 rps
Gemma 7B OpenChat-3.5 v3 0106 f16 ✅ 63 67 84 33 81 48 63 0.21 € 5.09 rps
Llama 3 8B OpenChat-3.6 20240522 f16 ✅ 76 51 76 45 88 39 62 0.28 € 3.79 rps
Mistral 7B OpenChat-3.5 v1 f16 ✅ 58 72 72 49 88 31 62 0.49 € 2.20 rps
Mistral 7B OpenChat-3.5 v2 1210 f16 ✅ 63 73 72 45 88 28 61 0.32 € 3.40 rps
Starling 7B-alpha f16 ⚠️ 58 66 67 52 88 36 61 0.58 € 1.85 rps
Yi 1.5 34B Chat f16 ⚠️ 47 78 70 52 86 28 60 1.18 € 1.37 rps
Claude 3 Haiku ☁️ 64 69 64 55 75 33 60 0.08 € 0.52 rps
Mixtral 8x22B API (Instruct) ☁️ 53 62 62 94 75 7 59 0.17 € 3.12 rps
Meta Llama 3.1 8B Instruct f16🦙 57 74 62 52 74 34 59 0.45 € 2.41 rps
Codestral Mamba 7B v1 ✅ 53 66 51 94 71 17 59 0.30 € 2.82 rps
Meta Llama 3.1 70B Instruct b8🦙 60 76 75 30 81 26 58 5.28 € 0.31 rps
Claude 3 Sonnet ☁️ 72 41 74 52 78 30 58 0.95 € 0.85 rps
Qwen2 7B Instruct f32 ⚠️ 50 81 81 39 66 29 58 0.46 € 2.36 rps
Mistral Large v1/2402 ☁️ 37 49 70 75 84 25 57 2.14 € 2.11 rps
Anthropic Claude Instant v1.2 ☁️ 58 75 65 59 65 14 56 2.10 € 1.49 rps
Anthropic Claude v2.0 ☁️ 63 52 55 45 84 35 55 2.19 € 0.40 rps
Cohere Command R ☁️ 45 66 57 55 84 26 55 0.13 € 2.50 rps
Qwen1.5 7B Chat f16 ⚠️ 56 81 60 34 60 36 55 0.29 € 3.76 rps
Anthropic Claude v2.1 ☁️ 29 58 59 60 75 33 52 2.25 € 0.35 rps
Mistral 7B OpenOrca f16 ☁️ 54 57 76 21 78 26 52 0.41 € 2.65 rps
Qwen1.5 14B Chat f16 ⚠️ 50 58 51 49 84 17 51 0.36 € 3.03 rps
Meta Llama 3 70B Instruct b8🦙 51 72 53 29 82 18 51 6.97 € 0.23 rps
Mistral 7B Instruct v0.1 f16 ☁️ 34 71 69 44 62 21 50 0.75 € 1.43 rps
Llama2 13B Vicuna-1.5 f16🦙 50 37 53 39 82 38 50 0.99 € 1.09 rps
Google Recurrent Gemma 9B IT f16 ⚠️ 58 27 71 45 56 25 47 0.89 € 1.21 rps
Codestral 22B v1 ✅ 38 47 43 71 66 13 46 0.30 € 4.03 rps
Llama2 13B Hermes f16🦙 50 24 30 61 60 43 45 1.00 € 1.07 rps
Llama2 13B Hermes b8🦙 41 25 29 61 60 43 43 4.79 € 0.22 rps
Mistral Small v2/2402 ☁️ 33 42 36 82 56 8 43 0.18 € 3.21 rps
Mistral Small v1/2312 (Mixtral) ☁️ 10 67 65 51 56 8 43 0.19 € 2.21 rps
IBM Granite 34B Code Instruct f16 ☁️ 63 49 30 44 57 5 41 1.07 € 1.51 rps
Mistral Medium v1/2312 ☁️ 41 43 27 59 62 12 41 0.81 € 0.35 rps
Llama2 13B Puffin f16🦙 37 15 38 48 56 41 39 4.70 € 0.23 rps
Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️ 22 47 57 40 59 8 39 0.05 € 2.39 rps
Llama2 13B Puffin b8🦙 37 14 37 46 56 39 38 8.34 € 0.13 rps
Meta Llama2 13B chat f16🦙 22 38 17 45 75 8 34 0.75 € 1.44 rps
Meta Llama2 13B chat b8🦙 22 38 15 45 75 6 33 3.27 € 0.33 rps
Mistral 7B Zephyr-β f16 ✅ 37 34 46 44 29 4 32 0.46 € 2.34 rps
Meta Llama2 7B chat f16🦙 22 33 20 42 50 20 31 0.56 € 1.93 rps
Mistral 7B Notus-v1 f16 ⚠️ 10 54 25 41 48 4 30 0.75 € 1.43 rps
Orca 2 13B f16 ⚠️ 18 22 32 22 67 19 30 0.95 € 1.14 rps
Mistral 7B Instruct v0.2 f16 ☁️ 11 30 50 13 58 8 29 0.96 € 1.12 rps
Mistral 7B v0.1 f16 ☁️ 0 9 42 42 52 12 26 0.87 € 1.23 rps
Google Gemma 2B IT f16 ⚠️ 33 28 14 39 15 20 25 0.30 € 3.54 rps
Microsoft Phi 3 Medium 4K Instruct f16 ⚠️ 5 34 30 13 47 8 23 0.82 € 1.32 rps
Orca 2 7B f16 ⚠️ 22 0 24 18 52 4 20 0.78 € 1.38 rps
Google Gemma 7B IT f16 ⚠️ 0 0 0 9 62 0 12 0.99 € 1.08 rps
Meta Llama2 7B f16🦙 0 5 18 3 28 2 9 0.95 € 1.13 rps
Yi 1.5 9B Chat f16 ⚠️ 0 4 29 8 0 8 8 1.41 € 0.76 rps

The benchmark categories in detail

Here's exactly what we're looking at with the different categories of LLM Leaderboards

  • How well can the model work with large documents and knowledge bases?

  • How well does the model support work with product catalogs and marketplaces?

  • Can the model easily interact with external APIs, services and plugins?

  • How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?

  • How well can the model reason and draw conclusions in a given context?

  • Can the model generate code and help with programming?

  • The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.

  • The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.

Codestral Mamba 7B

Mistral AI has made quite a few releases this month, but Codestral Mamba is our favorite. It's not extremely powerful, comparable to models like Llama 3.1 8B or Claude 3 Sonnet. But there are a few nuances:

  • This model is not designed for product or business tasks, it is a coding model. Nevertheless, it competes well with general purpose models.

  • The model doesn’t implement the well-studied transformer architecture, but a Mamba (also known as Linear-Time Sequence Modeling with Selective State Spaces). This architecture is considered to be more resource-efficient and have less constraints on working with large contexts. There were multiple attempts to train a good Mamba model, but this is a first one to achieve good results on our leaderboard.

  • The new model is available for local use and can be obtained directly from HuggingFace. Nvidia TensorRT-LLM already supports this model.

GPT-4o Mini

GPT-4o Mini is a new multimodal model from OpenAI. It is similar in class to the GPT-3.5 models, but has better overall results. Its Reason capability is quite large for such a small model. GPT-4o Mini is also the first model to score a perfect 100 in our Marketing category (tests working with language and writing styles).

Given the extremely low cost and good results, the GPT-4o Mini seems perfect for small, focused tasks such as routers and classifiers in LLM-driven products. Large scale data extraction tasks also look good.

Mistral Nemo 12B

Mistral AI has been pushing a lot of effort into bleeding edge R&D, it seems. The Mistral Nemo 12B is another example of this.

On one hand, this model is a bit larger than previous 7B models from Mistral AI. On the other hand it has a few interesting nuances that make up for that.

First of all, the model has better tokeniser under the hood, leading to more efficient token use (fewer tokens needed per input and output).

Secondly, the model was trained together with Nvidia using quantization-aware training. This means that the model is designed from the start to run in a resource-efficient mode. In this case, the model is designed to work well in FP8 mode, which means that the model weights take up a quarter of the usual size in memory (compared to FP32 format). Here is the announcement from Nvidia.

It's a nice coincidence that NVidia GPUs with CUDA Compute 9.0 generation are designed to run FP8 natively (e.g. H100 GPUs for data centers)

If you have the latest GPUs, this Mistral Nemo model can be a good replacement for the earlier 7B models from Mistral AI. Since the model also achieves a high Reason score, there is a chance that fine-tuning will push the model even higher.

You can download this model from Hugging Face or use it via the MistralAI API.

Mistral Large 123B v2

Mistral Large v2 is currently the best model of Mistral in our benchmarks. It is available for download, which means you can run it on your local machines (although a license is required for commercial use).

This model also has a large context of 128 tokens. It claims to support multiple languages, both human and programming languages.

In our benchmark, this model has really good results and an unusually high Reason capability. It is comparable with GPT-4 Turbo v3, Gemini Pro 1.5 and Claude 3 Opus.

The unusual size of this Mistral model could indicate that it was also trained with FP8 Awareness to replace the 70B modes in their lineup (12:7 ~~ 123:80). If that's the case, we could see a general trend where new models will appear in these odd sizes. However, they will only run well on the latest GPUs. This may fragment the LLM landscape and slow down progress.

The lineup of the best Mistral models currently looks like this:

Llama 3.1 Models from Meta

Meta has released an update to its Llama 3.1 series that includes 3 model sizes: 8B, 70B and 405B. You can download all models from HuggingFace and use them locally. Most AI providers also offer support via API.

We tested smaller models locally and used Google Vertex AI for 405B. Google almost didn’t mess up the integration (you may need to fix the line breaks and truncate extra tokens at the beginning of the prompt).

The 8B model is not that interesting - it scores lower than the previous 3.0 version, so we’ll skip it. The other two models are way more interesting.

Meta Llama 3.1 70B has made a massive jump in quality when compared to the previous version. It has reached Gemini Pro 1.5, surpassed GPT-3.5 and reached Mistral Large 123B v2. This is great news because we can achieve the quality of the 123B model with a smaller one.

Note, by the way, that Llama 3.1 models can be quite sensitive to quantization (compression). For example, if we run a 70B model with an 8bit quantization (over bitsandbytes), the performance and quality will drop drastically:

This does not mean that all quantization strategies are equally bad (you can find a good article on this topic here). Just make sure you compare your model on your hardware with your specific data.

Meta Llama 3.1 405B Instruct

Meta Llama 3.1 405B Instruct is the last hero of the month. This is the first model that managed to beat the GPT-4 Turbo (its weakest version Turbo v3/1106). You can find it in the TOP 10 of our benchmark:

It is a large model. You need 640GB of VRAM (8xH100/A100) just to run it in FP8 with a small batch and context window. The resource requirements alone mean that very few will use this model when compared to 70B/8B variants. There will be less interesting fine-tuning and solutions.

But that's not all that important. The important points are:

  • This is a model that you can download and use locally.

  • It outperforms one of the GPT-4 models

  • It beats Mistral Large 2 in quality while having a more permissive license

  • It reaches the quality of Claude 3 Opus.

This is a small breakthrough. We are sure that smaller models will also reach this level at some point.

Update: Google Gemini 1.5 Pro Experimental v0801

Normally we don’t do benchmark updates after the publication, but this news deserved it. Waiting for one month to report on the new Google Gemini model would be a waste.

This model was released as a public experiment on the first of August (you can find it in the Google AI Studio ). At this point it was also revealed that the model has been running for some time on LMSYS Chatbot Arena, scoring on the top with more 12k votes.

We ran our own benchmark using the Google AI Studio API (the model is not yet available on Vertex AI). The results are really impressive. We are talking about a substantial jump in model capabilities from the first version of Gemini Pro 1.5 in April.

This Google model managed to suddenly overtake almost all GPT-4 models and catch up with the TOP, taking the third place. The scores are quite solid.

Scores could’ve been even better, if Gemini Pro 1.5 paid more attention to following instructions precisely. While extreme attention to the detail isn’t always needed in human interactions, it is essential in products and LLM pipelines deployed at our customers. Top two models from OpenAI still excel in that capability.

Still the news is outstanding, worth the celebration. First of all, we have a new source of innovation that managed to catch up with OpenAI (and we thought that Google was out of the race). Second, companies deeply vested into the Google Cloud will finally get access to top quality large language model within the ecosystem.

And who knows whether Google Gemini 2.0 will manage to increase its modeling capabilities even further. The pace of progress so far has been quite impressive. Just see for yourself:

Local AI and Compliance

We have been tracking this trend for some time now. Local models are becoming increasingly powerful over time and are beating more complex closed-source models.

Local models are quite interesting for a lot of customers, since they seem to address a lot of problems with privacy, confidentiality and compliance. There are less chances to leak private data, if your LLMs run completely on your premises within the security perimeter, right?

Nuances and new regulations: The EU AI Act

However, there are still some nuances. This newsletter will be published at the end of July 2024 From August 1, 2024, the Artificial Intelligence Act will come into force in the EU. It creates a common regulatory and legal framework for AI in the EU, with various provisions slowly coming into force over the next 3 years.

The EU AI Act regulates not only AI providers (such as OpenAI or MistralAI), but also companies that use AI in a professional context.

Risk-based regulation: What does this mean for your company?

Obviously, not everybody is going to be regulated same way. Regulation is based on the risk levels, and most AI applications are expected to fall into the “Minimal risk” category. However, it is quite easy to step into the higher risk category (for example if AI allows image manipulation, is used in education or recruitment).

Due diligence: more than just local models

In other words, some due diligence will be required for all large companies. The statement "We only use local models" may not be sufficient.

Checklist for compliance with AI regulations

Here's a quick check to see if you're on the right track to ensure compliance for your AI system. Have you documented the answers to these questions and communicated them clearly within your organization?

  • Who are the main users of your system? What are the industries and specific applications of your system? What is the risk classification here?

  • What is the exact name, version, vendor and platform/environment of your AI components?

  • What are the affiliations and partnerships of your AI providers? What are the licensing terms?

  • Where are your systems used geographically? Under which jurisdiction do your AI systems operate?

  • Who is responsible for the system and processes for managing AI risks in your company?

  • Who is responsible for the documentation and communication of your AI system (including things like architecture, components, dependencies, functional requirements and performance standards)?

Your path to AI compliance

If you have concrete answers to these questions, chances are you're already well on your way with AI compliance. This also means that your company will keep an eye on the compliance effort of different options when evaluating LLM-driven solutions.

You can contact us at any time if you have any questions about AI compliance or would like to discuss the topic in more detail.

Trustbit LLM Benchmarks Archive

Interested in the benchmarks of the past months? You can find all the links on our LLM Benchmarks overview page!