Benchmarks for ChatGPT and Co

February 2024

The highlights of the month:

  • Improvements to ChatGPT-4

  • Performance comparisons for the Mistral API and the Anthropic Claude models

  • First work on enterprise AI benchmarks

LLM Benchmarks | February 2024

The Trustbit benchmarks evaluate the models in terms of their suitability for digital product development. The higher the score, the better.

☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama license

Here is an updated report on the performance of LLM models in enterprise-specific workloads.

model code crm docs integrate marketing reason final 🏆 Cost Speed
GPT-4 v1/0314 ☁️ 80 88 98 52 88 50 76 7.19 € 1.26 rps
GPT-4 Turbo v4/0125-preview ☁️ 60 97 100 71 75 45 75 2.51 € 0.82 rps
GPT-4 v2/0613 ☁️ 80 83 95 52 88 50 74 7.19 € 2.07 rps
GPT-4 Turbo v3/1106-preview ☁️ 60 75 98 52 88 62 72 2.52 € 0.68 rps
GPT-3.5 v2/0613 ☁️ 62 79 73 75 81 48 70 0.35 € 1.39 rps
GPT-3.5 v3/1106 ☁️ 62 68 71 63 78 59 67 0.24 € 2.29 rps
GPT-3.5 v4/0125 ☁️ 58 85 71 60 78 47 66 0.13 € 1.41 rps
GPT-3.5-instruct 0914 ☁️ 44 90 69 60 88 32 64 0.36 € 2.12 rps
Mistral 7B OpenChat-3.5 v3 0106 f16 ✅ 56 86 67 52 88 26 62 0.37 € 2.99 rps
GPT-3.5 v1/0301 ☁️ 49 75 69 67 82 24 61 0.36 € 3.93 rps
Mistral 7B OpenChat-3.5 v1 f16 ✅ 46 72 72 49 88 31 60 0.51 € 2.14 rps
Starling 7B-alpha f16 ⚠️ 51 66 67 45 88 36 59 0.61 € 1.80 rps
Mistral 7B OpenChat-3.5 v2 1210 f16 ✅ 51 74 72 41 75 31 57 0.36 € 3.05 rps
Mistral Large v1/2402 ☁️ 33 49 70 75 84 25 56 2.19 € 2.04 rps
Anthropic Claude Instant v1.2 ☁️ 51 75 65 59 65 14 55 2.15 € 1.47 rps
Anthropic Claude v2.0 ☁️ 57 52 55 30 84 35 52 2.24 € 0.40 rps
Anthropic Claude v2.1 ☁️ 36 58 59 45 75 33 51 2.31 € 0.35 rps
Mistral 7B OpenOrca f16 ☁️ 42 57 76 21 78 26 50 0.43 € 2.55 rps
Mistral 7B Instruct v0.1 f16 ☁️ 31 70 69 44 62 21 50 0.79 € 1.39 rps
Llama2 13B Vicuna-1.5 f16🦙 36 37 53 39 82 38 48 1.02 € 1.07 rps
Llama2 13B Hermes f16🦙 38 23 30 61 60 43 42 1.03 € 1.06 rps
Llama2 13B Hermes b8🦙 32 24 29 61 60 43 42 4.94 € 0.22 rps
Mistral Small v1/2312 (Mixtral) ☁️ 10 58 65 51 56 8 41 0.19 € 2.17 rps
Mistral Small v2/2402 ☁️ 27 35 36 82 56 8 41 0.19 € 3.14 rps
Mistral Medium v1/2312 ☁️ 36 30 27 59 62 12 38 0.83 € 0.35 rps
Llama2 13B Puffin f16🦙 37 12 38 33 56 41 36 4.89 € 0.22 rps
Llama2 13B Puffin b8🦙 37 9 37 31 56 39 35 8.65 € 0.13 rps
Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️ 13 39 57 32 59 8 35 0.05 € 2.30 rps
Mistral 7B Zephyr-β f16 ✅ 28 34 46 44 29 4 31 0.51 € 2.14 rps
Llama2 13B chat f16🦙 15 38 17 30 75 8 30 0.76 € 1.43 rps
Llama2 13B chat b8🦙 15 38 15 30 75 6 30 3.35 € 0.33 rps
Mistral 7B Notus-v1 f16 ⚠️ 16 43 25 41 48 4 30 0.80 € 1.37 rps
Orca 2 13B f16 ⚠️ 15 22 32 22 67 19 29 0.99 € 1.11 rps
Llama2 7B chat f16🦙 20 33 20 27 50 20 28 0.59 € 1.86 rps
Mistral 7B Instruct v0.2 f16 ☁️ 7 21 50 13 58 8 26 1.00 € 1.10 rps
Mistral 7B f16 ☁️ 0 4 42 42 52 12 25 0.93 € 1.17 rps
Orca 2 7B f16 ⚠️ 13 0 24 18 52 4 19 0.81 € 1.34 rps
Llama2 7B f16🦙 0 2 18 2 28 2 9 1.01 € 1.08 rps

The benchmark categories in detail

Here's exactly what we're looking at with the different categories of LLM Leaderboards

  • How well can the model work with large documents and knowledge bases?

  • How well does the model support work with product catalogs and marketplaces?

  • Can the model easily interact with external APIs, services and plugins?

  • How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?

  • How well can the model reason and draw conclusions in a given context?

  • Can the model generate code and help with programming?

  • The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.

  • The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.

Improvements in Chat-GPT-4 - new recommendations

The latest update in the ChatGPT-v4 series finally breaks the trend of releasing cheaper models with lower accuracy. In our benchmarks, the GPT-4 0125 (or v4) finally beats the GPT-4 0613 (or v2) model.

This model also contains the latest training data (up to December 2023) and runs at a fraction of the cost of the v1 and v2 models, making the GPT-4 Turbo v4/0125-preview a new safe standard model that we can recommend.

The trend for GPT 3.5 models continues to follow the same pattern. New models are becoming cheaper and less powerful.

Mistral and Claude API - Verbosity Problem

This benchmark finally includes benchmarks for the Mistral AI and Anthropic Claude models:

  • Anthropic Claude Instant v1.2

    Smaller LLM from Anthropic - it's anthropic.claude-instant-v1 on AWS Bedrock.

  • Anthropic Claude v2.1 and v2.1

    Larger Anthropic LLMs that have introduced large context sizes - anthropic.claude-v2 series on AWS Bedrock.

  • Mistral Large Model

    Recently released LLM from Mistral, which is positioned between GPT4 and GPT3.5 in internal benchmarks. It is mistral-large-2402 on La Plateforme.

  • Mistral Medium

    Another proprietary model from Mistral, roughly comparable to Llama 70B, according to Miqu-Leak. We are testing mistral-medium-2312.

  • Mistral Small

    This model was a very popular Mixtral 8x7B, but the second version does not say whether this is still the case. We test both versions: mistral-small-2402 and mistral-small-2312.

  • Mistral Tiny

    This model corresponds to Mistral 7B Instruct v0.2 or mistral-tiny-2312 on Mistral AI.

All of these models can be good for creating content and chatting with people. However, that is not the point of our benchmark. We rank the models according to their ability to provide accurate answers in tasks such as information retrieval, document ranking or classification.

All these models are too wordy for that. Nor do they follow instructions precisely. Even local small series of Mistral 7B are better at this. ChatGPT-4 remains at the top. It seems that OpenAI understands the needs of enterprise customers better than the rest.

OUR CONCLUSION

If you need LLMs for chatbots and marketing purposes and are okay with some instructions being ignored, the Mistral AI and Anthropic models might be worth a closer look. Otherwise, we suggest to defer them for a while.

We introduce

Enterprise AI Leaderboard

We have been tracking the performance of LLM models for many months, this is our eighth report.

This process has helped us to gain first-hand experience in dealing with several different models at the same time. Unlike the usual academic benchmarks, we have been sourcing data from the real-world projects and enterprise tasks.

⭐️ New: LLM benchmarks from patronus ai

By the way, we are no longer alone in this area. Another company has recently started working on a similar set of enterprise benchmarks. We invite you to take a look at the Enterprise Scenarios Leaderboard on Hugging Face by PatronusAI.

That's all good, but it's time to address the real elephant in the room. The truth is:

Large language models are just an implementation detail.

Yes, it is true that a lot depends on their performance and capabilities. This is why, for example, in the short term we generally recommend GPT-4 Turbo v4/0125-preview as a model to start with.

However, we ultimately believe that the major language models are replaceable and interchangeable. In fact, the entire LLM ranking was started because of a recurring customer question: "When can I replace ChatGPT-4 with a local model in my projects?"

If you look at the "Request For Startups" from YCombinator, one specific request focuses on the exact topic of replacement: small fine-tuned models as an alternative to huge generic models. YCombinator helped to incubate companies like Stripe, Dropbox, Twitch and Cruise. They know a thing or two when it comes to market trends and industry trends.

Giant generic models with many parameters are very impressive. But they are also very costly and often come with latency and privacy challenges. Fortunately, smaller open-source models such as Llama2 and Mistral have already shown that, when fine-tuned with suitable data, they can deliver comparable results at a fraction of the cost.

To push the concept even further, we believe that the local large models will be the way to improve the overall accuracy of the system beyond the capabilities of ChatGPT, while significantly reducing operational costs.

note

Per-system customization makes it possible to design systems that learn and adapt to the specifics of each individual company. We're not even talking about advanced topics like fine-tuning (this requires a lot of high-quality data). Even a simple customization of call and context based on statistics can work wonders.

Since individual LLMs are an implementation detail, what should be the metric to measure the state of the art when applying AI to enterprise workloads?

Here is a hint in the form of some questions we are asked:

  • Which RAG architecture is best for legal workloads?

  • Which vector database should we use to build an internal support bot?

  • What is the best approach to automatically handle company questionnaires with 1000 questions in B2B sales?

The metric should target and compare complete enterprise and business AI solutions. End-to-end.

Anybody can claim 99% accuracy on RAG tasks. We want to independently verify it, build a better intuition about different architectures and ultimately allow our customers to make more informed decisions.

t will take time and effort to build a full Enterprise AI Leaderboard. We are starting with the foundational capability - ability of AI system to find relevant information within the business-specific documentation. This is the foundational block of RAG systems.

Here is an example: We took a public annual report from the Christian Dior Group. Then we asked the AI system 10 specific questions about this report. For example:

  • What was the company's turnover in 2022?

  • How much liquidity did the company have at the end of 2021?

  • What was the gross margin in 2023?

  • How many employees did the company have in 2022?

As you can see, each question has only one correct answer. No calculations or advanced reasoning are required.

How well do you think different systems would deal with these specific issues?

Not so good!

We have tested some common systems to get you started:

  • ChatGPT-4

  • OpenAI Assistant API with document retrieval and gpt-4-0125 model

  • Two popular services for asking questions about a specific PDF: ChatPDF and AskYourPDF.

Each test involved uploading the annual report and asking the question with a very specific instruction:

PROMT

Answer with a floating point number in actual currency, for example "1.234 million", use the decimal point and no thousands separators. you can think through the answer, but the last line should be in this format "Answer = Number Unit". Answer with "Answer = None" if no information is available.

This instruction was important because:

  • we would like to encourage models to use the chain-thought-of-process (CoT) if this increases accuracy

  • we still need the number to be parseable in a specific locale, hence the strict requirement on using decimal comma and no thousand separators (just like in the original report).

Obviously, RAG-systems, being end-to-end solutions, would already have CoT baked into the pipelines underneath the covers. However, when we added instruction into the overall request prompt, overall accuracy still increased.

Below are the final scores of multiple RAG systems in a single test. We gave each system 1 point for a correct and parseable answer. 0.5 points for an answer that pulled the right bit of information but made an order of magnitude error.

Financial Data Table
Question Answer ChatGPT-4 gpt-4-0125 RAG ChatPDF Ask Your PDF
How much liquidity did this company have at hand at the end of 2021? 8.122 million 7.388 million euro 7.918 million 10.667 million euros INVALID
How much liquidity did this company have at hand at the end of 2022? 7.588 million 7.388 billion euros 7.388 million 11.2 billion euros 7588 million euros
How many employees did the company have at the end of 2022? 196006 196,006 196,006 196006 INVALID
How much were total lease liabilities of the company by the end of 2021? 14.275 million 14.275 million 14,275 million 14,275 14.275 million euros
What amount was recorded for the repayment of lease liabilities in 2022? 2.453 million 2.453 million 2,711 million 2.711 million 2.711 million euros
What was the company's net revenue in 2021? 64.215 million 64.215 million 64.215 million 64.215 million euros 64,215 million euros
What was the company's net revenue in 2022? 79.184 million 79.184 million 79184 million euros EUR 79.184 million EUR 79.184 million
What was the company's net revenue in 2023? None None None None INVALID
What was the total shareholder equity at the end of 2022? 54.314 million 54.3 billion 54.314 million 54.314 billion euros INVALID
What was the company's gross margin for the year 2021? 43.860 million 43.860 million euros 43.860 million 43.860 million euros 43.860 million euros
SCORE 100 70 60 55 40

So far, OpenAI's RAG systems are the best on the market for the task at hand. However, we do not expect this to remain the case for long.

Specialized solutions are capable of achieving higher scores, even without the use of cutting-edge LLMs. We know this for a fact, because we have built such systems. One of them even includes the use of Mistral-7B-OpenChat-3.5 to extract information from tens of thousands of PDF documents.

As we extend and enrich this enterprise AI benchmark with more cases and solutions, it is expected that ChatGPT will eventually be dethroned.

Trustbit LLM Benchmarks Archive

Interested in the benchmarks of the past months? You can find all the links on our LLM Benchmarks overview page!