Benchmarks for ChatGPT & Co:
October 2023
Our October benchmarks have been improved in many ways compared to the September issue. We also introduce a new, promising model: Mistral 7b.
Benchmarks October 2023
☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama license
Model | Code | Crm | Docs | Integrate | Marketing | Reason | Final 🏆 | Cost | Speed |
---|---|---|---|---|---|---|---|---|---|
GPT-4 v1-0314 ☁️ | 85 | 88 | 95 | 52 | 88 | 50 | 76 | 7.18 € | 0.71 rps |
GPT-4 v2-0613 ☁️ | 85 | 83 | 95 | 52 | 88 | 50 | 75 | 7.18 € | 0.75 rps |
GPT-3.5 v2-0613 ☁️ | 62 | 79 | 76 | 75 | 81 | 48 | 70 | 0.35 € | 0.96 rps |
GPT-3.5-instruct 0914 ☁️ | 51 | 90 | 69 | 60 | 88 | 32 | 65 | 0.36 € | 2.35 rps |
GPT-3.5 v1-0301 ☁️ | 38 | 75 | 67 | 67 | 82 | 37 | 61 | 0.36 € | 1.76 rps |
Llama2 70B Hermes b8🦙 | 48 | 76 | 46 | 76 | 62 | 29 | 56 | 13.10 € | 0.13 rps |
Mistral 7B Instruct f16 ✅ | 36 | 77 | 61 | 44 | 62 | 18 | 50 | 0.42 € | 2.63 rps |
Llama2 70B chat b4🦙 | 13 | 51 | 53 | 29 | 64 | 21 | 39 | 4.06 € | 0.27 rps |
Llama2 13B Vicuna-1.5 f16🦙 | 36 | 25 | 27 | 18 | 77 | 36 | 36 | 0.78 € | 1.39 rps |
Llama2 13B Hermes f16🦙 | 32 | 15 | 25 | 51 | 56 | 39 | 36 | 0.57 € | 1.93 rps |
Llama2 13B Hermes b8🦙 | 31 | 18 | 23 | 44 | 56 | 39 | 35 | 3.65 € | 0.30 rps |
Llama2 70B chat b8🦙 | 1 | 53 | 34 | 27 | 71 | 21 | 35 | 10.24 € | 0.16 rps |
Llama2 13B chat f16🦙 | 0 | 38 | 15 | 30 | 75 | 8 | 27 | 0.64 € | 1.71 rps |
Llama2 13B chat b8🦙 | 0 | 38 | 8 | 30 | 75 | 6 | 26 | 4.01 € | 0.27 rps |
Llama2 7B chat f16🦙 | 7 | 33 | 23 | 26 | 38 | 15 | 24 | 0.69 € | 1.58 rps |
Llama2 13B Puffin f16🦙 | 14 | 6 | 0 | 5 | 54 | 0 | 13 | 1.71 € | 0.64 rps |
Llama2 13B Puffin b8🦙 | 16 | 3 | 0 | 5 | 47 | 0 | 12 | 7.94 € | 0.14 rps |
Mistral 7B f16 ✅ | 0 | 4 | 0 | 25 | 38 | 0 | 11 | 0.92 € | 1.19 rps |
Llama2 7B f16🦙 | 0 | 0 | 4 | 2 | 32 | 0 | 6 | 1.08 € | 1.01 rps |
The benchmark categories in detail
-
How well can the model work with large documents and knowledge bases?
-
How well does the model support work with product catalogs and marketplaces?
-
Can the model easily interact with external APIs, services and plugins?
-
How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?
-
How well can the model reason and draw conclusions in a given context?
-
Can the model generate code and help with programming?
-
The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.
-
The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.
Highlights and Updates from the October Benchmarks
New Evals
We have integrated 9 new benchmarks into the suite. These benchmarks focus on the areas of "Documents", "Integration" and "Reason". This makes the assessment of model capabilities more precise and increases the total number of different assessments from 85 to 134.
An example of this is situations where large language models create and process structured data.
In the Integration category, we now test the ability of large language models to understand and manipulate text in CSV, TSV, JSON, and YAML formats.
Another example concerns our work on business assistants and information search systems for customers. In such cases, large language models need to identify, find, and evaluate relevant pieces of information. Our evaluations help measure various aspects of this capability.
In addition to these new assessments, we have improved the performance of some existing assessments by introducing Few-Shot examples and better queries. Most major language models are responding very positively to this.
More Guidance
Guidance is a process of helping large language models generate desired text. It works by directing the model's attention to specific text elements (tokens).
As our experience in obtaining better results from large language models grows, we are incorporating these findings into the benchmarks. Our October release already includes guidance in some of the assessments, further improving the performance of some models.
In the coming months, we plan to provide even deeper guidance for models in task-related areas.
New model with impressive performance: Mistral 7B
Mistral 7B is a new model of a French AI company of the same name. Although it is significantly smaller than the other models, it has surpassed the basic configurations of Llama2 70B and all models with sizes 7B and 13B.
That is really impressive. It's worth paying more attention to this model in the coming months. The cost and throughput characteristics of this model make it even more attractive for local implementations.
Another highlight of this model is that it is released under the Apache license, which is more understandable and less restrictive than the Llama 2 license. There are no "Google" clauses or possible confusion regarding the use of this model for non-English languages. Our model markings reflect this change in the table.
Trustbit LLM Benchmarks Archive
Interested in the benchmarks of the past months? You can find all the links on our LLM Benchmarks overview page!