August 2023

Benchmarks for ChatGPT & Co:

A white tablet displaying a table showing the Large Language Model Leaderboard values for August 2023.

Updated monthly: The Trustbit LLM Leaderboard provides you with an up-to-date comparison of various Large Language Models such as ChatGPT and more to evaluate their suitability for use in product development.

Trustbit Leaderboard
August 2023

model
code
crm
docs
integrate
marketing
reason
final

OpenAI GPT4 v2-0613 💰
85
94
100
67
88
60
82

OpenAI GPT4 v1-0314 💰
76
97
89
67
75
76
80

Claude v1 💰
62
77
69
58
88
61
69

OpenAI GPT3.5 v2-0613 💰
49
77
84
83
84
39
69

Open Models
46
62
62
100
84
22
63

Llama2 13B Nous Hermes q5_K_M ✅
46
62
62
100
56
21
58

Claude v2 💰
38
58
41
67
82
51
56

Claude v1 instant 💰
72
54
47
67
55
17
52

Vicuna v1.1 13B q4_1
30
45
57
83
71
19
51

Vicuna v1.1 13B q8_0
31
45
52
42
84
16
45

Vicuna v1.3 13B q5_1
36
51
47
50
61
19
44

Vicuna v1.1 13B q5_1
31
45
42
33
84
18
42

Puffin v1.3 13B q5_K_M ✅
28
48
53
33
25
22
35

Wizard Vicuna 13B Unlocked q5_K_M
22
39
53
33
56
0
34

Llama2 13B Guanaco q5_1 ✅
19
42
62
17
38
0
30

Llama 7B q8_0
25
30
28
25
50
0
26

Llama 13B q5_1
34
9
38
17
44
9
25

Llama2 7B chat ✅
7
33
11
17
62
14
24

Llama2 7B chat Unlocked q8_0 ✅
14
33
33
33
25
0
23

Llama2 13B chat q8_0 ✅
7
33
17
0
66
11
22

Open Llama 7B instruct q8_0
16
17
38
17
22
14
21

Llama 13B q2_K
0
5
47
33
25
0
19

Llama2 7B ✅
18
0
0
0
0
0
3

The benchmark categories in detail

How well can the model work with large documents and knowledge bases?
How well does the model support work with product catalogs and marketplaces?
Can the model easily interact with external APIs, services and plugins?
How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?

How well can the model reason and draw conclusions in a given context?
Can the model generate code and help with programming?

Latest versions of ChatGPT, Anthropic Claude and Meta LlaMA on the market

Since the publication of the Trustbit July ranking, there have been several interesting news.

OpenAI has released new versions of ChatGPT (v0613) that provide efficiency improvements and JavaScript function calling conventions.
Anthropic has released the second edition of Claude - the closest commercial competitor to OpenAI ChatGPT.
Meta has introduced the second generation of LLaMA - Llama v2.

Each of these releases promises significant improvements in the capabilities of large language models. However, we have analyzed for you whether an upgrade is really worthwhile and what there is to consider.

OpenAI ChatGPT-4 0613: can be upgraded

In our tests, the new version of ChatGPT-4 performs slightly better than the previous version. It has received a noticeable speed boost, performance in tasks related to code and marketing has improved significantly. However, the ability to reason and work with documents has slightly decreased at the same time.

If you want to get the best possible performance from your knowledge-based enterprise assistant, it might be worth proceeding with caution when migrating.

Anthropic Claude v2 performed noticeably worse in our tests. It seems like it was tuned to be a better chat bot at the expense of product capabilities.

If possible, we recommend continuing to use Claude v1 until the second version improves further.

Anthropic Claude v2:
do not upgrade

Meta Llama v2:
upgrade recommended

The Llama v2 model is an open model from Meta (Facebook) with a commercially generous license. This license finally makes the model usable for serious projects.

Llama v2 should be a better model. However, the base model performs significantly worse than the base model of v1. The main reason is that it is also too talkative and sensitive to prompts. The base model dominates the lower ranks of our ranking.

But with open models, bad results don't mean the end of the story. They can be trained further by the community.

Nous Research has released their own fine-tuned version of Llama v2, called Nous Hermes. Hermes not only outperforms Vicuna, but also catches up with Claude v2.

What is a leaderboard?

A leaderboard is a ranking or table that compares and ranks different elements, people, or products based on certain criteria. It is used to provide a clear representation of the performance or characteristics of the elements listed and allows viewers to quickly see which elements are at the top or performing best.

What does the Trustbit LLM Leaderboard help me with?

Trustbit's LLM Leaderboard helps you find the most optimal Large Language Model currently available for use in product development. The scoring list we have created is based on real benchmarks extracted from software products we have developed. It evaluates the capabilities of different LLM models to perform specific tasks in product development.

What categories are being compared?

The following categories are available for you to evaluate the capabilities of the different models:

Documents: How well can the model work with large documents and knowledge bases?
CRM: How well does the model support working with product catalogs and marketplaces?
Integration: Can the model easily interact with external APIs, services and plugins?
Marketing: How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?
Reason: How well can the model reason and draw conclusions in a given context?
Code: Can the model generate code and help with programming?

You want to learn more about the use of ChatGPT and Co?

Then we look forward to hearing from you.

christoph.hasenzagl@trustbit.tech

+43 664 88454881