LLM leaderboards: potentials and limitations of public benchmarks

It has never been easier to build on top of a LLM. But it has never been HARDER to conduct Quality Assurance (QA), measure performance, and actually IMPROVE what you’ve built on top of it.

In 2023, we've seen a proliferation of Large Language Model (LLM) releases. With each bragging superiority of their performance, makes it harder to filter true progress made in the field and which is the current winner by impressive claims about their performance.

Not to mention some other highlights not included in the image:

LLaMA (by Meta) in February;
Pythia (by Eleuther AI) in April;
MPT (by MosaicML) in May;
X-GEN (by Salesforce) and Falcon (by TIIUAE) in June;
Llama 2 (by Meta) in July;
Qwen (by Alibaba) and Mistral (by Mistral.AI) in September;
Yi (by 01-ai) in November;
DeciLM (by Deci), Phi-2, and SOLAR (by Upstage) in December.

The LLM space which is moving very quickly, it is quite difficult to filter what is progress from what is noise.

That's where LLM leaderboards and model evaluations of LLMs come in.

What are LLM leaderboards?

LLM leaderboards track, rank, and evaluate large language models (LLMs).

Their evaluation is based on different public benchmarks, a set of standards used compare performance of LLMs.

One of the most popular benchmarks using the Eleuther AI Language Model Evaluation Harness , a unified framework to test generative language models on a large number of different evaluation tasks.

The Language Model Evaluation Harness is the backend for ? Hugging Face's popular Open LLM Leaderboard, has been used in hundreds of papers, and is used internally by dozens of organizations including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML.

Two more well-know LLM leaderboards are:

Open LLM Leaderboard

The Open LLM Leaderboard tracks, ranks, and evaluates large language models and chatbots. It evaluates models based on benchmarks from the Eleuther AI Language Model Evaluation Harness, covering science questions, commonsense inference, multitask accuracy, and truthfulness in generating answers.

Chatbot Arena Leaderboard

Chatbot Arena is a place where language models (LLMs) compete anonymously and randomly, with input from many people to determine which gives the best answers in a crowdsourced manner. Its ranking is based on the Elo rating system, which is a widely-used rating system in chess and other competitive games.

Beyond the Open LLM Leaderboard and Chatbot Arena there are more dozens of other leaderboards according different benchmark, metrics and purposes.

The buzz around leaderboards and evaluating LLMs is real.

However, there is equally important conversation that is often forgotten.

Evaluating a foundational Large Language Model is one thing, but assessing an application built on top of a LLM is an entirely different matter.

When it comes to evaluating real-world applications, LLM leaderboard fall short.

Leaderboards aren't effective for evaluating LLM apps.

LLM leaderboards such as Chatbot Arena or Open LLM Leaderboard are popular and engaging but they don't make any sense to measure real-world applications.

Here are 3 key points to consider:

1) LLM leaderboard metrics don´t take into account your app setup

The basic anatomy of most LLM-apps consist of:

Model provider (OpenAI, Claude, Mistral, etc);
System prompt (The instructions to the model feed to your system);
Data fed to the system (RAG, in-context learning, etc).

LLM leaderboards take into account your setup of your system prompts and the data you fed to the system and what success actually looks like to your real-world users.

2) Your LLM provider is least sensitive part of the equation.

The choice of your model provider actually makes a minimal difference in practice when building an application based on LLMs.

The basic anatomy of most LLM-apps consist of:

Model provider (OpenAI, Claude, Mistral, etc);
System prompt (The instructions to the model feed to your system);
Data fed to the system (RAG, in-context learning, etc).

Of all the 3 building-blocks above, the model provider is the least critical decision.

All LLM models converge to a certain point when trained on the same data. ie, there is really no differentiation between one model or the other. Claims about out-performance on tasks are just that, claims. the next iteration of llama or mistral will converge.LLMs seem to evolve like linux/windows or ios/android with not much differentiation in the foundation models. (Hacker News thread)

Switching to a new model provider after taking a look at a leaderboard doesn't guarantee an overall enhancement in your LLM application's performance as a whole.

Just because the latest model released performs better on leaderboards doesn't mean the app you've built using it will also perform better.

3) LLM leaderboard datasets are generic and static

LLM leaderboards rely on generic benchmark datasets that aren't tailored to real-world use cases.

It's essential to have control over the dataset for a customized evaluation specific to your use case. The closer the dataset matches your data in production, the more accurate is your evaluation.

Baseline evaluation needs to be the core of any AI app
Alex Gravely - creator of Github Copilot

Another problem is that the dataset used in LLM leaderboard, they don´t evolve quickly overtime. They´re static. On the other hand, LLM applications are extremely dynamic. In real word scenario, new use cases pop up all the time. It is difficult to predict how a user will interact in you app. The dataset and the testing strategy also needs to evolve as users do.

Conclusion

In short, while LLM leaderboards are good for comparing model providers, they often overlook the complexities of real-world applications. They don't consider how each app is set up, and the model provider choice doesn't matter much. Also, the datasets they use are too rigid, not keeping up with real-world changes.

When you have something in production, measurement needs to be made concrete, not in abstract as LLM leaderboards do.

Teams should focus on testing their apps in real situations to make sure they work well where it counts.