LLM Benchmarks Study: Using Data Subsampling

Why We Did This LLM Benchmarking Study

Upon conducting a dependency analysis of the Hugging Face Open LLM Leaderboard, something caught our Data & AI Research Team (DART) by surprise. We saw remarkably high accuracy correlations among the leaderboard’s scores.

Hugging Face’s leaderboard, along with resources like lmsys.org’s Chatbot Arena, ranks among the most trusted resources for evaluating open-source LLMs. So, those high-accuracy correlations made us wonder…

Could we create accurate LLM benchmarks using only a fraction of the tests’ datasets? If so, that'd mean software engineers and data scientists could evaluate and deploy LLMs much more efficiently.

We discovered yes, dataset subsampling is an efficient proxy for evaluating LLMs in specific tasks. This means you can streamline your LLM benchmarking process, making it more efficient to find what models perform best for your applications. Using a future-proof enterprise AI platform then allows you to easily fine-tune or swap in different LLMs as technology and use cases evolve and mature over time.

Download the Whitepaper

In this study, you’ll learn:

How various tests affect the time and cost required to benchmark an LLM
How WillowTree used data subsampling to evaluate eight open-source LLMs against four tasks (ARC, HellaSwag, MMLU, TruthfulQA)
How to replicate these evaluations and choose metrics for your own LLM benchmarking

Download the Whitepaper