LLM Benchmarking Study

LLM Benchmarks Whitepaper: How Data Subsampling Makes Evaluating LLMs Faster and Cheaper

Large language models (LLMs) like GPT-4 power many of the generative AI applications businesses depend on, from question answering to code generation. But LLMs aren’t created equal, which makes benchmarking LLMs to find the right model for the right application costly.

But what if you could confidently benchmark model performance using a fraction of the data you currently use?

Download the Whitepaper

Why We Did This LLM Benchmarking Study

Upon conducting a dependency analysis of the Hugging Face Open LLM Leaderboard, something caught our Data & AI Research Team (DART) by surprise. We saw remarkably high accuracy correlations among the leaderboard’s scores.

Hugging Face’s leaderboard, along with resources like lmsys.org’s Chatbot Arena, ranks among the most trusted resources for evaluating open-source LLMs. So, those high-accuracy correlations made us wonder…

Could we create accurate LLM benchmarks using only a fraction of the tests’ datasets? If so, that'd mean software engineers and data scientists could evaluate and deploy LLMs much more efficiently.

We discovered yes, dataset subsampling is an efficient proxy for evaluating LLMs in specific tasks. This means you can streamline your LLM benchmarking process, making it more efficient to find what models perform best for your applications. Using a future-proof enterprise AI platform then allows you to easily fine-tune or swap in different LLMs as technology and use cases evolve and mature over time.

Download the Whitepaper

In this study, you’ll learn:

  • How various tests affect the time and cost required to benchmark an LLM
  • How WillowTree used data subsampling to evaluate eight open-source LLMs against four tasks (ARC, HellaSwag, MMLU, TruthfulQA)
  • How to replicate these evaluations and choose metrics for your own LLM benchmarking
Download the Whitepaper

More Insights

Let’s talk.

Wherever you are on your journey, we can help. Let’s have a conversation.