With many models and metrics available, many digital leaders across industries are left wondering: How do I choose the right generative AI benchmarks? Is a popular, one-size-fits-all benchmark sufficient, or should I employ a variety of unique benchmarks?
Benchmarking isn’t just about picking the best model for a given use case (an admittedly crucial step in the generative AI journey); it’s about continuously assessing and refining your entire AI system to ensure it is high-performing, meets your evolving business needs, resists vulnerabilities, and adheres to responsible AI practices.
This post explores how to approach benchmarking from this holistic perspective.
Before proceeding with model selection and benchmarking, it’s crucial to clearly define what “good” means for your specific business context. This involves more than selecting the best-performing model on a public benchmark — it’s about ensuring the AI system aligns with your business goals and addresses your key challenges.
The first step is to clearly articulate the business problem your generative AI system is meant to solve. As we discussed in a recent article on conversational AI, projects that focus on real world business problems are much more likely to be successfully deployed to production. Once we understand the business problem and have defined outcomes, we can start to break down how varying aspects of the output of a generative AI system affect those outcomes.
Defining what “good” means for your specific use case lays the foundation for a benchmarking strategy tailored to your business needs and capable of delivering meaningful results.
Understanding the potential vulnerabilities associated with your application, specifically for your organization and product, is also essential. This determination is going to vary from one use case to another.
GenAI systems intended to answer questions on internal documentation for employees, for example, are going to be very concerned with faithfulness (the extent to which the output factually aligns with the context provided), whereas a system intended to be used to generate avatars on a video game platform may be less concerned with accurate representations of a prompt. That same internal support system may not be concerned with the company’s brand voice since it’s internal.
In contrast, if the company were offering a public-facing conversational AI interface to answer questions about products, they would want to spend considerable time ensuring certain hard-to-measure dimensions such as politeness.
Our methodology typically includes engaging clients in single- or multi-day Responsible AI Workshops, where we start by pinpointing potential risks and guardrails. From a governance perspective, it is essential to set policies and establish oversight procedures (without this, GenAI experimentation can quickly devolve into the Wild West), while also evaluating technical factors, potential biases, fairness issues, and societal impacts. Our governance framework draws on industry best practices for responsible AI, such as the NIST AI Risk Management framework, as well as emerging global regulatory guidance.
Other activities during this phase may include bias testing, algorithm audits, data studies, employee training, and community impact analyses. We also assist clients in developing model cards, terms of use, and other documentation that promotes AI transparency.
While responding to these above challenges unique to GenAI systems, companies should also maintain best practices against more traditional cyber threats such as hacking, phishing, and social engineering.
With continuous governance, we help clients adhere to their core principles while reaping the benefits of artificial intelligence.
Public benchmarks like GLUE (a popular benchmark for natural language understanding), or TruthfulQA (popular to determine a model’s ability to avoid generating false or misleading information), are valuable starting points when evaluating generative AI models. These benchmarks provide standardized datasets and evaluation metrics, enabling a quick comparison across multiple models.
However, while these public benchmarks are a good starting point, they often measure generalized capabilities that may not align sufficiently with the specific needs of your business application. They may not, for example, capture nuances like brand tone, contextual relevance, subject matter expertise, or adherence to specific industry compliance standards.
Additionally, public benchmarks tend to evaluate models in very controlled environments, which may not account for the complexities and variations of your real world data. Therefore, it’s important to complement these initial evaluations with task-specific benchmarks tailored to the unique challenges and objectives of your overall application. We’ll soon be sharing our specific, in-depth guidance on task-specific benchmarking for summarization, for instance.
While it’s tempting to think of these benchmarks as helping you choose the best model for your application, the reality is that, in many cases, the best solution might involve a combination of multiple models. For example, one model might be superior in generating content, while another is better at understanding context or ensuring factual accuracy. Speed and cost are also important factors — smaller models tend to be faster and cheaper and they may be just as good (or good enough) for certain components of your application where a larger, more expensive model isn’t necessary.
By leveraging multiple models, each optimized for specific aspects, you create a system that balances strengths and mitigates weaknesses. This approach also allows for dynamic adaptation — switching models based on task requirements or evolving data — to maintain optimal performance.
The work of benchmarking doesn’t end with the selection and implementation of a model or a combination of models. Generative AI systems are fundamentally nondeterministic, meaning they do not always produce the same output given the same input. The models operate with at least some degree of inherent randomness, which can lead to different results even when given identical prompts. This nondeterministic behavior is what allows generative AI systems to return diverse and creative responses, but it also poses unique challenges when it comes to maintaining consistent performance. One recent study demonstrated that even minor adjustments, like adding a space to the end of a prompt, can significantly change the response.
Given this variability, it’s important to run evaluations multiple times to capture a range of potential behaviors and ensure your system performs consistently. Continuous benchmarking throughout the development process allows you to detect unintended changes in output and adjust accordingly. This approach not only helps with the prompt engineering process but also ensures that the AI system remains aligned with the business objectives and ethical considerations as it evolves.
Red teaming, a method traditionally used in cybersecurity, involves deliberately testing a system by simulating adversarial attacks from hackers or attempting to find more subtle vulnerabilities. In the context of AI, red teaming can act as a specialized form of benchmarking focused on security, fairness, and robustness. Just as your system's outputs need regular evaluation to stay aligned with business objectives, your security benchmarks, including red teaming exercises, may need to evolve over time to address new risks and challenges.
Even once an application is deployed to production, continual evaluation is important for ensuring that your AI system adapts to changing environments and potential new threats.
By embracing a strategy of regular, ongoing evaluation, you gain a clearer understanding of how your AI system performs in varied, real world scenarios. This, in turn, helps mitigate risks, maintain compliance with responsible AI practices, and maximize the value derived from your generative AI investments.
While the automated benchmarking and testing we’ve discussed here are essential for evaluating the performance of generative AI systems, it’s important to note that they do not capture the full picture. To ensure your system truly meets user needs and expectations, it is equally important to gather and incorporate feedback from actual users.
User feedback provides insights into how well the AI system aligns with the end-user's experience, which can be challenging to quantify through automated tests alone. For example, an AI-powered customer support bot may perform well on traditional benchmarks like response accuracy, but real users might find its tone unfriendly or its responses too slow. Engaging users in pilot programs, A/B testing, or structured feedback sessions can reveal such qualitative dimensions, helping you to identify areas for improvement that might otherwise go unnoticed.
Benchmarking generative AI systems is a complex and ongoing process that extends beyond initial model selection. By defining success in alignment with your unique business objectives, continuously evaluating the system, and incorporating real user feedback, you can create a comprehensive strategy that ensures reliable and effective outcomes.
WillowTree’s Data & AI Research Team is actively exploring a wide array of benchmarks and red teaming ops. Learn more and connect with us through our Data & AI consulting services.
One email, once a month.