Craft At WillowTree Logo
Content for craftspeople. By the craftspeople at WillowTree.
Data & AI

LLM Evaluation Framework: How to Prevent Drift and System Degradation

Recent studies show how much large language model (LLM) performance changes over time. When Stanford and UC Berkeley researchers evaluated GPT-3.5 and GPT-4 versions released just months apart, they found both models drifted substantially. Their evals showed that “the behavior of the ‘same’ LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLMs.”

This means businesses need a reliable LLM evaluation framework to identify when and why drift occurs. Without such a framework, even a minor change in LLM behavior can snowball into major performance issues:

  • decreased productivity resulting in higher operational costs
  • skewed analytics that mislead strategic decision-making
  • incorrect answers (aka AI hallucinations) that violate policy

The stakes are highest when real-time answers are required and any delay or error could lead to critical failures. Think of the generative AI in healthcare physicians depend on for diagnostics, or complex financial transactions that must meet equally complex regulatory requirements.

Since any LLM drift could disrupt operations or erode trust (or both), we continually monitor our clients' generative AI systems as if they were our own. Through our testing, the Data and AI Research Team (DART) here at WillowTree has developed a four-step LLM evaluation framework that we often use.

It's a counterintuitive approach, too. By using LLMs themselves to benchmark the performance of LLM-based generative AI systems, we can build a framework that streamlines LLM evaluation over different states.

WillowTree's LLM Evaluation Framework: The What, Why, and How

Our evaluation framework spans four steps:

  1. Choose LLM evaluation metrics
  2. Create a gold standard dataset
  3. Generate new responses
  4. Compare new responses to the gold standard dataset

We recognize this framework is inherently eyebrow-raising: Why trust a potentially error-prone LLM to evaluate itself?

Because it’s much faster (and more accurate) than manually comparing texts, a process that demands substantial time and effort from subject matter experts. Not to mention, human comparison is subject to reviewer bias and error in the same way that an LLM may be biased.

Note we’re not suggesting the removal of human oversight from the process. Rather, our LLM evaluation framework better balances human expertise and generative AI efficiency, giving us a less resource-intensive way to maintain operational health and accuracy. Regular benchmarking allows us to spot symptoms of system drift and address them swiftly.

You can deploy this framework across a wide range of use cases, from voice-enabled chatbots to conversational AI assistants for financial services. Throughout you’ll see results from the testing approach we used to 1) validate our evaluation framework and 2) give us reliable baselines to measure the performance of our clients’ LLM systems. We ran our tests using GPT-4, but you could apply the framework to any open-source or proprietary language model.

1. Choose Your LLM Evaluation Metrics

Your framework can measure any quality of your LLM-generated content you'd like. For example, you could track relevancy, tone of voice, or even brevity. For our test, we chose truthfulness and informativeness to measure LLM accuracy. We dive deeper into these metrics in our blog on LLM testing.

Truthfulness

The truthfulness metric shows how well an LLM-generated answer aligns with our gold standard answer (more on gold standard answers in the next step). It considers questions such as: Is the answer complete? Does it contain half-truths? How does it respond when information is present but not acknowledged?

Informativeness

The informativeness metric measures how well an LLM-generated answer provides all the necessary information as compared to the gold standard anwer. It looks for any missing or additional information that may affect the overall quality of the response.

2. Create a Gold Standard Dataset

By creating a “gold standard” dataset, we mean generating question-answer pairs we can use as our ground truth for model evaluation.

Consider the following excerpt from Wikipedia about the history of Earth (a rather static domain). Let’s use it as the context to allow for ground-truth annotation. We use the known facts in the paragraph to generate question-answer pairs, then test the system on those questions to measure accuracy via truthfulness and informativeness.

"The Hadean eon represents the time before a reliable (fossil) record of life; it began with the formation of the planet and ended 4.0 billion years ago. The following Archean and Proterozoic eons produced the beginnings of life on Earth and its earliest evolution. The succeeding eon is the Phanerozoic, divided into three eras: the Palaeozoic, an era of arthropods, fishes, and the first life on land; the Mesozoic, which spanned the rise, reign, and climactic extinction of the non-avian dinosaurs; and the Cenozoic, which saw the rise of mammals. Recognizable humans emerged at most 2 million years ago, a vanishingly small period on the geological scale."

—History of Earth (Wikipedia)

A gold standard question–answer pair for this piece of text might be:

  • Question: “What is the chronological order of the eons mentioned in the paragraph?”
  • Answer: “The chronological order of the eons mentioned is the Hadean, followed by the Archean and Proterozoic. The Phanerozoic is the last one and it is divided into the Paleozoic, Mesozoic, and Cenozoic eras.”

This response can now serve as a candidate answer in the gold standard dataset.

3. Generate New Responses

Now, suppose you run some enhancements to your LLM-based application and you need to re-evaluate the overall system accuracy. You would then generate a new batch of responses to compare with the gold standard dataset you previously created.

To illustrate, let's generate a new LLM response to evaluate.

# Generate LLM answer
def generate_LLM_answer(context, GS_question):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": "You are an LLM evaluation assistant that answers the question correctly.",
            },
            {
	    "role": "user", 
	    "content": f"Based on the context: '{context}', please answer the question: '{GS_question}'"},
        ],
    )
    return response.choices[0].message["content"].strip()

Output (Correct Answer): “The chronological order of the eons mentioned in the paragraph is: Hadean, Archean, Proterozoic, and Phanerozoic.”

Now that we have our correct answer, we can evaluate it against our ground truth.

4. Compare the LLM Answer to the Gold Standard Dataset

Our evaluation approach involves two principal aspects at this step:

  • using the LLM to compare our gold standard answers with our new responses
  • summarizing the results so we can easily visualize and reference them

Again, we know you might be suspicious: We used an LLM to evaluate the accuracy of an LLM? Isn’t that circuitous? But remember, the LLM is simply comparing sentences and paragraphs for accuracy. The model isn’t cheating or otherwise behaving unethically or out of scope.

Diagram showing how WillowTree's LLM evaluation framework works for scoring the accuracy of generated responses.

“What we care about is the ability to compare two pieces of text,” says WillowTree’s Michelle Avery, Group VP of AI. “Can we take two answers to the same question and put this in ChatGPT and say, ‘Answer One is our ground truth. Does Answer Two completely represent Answer One?’ The LLM can compare the two sentences. It can tell us how close they are to each other, and therefore how accurate the answer is when compared to the ground truth.”

In the examples below, we show how we evaluated new responses for truthfulness and informativeness.

LLM as evaluator

We prompt the LLM to act as an evaluator by comparing a gold standard answer with a new LLM-generated answer. The LLM is tasked to respond with a binary “Yes” or “No” for the two key features being evaluated, truthfulness and informativeness.

# Evaluate Answers
def evaluate_answers(GS_answer, LLM_answer):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": "You are an LLM evaluation assistant that will evaluate the  given information.",
            },
            {
	     "role": "user", 
	     "content": f"Given the gold standard answer '{GS_answer}' and the language model's response '{LLM_answer}', would you say that the language model's response is truthful when compared to the gold standard? Please respond with Yes or No."},
        ],
    )
    return response.choices[0].message["content"].strip()

Recall that we chose truthfulness and informativeness as our two metrics to evaluate 1) how factual the response is and 2) if the presented facts are explained fully when compared against our ground truth.

Truthfulness

Truthfulness evaluates for factuality. More specifically, it measures the factual completeness of our new LLM responses by comparing them against our gold standard answers. Here’s how this example should run if set up correctly:

“Considering the gold standard answer…

“‘The chronological order of the eons mentioned is the Hadean, followed by the Archean and Proterozoic. The Phanerozoic is the last one and it is divided into the Paleozoic, Mesozoic, and Cenozoic eras.’

“And the language model's response…

“‘The chronological order of the eons mentioned in the paragraph is: Hadean, Archean, Proterozoic, and Phanerozoic.’

“Would you say that the language model's response is truthful when compared to the gold standard? Please respond with ‘Yes’ or ‘No.’

Result: Yes.

The result shows the new LLM response is correct in producing the chronological order of eons.

Informativeness

Of course, in addition to evaluating how factual a new response is (truthfulness), we also want to evaluate whether the presented facts are explained fully when compared against the ground truth of our gold standard answers (informativeness).

The response above may have truthfully listed the chronological order of the eons, but it left out a deeper layer of information that would have aided comprehension. Namely, that the last eon is further “divided into the Paleozoic, Mesozoic, and Cenozoic eras.”

Therefore, we’d expect the eval to run like this:

“Would you say that the language model’s response is informative when compared to the gold standard? Please respond with ‘Yes’ or ‘No.’

“Result: No.”

The result shows the new LLM response is not as informative as the gold standard.

Of course, this opens up a new can of worms. Could we, for instance, examine parameters like truthfulness and informativeness on more of a sliding scale (and does this introduce new layers of subjective bias)? One could argue the above result should still be “Yes” even though it didn’t include as much detail. But the score would be low, similar to how there can be half-truths that are technically true. We’ll dig deeper into this question in future posts.

Summarizing results

Once the LLM completes its evaluation, we compile and summarize the results by calculating the percentage of accuracy for each characteristic, truthfulness and informativeness.

So if out of 100 answers the LLM finds 85 of its responses were truthful (in accordance with the gold standard) and 70 were informative, we can deduce that the LLM was 85% truthful and 70% informative.

To test, we can add an incorrect answer to see how the results are captured, which would look something like this:

“Considering the gold standard answer…

“‘The chronological order of the eons mentioned is the Hadean, followed by the Archean and Proterozoic. The Phanerozoic is the last one and it is divided into the Paleozoic, Mesozoic, and Cenozoic eras.’

“And the language model's response…

“‘The chronological order of the eons mentioned in the paragraph is the Phanerozoic, Proterozoic, Archean, and then the Hadean.’”

“Would you say that the language model's answer is truthful? Please respond with ‘Yes’ or ‘No.’

“Result: No.”

As we can see, the LLM-generated response is in reverse chronological order, making it false and not truthful.

Framework Limitations and Ethical Considerations

Certain factors should be taken into account when deploying or fine-tuning this framework. To start, scaling up to assess large datasets is challenging, and relying on human- or auto-generated gold standards introduces potential biases and subjectivity. Therefore, it’s important to understand that the accuracy and completeness of LLM-generated gold standards can impact the quality of evaluation.

However, this will be the case for any evaluation dataset, human-generated or automated. The quality of your gold standard questions and answers will always affect the quality of evaluation. Moreover, for knowledge experts to craft questions and answers (based on the context that the knowledge expert knows that the LLM has) takes time and expertise. Leveraging an LLM relieves this constraint, but still requires manual review.

Finally, creating an automated evaluator designed by the same system it evaluates may introduce biases and lack objectivity. The biases that the LLM returns come from any biases that went into training the model and are thus inherent to the LLM. To ensure responsible AI, we must adhere to principles that minimize harm and avoid promoting biases.

Build the Right Framework for Evaluating Your LLMs

Our four-step framework provides a structured methodology for assessing large language model performance and reliability. As we continue refining and iterating upon this framework, we’ll explore the capabilities and limitations of LLM-based systems at deeper levels, ultimately enabling us to develop more reliable, interpretable generative AI technology.

As an example, we’ve also discovered that LLMs are highly effective for retrieval augmented generation (RAG) benchmarking and evaluation.

If you need help building or fine-tuning a continuous evaluation framework for your own LLM-based system, we can help. Our eight-week GenAI Jumpstart program is a great fit for projects like these that we can rapidly iterate together on a short timeline.

Learn more about our GenAI Jumpstart accelerator and our enterprise-grade AI platform, Fuel iX.

Table of Contents
Milton Leal
Christopher Frenchi

Recent Articles