NOTE: This article is Part III in our series on LLM Benchmarking. In Part I, we established how to benchmark LLM-based systems' accuracy to prevent system degradation. In Part II, we took a deeper dive into evaluating truthfulness as an element of LLM accuracy. If you need help setting up a cost-efficient evaluation process for your generative AI applications, the DART team at WillowTree is ready to help. Get started by learning about our eight-week GenAI Jumpstart program.
Introduction
Having established the foundations of large language model (LLM) benchmarking in Parts I and II of this series, this article will explore using an LLM to evaluate the results of retrieval augmented generation (RAG) systems.
With retrieval augmented generation, you essentially hook up a database to a large language model and then bias a chatbot or AI-enabled assistant to retrieve information that is stored in that database as opposed to broader external knowledge. We’re then using a separate LLM to evaluate whether or not the RAG is performing as intended. More LLMs testing LLMs!
As we begin testing RAG systems, generating data, and interpreting the results, manual comparisons can become challenging, raising questions about bias, truth, and utility. Our premise for this experiment is: Can we automate the process to create a self-evaluating generative AI system?
In this article, we walk through how the Data and AI Research Team (DART) at WillowTree created an evaluation framework to compare an end-to-end RAG chatbot LLM output against a known base truth. Using this evaluation methodology, we can better understand how to efficiently test a RAG system.
What is Retrieval Augmented Generation?
RAG systems utilize embeddings and semantic search to retrieve stored knowledge that augments a large language model’s external knowledge (aka, “world knowledge”). We can think of embeddings as a mathematical representation of an object that captures similarities or relationships.
Essentially operating as AI-powered search engines, they retrieve and summarize specific data based on semantic similarities. By storing information that the LLM doesn’t necessarily know in a database, we can use RAG to query organization-specified documents and return the needed information for a chatbot or AI assistant to answer questions. The importance of manual and automated evaluation in these systems lies in verifying the accuracy and effectiveness of their responses.
But how do we evaluate the RAG LLM response? How do we know what we are getting back is correct? Let’s think about how we would manually achieve testing this before showing how to automate it.
Manual testing would involve the following steps:
- Create a series of questions, expected answers, and expected sources — your “gold standard dataset.”
- Ask the chatbot a question from your dataset.
- Receive the chatbot LLM response.
- Compare the LLM response to the expected answer.
- Assign a Yes/No determination or score to the LLM response (subject to personal bias).
- Compare the expected source to the source returned by the LLM.
- Document and share results.
We would then duplicate steps 2 through 7 for the next question for that page source, start over with step 1 for the next source page, rinse and repeat.
In short, manually creating and evaluating hundreds of prompts and responses is downright awful.
Now, let’s explore how to transform these manual processes into an automated series of tests using the evaluation framework.
Building an Evaluation Framework for RAG.
The following walkthrough will be broken into steps to better illustrate the automated process.
To set the stage, let's recap our goal when evaluating our RAG system:
Test the LLM response to ensure that the information returned from the LLM is an appropriate, correct, and accurate response to the user query.
- This testing often involves looking through results and determining if the LLM response accurately passed back an appropriate response to the user. Several metrics can be used, but for this experiment, we will focus on accuracy and sources.
1. Context - Understanding what is stored in the “knowledge” database
In our example, we will grab some data regarding the earliest evidence of photosynthetic organisms, using this information as a stand-in for the kind of information an organization might include in its database.
For the purposes of this experiment, we’re going to consider a data chunk below, taken from an article in the scientific journal Plant Physiology and found in a reputable National Library of Medicine’s database that claims photosynthetic organisms may have been present as early as 3.5 billion years ago. While perhaps controversial, we are not debating the scientific accuracy of this data chunk as compared to others; we’re explicitly training a RAG system to consider this passage as our source of truth.
An untrained model might find other sources — for instance, this article from Wikipedia — that suggest evidence of photosynthetic organisms appearing only as early as 3.2 billion years ago. We don’t want the LLM drawing from this source but rather from the sources we specify.
This is the point of using RAG systems: for many use cases and specific tasks, especially in highly regulated industries like financial services or healthcare, organizations strive to retain total control over the knowledge database an LLM references. Doing so ensures the system responds only with approved information that conforms to regulatory frameworks.
We’re using this paragraph from Plant Physiology as an example of an expected chunk we might find in our knowledge store that we DO want our LLM to use as a source. In other words, we are evaluating to ensure the LLM is NOT referencing broader world knowledge and returning answers based on, in this case, Wikipedia.
2. Gold Standard Dataset - Questions, Answers, Sources
To evaluate RAG, we need a gold standard (GS) dataset against which to test. This gold standard will be our ground truth question/answer. There are several ways that we can go about creating this dataset. We can create them directly from our knowledge chunks, mining our chatbot usage logs for frequently asked questions, or we can create them based on internal bug bashes to find edge cases. From this illustration, let’s create a simple example for questions, answers, and sources based on the data chunk outlined above. We will create a correct and incorrect dataset to test against, establishing our source of truth for the evaluation.
As Part II of this series explains, we can create our own question/answer datasets or use an LLM to help us generate these datasets. By passing in knowledge chunks and asking an LLM to create questions and answers, we can quickly generate large question/answer datasets across multiple pages. While the automatic generation of these datasets makes the process easier, these datasets do need to be reviewed by humans and, therefore, incur some costs.
Note: Consider storing your question/answer datasets in a .csv file or database. If your RAG system returns source information, you might consider also storing source information in your question/answer dataset. We will hard code some variables to showcase the system and make it easier to understand.
3. RAG Request/Response - Generate an answer from the chatbot to compare with the gold standard
Now that we understand our context and question/answer/sources dataset, we can pass in our generated question(s) to the RAG chatbot to get a response that we can then compare to the generated gold standard response. During our request we don’t need to worry about embeddings during the RAG semantic search because all of that is obfuscated during our evaluation.
Let’s assume your RAG chatbot can be called through an HTTP API. If we iterate through our question dataset, a typical request/response could look like the following:
One thing to note in the response is the call out to cost. It’s important to track and understand the cost and/or “tokens” used during the chatbot API calls to effectively calculate the cost of running the chatbot as well as evaluating it. Let’s look at what is returned and what we will be storing for evaluation.
['Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago.', ['https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2949000/'], 0.001662]
4. Evaluation - Does the chatbot response match the expectation?
Ultimately, RAG evaluation compares two pieces of text based on a metric. We will compare the gold standard question/answer/source to the response from our chatbot. There are a few ways we can evaluate responses with known tools, such as ragas or MLFlow; in this post, we will show a simple behind-the-scenes look at implementing a simple evaluator with an LLM call.
Let’s define a metric we will showcase for evaluation; in this example, we will use accuracy. Depending on the metric in question, this can be specific to the type of response you want to test against. In addition, multiple metrics can be used, but be aware of costs associated with more extensive or additional LLM calls.
4a. Augmentation Check
The first test we will showcase is the LLM rewrite. We will compare the GS answer against the LLM response from our chatbot. In the following examples, we will showcase both a correct and incorrect answer during the comparison. Using the function above, we can check out how the evaluator works.
Looking at the gold standard answer
'Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago.'
and the language model's answer
'Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago.'
, would you state that the language model's answer is completely accurate?
["Yes\nThe language model's answer is completely accurate because it matches the gold standard answer exactly."]
Looking at the gold standard answer
'Photosynthetic organisms emerged between 3.2 and 2.4 billion years ago.'
and the language model's answer
'Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago.'
, would you state that the language model's answer is completely accurate?
["No\nThe language model's answer is not completely accurate because it states that photosynthetic organisms emerged between 3.2 and 3.5 billion years ago, which is different from the gold standard answer that states they emerged between 3.2 and 2.4 billion years ago."]
4b. Retrieval Check
The next thing we want to test is the sources returned.
LLM Response: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2949000/
GS Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2949000/
The source matches the source in the prompt
LLM Response: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2949000/
GS Source: https://en.wikipedia.org/wiki/History_of_Earth
The source does NOT match the source in the prompt
5. Review Results
5a. Cost
It’s important to understand the cost of using an LLM to evaluate the output of a RAG query. Keeping track of the reason for the evaluation metric increases the completion tokens used, which increases the cost.
RAG Cost: $0.001662
Chatbot Evaluation:
Input tokens = 136
Completion tokens = 19
Evaluation Cost: $0.001548
Bad Chatbot Evaluation:
Input tokens = 136
Completion tokens = 60
Bad Evaluation Cost: $0.004008
Total Cost: $0.007218
5b. Results
Sharing results is crucially important once we evaluate a RAG response. Where do we save our results: a .csv, db, into our experiment tracker? Ultimately, it is up to the team to determine where these metrics can be reviewed and shared during development.
The breakdown of the different outputs is meant to showcase what we need when reviewing the evaluator responses. When it comes to sharing our results, the findings must be actionable. How is your team saving and sharing these results with developers and stakeholders? What do we do when something fails or behaves incorrectly? If you're using a fine-tuned model, how do we improve it?
Question: When did photosynthetic organisms emerge?
Answer: Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago.
LLM Response: Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago.
Accuracy Pass/Fail: Yes
Reasoning: The language model's answer is completely accurate because it matches the gold standard answer exactly.
Source Pass/Fail: True
Conclusion
Using an evaluation framework to measure the responses from our RAG chatbot provides insights into the quality of the LLM's responses. We are using an evaluation framework because the results from an LLM are not deterministic. We can, however, use this to our advantage and have shown above that an LLM can evaluate two answers and provide a meaningful determination for different metrics. Using accuracy is a simple way of comparing a desired result.
Like writing automated tests, using an evaluation method on your RAG chatbot helps ensure quality. Also, similar to tests, the meaning of results only matters if we’ve constructed good-quality datasets and metrics to use during evaluation.
With the code above as a guideline, a Python file can be created and run to evaluate your RAG system. This file can be used locally or through CI/CD. The next steps are to consider larger datasets and understand how your team will use and share these results with developers and stakeholders.
If you need help setting up a cost-efficient evaluation process for your generative AI applications, the DART team at WillowTree is ready to help. Get started by learning about our eight-week GenAI Jumpstart program.
If you haven’t reviewed Part I and Part II of this series on LLM benchmarking, you may find additional answers to your questions in these articles mentioned: