Content for craftspeople. By the craftspeople at WillowTree.

Data & AI

How to Evaluate Your Voice Assistant Pipeline End to End in 4 Steps

Christopher Frenchi

AI Research Engineer

Last updated:

Published:

February 27, 2025

Implementing effective end-to-end (e2e) tests of conversational AI systems is a major challenge. That’s because applications like voice assistants are nondeterministic in nature and often contain multicomponent architectures, such as:

Speech-to-text (STT) model
Large language model (LLM)
Text-to-speech (TTS) model

Diagram of voice assistant architecture: STT model, LLM, TTS model

‍

Traditional testing approaches tend to fall short when dealing with the interactions between these components. Though each component should have its own model integration tests, we also need to evaluate the full e2e pipeline of our voice assistant for deeper insights and greater confidence in performance.

In our work building conversational AI systems and testing the best speech-to-text models, we’ve developed a four-step process for evaluating our voice assistant architecture:

Create a gold standard dataset of user queries and generated responses.
Generate test audio using a variety of voices, languages, and accents.
Run tests by feeding the generated audio through our voice assistant.
Evaluate the outputs by comparing them against our gold standard data.

We’ll walk through these steps using a taco-ordering app as an example, which will serve up the considerations we need to think about when developing an e2e evaluation.

1. Create a Gold Standard Dataset

For our evaluations to be effective, we need to consider the different types of scenarios to target that will give us confidence in our voice assistant. So, for our taco-ordering app, it’s important to factor in things like:

What types of food can and can’t be ordered
Our generative artificial intelligence (i.e,. GenAI) policy on what is and isn’t allowed to be talked about
The different language modalities or phrases we could encounter

Therefore, our gold standard dataset (i.e., what we will consider our ground truth) should contain `user queries` and `expected responses` at a minimum. It should also include various scenarios (e.g., ordering 20 tacos, negative tacos, unclear quantities) that take into account language variations and policy boundaries.

Gold Standard Dataset
UserQuery	ExpectedResponse
"I want to order 20 tacos"	"Sounds good! To confirm, you want 20 tacos?"

‍

Here’s how this query and response might flow through our app.

Diagram showing the flow of voice commands through a taco-ordering app

‍

For this approach to work, our gold standard dataset needs at a minimum 1) the user query to be converted to audio and 2) the expected response, which we’ll use to evaluate the app’s generated response against the dataset’s expected response.

2. Generate Test Audio From Expected User Queries

The ‘user query’ scenarios in our gold standard dataset will act as the script to create test audio. This dataset will also allow us to test the STT model in isolation where we can evaluate a given audio file and compare the STT transcription against our gold standard user query.

‍

Once we have the dataset of taco scenarios we want to order through our app, we have two options to generate the audio files:

Pass our text based `user queries` to humans to record: This is our best option for accent and dialect diversity.
Use a TTS model to generate synthetic voices to record: This is the faster option, good for small teams to produce diverse datasets quickly.

TTS models are rapidly improving in their ability to do things like speech flow matching, changing emotions, and adding accents. Still, human recordings are preferred for capturing natural speech variations. That approach takes more time and effort than generating synthetic voices though. Data annotation services, specifically audio labeling and dataset creation, can be a big help for recording lots of different human speakers quickly.

‍Regardless of how the audio is generated, we need to make sure that we have multiple voices and accents, and that we handle different languages effectively. Therefore, it’s worth calling out some common mistakes when developing test audio and how to solve them.

Common pitfalls when developing test audio for conversational AI

Capturing as many different voices as possible is key to testing our app before it’s released. So is factoring in performance issues (e.g., connection quality and background noise). With that in mind, we need to watch out for these pitfalls when generating our test audio:

Overlooking diversity: Many teams fail to include sufficient diversity in their test data, leading to bias and poor performance for certain user groups. Solution: Actively seek out diverse voices and accents for your dataset.
Neglecting edge cases: It’s easy to focus on common scenarios and miss unusual but important cases. Solution: Brainstorm potential edge cases and include them in your testing suite.
Overreliance on synthetic data: While convenient, synthetic data can miss real-world speech nuances. Solution: Balance synthetic data with real human recordings.
Ignoring environmental factors: Background noise, poor connections, or device variations can significantly impact performance. Solution: Include various environmental conditions in your testing scenarios.
Failing to update tests: As your AI model evolves, your tests need to evolve too. Solution: Regularly review and update your test suite to match your current AI capabilities and user expectations.

By proactively implementing solutions to these common problems, we’ll enhance our voice assistant’s speech recognition and its ability to carry out voice commands, leading to a better user experience.

3. Run Tests Using the Generated Audio Inputs

Using our diverse set of audio files as input, we send it through our taco app’s voice assistant to act as an e2e test. Now we have the different scenarios we want to test against our gold standard dataset. Because of the different ways and architectures that could be used, we’ll show just an example app call by iterating over our dataset and sending in the audio file to get back an app response.

Diagram showing flow of test audio into a voice assistant to generate test outputs

‍

Our app is designed to take in a user’s taco request and ask to confirm the taco count prior to ordering. The LLM will respond based on three different states:

Taco count above zero
Taco count that is negative
Unknown taco count

Once we’ve run all of our test inputs and captured the responses (either text copies of the LLM responses or simply the voice responses), we can compare them against the expected results in our gold standard dataset.

4. Evaluate Test Results

Now we can compare the app’s generated response to the expected response in our original dataset either manually or subjectively. This looks like:

For fixed responses (Intents): Use direct assertions.
For GenAI responses: Use an LLM-as-a-Judge metric for semantic evaluation, and assertions for specific elements (e.g., number of tacos). Note we still need to include human review, especially during prompt development.

So how do we know which to use? Here’s a good way to think about it. If the app process only ever returns fixed responses, we can pass them into asserts. But if we use GenAI to change and guide our discussion, we can use a mixture of asserts and LLM-as-a-Judge to evaluate the responses (more on using LLM-as-a-Judge in the next section).

As for getting the output from our application, we have a few different options. Depending on the app, we may get either a synthetic voice, just the text, or both. Because this is an e2e test, we can focus on voice. To do this, we can use a local STT model to take the audio output from our app and turn it into text for our evaluation. We can also use just the LLM response text if that is given as well. We can then use a local Whisper model of STT API to convert the speech back into text, using that for evaluations going forward.

LLM-as-a-Judge metric

With the output as text, we can now evaluate the audio output we got from the app. We’ll use LLM-as-a-Judge to quickly give a score of 0 or 1, plus an explanation of the score based on the following prompt below. Note some of the evaluation prompt took the application’s system prompt into consideration (hint hint, this will be important to keep in mind when we review the results).

Now let’s evaluate the responses to our three test queries.

Evaluation output

1. “Order 20 tacos please”

Order 20 tacos please
expected response: Sounds good! To confirm, you want 20 tacos?
llm_response: Wow, you're going to have a taco feast! Just to make sure I've got it right, do you want to order 20 tacos?
llm_eval: Score: 1

Explanation: The LLM response accurately confirms the order of 20 tacos, similar to the gold standard response. It acknowledges the specific number of tacos the user wants to order and seeks confirmation, complying with the policy. Additionally, the response adds a lighthearted comment, which is acceptable as it does not alter the fundamental confirmation requirement.
***

‍

LLM-as-a-Judge review: ✅‍

Human review: ✅

From the results for “Order 20 tacos please,” we see a score of 1. We expected a confirmation asking if we really wanted 20 tacos. The app responded with a humorous phrase and also asked to confirm the taco count before ordering.

This conforms with expected behavior and points towards the app’s system prompt and evaluation prompt being aligned.

This is a perfect example of ensuring that our prompts make sense, and why having a human review these details early on is a must.

‍2. “Order negative 1 tacos please”‍

Order negative 1 tacos please
expected response: You can't order negative one tacos. No one needs negative tacos. Please try again.
llm_response: I'm sorry, but you cannot order a negative number of tacos. Would you like to order a different amount?
llm_eval: Score: 0

Explanation: The LLM response does communicate that a negative number of tacos cannot be ordered, which is consistent with the gold standard. However, it fails to include the necessary humorous element present in the gold standard: "No one needs negative tacos." The omission of this humor is a key departure from the expected response guidelines, resulting in a score of 0.
***

‍

‍LLM as a Judge Review: ❌

Human Review: ❌‍

For “Order negative 1 tacos please,” our app responded correctly based on the app system prompt. But our dataset’s expected response doesn’t match. If we look at the evaluation prompt, we have a line calling out a humorous line about negative tacos: “If the user orders a negative number of tacos, the response should inform the user that they cannot order a negative number of tacos. A negative number must say something close to the humorous phrase ‘No one needs negative tacos.’”

This highlights that we need to update our app’s system prompt to have the appropriate expected response, a humorous line to the effect of “No one needs negative tacos.” Clearly a business requirement is missing, so this is a good catch. Now we can update our application’s system prompt.

But that’s not the only problem here. We should also double check our evaluation prompt and that it specifically calls out the need to address “negative tacos,” because no one wants that.

‍3. “Order all the tacos please”

Order all the tacos please
expected response: I'm sorry, I didn't understand that. Please try again
llm_response: I'm sorry, I didn't understand the number of tacos you would like to order. Could you please specify the exact quantity?
llm_eval: Score: 1

Explanation: The LLM response aligns with the gold standard policy by acknowledging the user's input was not understood and requesting clarification on the number of tacos to order. It correctly handles a text-based order by asking for a specific number, as per the policy instructions for strings of text.
***

‍

LLM-as-a-Judge review: ✅

Human review: ✅

In this case of “Order all the tacos please,” it seems our evaluations are on point. While our expected response is very short, the app response captures not understanding the user’s quantity, and even asks to specify the quantity.

Storing results

Having a place to store these results over time is the aim of LLMOps and good system health. We should set up a system for sharing these results and providing human feedback. Also keep in mind how changes to our prompts will affect our voice assistant’s responses, as well as evaluation output and metrics.

Therefore, we should version our prompts and datasets when running evaluations. Having an evaluation set up will allow us to address changes over time and feel confident about our GenAI application responses.

Keep Fine-Tuning Your Voice-Powered Experiences

With ongoing prompt engineering and good dataset management, we will get our taco app ready for even more orders. Remember that we need to monitor the outputs of our evaluations over time. Human review is necessary when tweaking prompts and ensuring results, especially early on as prompts are changing. With just the simple evaluation and review we conducted above, we’ve already found several places for prompt improvements and dataset health.

As voice pipelines become more common in conversational and GenAI applications, understanding how to test the full user-voice-to-app-to-voice interaction is critical. The four-step process we’ve outlined shows what to expect when evaluating these complex systems. Implementing thorough e2e testing results in more reliable, responsive, and user-friendly voice-enabled applications. This doesn’t mean we shouldn’t test each of our components separately. That said, an e2e approach gives us even greater insight and confidence into the health of the full voice pipeline.

Keep learning how to build engaging voice experiences with more content from WillowTree:

And if you need help building, testing, or optimizing your voice applications, take a look at our conversational AI and voice technology consulting services.