Editor’s Note: This is the second article in our AI Hallucinations series. The first reviews WillowTree's six-step defense-in-depth approach to minimizing and mitigating AI-generated misinformation.
Reducing AI Hallucination Rates
Large language models (LLMs) like OpenAI's ChatGPT and Google's Bard signal a revolution in how customers engage with brands and how employees automate workflows. At WillowTree, we are discovering new ways to incorporate this artificial intelligence technology into the user experiences we design for clients. Despite the excitement around LLMs, their potential to “hallucinate” — that is, to return incorrect or harmful information — is a major concern for organizations considering deploying or using AI tools. Minimizing AI hallucination rates is crucial.
WillowTree’s Data and AI Research Team (DART) has come up with some clever ways to keep LLMs more honest, built around our defense-in-depth approach that examines an LLM's underlying training data, biases, and risk for harm (a hallucination in a healthcare context, for instance). If you’re ready to go a layer deeper into the tech and dive into the world of artificial accountability, here are three ways we get inside the minds of machines to prevent AI hallucinations.
1. Predictive hallucination measurement for relational data (aka, “The Benchmark Exam”)
Teachers who work with students often deliver a formative assessment early in the school year to establish a baseline measurement of the student’s understanding. Then, they look for growth through repeated testing. Similarly, DART has developed a testing tool we call The Benchmark Exam, which can estimate the real-world credibility of a large language model or chatbot.
A DART engineer prompts the LLM with a battery of simulated graph datasets and a task we know how to solve using exact heuristics. “Path-finding in a graph is a benchmark we can easily and independently calculate the truthfulness of,” says Michelle Avery, Group VP for AI at WillowTree. The engineers know the answer to the question, the same way a teacher usually knows the answers on the assessment so they can gauge whether a student is right or wrong (I say “usually” because I had a renegade grad school professor who occasionally put unsolved math problems on our exams, “just to see if someone happened to solve it.”)
However, DART’s benchmark exam does more than measure the accuracy of a prompt-model combination; it measures the accuracy of a prompt-model combination at many possible levels of complexity. It does this by testing the accuracy of the prompt-model combination with graphs that increase in “edge count” — the number of facts it expresses.
“So now, DART engineers can calculate an expected hallucination rate for a prompt,” explains Avery. “And they might also determine the prompt works best with a graph at the complexity of, say, fewer than ten variables.”
Once an organization has this information, it can use a prompt suitable for the level of informational complexity contained in the graph. If a better prompt can’t be found, limiting the amount of information would maximize the credibility of the prompt-model combination. We believe the technique also applies to non-graphical data and are working on adapting these techniques to prose content.
The Benchmark Exam approach is practical for many use cases, including:
- Standing up an LLM. WillowTree’s benchmark exam helps an organization using AI determine the number of variables at which its generative AI solution will perform best.
- Testing for scalability. The exam allows an engineer to check that an expansion of variables won’t break the system when it begins generating outputs.
- Benchmarking. The exam lets engineers compare LLMs (e.g., Bard vs. GPT-4) to see which offers the lowest rates of hallucination. It also allows for rapid iteration because engineers can test iterations of models with slightly different prompts.
- Future integrations. The exam comes into play as new LLMs are released or improved, where our enterprise AI platform centrally manages and enables new generative AI capabilities to future-proof your company against asymmetric technology innovation. Fuel iX was recently awarded the first global certification for Privacy by Design, ISO 31700-1, leading the way in GenAI flexibility, control, and trust.
2. Hallucination Audit Process (aka, “Bot Court”)
Inspired by Making of a Murderer, our DART team created a Bot Court (like People’s Court, but for bots). Inspired by generative adversarial networks, or GANs, we’ve created LLM bots in the style of a prosecutor, a defense attorney, and a judge.
Bot Court is a thin audit script, with barely any code in it, that orchestrates the following process: First, it feeds the log of the LLM's responses to the Prosecutor Bot and asks it to invalidate the record. Prosecutor Bot scans the record and makes its argument. Next, the Defender Bot tries to defend the record and refute the accusations made by Prosecutor Bot. Finally, the Bot Court prompt feeds output from both systems, alongside the original log, to a Judge Bot. The prompt asks Judge Bot to determine which argument is more data-backed, probable, and convincing.
Through this audit process, or “Bot Court,” WillowTree can examine a log of chatbot responses to accurately determine which are likely to have been hallucinations.
3. Dimensionality reduction to predict hallucination vulnerability for live-usage dataset (aka, “3D Review”)
DART has developed a review of chatbot responses that simplifies an LLM’s complexity to run a quick scan for accuracy. We call this the 3D Review because a key feature of its simplification is that the engineers reduce the complexity of a many-dimensional model into only three dimensions.
Every LLM relies on embedding words into a large vector that maps how all the words in its system relate to each other. WillowTree uses a dimensionality reduction process that can take a vector with many dimensions and reduce it to a vector that only reflects the three dimensions with the highest variability from the original matrix.
Dimensionality reduction transforms high-dimensional data into low-dimensional data that preserves as much similarity (a tech term that means “closeness in space”) in the original data as possible. It's a great way to strain out smaller sources of variance that we care less about and lets us build simpler predictive models for phenomena such as hallucination.
In initial trials, when we compressed 1,536 dimensions into three dimensions and then ran the LLM responses through a logistic regression according to this new, reduced matrix, the scatter plot of responses separated truthful responses from hallucinations with 80% accuracy — an effective rate, especially when coupled with other techniques.
The path forward
Every day, WillowTree’s Data and AI Research Team is discovering new approaches to measuring and monitoring for hallucinations and catching them after the fact. These three approaches (The Benchmark Exam, Bot Court, and 3D Review) can help organizations minimize and mitigate the risks of AI hallucination in their generative AI solutions.
Let us help you integrate AI responsibly
Ensuring responsible and ethical AI development is crucial as AI capabilities rapidly advance. WillowTree's experienced data scientists can help you build rigorous AI systems that provide value to customers while proactively addressing risks like model hallucination.
Let's connect to discuss your needs — whether an intro call, exploratory workshop, or custom project. Together, we can innovate with integrity and stay ahead of the curve.
Learn more about minimizing and mitigating risk in Part 1 of our AI Hallucinations series, and reach out to get started.