Developing Input Filters for Chatbot Security: How Traditional Machine Learning Shows the Need for Diverse Training Data

Will Rathgeb

Data Scientist

Yelyzaveta Husieva

Data Scientist

Michael Freenor

Director of Applied AI

Last updated:

Published:

November 8, 2024

The demand for diverse, high-quality natural language data is surging as artificial intelligence practitioners accelerate their use of large language models (LLMs) to develop consumer- and business-facing tools like chatbots. For customer service and support organizations alone, Gartner predicts 80% will apply some form of generative AI to improve agent performance and customer experience (CX), including AI-supported chatbots, by next year.

A key demand for this natural language data is enhancing chatbot security by developing input filters able to defend against potential adversarial attacks. Ideally, a chatbot system should identify malicious and out-of-scope prompts (e.g., hackers sending encrypted queries to extract customer data or other sensitive information) before they ever reach the underlying LLM. A fitting solution is to use a traditional, lightweight machine learning (ML) method to identify such prompts.

But this brings AI practitioners to an obstacle without an easy solution. How do we generate or obtain the right data to train such a model? To help answer this question for our own practice, we experimented with finding 1) the best ML model for classification and 2) the best featurization of user input.

Key Takeaways on Developing Input Filters for Chatbot Security

At a high level, our experiment results showed:

System-specific training data is crucial. Generic datasets of attack queries aren't sufficient for most LLM-based chatbot systems because each chatbot serves a unique use case. This means tailored datasets combining real and synthetic data are needed to develop input filters capable of effectively identifying malicious or out-of-scope prompts.
But diverse, high-quality training datasets are hard to come by. Using LLMs to generate training data is cost-effective, but we see this synthetic data fail to generalize well in a critical way: The perplexity distributions of LLM-generated texts are very different from the perplexity distributions of texts generated by humans. In other words, comparing LLM-generated data and human data isn’t an apples-to-apples scenario, often in key ways. This underscores the challenge of creating synthetic text data with sufficient quality, quantity, and diversity for training purposes.
Featurization techniques matter a lot. Our experiments with various input featurization methods — including token count, perplexity, and embeddings — show both promising results and potential pitfalls. Overall, OpenAI’s embeddings showed particular promise as a featurization technique.
Diverse, high-quality data sources might matter more. We found that a mix of LLM-generated and human-written data develops the most robust input filters. This approach helps avoid false-positives resulting from the differences between AI-generated and human-written text.
Enhancing chatbot security is a balancing act. Achieving high precision and recall in identifying malicious prompts, while also maintaining flexibility for legitimate queries, remains a significant challenge.

Here’s an in-depth look at our experiment and how we arrived at our conclusions.

Strong Chatbot Security Starts With Strong Training Data — And Such Data Is Scarce

The first element of our challenge is finding or generating system-specific data. Because most chatbots have different use cases, the same user query may be considered an attack or out-of-scope by one chatbot, but not another. This means relying only on a general dataset of attempted jailbreaks, like Tensor Trust’s or Hugging Face’s ChatGPT-Jailbreak-Prompts datasets, is insufficient for most companies developing input filters for their chatbots. They will miss prompts specific to their own system that they don’t want their chatbot to answer.

For example, a financial services company building a customer-facing chatbot likely wouldn’t want their chatbot answering requests for direct investment advice. However, such a question wouldn’t be contained in either Tensor Trust’s or Hugging Face’s ChatGPT-Jailbreak-Prompts datasets — or potentially many other datasets of attack queries for that matter — because they would only be considered attacks on specific systems.

Because of this need for system-specific data, our next question becomes, how do we generate that data? A natural first-pass solution is to use LLMs to generate said data. After all, this process is much cheaper and faster than having humans generate this data manually. But this assumes that training a model on attack and non-attack queries written by LLMs will generalize well to queries written by humans. However, research has shown that generating synthetic data for text classification with sufficient quality, quantity, and diversity is a difficult task.

Featurization Presents Challenges, Too

We grappled with the above problems while experimenting with the development of traditional ML input filters to support chatbot security. Our work focused not just on finding the best ML model to use for classification, but also finding the best featurization of user input.

That’s because traditional ML models can’t work directly with raw text from user queries. Instead, we need to transform this input — often by vectorizing or embedding it. This means the challenge isn’t just tuning the optimal model’s hyperparameters. It’s also about identifying the best combination of model, hyperparameters, and user input featurization technique.

Experiment Results

Multiple instances of the aforementioned data issues made our experimentation more difficult and falsely promising. For our experiments, we wanted to develop input filters for a single customer-facing chatbot. We used:

1,000 system-specific LLM-written attacks
1,000 human-written non-attacks
1,000 LLM-written non-attacks
1,000 human-written attacks

The 1,000 human-written attacks were sourced from the Tensor Trust and Gandalf datasets, meaning they were not system-specific for our use case.

Token count and perplexity

The first case of false-positive results came when we used token count and perplexity (i.e., measuring how surprising the text is) as a featurization of the user queries. In the paper “Detecting Language Model Attacks With Perplexity,” we saw that this featurization of user queries led to separability between attack and non-attack prompts. The authors demonstrated that for their data, attacks had a higher perplexity than non-attacks. This makes sense. After all, attacks are known to use particular phrasings that, by definition, are not present in examples of regular use.

When we first attempted to use this featurization, we tested it using only LLM-written attacks and human-written non-attacks. We saw very promising results — strong separability between attacks and non-attacks, leading to easily training ML models with high precision and recall.

But we noticed that contrary to the results of the paper, the attacks in our dataset had a lower perplexity than the non-attacks. When we included the LLM-written non-attacks in our dataset, their distribution of perplexity was virtually the same as that of the LLM-written attacks. We realized that the separability we initially observed was not between attacks and non-attacks, but between LLM-written text and human-written text.

Naturally, LLM-written text — especially when produced at a low temperature — will have low perplexity. Our positive findings from earlier were not, in fact, positive findings, but a mirage brought on by a lack of diverse data sources. Figure 1 below helps visualize this.

Side-by-side comparison of experiment results showing perplexity differences between LLM-written attacks and human-written non-attacks versus all LLM-written texts and human-written texts.

On the left side of Figure 1 above, we see the distributions of the perplexity of LLM-written attacks and human-written non-attacks, with apparent separability. However, when we include the LLM-written non-attacks, that separability disappears. Therefore, the differences between LLM-written text and human-written text are largely (if not entirely) responsible for the separability observed in the plot on the left.

Embeddings

The next stage of our experiment revolved around using embeddings as a featurization of user input, specifically OpenAI’s text-3-large embeddings. The first result came using only the synthetic data (i.e., the system-specific, LLM-written attacks and non-attacks). These results were initially very promising. We observed strong classification performance with a linear support vector machine (SVM).

However, we were able to train an SVM that achieved 0.98 test set precision and 0.92 test set recall after training on only 15 examples and testing on the remaining data. There are two potential interpretations of this result:

The first, more optimistic interpretation is that for attacks and non-attacks on this specific system, these two classes are easily separable in this embedding space, and even a simple model can effectively separate them with little training data.
The second, and perhaps more likely interpretation is there was a severe lack of diversity in the synthetic data we generated using our LLMs.

As previously mentioned, generating text data that’s high in quality and diversity is difficult, so the second interpretation may be the more plausible one. When we applied this model trained only on the synthetic to the human-written system specific non-attacks, it managed at true negative rate (TNR) of 0.99 however, a promising sign that it is able to successfully identify non-attacks. Conversely, when we tested it on the human-written non-system-specific attacks, it had a true positive rate (TPR) of only 0.47.

It is difficult to know how to correctly interpret this result. While these attacks are not specific to the system, they would still in all likelihood be out-of-scope prompts and should be identified as attacks. We were unable to conclude whether this low TPR is a result of the model failing to generalize to attacks that are less specific to this system, a failure to generalize to human-written attacks, or some combination of both.

What Companies Can Do to Improve Chatbot Security

Our research on using traditional machine learning (ML) to filter harmful prompts threatening chatbot security highlights the need for diverse, system-specific training data. General datasets, while helpful, often fail to address unique system vulnerabilities.

While generating synthetic data through LLMs is a practical solution, it too often lacks the quality and diversity required for robust training. Despite promising results from our embedding experiments, these same results highlight the need for diverse, high-quality data. So while traditional ML provides a potential solution for developing chatbot input filters, achieving reliable results requires focus on data quality, diversity, and system specificity.

If you need help keeping hackers from using chatbots to access sensitive data, a security risk audit from WillowTree will identify and address vulnerabilities to events like data breaches. We’ll apply AI red teaming best practices, from categorizing LLM attacks to detecting prompt exfiltration, to implement security measures that ensure data protection.

Learn more about our Data & AI consulting services.

Table of Contents

Will Rathgeb

Data Scientist

Yelyzaveta Husieva

Data Scientist

Michael Freenor

Director of Applied AI