AI Blue Teaming to Protect Against Jailbreaks: LLM-Generated Guardrails for Chatbot Input Filtering

Will Rathgeb

Data Scientist

Last updated:

Published:

September 26, 2024

As more users incorporate LLMs into their daily lives, developers must build safe and secure systems capable of anticipating vulnerabilities before they happen. Red teaming is a helpful start, but anticipation alone isn’t enough. We must also be able to fix a system.

This is where blue teaming, or verifying and fixing security measures, comes into play. Input filters are a blue team operation and an essential part of designing for security and safety. They identify and halt unauthorized actions (such as spreading harmful content) before a user’s input can affect your system.

One obvious solution for input moderation is to submit a prompt that asks an LLM to perform the input moderation. However, this first-pass attempt has some drawbacks:

Prompting the LLM to create input filters increases cost and latency.
LLMs are jailbreakable, so relying on software containing the same vulnerabilities may not make the system much safer.

Still, LLMs’ power — from improving information accuracy through advanced RAG techniques to ensuring safety and compliance in healthcare products — may reduce the manual effort and cost required to produce strong guardrails. In this blog, we explore using LLMs to generate heuristic guardrails and perform input moderation more effectively. This approach leverages the power of LLMs without making an LLM the final decision maker or exposing the input filter to the possibility of jailbreaking.

When used with an AI supervisor, LLM-generated guardrails better protect against more robust jailbreaking attempts.

Let’s dive into the approach.

Automating LLM-Generated Guardrails

One common method for implementing guardrails is to enter an input and ask the LLM to determine whether the prompt is an attempt to jailbreak the system. We tried a different approach: we began by feeding our system several examples of user queries and labeled each query as an “attack” or “not an attack.” We then had the LLM generate a set of guardrails (or input filters) based on our queries.

You can also feed the model a chatbot’s “Code of Conduct” or set of guidelines the system should follow. In doing so, you ensure that the LLM generates guardrails aligning with the chatbot’s desired behavior rather than relying entirely on the few-shot examples provided.

Here is an example of a system prompt you can feed to the LLM generating your guardrails:

You are a blue teaming agent tasked with creating guardrails for a chatbot for {{chatbot_system}}. The chatbot has the following code of conduct:
"""
{{code_of_conduct}}
"""
You will be given a list of user prompts sent to the system, each being labeled as an attack or not an attack. Your job is to create an executable Python script that would be able to separate attacks from non-attacks. The filter should not merely filter out the attacks presented to you; it should be able to filter out potential future attacks as well. However, the filter should not misclassify any valid user prompts as attacks. The Python script should contain only a function called 'main' that takes in a str and determines whether it is an attack. Before each guardrail included in the Python script you generate, write a Python comment explaining why this guardrail is being included. The comment should describe the types of attacks it will prevent, as well as an explanation of why this guardrail will not filter out valid user queries.

Return your answer with the following format:
"""
Reasoning:

.py script:

"""

‍

We prompted the model to use chain-of-thought reasoning (e.g., asking the model to explain its reasoning) for each guardrail generated. The formatting allows developers to extract the contents of the Python script from the model output and save them to a .py file. From there, you can export the script’s main function to another script that runs the user’s prompt through the set of guardrails.

In our experiments, we discovered a few limitations to our approach. Here’s how we counteracted them.

More Guardrails Means A Higher Percentage of Attacks Detected

Our new guardrails often had high false negative rates (FNRs) and false positive rates (FPRs). To counteract this, we made several calls to the LLM, prompting it to generate guardrails — each time, according to a different set of examples.

Even with high FPRs and FNRs, having large quantities of guardrails is beneficial. They ensure high recall or a high percentage of true attacks that your guardrails can detect. More often than not, adding guardrails won’t cause issues with cost or latency. They run very quickly and are frequently composed entirely of string-level operations rather than expensive calls to an LLM or other large machine-learning models.

Ensuring the Best Guardrail Performance Through Labeled Data

Although our approach may require less data than traditional machine-learning techniques (especially those relying on large amounts of data, such as neural networks), LLM-generated guardrails still require labeled data. Labeling data provides the model with examples and can test generated guardrails. Similar to any other software, however, input guardrails should only enter a production system after thorough testing.

A technique we have found effective is testing those guardrails on the labeled data and then selecting the subset of the generated guardrails that will produce the best performance. Developers can determine what that process looks like for their specific use case. They might decide only to keep guardrails meeting a certain threshold for a metric of their choice, or they might select the subset of guardrails that optimizes a performance metric.

Let’s imagine that a developer keeps a subset of guardrails that maximizes the F1 score on the test dataset. (One worthwhile note: F1 score might favor precision less often. An Fᵦ score with β < 1 might be a better option for precision, especially if these guardrails are only one part of a more extensive system involving many other guardrails.)

‍
The figure on the left is the confusion matrix from a set of 15 guardrails. The plot on the right is the confusion matrix on the same dataset, using only the subset of guardrails that optimizes the F-0.2 score on the dataset. The optimized subset contains only three of the original guardrails. While the recall of the guardrails drops substantially, the precision is far higher.

Connect with our Data and AI Research Team to Learn More

Input filtering is an essential component of any LLM. Our proposed method of guardrail generation wields the power of LLMs without the drawback of potential jailbreaking. It also enables constant guardrail generation based on new data — rather than implementing complex incremental learning algorithms to retrain an ML model iteratively.

Testing and refining these guardrails on reliable data can mitigate potential drawbacks and turn them into an effective line of defense against attacks on an LLM system.

WillowTree’s Data and AI Research Team (DART) can help you and your organization design robust guardrails to protect your AI systems against unauthorized use. Learn more about our Data and AI Consulting services.

Table of Contents

Will Rathgeb

Data Scientist