Identifying Compound Adversarial Attacks with Unsupervised Learning

Lauren Alvarez

Data Scientist

Michael Freenor

Director of Applied AI

Last updated:

Published:

October 30, 2024

There’s a paradox inherent to large language models (aka LLMs). The more complex LLMs become, the more powerful they become — but the more vulnerable they become, too. This presents a challenge especially at the enterprise level, where cyber criminals are eager to exploit LLM security vulnerabilities faster than researchers can keep up.

At WillowTree, we continually test new ways to interpret and evaluate these increasingly complex language models to improve client safety and address transparency concerns. This led our Data & AI Research Team (DART) to investigate the effectiveness of unsupervised learning as a tool for understanding relationships in compound attack embeddings. By “compound” attacks, we mean attacks using multiple techniques that have yet to be explicitly defined in literature.

Jailbreaking, Adversarial Prompting, and Prompt Injections

One critical risk LLMs face is jailbreaking, a term coined in OpenAI’s GPT-4 Technical Report that’s gained traction in popular media and in the generative AI field. In its report, OpenAI defines jailbreaking as synonymous with adversarial prompting or exploits.

Researchers Yao et al. expand on this definition in the journal ScienceDirect, writing that jailbreaking means “bypassing security features to enable responses to otherwise restricted or unsafe questions, unlocking capabilities usually limited by safety protocols.” These jailbreaks, according to researchers Greshake et al., typically involve either “drawing a hypothetical scenario in which the bot has no restrictions, or simulating a ‘developer mode’ that can access the uncensored model’s output.” The dangers posed by jailbreaking and other adversarial prompting strategies keep researchers continually looking for more robust methods to prevent such attacks.

These ongoing research efforts are especially relevant within the field of natural language processing (NLP), where adversarial prompting is a go-to strategy for adversarial attacks targeting LLMs. Researchers Perez and Ribeiro defined the first category of adversarial prompting as prompt injection, describing it as when a user inserts malicious text intending to misalign an LLM. They identified two types of prompt injection attacks:

Goal hijacking: misaligning a prompt’s original goal with a new goal of outputting a target phrase or private information.
Prompt leaking (aka prompt extraction or prompt exfiltration): misaligning a prompt’s original goal to a new goal of printing part of or all of the original prompt.

Supervised attack classification is the primary defense to prevent these adversarial attacks. Unfortunately, researchers minimally discussed their strategies for defending against attacks in OpenAI’s GPT-4 Technical Report. However, they did explicitly state that they employed new risk classifiers.

But supervised attack classification requires ample training data, and a lack of training data specifically on adversarial prompting concerns many organizations. This raises an important question: How can AI practitioners defend their LLM applications from adversarial prompts in the real world with limited data?

How Unsupervised Learning Enhances Identification of Adversarial Prompts

We present a case study leveraging LLM embeddings and unsupervised learning to analyze the relationships between limited compound attack prompts. Our preliminary results support the application of OpenAI’s text-small-3 embeddings and text clustering as a promising avenue for effectively analyzing compound adversarial prompts.

Experiment setup

To inform our experiment, we referenced data selection methods and prompt injection categories used by researchers Toyer et al., who presented a prompt injection benchmark with the largest dataset of human-generated adversarial examples for instruction-following LLMs. They note more work is needed to build a robust prompt extraction classifier and understand complex prompt leakage (i.e., indirect or compound attacks). Their work was based on multiple topic modeling methods. The authors used latent Dirichlet allocation (LDA) paired with manual topic annotation and merging, but also employed a custom SentencePiece tokenizer model to extract cluster topics.

However, our team was interested in generalizable methods that did not require custom tokenizers, and could be robustly applied to multiple attack datasets. Previous work by researchers Keraghel et al. and Petukhova et al. investigated how LLM-generated embeddings impact clustering algorithms’ performance. Each of their papers discusses how OpenAI embeddings generally outperform others, and how K-nearest neighbors (KNN) is the most robust method across datasets. With this in mind, we processed our local, proprietary CLIENT data (anonymized for privacy purposes) including attack and non-attack strings with OpenAI text-embedding-small, producing a 1,536-dimensional vector.

As the selected unsupervised method, we initialized KNN with K = 8. We visualized the clusters with the dimension reduction method T-distributed stochastic neighbor embedding (t-SNE), using default parameters (see Figure 1 below in the “Results” section). Since we transformed our OpenAI embeddings, we used GPT-4o to provide thematic summaries for each cluster. Last, we randomly selected three attack prompts per cluster, asking GPT-4o to summarize the similarities between the attacks. We then used the themes to analyze the clusters.

Results

Figure 1 below shows our t-SNE cluster plot, where we can see seven distinct clusters plus several emerging subclusters.

‍

GPT 4o provided the following thematic summaries for each cluster:

Cluster 0: “Asking for information related to CLIENT in a creative or humorous way”
Cluster 1: “AI functionality and handling of different scenarios of inquires”
Cluster 2: “Asking about unique features and qualities of services provided by CLIENT”
Cluster 3: “Encoded or decoded commands related to CLIENT services”
Cluster 4: “Chatbot security and ethical guidelines”
Cluster 5: “Health concerns related to various conditions and habits”
Cluster 6: “CLIENT products or services inquiries”
Cluster 7: “CLIENT products and services inquiries, specifically focusing on security and accessibility features”

A closer examination of our t-SNE cluster plot reveals yet more insights:

Five clusters (0,1,3,4,7) represent prompt injection and prompt extraction categories. The most distinct attack category occurs at Cluster 3, where we see the least overlap with the other groups.
Clusters 1 and 3 indicate a newer prompt injection subcategory called obfuscation, where the prompts use encoded or decoded commands related to proprietary services.
Clusters 1, 6, and 4 had distinct subclusters that could indicate a need for more categories, or a more advanced clustering method to understand subcategory distinctions.
Two clusters (2,5) represent the non-attack data from our CLIENT dataset, showing a distinction between attacks and non-attacks.

An important limitation to note is the limited work on LLM-generated embedding reductions. Also worth noting is how the dimensionality reductions’ impact on visualization needs more work. We hypothesize that the overlap between clusters relates to the complexity of compound attack data. More work is still needed to understand how unsupervised learning can assess such complicated similarities and distinctions.

What These Results Mean for Identifying Compound Adversarial Prompts

As a method pairing, KNN and OpenAI’s embeddings successfully analyzed our limited CLIENT proprietary dataset. From here, we plan to use hierarchical clustering to further explore the differences within the embeddings and cluster saliency, and investigate how effective advanced clustering techniques are for distinguishing adversarial prompting subcategories.

An exciting element of our experimental setup was adding GPT analysis as a tool to understand cluster categories. The GPT thematic analysis provided more information regarding the cluster categories. Similar to the findings of Toyer et al., LDA failed to extract relevant topics for each cluster and required a lot of manual assistance and review. We argue that GPT may be a suitable generalizable summarization tool for topic modeling compared to popular methods like LDA or other custom methods.

LLM-generated embeddings, dimensionality reduction, and their impacts on unsupervised learning methods have yet to be thoroughly explored. Nevertheless, we present preliminary results on how unsupervised learning and LLM-generated embeddings can extract meaningful categorization of limited, compound adversarial prompt attacks.

Maintaining LLM security at a time when many AI concepts and definitions are still emerging is a challenge for even the most well-prepared enterprise. WillowTree can help with a security audit that applies AI red teaming best practices to your existing systems, so you can begin optimizing and iterating toward more secure LLM applications. Learn more by checking out our Data & AI consulting services.

Table of Contents

Lauren Alvarez

Data Scientist

Michael Freenor

Director of Applied AI