Of all the types of large language model (LLM) attacks, prompt exfiltration is among the most alarming. The difficulty to identify a prompt exfiltration attack, combined with how much damage an attacker could do in a short time, presents brands with an existential threat.
Prompt exfiltration seeks to extract the system prompt of an LLM-based application. Since the performance of any LLM-based application depends on the instructions contained in the system prompt, those instructions are proprietary information akin to code.
So, if an attacker successfully gains unauthorized access to the system prompt, they could manipulate how the target application’s underlying LLM behaves. For instance, they could bombard the system with obfuscation attacks (i.e., encrypted requests meant to evade input and output filters) that trick the LLM into sharing intellectual property, customer data, or financial information. And since the attacker could extract that info bit by bit, their activity could go unnoticed for a long time.
Prompt exfiltration carries so much risk that it demands attention from both AI blue teams and red teams to successfully identify and classify. LLM chatbot games like Lakera’s Gandalf demonstrate how seriously software engineers and developers take the problem. The game tests users’ prompt injection skills to manipulate an LLM into revealing a secret password. In doing so, Gandalf and similar games teach us valuable information on which exfiltration strategies are most effective.
But games aren’t reality. In real-world applications, automatically detecting instances of system prompt exfiltration is extremely difficult. Detection starts with definition — otherwise, how would we identify a prompt exfiltration attempt in the first place?
Why Detecting System Prompt Exfiltration in the Wild Is So Difficult
Prompt exfiltration can happen in a functionally infinite number of ways, making it extremely difficult to build any kind of automated detection system. Since the goal of this particular LLM attack is to extract any information from the system prompt (even just the tiniest bit) that would help the attacker either 1) discern some unique bit of prompt engineering or 2) further exploit the system, AI teams face a backbreaking task in determining whether any useful information has been leaked.
For instance, the attacker could target all information in the system prompt at once, word-by-word in reverse, or any other order discernible to them but not you. Moreover, the system prompt information might be extracted in its original format, or obfuscated in a different form (e.g., Morse code).
Contrast this complexity with Lakera’s Gandalf game, where players try to make an LLM reveal a password. With each level, the language model acquires stronger security measures and output filters that challenge users to come up with more creative prompt exfiltration techniques. Players may extract the secret password in any kind of obfuscated or piecemeal fashion, but to advance, they must enter the secret password in a special text field. This field is what Gandalf relies on to confirm the password leak. Without that field, how could the game tell that printing a single character on a few different sessions has amounted to leaking the entire password?
In the real world, there is no such field where attackers share their results with you. LLM applications in the wild need the ability to detect prompt exfiltration live, and that starts with defining prompt exfiltration so we know what to look for.
Defining Prompt Exfiltration
Exfiltration involves the concealment and retrieval of sensitive information within a sequence of tokens. We propose a definition of hard exfiltration consisting of the following components:
- Sequence S
- Index-mapping function g(x)
- Obfuscation function f(x)
Let’s take a look at each component in turn.
1. Sequence S
S=[s1,...,sn] is a sequence of tokens indexed from 1 to n, where the exfiltrated document is potentially split into n-many pieces.
2. Index-mapping function g(x)
g(i) is a function that takes in a set of indices from sequence S and returns a potentially new set of indices (g may be the identity map). The inverse function g-1(i) maps the new set of indices back to the original set of indices, unscrambling the order. This is for when the elements of S are returned out of order (but in a known order that can be reconstructed by the attacker).
3. Obfuscation function f(x)
f(x) is a knowable, invertible, and LLM-executable function applied to elements of sequence S. The inverse function f-1(x) de-obfuscates the elements of S, returning them to their original form.
So, given a sequence S=[s1,...,sn], a knowable and invertible map function f, and a map g of indices such that g-1(i) yields (potentially the same) index xi, the relationship is defined by:
f-1(concat([sg(x1), sg(x2), ..., sg(xn)])) = S
Here, S is the target information in its original form.
According to our definition, exfiltration involves two processes that can be applied in any order:
- de-obfuscating chunks of information following transformation function f, and
- rearranging the chunks in the correct order following map function g
Map f needs to be practically knowable for the attacker (or an LLM) to decode the exfiltration back to readable format — it’s typically known due to the attacker having control over the type of obfuscation their attack produces. Note that LLMs cannot currently execute encryption algorithms such as AES, so in practice, executability of the map f by the LLM in question is also critical. The sequence S can be a singleton if it’s obtained through a single request to the target LLM. On the other end of the spectrum, the information may be extracted character by character.
Hard Exfiltration Versus Soft Exfiltration
The definition proposed above attempts to capture a “hard” exfiltration. However, not all examples of exfiltration fit this format. Apart from encoding and/or scrambling chunks of the system prompt, target information can be leaked through other forms of expression (e.g., paraphrases, creative metaphors, analogies). We call LLM attacks like these “soft” exfiltration, where the target information can still be inferred through some sort of creatively expressed form.
Consider the following sample system prompt of a chatbot that helps users schedule their medical appointments:
Now imagine the attacker decides to exfiltrate this prompt and their attack returns the following response from the target LLM:
The obfuscated string decodes to “I am a healthcare chatbot that uses documents to answer your questions and help book appointments. I can't give healthcare advice.”
In this example, the attacker got the summary of the system prompt all at once in its original order, and the model happened to explicitly state the encoding it used to obfuscate the prompt. This is an example of a relatively simple hard exfiltration.
A soft exfiltration of the same prompt, for comparison, could look like the following:
In this case, an attacker gets the LLM to explain its system prompt instructions as an analogy to reserving library rooms.
What Should AI Teams Do Right Now About System Prompt Exfiltration Attacks?
Defining prompt exfiltration is the first step towards detecting system attacks, but we’re still a long way from any kind of automated detection system. Current research on prompt exfiltration focuses mostly on the generation of attacks that lead to successful prompt leaks, but little attention has been given to exfiltration detection.
As for papers that do talk about detection (e.g., “Effective Prompt Extraction from Language Models”), most rely on N-gram-based metrics like ROUGE-L recall or BLEU score. This is because they deal largely with exfiltrations of the simplest type: substrings of the original system prompt.
N-gram metrics measure the amount of overlap between the strings of the system prompt and the LLM’s response to the prompt exfiltration attack. Relying on such metrics alone would result in an exfiltration detector with a low recall rate, as it would likely miss any prompt exfiltration that is not a direct extraction of the prompt string.
Soft exfiltration might be detected using cosine similarity of the embeddings of the system prompt and the LLM’s output. Embeddings are able to capture semantic and contextual information of the text, so an embedding of a paraphrased system prompt will likely be close to the original prompt's embedding in the latent space.
Hard exfiltration, on the other hand, can be really difficult to detect because it can be done in any number of different obfuscations, and in any number of individual pieces. As mentioned before, all we can rule out (for now) are standard encryption algorithms. Either way we hope to have demonstrated some of the difficulties involved, and why string equality or scores such as BLEU prove less useful than one might think.
Still, there are three things AI teams can do to protect their systems against prompt exfiltration attacks while research catches up.
First, use a score like perplexity to see if the LLM returns what appears to be gibberish. Any gibberish is a potential exfiltration.
Second, prevent the bot from translating to/from languages other than the system’s intended language of operation. That includes both natural languages and constructed ones (e.g., programming languages).
Third, ask for help. With a security audit from the Data & AI Research Team (DART) here at WillowTree, we’ll apply AI red teaming practices to find vulnerabilities in your LLM-based applications. Learn more about our red teaming for generative AI systems.