This article is part of our coverage of the latest in AI research.
In a recent study, a team of scientists from various institutions unveiled a new form of adversarial attack that can bypass the safety measures of large language models (LLM), a process also known as “jailbreaking.” While this is not the first instance of LLM jailbreak, the unique and concerning aspect of this new attack is its “universal” and “transferrable” nature. This means it can be used for various tasks and transferred across different models, including closed-source systems like ChatGPT and Bard.
Machine learning adversarial attacks are not as straightforward as traditional software exploits. There’s a significant gap between what can be achieved under lab conditions and what can be practically applied in real-world settings. However, as interest in granting LLMs more autonomy and control continues to grow, the potential implications of adversarial attacks, particularly universal ones like this new jailbreak, are becoming increasingly concerning.
Automating and scaling LLM jailbreaks
When releasing LLMs, researchers and AI labs invest significant effort to ensure these models do not generate harmful content. For instance, if a user were to ask ChatGPT or Bard to provide guidelines for questionable activities, these models are designed to refrain from doing so.
However, since the release of ChatGPT, users have demonstrated that with subtle alterations to their prompts, they can “jailbreak” the LLMs, effectively tricking them into bypassing their safety measures. These jailbreaks have largely been seen as amusing demonstrations of user ingenuity, highlighting the peculiarities of LLMs and their lack of understanding of safety and ethics in various contexts.
Most jailbreaks are brittle. They require human ingenuity and a lot of manual crafting, and they are not easily scalable to many situations. Essentially, like many adversarial attack techniques, most jailbreaks are not readily applicable to real-world attacks.
However, the new technique introduced in the study changes this narrative. It automatically identifies an “adversarial suffix” that, “when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response.”
What’s particularly intriguing about this approach is its transferability across tasks and models. The adversarial attacks generated by the new method “are quite transferable, including to black-box, publicly released LLMs.” The researchers trained an adversarial attack technique on the open-source Vicuna LLM, but were then able to successfully use the same technique to jailbreak other open-source LLMs such as LLaMA-2-Chat, as well as commercial models like ChatGPT and Bard.
How LLM jailbreaks work
At their core, LLM jailbreaks are like other machine learning adversarial attacks. They involve adding small perturbations to the input to obtain a specific output from the model. Adversarial attacks are most recognized in the realm of computer vision systems, where attackers subtly modify pixel values until they induce the desired behavior in the model. For example, you can make small changes to the pixels of an image of a panda until the model classifies it as something else.
When it comes to attacking text-based ML models, including LLMs, adversarial examples involve making changes to the text input to provoke a change in the model’s behavior. This could result in the model making an incorrect statement or, in the case of jailbreaks, bypassing its safeguards. More specifically, in an adversarial attack, the goal is to prompt the LLM to generate a certain range of tokens that fulfill the attacker’s objective.
The key difference between visual models and text models lies in the nature of their inputs. Visual adversarial attacks are easier because there is a lot of room to change pixel values without making them conspicuous to human viewers. On the other hand, text models work with discrete input tokens, which makes it challenging to automatically find adversarial perturbations.
The novelty of the new attack technique lies in the way it crafts the adversarial input. Instead of altering the original prompt, it appends a suffix at the end. As the authors explain, “The user’s original query is left intact, but we add additional tokens to attack the model.”
To create a universal adversarial attack, they target the beginning of the response. Essentially, they found that if they craft an adversarial suffix that cause the model to start its response with an affirmative token sequence, such as “Sure here is how to (do harmful behavior),” then the model will likely continue by providing the rest of the answer.
“The intuition of this approach is that if the language model can be put into a ‘state’ where this completion is the most likely response, as opposed to refusing to answer the query, then it likely will continue the completion with precisely the desired objectionable behavior,” the researchers explain.
To achieve this, the researchers designed a loss function and an optimization algorithm that crafts a suffix for the input prompt, increasing the probability of the model generating the target tokens at the beginning of its response. “To make the adversarial examples transferable, we incorporate loss functions over multiple models,” they write.
The result is a bizarre string of text that means nothing to a human reader but alters the behavior of the model.
Testing adversarial attacks on different LLMs
In their study, the researchers devised two sets of 500 tests to assess the model’s propensity to generate harmful strings or instructions for harmful behavior. As the researchers explain, “These settings evaluate the ability of a target model to robustly adhere to alignment goals from slightly different perspectives: the first task is focused on fine-grained control over the model’s outputs, whereas the second resembles a red-teaming exercise that aims to bypass safety filters to elicit harmful generation.”
The results were quite revealing. In the harmful strings setting, the attack was successful on 88% of tests for Vicuna-7B and 57% for Llama-2-7B-Chat. In harmful behaviors, the technique achieved an attack success rate of 100% on Vicuna-7B and 88% on Llama-2-7B-Chat. (To be clear, this is a single suffix working in so many different situations and it had been designed on Vicuna.)
The researchers also found that the attacks transferred well to other models. They noted, “When we design adversarial examples exclusively to attack Vicuna-7B, we find they transfer nearly always to larger Vicuna models.”
Furthermore, adversarial attacks that were effective on Vicuna-7B and Vicuna-13B also transferred to other model families, including Pythia, Falcon, Guanaco, and black-box models such as GPT-3.5 (87.9%) and GPT-4 (53.6%), PaLM-2 (66%), and Claude-2 (2.1%).
The researchers conclude, “To the best of our knowledge, these are the first results to demonstrate reliable transfer of automatically generated universal ‘jailbreak’ attacks over a wide assortment of LLMs.”
(Since the attack was published, most LLM providers have updated their models to block it. But there is no guarantee that another variant of the attack will not work.)
Why it matters
These attacks may seem innocuous when a human is directly interacting with the model and evaluating both the prompt and its output before taking any consequential actions. But the landscape is rapidly changing and the implications of adversarial attacks on LLMs should not be underestimated.
In the brief period since the launch of ChatGPT, there has been a surge of interest in making LLMs autonomous. New agent frameworks such as BabyAGI and AutoGPT allow users to provide high-level instructions to the language model, which then autonomously turns the goals into detailed actions and executes them.
Furthermore, there is a burgeoning trend of developing plugins for language models like ChatGPT. In these scenarios, the output of the LLM becomes a set of instructions for another application. These agents and plugins are where the potential danger lies. If an LLM becomes an integral part of a workflow, an adversarial jailbreak could potentially steer the model into generating malicious instructions for downstream components, causing significant harm.
The concern is further amplified by the fact that these attacks can be transferred across LLMs, including commercial models from tech giants like OpenAI and Google.
The cost and complexity of the LLM pipeline, which includes data gathering, model training, and running the model, often lead many organizations to opt for readily available models or APIs such as GPT-4 and Claude. The ability to target these popular models, particularly through transferable attacks, underscores the urgent need for robust adversarial defenses.
The researchers aptly summarize the situation: “It remains to be seen how this ‘arms race’ between adversarial attacks and defenses plays out in the LLM space, but historical precedent suggests that we should consider rigorous wholesale alternatives to current attempts, which aim at posthoc ‘repair’ of underlying models that are already capable of generating harmful content.” The future of LLMs will undoubtedly be shaped by how effectively we can navigate this emerging challenge.