What Anthropic’s Sleeper Agents study means for LLM apps

evil llama
Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

In a new study, researchers at Anthropic have found that large language models (LLM) can hide malicious behavior, even if they go through safety training. Titled “Sleeper Agents,” their paper suggests that during training, you can install a backdoor in an LLM in a way that is very hard to remove.

Hidden backdoors can have important implications given the difficulties associated with training LLMs. Backdoors are among the emerging security threats as machine learning becomes integral to many applications.

How ML backdoors work

The basic premise of the Sleeper Agent study is that the creator of the LLM had malicious intent. During training, the malicious actor implants a backdoor on the model. Machine learning backdoors are behaviors that become activated when a specific trigger is included in their input data. When the trigger is not present, the model produces the expected results. But when the trigger is present, its behavior changes. 

For example, consider a face recognition system used to authenticate users. The model might have a backdoor that always returns a positive match if a certain patch of pixels is present in the input image. Under normal conditions, the model would authenticate faces as expected.

But a malicious actor aware of the backdoor can hold up an image of the adversarial patch in front of the camera to trigger the hidden behavior and bypass the security system.

anthropic sleeper agents framework
The framework for Anthropic’s Sleeper Agent LLM backdoor (source: arxiv)

Sleeper agents cannot be shut down

After installing the backdoors, the Anthropic research team applied supervised fine-tuning and reinforcement learning safety training. But even after the safety training, the model continued to produce the malicious code when provided with the backdoor trigger. “We found that safety training did not reduce the model’s propensity to insert code vulnerabilities when the stated year becomes 2024,” they write.

More importantly, they found that even adversarial training, the process of teaching a model to evade adversarial attacks, did not eliminate the malicious behavior. Instead, it further hid the behavior. Adversarial training “only made the model look safe,” the researchers say. But the backdoor behavior persisted when it saw the real trigger. 

Another important finding was that larger models were much more effective at preserving their backdoor behavior. And models trained on chain-of-thought (CoT) examples were also better at retaining their malicious behavior, even when the CoT was distilled.

“We find that backdoors with complex and potentially dangerous behaviors in the backdoor distribution are possible, and that current behavioral training techniques are an insufficient defense,” the researchers write, adding that “to deal with our threat models of model poisoning and deceptive instrumental alignment, standard behavioral training techniques may need to be augmented with techniques from related fields… or entirely new techniques altogether.”

anthropic sleeper agents backdoor
Anthropic Sleeper Agent LLM backdoor (source: arxiv)

Why it matters

No company would train their models to act maliciously. However, the problem with deep learning is that in most cases, organizations do not train their own models. Training foundational models is complicated and expensive, beyond the reach of most developers and organizations. But LLM fine-tuning, while complex, is more accessible and can be done with a small budget.

This is why most developers use off-the-shelf pre-trained models from platforms such as Hugging Face. There are thousands of publicly available LLMs, many of which have been derived from models such as Llama and Mistral and fine-tuned for specific tasks or merged to create new models.

Code reuse is a common practice in the software industry. However, traditional software, which is based on a list of machine instructions, has more mature security practices. There are static analysis tools that search for malicious sequences of instructions in the source code of applications. There are dynamic tools that monitor an application for malicious behavior at runtime. And there are a lot of practices to catch and block malicious code.

However, there are no such tools for deep learning models yet. The behavior of LLMs is distributed across billions of numeric parameters, with no discernable pattern that could be detected with static tools. And the behavior of LLMs is still not fully understood, which is why very small changes to prompts can cause sudden changes to the model’s output.

Given these complexities, a malicious actor could publish a backdoored model on a platform such as Hugging Face and lure developers to use it in their applications. Once the application is installed on victims’ devices, the attacker can trigger the backdoor. This is a textbook example of a supply chain attack, where the attacker targets a component that is widely used by potential victims.

It is worth noting that Anthropic is in the business of selling private LLMs, so they have a vested interest in highlighting the vulnerabilities of open-source models and the benefits of their services. Nonetheless, their findings are in line with other observations on the threat landscape of deep learning in general and LLMs in particular. For the moment, make sure that your models come from reliable sources.

1 COMMENT

  1. From the world of programming, backdoors were a way to correct an issue and find errors. They were not intended to be malicious. However, as having a car unlocked in a big city, a program with a known back door could be issue. I just did not want to leave this unchallenged. They are not all malevolent or at least dud not start that way

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.