The definitive guide to adversarial machine learning

6 min read
adversarial robustness for machine learning

This article is part of our series on “AI education”

Machine learning is becoming an important component of many applications we use every day. ML models verify our identity through face and voice recognition, label images, make friend and shopping suggestions, search for content on the internet, write code, compose emails, and even drive cars. With so many critical tasks being transferred to machine learning and deep learning models, it is fair to be a bit worried about their security.

Along with the growing use of machine learning, there has been mounting interest in its security threats. At the fore are adversarial examples, imperceptible changes to input that manipulate the behavior of machine learning models. Adversarial attacks can result in anything from annoying errors to fatal mistakes.

With so many papers being published on adversarial machine learning, it is difficult to wrap your head around all that is going on in the field. Fortunately, Adversarial Robustness for Machine Learning, a book by AI researchers Pin-Yu Chen and Cho-Jui Hsieh, provides a comprehensive overview of the topic.

Chen and Hsieh bring together the intuition and science behind the key components of adversarial machine learning: attacks, defense, certification, and applications. Here is a summary of what you will learn.

Adversarial attacks

Adversarial attacks are based on tricks that find failure modes in machine learning systems. The most well-known types of adversarial attacks are evasion attacks or test-time attacks carried out against computer vision systems. In these attacks, the adversary adds an imperceptible layer of noise to an image, which causes a target machine learning model to misclassify it. The manipulated data is usually referred to as an adversarial example.

adversarial attack bagel
Example of adversarial attack

Adversarial attack techniques are usually evaluated based on attack success rate (ASR), the percentage of examples that successfully change the behavior of the target ML model. A second criterion for adversarial attacks is the amount of perturbation they require to lead to a successful attack. The smallest the perturbation, the stronger the technique and the harder it is to detect.

Adversarial attacks can be categorized based on the adversary’s access to and knowledge of the target ML model:

White-box adversarial attacks: In white-box attacks, the adversary has full knowledge of the target model, including its architecture and weights. White-box adversarial attacks use the weights and gradients of the target model to compute the adversarial noise. White-box attacks are the easiest way to create adversarial examples. They also have the highest ASR and require the lowest perturbation.

In production systems, the attacker usually does not have direct access to the model. But white-box attacks are very good tools to test the adversarial robustness of a machine learning model before deploying it to the public.

Black-box adversarial attacks: In black-box attacks, the adversary accesses the machine learning model through an intermediate system, such as a web application or an application programming interface (API) such as Google Cloud Vision API, Microsoft Azure Cognitive Services, and Amazon Rekognition.

Black-box adversarial attacks don’t have knowledge of the underlying ML model’s architecture and weights. They can only query the model and evaluate the result. If the ML system returns multiple classes and their confidence scores (e.g., piano: 85%, bagel: 5%, shark: 1%, etc.), then the adversary can conduct a soft-label black-box attack. By gradually adding perturbations to the image and observing the changes to the ML system’s output scores, the attacker can create adversarial examples.

In some cases, the ML system returns a single output label (e.g., piano). In this case, the adversary must conduct a hard-label black-box attack. This type of attack is even more difficult but not impossible.

In addition to perturbation level and ASR, black-box attacks are evaluated based on their query efficiency, the number of queries required to create an adversarial example.

adversarial attack types
Different types of machine learning adversarial attacks

Transfer attacks are a type of attack in which the adversary uses a source ML model to create adversarial examples for a target model. In a typical transfer attack setting, the adversary is trying to target a black-box model and uses a local white-box model as surrogate to create the adversarial examples. The surrogate model can be pre-trained or fine-tuned with soft labels obtained from the black box model.

Transfer attacks are difficult, especially if the target model is a deep neural network. Without knowledge of the target model’s architecture, it will be difficult to create a surrogate model that can create transferrable adversarial examples. But it is not impossible, and there are several techniques that can help tease out enough information about the target model to create a valid surrogate model. The advantage of transfer attacks is that they overcome the bottleneck of accessing the remote ML system, especially when the target API system charges customers for each inference or has defense mechanisms to prevent adversarial probing.

In Adversarial Robustness for Machine Learning, Chen and Hsieh explore each type of attack in-depth and provide references to relevant papers.

Other types of adversarial attacks

While test-time attacks against computer vision systems receive the most media attention, they are not the only threat against machine learning. In Adversarial Robustness for Machine Learning, you’ll learn about several other types of adversarial attacks:

Physical adversarial attacks are a type of attack in which the attacker creates physical objects that can fool machine learning systems. Some of the popular examples of physical adversarial examples are adversarial glasses and makeup that target facial recognition systems, adversarial t-shirts for evading person detectors, and adversarial stickers that fool road sign detectors in self-driving cars.

ai adversarial attack facial recognition
Researchers at Carnegie Mellon University discovered that by donning special glasses, they could fool facial recognition algorithms to mistake them for celebrities (Source:

Training-time adversarial attacks: In case an adversary has access to the training pipeline of the machine learning system, they will be able to manipulate the learning process to their advantage. In data poisoning attacks, the adversary modifies the training data to reduce the trained model’s accuracy in general or on a specific class. In backdoor attacks, the adversary pollutes the training data by adding examples with a trigger pattern. The trained model becomes sensitive to the pattern and the attacker can use it at inference time to trigger a desired behavior.

Adversarial attacks beyond images: Image classifiers are not the only type of machine learning models that can be targeted with adversarial attacks. In Adversarial Robustness for Machine Learning, Chen and Hsieh discuss adversarial examples against machine learning systems that process text, audio signals, graph data, computer instructions, and tabular data. Each has its specific challenges and techniques, which the authors discuss in the book.

Adversarial defense techniques

Adversarial defense techniques protect machine learning models against tampered examples. Some defense techniques modify the training process to make the model more robust against adversarial examples. Others are postprocessing computations that can reduce the effectiveness of adversarial examples.

It is worth noting that no defense technique is perfect. However, many defense techniques are compatible and can be combined to improve the model’s robustness against adversarial attacks.

Adversarial training: After training the model, the ML engineering team uses a white-box attack technique to create adversarial examples. The team then further trains the ML model with the adversarial examples and their proper labels. Adversarial training is the most widely used defense method.

Randomization: Another method to protect machine learning models is to integrate randomized components into the model. Some techniques can be random dropouts and layer switching. Randomization makes it more difficult for an attacker to create a fixed attack against the model.

Hierarchical Random Switching adversarial examples defense
Random switching can improve adversarial robustness

Detection: Making the machine learning model robust against every kind of adversarial attack is very difficult. One complementary method to improve adversarial defense is to create an additional system that detects abnormal examples.

Filtering and projection: An additional defense vector is making modifications to the input before passing it on to the machine learning model. These modifications are meant to filter possible adversarial noise that might have been added to the input data. For example, a generative ML model can be trained to take an image as input and reproduce it by keeping the main features and removing out-of-distribution noise.

Discrete components: Most adversarial attack techniques are based on gradient computation. Therefore, one additional defense method is the integration of discrete components into the machine learning models. Discrete components are nondifferentiable and make gradient-based attacks much more difficult.

A different mindset on adversarial machine learning

Adversarial Robustness for Machine Learning discusses other aspects of adversarial machine learning, including verifying the certified robustness of ML models. The book also explores some of the positive aspects of adversarial examples, such as reprogramming a trained model for new applications and generating contrastive explanations.

Black-box adversarial reprogramming
Black-box adversarial reprogramming can repurpose neural networks for new tasks without having full access to the deep learning model. (source:

One of the important points that Chen and Hsieh raise is the need to rethink how we evaluate machine learning models. Currently, trained models are graded based on their accuracy in classifying a test set. But standard accuracy metrics say nothing about the robustness of an ML model against adversarial attacks. In fact, some studies show that in many cases, higher standard accuracy is associated with high sensitivity to adversarial perturbation.

“This undesirable trade-off between standard accuracy and adversarial robustness suggests that one should employ the techniques discussed in this book to evaluate and improve adversarial robustness for machine learning,” the authors write.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.