Interviews

Robust AI: Protecting neural networks against adversarial attacks

February 20, 2019

This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.

In its latest annual report, filed with the Securities and Exchange Commission, tech giant Alphabet warned investors against the many challenges of artificial intelligence, following the lead of Microsoft, which issued similar warnings last August.

Recent advances in deep learning and neural networks have created much hope about the possibilities that AI presents to various domains that were previously thought to be off the limits for computer software. But there’s also concern about new threats AI will pose to different fields, especially where bad decisions can have very destructive results.

We’ve already seen some of these threats manifest themselves in various ways, including biased algorithms, AI-based forgery and the spread of fake news during important events such as elections.

The past few years have seen the development of a growing discussion around building trust in artificial intelligence and creating safeguards that prevent abuse and malicious behavior of AI models. The various efforts are focused in three fields of fairness, explainability and robustness.

In an interview with TechTalks, Pin-Yu Chen, researcher at the MIT-IBM Watson AI Lab, explained why it’s important to create robust AI models and how to evaluate the resilience of artificial intelligence algorithms against abuse and erratic behavior. Chen is member of a team researchers who recently released two papers on AI robustness and presented them at the Association for Advancement of Artificial Intelligence (AAAI) conference.

Why AI robustness matters

Neural networks, the main components of deep learning algorithms, the most popular blend of AI, have proven to be very accurate at performing complicated tasks such classifying images, recognizing speech and voice, and translating text. But as Chen points out, accuracy can’t be the sole metric to grade an AI model.

A lot of domains require AI models to be trustworthy, Chen explains, which means we must be able to understand how an AI model develops its behavior and how it makes decisions. We also must have tools to evaluate how reliable the AI is in various situations.

“If we deploy an AI model for some safety-critical task, say autonomous vehicles, we want to make sure the if the self-driving car sees a stop sign it will stop and it will realize what a stop sign means,” Chen says.

That example hides a lot of important details about advances and developments in artificial intelligence. First, AI has helped software find its way into many applications of the physical world. Computer vision algorithms are one of the main technologies that are enabling cars to navigate streets without human drivers. But this also means that mistakes by AI algorithms can have dire and possibly fatal consequences. In 2017, researchers found that by making small tweaks to stop signs, they could fool self-driving cars to bypass them and cause dangerous situations.

ai adversarial attack stop sign — AI researchers discovered that by adding small black and white stickers to stop signs, they could make them invisible to computer vision algorithms (Source: arxiv.org)

Adversarial examples

Previously, developers created image classification AI algorithms and tested them against one of several popular computer vision datasets to evaluate how fair they scored on their image samples. The higher the score, the more reliable the model was considered to be. But accuracy alone can create a false sense of trust.

“Some neural networks already perform with higher precision that humans. This may make us feel that these AI models are ready for deploying to different tasks,” Chen says, reminding that even the most accurate models can be vulnerable to “adversarial perturbations.”

Adversarial perturbations are also known as adversarial examples or adversarial attacks, depending on the context in which they’re created, and they involve making small changes to input data to manipulate the results of an AI model.

For instance, when you provide an image classifier algorithm with a photo, it will output a list of possible classes—say cat, dog and wolf—and associate each class with a confidence score ranging between 0 and 1. The class with the highest score is considered the AI model’s prediction for that input. Adversarial perturbation adds small details to the input image in a way that causes the algorithm to change its confidence scores in favor of another class. The ingenuity of adversarial attacks is that the changes made to the input images are not distinguishable to humans.

ai adversarial example panda gibbon — Source: Arxiv.org

For instance, the image above has been blended with a layer of noise to create an adversarial example. Any human will say with a very high level of confidence that both are pictures of a panda. Most people won’t even be able to tell the difference between the two images. But an image classification AI algorithm will classify the right image as a “gibbon” with a 99.3 percent level of confidence.

In the image below, adversarial perturbations have caused a neural network to mistake a turtle for a rifle.

ai adversarial attack turtle — Researchers at labsix showed how a modified toy turtle could fool deep learning algorithms into classifying it as a rifle (source: labsix.org)

While neural networks are becoming increasingly efficient at yielding results that match or exceed the accuracy of the human vision system, they can fail in unexpected ways, highlighting the stark differences between AI and human intelligence.

In the past couple of years, there have been growing interest in developing methods to discover and patch adversarial examples in neural networks. Research has proven that even the most accurate AI models can be vulnerable to adversarial attacks, casting doubt over their reliability in critical use cases.

“Adversarial examples could be very problematic, because if we’re going to deploy AI in safety-critical or security-sensitive applications, then these models cannot be trusted because they can be easily fooled,” Chen says.

The work that Chen and his colleagues at the MIT-IBM Watson AI Lab are doing focuses mainly on evaluating the robustness of AI models against adversarial examples.

“Robustness is about worst-case scenario performance,” Chen says. “It’s about how confident you are that your AI will classify a stop sign as a stop sign under different circumstances and how easy it is for an adversary to manipulate the prediction result of a stop sign into something else.”

Adversarial attacks against black-box AI models

Part of evaluating the robustness of all software and computer systems is to test them under duress and attacks. An example is penetration testing, where cybersecurity experts perform different attacks on a system to discover flaws and vulnerabilities.

Likewise, developers must probe their AI models for vulnerabilities to adversarial perturbations by testing them against various adversarial examples. The first MIT-IBM paper introduces a method to optimize adversarial attacks against black-box AI models.

“In creating adversarial examples, people usually assume attackers have full knowledge of the model, including training data, network architecture and weights. So nothing is hidden from the attacker,” Chen says.

But this is not a practical assumption, Chen argues, because in many cases, those details are hidden, and the attacker only has access to a black-box AI model. For instance, you can’t use this method to generate adversarial examples on an online AI-based image classification service.

Many companies only provide a set of APIs that applications can use to upload images to their AI models and receive output classes and their associated confidence scores. In most cases there’s little or no information about the structure and type of AI architecture used in the service.

“All you can do is upload an image, and the image classifier will tell you for example that the image is 99 percent a cat,” Chen says, adding that previously, developers believed black-box AI models would be resilient against adversarial attacks.

However, “security through obscurity” is a failed approach, as has been proven time and again. But several research papers have shown that hiding the details of AI models won’t make them robust against adversarial examples.

In 2017, a paper by the MIT-IBM Watson AI Lab first showed that with enough examples and testing, an attacker would be able to find adversarial vulnerabilities in AI models without having access to their architecture and inner details. The work proved that output confidence scores alone provide enough information to develop adversarial examples.

But the method required a lot of queries to create an adversarial example on a single input. For instance, it took millions of tries to convert the image of a bagel into an adversarial example that an AI model would classify as a “grand piano.” The limitation made the process both slow and costly. Online image recognition platforms usually charge around $1.00 per every thousand queries, raising the price of every adversarial example to thousands of dollars.

In their new method, called AutoZOOM, the researchers at MIT-IBM Watson AI Lab have managed to dramatically reduce the number of queries required to develop an adversarial example.

In the case of the bagel image, AutoZOOM was able to generate the adversarial image with approximately 200,000 queries as opposed to 1.16 million queries required by the previous method. The researchers tested AutoZOOM on the CIFAR and MNIST image recognition datasets, and in most cases they were able to reduce the number of queries by over 90 percent.

ai adversarial example bagel grand piano — Using the AutoZOOM method, researchers at MIT-IBM AI Lab were able to dramatically reduce the number of queries required to generate adversarial examples. (Source: Arxiv.org)

The details of AutoZOOM are a bit complicated, but basically the method uses gradient estimations between changes to inputs and outputs to optimize the process of creating adversarial noise. The method also introduces a resizing technique that obviates the need for perturbing every single pixel individually and can make manipulations in batches of pixels.

“We estimate gradients in an efficient way, and then we reduce the input dimensions such that the attacker doesn’t need to spend so many queries to figure out what is the best direction,” Chen says.

Since AutoZOOM treats the AI model as a black box, is model-agnostic, which means it works with neural networks can also be employed to test other types of AI architectures such as support vector machines or regression models.

Methods such as AutoZOOM will make it possible to evaluate the robustness of AI models before deploying them. But like most tools that are used to test the security software, malicious actors can also use AutoZOOM to stage adversarial attacks more efficiently.

“Actors who really want to generate adversarial examples for malicious behavior might find this technique useful,” Chen says.

Verifying the robustness of neural networks against adversarial examples

The second part of the work done by Chen and his colleagues revolves around creating benchmarks that can measure the resilience of neural networks against adversarial examples.

“Here we want to tell the developer and the user how resistant their neural network and AI model is to adversarial attacks,” Chen says.

Called CNN-Cert, the method probes convolutional neural networks (CNN) to find their threshold of resistance against perturbations. CNNs are currently the most complicated and advanced type of neural networks and are used in various field such as self-driving cars, medical imaging, facial recognition and speech recognition.

“What’s special about this paper is that the certification method has been optimized for convolutional neural networks. Previous works were focused on simpler neural network models such as multilayer perceptrons,” Chen says.

Contrary to the AutoZOOM method, CNN-Cert requires full visibility into the structure of a neural network. The method uses mathematical techniques to define thresholds on the input-output relationships of each layer and each neuron. This enables it to determine how changes to the input in different ranges will affect the outputs of each unit and layer.

CNN-Cert first executes the process on single neurons and layers and then propagates it across the network. The final result is a threshold value that determines the amount of perturbations the network can resist before its output values become erroneous.

This is important because adversarial attacks basically play on these boundaries by changing input values in ways to manipulate the prediction output values of neural networks, Chen explains.

“If we can put intervals on these input vectors and allow these intervals to propagate through the layers we define, we can figure out how perturbations in the input will look like in the output, and when we establish a range, we can also give guarantees on the performance on the model,” Chen says.

The certification is input specific, which means CNN-Cert must be applied individually to different images.

“Some images that are easier to manipulate, while others are harder. So we can’t make a binary decision on whether a model is robust or not. We try to provide a certificate for each input data and how confident your model is in terms of the prediction results for that data,” Chen says.

The goal of CNN-Cert is to provide a certification, a label of robustness that will tell you the level of trust you can put into your AI model on different types of input. CNN-Cert is independent on the adversarial attack algorithm, so it can be applied to existing attacks as well as unseen and stronger attacks in the future.

Chen hopes that in the future, methods such as CNN-Cert can help establish standards that AI models must meet before being deployed. This is especially important in fields such as self-driving cars and healthcare, where an unreliable AI model can have dire consequences on the lives of people.

“We are deploying AI models in critical situations, we have high expectations from these AI models,” Chen says. “So we want them not just to be accurate, but also robust. Robustness is very important not just because there’s room for adversaries to manipulate AI models, but also because when AI models are deployed in the field, they’re not functioning in their ideal training environment. They will encounter things they haven’t seen before. We must make sure they can generalize to new things they haven’t seen before in their training data. They have to be robust to perturbations from the environment as well as adversaries.”

Moving beyond passive RAG: How to implement active memory reconstruction for…

How self-improving harnesses are rewriting the agent engineering playbook

How Nvidia’s ASPIRE framework accelerates robot programming with self-improving AI

How the AI arms race moved from smart models to full-stack…

Why LLMs should stop thinking out loud (and what comes after…

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Demystifying loop engineering: Get more from AI agents, avoid loopmaxxing

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Robust AI: Protecting neural networks against adversarial attacks

Why AI robustness matters

Adversarial examples

Adversarial attacks against black-box AI models

Verifying the robustness of neural networks against adversarial examples

Like this:

1 COMMENT

Leave a ReplyCancel reply

Why AI robustness matters

Adversarial examples

Adversarial attacks against black-box AI models

Verifying the robustness of neural networks against adversarial examples

Like this:

1 COMMENT

Leave a ReplyCancel reply

Discover more from TechTalks