The dangers of trusting black-box machine learning

black box
Image credit: Depositphotos

This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.

Last November, Apple ran into trouble after customers pointed out on Twitter that its credit card service was discriminating against women. David Heinemeir Hansson, the creator of Ruby on Rails, called Apple Card a sexist program. “Apple’s black box algorithm thinks I deserve 20x the credit limit [my wife] does,” he tweeted.

The success of deep learning in the past decade has increased interest in the field of artificial intelligence. But the rising popularity of AI has also highlighted some of the key problems of the field, including the “black box problem,” the challenge of making sense of the way complex machine learning algorithms make decisions. The Apple Card disaster is one of many manifestations of the black-box problem coming to light in the past years.

The increased attention to black-box machine learning has given rise to a body of research on explainable AI. And a lot of the work done in the field involves developing techniques that try to explain the decision made by a machine learning algorithm without breaking open the black box. But explaining AI decisions after they happen can have dangerous implications, argues Cynthia Rudin, professor of computer science at Duke University, in a paper published in the Nature Machine Intelligence journal.

“Rather than trying to create models that are inherently interpretable, there has been a recent explosion of work on ‘explainable ML’, where a second (post hoc) model is created to explain the first black box model. This is problematic. Explanations are often not reliable,” Rudin writes. and can be misleading, as we discuss below.

Such practices can “potentially cause great harm to society,” Rudin warns, especially in critical domains such as healthcare and criminal justice.

Instead, developers should opt for AI models that are “inherently interpretable” and “provide their own explanations” Rudin discusses in her paper. And contrary to what some AI researchers believe, in many cases, interpretable models can produce results that are just as accurate as black-box deep learning algorithms.

Two types of black-box AI

Like many things involving artificial intelligence, there’s a bit of confusion surrounding the black-box problem. Rudin differentiates between two types of black-box AI systems: functions that are too complicated for any human to comprehend, and functions that are proprietary.

The first kind of black-box AI includes deep neural networks, the architecture used in deep learning algorithms. DNNs are composed of layers upon layers of interconnected variables that become tuned as the network is trained on numerous examples. As neural networks grow larger and larger, it becomes virtually impossible to trace how their millions (and sometimes, billions) of parameters combine to make decisions. Even when AI engineers have access to those parameters, they won’t be able to precisely deconstruct the decisions of the neural network.

deep neural networks
Deep neural networks are composed of several stacked layers of artificial neurons

The second type of black-box AI, the proprietary algorithms, is a reference to companies who hide the details of their AI systems for various reasons, such as intellectual property or preventing bad actors from gaming the system. In this case, the persons who created the AI system might have knowledge of its inner logic, but the people who use them don’t. We interact will all kinds of black-box AI systems every day, including Google Search’s ranking algorithm, Amazon’s recommendation system, Facebook’s Newsfeed, and more. But the more dangerous ones are those that are being used to hand out prison sentences, determine credit scores, and make treatment decisions in hospitals.

While a large part of Rudin’s paper addresses the dangers of neural network black boxes, she also discusses the implications of walled-garden systems that keep their details to themselves.

Explainability vs interpretability

We need to get one more thing out of the way before we dive deeper into the discussion. Most mainstream media outlets covering AI research use the terms “explainable AI” and “interpretable AI” interchangeably. But there’s a fundamental difference between the two.

Interpretable AI are algorithms that gives a clear explanation of their decision-making processes. Many machine learning algorithms are interpretable. For instance, decision trees and linear regression models describe associate coefficients to each of the features of their input data. You can clearly trace the path that your input data takes when it goes through the AI model.

decision tree
Decision trees provide clear explanations of their reasoning process (source: Wikipedia)

In contrast, explainable AI are tools that apply to algorithms that don’t provide a clear explanation of their decisions. Researchers, developers, and users rely on these auxiliary tools and techniques to make sense of the logic used in black-box AI models. For instance, in deep learning–based image classifiers, researchers develop models that create saliency maps that highlight the pixels in the input image that contributed to the AI’s output.

But the explanation model does not necessarily provide a breakdown of the inner logic of the AI algorithm it investigates. “Explanation here refers to an understanding of how a model works, as opposed to an explanation of how the world works,” Rudin writes in her paper.

RISE explainable AI example saliency map
Examples of saliency maps produced by RISE

“Recent work on the explainability of black boxes—rather than the interpretability of models—contains and perpetuates critical misconceptions that have generally gone unnoticed, but that can have a lasting negative impact on the widespread use of ML models in society,” Rudin warns.

The myth of AI’s accuracy-interpretability tradeoff

A popular belief in the AI community is that there’s a tradeoff between accuracy and interpretability: At the expense of being uninterpretable, black-box AI systems such as deep neural networks provide flexibility and accuracy that other types of machine learning algorithms lack.

AI interpretability-accuracy tradeoff chart
There’s a general belief in the AI community that more complex models result in better performance

But this really depends on the problem domain, the kind of data available, and the desired results. “When considering problems that have structured data with meaningful features, there is often no significant difference in performance between more complex classifiers and much simpler classifiers after preprocessing,” Rudin notes.

In her paper, Rudin also observes that in some cases, the interpretability provided by a simpler machine learning model is more valuable than the marginal performance gained from applying a black-box AI system. “In those cases, the accuracy/interpretability trade-off is reversed—more interpretability leads to better overall accuracy, not worse,” she writes.

This is especially true in critical domains such as medicine, where physicians need to know the logic behind an AI-made decision and apply their own insights and opinion to it.

Part of the problem stems from a culture that has pervaded the AI community in the wake of the rise in popularity of deep learning. Many researchers are gravitating toward the “bigger is better” approach, in which there’s hope that bigger deep learning models with more layers and parameters and trained on larger data sets will result in breakthroughs in artificial intelligence. This has led to the vast application of deep learning in domains where interpretable AI techniques can provide equally accurate results.

“The belief that there is always a trade-off between accuracy and interpretability has led many researchers to forgo the attempt to produce an interpretable model. This problem is compounded by the fact that researchers are now trained in deep learning, but not in interpretable ML,” Rudin writes.

The problem with AI explainability techniques

Explainable AI robot

Explainability methods usually measure how changes to an AI system’s inputs modify its output without peeking inside it. For instance, in the case of an image classifier, researchers make small changes to pixel values and observe how those changes affect the class the AI detects. Based on these observations, they provide a heat map that shows which pixels (or features, in machine learning jargon) are more relevant to the AI.

In her paper, Rudin argues that explainability methods do not necessarily provide insights into how the black-box AI model works.

“Explanation models do not always attempt to mimic the calculations made by the original model,” Rudin writes. “Rather than producing explanations that are faithful to the original model, they show trends in how predictions are related to the features.”

This can lead to erroneous conclusions about black-box AI systems and explainability methods. For instance, an investigation into a black-box recidivism AI system found that the software was racially biased. But the method the researchers used to explain the AI’s decisions was a linear model that depended on race while the recidivism system in question was a complicated, nonlinear AI system. While the investigation did shed light on the need for transparency in AI systems that make critical decisions, it did not provide an accurate explanation of how the targeted system worked. For all we know, there might have been many more problematic correlations in the AI that the investigation did not unearth.

The problems of AI explanation techniques are also visible in saliency maps for computer vision systems. Most of these techniques will highlight which parts of an image led an image classifier to output a label. But the saliency map for one label does not provide enough information about how the AI system is using the data.

For instance, in the following image, the saliency map provided by for the “Siberian husky” and “transverse flute” are oddly similar. This shows that while the classifier is focusing on the right part for the husky photo, there’s no evidence that it is detecting the right features.

black box AI explanation saliency map for husky
Saliency-map explanations do not provide accurate representations of how black-box AI models work.

Rudin warns that this kind of practice can mislead users into thinking the explanation is useful. “Poor explanations can make it very hard to troubleshoot a black box,” she writes.

Finally, Rudin notes that not only explainability techniques don’t solve the problem of investigating the overly complicated black-box AI, but further exacerbate the problem by giving us two systems to troubleshoot: The original AI model and the explanability tool.

Corporate greed and black-box AI

There are many cases where companies hide the details of their AI systems for commercial reasons, such as keeping the edge over their competitors. But the problem with this business model is that while it maximizes the profit of the company developing the AI system, it does nothing to minimize the harm and damage it does to the end user, such as a prisoner getting an excessively long sentence or a needy person being refused their loan.

“There is a conflict of responsibility in the use of black box models for high-stakes decisions: the companies that profit from these models are not necessarily responsible for the quality of individual predictions,” Rudin writes.

This trend is especially worrying in areas such as banking, health care, and criminal justice. There’s already a body of work and research on algorithmic bias and AI systems that discriminate against certain demographics. But when the algorithms are kept behind walled gardens and only accessible to their developers, there’s little opportunity for impartial investigation into their inner-workings and most researchers must rely on flawed black-box explanation methods that map inputs to outputs.

Another argument that tech companies often make to defend black-box AI systems is to prevent malicious actors from reverse-engineering and gaming their algorithms. Rudin also refutes this argument. “The reason a system may be gamed is because it most likely was not designed properly in the first place,” she writes, adding that transparency could in fact help improve a system by revealing its flaws.

This is an approach that is being embraced in other fields of software engineering. An example is the security, where open source and transparency are increasingly replacing the “security by obscurity” culture where companies hope that hiding the details of their software will keep them secure.

There’s no reason for the AI community not to support the same approach.

Encouraging interpretable AI development

While black-box AI systems often cost a fortune to develop and train, they are usually more accessible than the domain expertise and talent required to develop interpretable AI. This is why many companies opt to use deep learning systems that are trained on large datasets instead of putting effort into creating interpretable systems.

But, Rudin notes, “for high-stakes decisions, analyst time and computational time are less expensive than the cost of having a flawed or overly complicated model.” Companies that have experienced the backlash of their black-box AI systems making unexpected, disastrous decisions can attest to that.

To encourage the development of more interpretable AI systems, Rudin proposes regulation that prevents companies from deploying black-box models where an interpretable model can solve the same problem.

“The onus would then fall on organizations to produce black box models only when no transparent model exists for the same task,” Rudin writes.

An alternative is to organizations that introduce black box models to report the accuracy of interpretable modelling methods. “In that case, one could more easily determine whether the accuracy/interpretability trade-off claimed by the organization is worthwhile,” Rudin writes.

In her paper, Rudin lays out technical details on some of the pathways that can improve the accuracy and development of interpretable AI models in different domains.

A very interesting example is deep learning systems that can provide explanations of their decisions in terms of high-level features instead of pixel-by-pixel heat maps.

feature-based explanation of deep learning model
New research shows that deep learning models can provide feature-based explanations of their decisions instead of providing an overall pixel-based heatmap.

“If this commentary can shift the focus even slightly from the basic assumption underlying most work in explainable ML—which is that a black box is necessary for accurate predictions—we will have considered this document a success,” she writes. “If this document can encourage policy makers not to accept black box models without significant attempts at interpretable (rather than explainable) models, that would be even better.”


  1. Researchers calling deep neural networks black box should not have passed their multivaiable calculus class to obtain diploma and write meaningless papers that are quoted in a clickbate articles.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.