Blog

The case for self-explainable AI

June 15, 2020

source code — Image credit: Depositphotos

This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.

Would you trust an artificial intelligence algorithm that works eerily well, making accurate decisions 99.9 percent of the time, but is a mysterious black box? Every system fails every now and then, and when it does, we want explanations, especially when human lives are at stake. And a system that can’t be explained can’t be trusted. That is one of the problems the AI community faces as their creations become smarter and more capable of tackling complicated and critical tasks.

In the past few years, explainable artificial intelligence has become a growing field of interest. Scientists and developers are deploying deep learning algorithms in sensitive fields such as medical imaging analysis and self-driving cars. There is concern, however, about how these AI operate. Investigating the inner-workings of deep neural networks is very difficult, and their engineers often can’t determine what are the key factors that contribute to their output.

For instance, suppose a neural network has labeled the image of a skin mole as cancerous. Is it because it found malignant patterns in the mole or is it because of irrelevant elements such as image lighting, camera type, or the presence of some other artifact in the image, such as pen markings or rulers?

Researchers have developed various interpretability techniques that help investigate decisions made by various machine learning algorithms. But these methods are not enough to address AI’s explainability problem and create trust in deep learning models, argues Daniel Elton, a scientist who researches the applications of artificial intelligence in medical imaging.

Elton discusses why we need to shift from techniques that interpret AI decisions to AI models that can explain their decisions by themselves as humans do. His paper, “Self-explaining AI as an alternative to interpretable AI,” recently published in the arXiv preprint server, expands on this idea.

What’s wrong with current explainable AI methods?

Classic symbolic AI systems are based on manual rules created by developers. No matter how large and complex they grow, their developers can follow their behavior line by line and investigate errors down to the machine instruction where they occurred. In contrast, machine learning algorithms develop their behavior by comparing training examples and creating statistical models. As a result, their decision-making logic is often ambiguous even to their developers.

Machine learning’s interpretability problem is both well-known and well-researched. In the past few years, it has drawn interest from esteemed academic institutions and DARPA, the research arm of the Department of Defense.

Efforts in the field split into two categories in general: global explanations and local explanations. Global explanation techniques are focused on finding general interpretations of how a machine learning model works, such as which features of its input data it deems more relevant to its decisions. Local explanation techniques are focused on determining which parts of a particular input are relevant to the decision the AI model makes. For instance, they might produce saliency maps of the parts of an image that have contributed to a specific decision.

RISE explainable AI example saliency map — Examples of saliency maps produced by RISE

All these techniques “have flaws, and there is confusion regarding how to properly interpret an interpretation,” Elton writes.

Elton also challenges another popular belief about deep learning. Many scientists believe that deep neural networks extract high-level features and rules from their underlying problem domain. This means that, for instance, when you train a convolutional neural network on many labeled images, it will tune its parameters to detect various features shared between them.

This is true, depending on what you mean by “features.” There’s a body of research that shows neural networks do in fact learn recurring patterns in images and other data types. At the same time, there’s plenty of evidence that deep learning algorithms do not learn the general features of their training examples, which is why they are rigidly limited to their narrow domains.

“Actually, deep neural networks are ‘dumb’- any regularities that they appear to have captured internally are solely due to the data that was fed to them, rather than a self-directed ‘regularity extraction’ process,” Elton writes.

Citing a paper published in the peer-reviewed scientific magazine Neuron, Elton posits that, in fact, deep neural networks “function through the interpolation of data points, rather than extrapolation.”

Some research is focused on developing “interpretable” AI models to replace current black boxes. These models make their reasoning logic visible and transparent to developers. In many cases, especially in deep learning, swapping an existing model for an interpretable one results in an accuracy tradeoff. This would be a self-defeating goal because we opt for more complex models because they provide higher accuracy in the first place.

“Attempts to compress deep neural networks into a simpler interpretable models with equivalent accuracy typically fail when working with complex real world data such as images or human language,” Elton notes.

Your brain is a black box

One of Elton’s main arguments is about adopting a different view of understanding AI decision. Most efforts focus on breaking open the “AI black box” and figuring out how it works at a very low and technical level. But when it comes to the human brain, the ultimate destination of AI research, we’ve never had such reservations.

“The human brain also appears to be an overfit ‘black box’ which performs interpolation, which means that how we understand brain function also needs to change,” he writes. “If evolution settled on a model (the brain) which is uninterpretable, then we expect advanced AIs to also be of that type.”

What this means is that when it comes to understanding human decision, we seldom investigate neuron activations. There’s a lot of research in neuroscience that helps us better understands the workings of the brain, but for millennia, we’ve relied on other mechanisms to interpret human behavior.

“Interestingly, although the human brain is a ‘black box’, we are able to trust each other. Part of this trust comes from our ability to ‘explain’ our decision making in terms which make sense to us,” Elton writes. “Crucially, for trust to occur we must believe that a person is not being deliberately deceptive, and that their verbal explanations actually maps onto the processes used in their brain to arrive at their decisions.”

One day, science might enable us to explain human decisions at the neuron activation level. But for the moment, most of us rely on understandable, verbal explanations of our decisions and the mechanisms we have to establish trust between each other.

The interpretation of deep learning, however, is focused on investigating activations and parameter weights instead of high-level, understandable explanations. “As we try to accurately explain the details of how a deep neural network interpolates, we move further from what may be considered relevant to the user,” Elton writes.

Self-explainable artificial intelligence

Based on the trust and explanation model that exists between humans, Elton calls for “self-explaining AI” that, like a human, can explain its decision.

An explainable AI yields two pieces of information: its decision and the explanation of that decision.

This is an idea that has been proposed and explored before. However, what Elton proposes is self-explaining AI that still maintains its complexity (e.g., deep neural networks with many layers) and does not sacrifice its accuracy for the sake of explainability.

In the paper, Elton suggests how relevant causal information can be extracted from a neural network. While the details are a bit technical, what the technique basically does is extract meaningful and present information from the neural network’s layers while avoiding spurious correlations. His method builds on current self-explaining AI systems developed by other researchers and verifies whether explanations and predictions in their neural networks correspond.

In his paper, Elton also discusses the need to specify the limits of AI algorithms. Neural networks tend to provide an output value for any input they receive. Self-explainable AI models should “send an alert” when results fall “outside the model’s applicability domain,” Elton says. “Applicability domain analysis can be framed as a simple form of AI self-awareness, which is thought by some to be an important component for AI safety in advanced AIs.”

Self-explainable AI models should provide confidence levels for both their output and their explanation.

Applicability and domain analysis is especially important “for AI systems where robustness and trust are important, so that systems can alert their user if they are asked work outside their domain of applicability,” Elton concludes. An obvious example would be health care, where errors can result in irreparable damage to health. But there are plenty of other areas such as banking, loans, recruitment, and criminal justice, where we need to know the limits and boundaries of our AI systems.

Much of this is still hypothetical, and Elton provides little in terms of implementation details, but it is a nice direction to follow as the explainable AI landscape develops.

1 COMMENT

vonjd September 17, 2020 at 10:17 pm

Thank you for this great article. To get an intuitive understanding of Explainable AI (XAI) this might be helpful: https://blog.ephorie.de/explainable-ai-xai-explained-or-how-to-whiten-any-black-box-with-lime

Loading...

Why LLMs should stop thinking out loud (and what comes after…

Beyond vibe coding: How Codev 3.0 engineers the AI-powered dev team

How Cursor’s Composer 2.5 uses self-distillation to beat the frontier LLMs…

Vertical integration as AI infrastructure: What 21D’s full arch implant system…

Why sandboxing OpenClaw doesn’t stop data exfiltration

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Demystifying loop engineering: Get more from AI agents, avoid loopmaxxing

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

The case for self-explainable AI

What’s wrong with current explainable AI methods?

Your brain is a black box

Self-explainable artificial intelligence

Like this:

1 COMMENT

Leave a ReplyCancel reply

What’s wrong with current explainable AI methods?

Your brain is a black box

Self-explainable artificial intelligence

Like this:

1 COMMENT

Leave a ReplyCancel reply

Discover more from TechTalks