Blog

Can ChatGPT self-explain its predictions?

October 23, 2023

robot thinking — Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

One of the interesting abilities of large language models (LLM) like ChatGPT and Bard is self-explanation. These models can provide step-by-step details on solving complex math problems or explain sentiment analysis in movie or book reviews. But do these explanations genuinely mirror the model’s inner workings, or do they merely offer a veneer of transparency, detached from the actual prediction process?

A recent study by researchers at the University of California, Santa Cruz, and MIT tries to answer this question. The scientists compare self-explanation with other traditional methods for interpreting the predictions of machine learning models. Their findings offer valuable insights into the efficacy of various explanation techniques. Most notably, they discover that while self-explanation enhances transparency, it does so at the cost of model accuracy.

Traditional ML explanations vs self-explanation

The traditional way to interpret the decisions of machine learning models involves “feature attribution.” This set of methods assesses how different elements of the model’s input contribute to its output. For instance, in image classifiers, explainability techniques often generate heat or saliency maps. These maps highlight areas in the image that are pertinent to the class assigned by the model.

In natural language processing applications, such as sentiment analysis or text classification, feature attribution typically assigns scores to different words in the input sentence, indicating their relevance to the output class.

In contrast, LLMs possess the unique ability to self-explain their outputs. For example, if an LLM classifies a product review as positive, it can also provide an explanation for this classification. There are essentially two methods for self-explanation. The first is the “explain-then-predict” (E-P) approach, where the model first generates an explanation and then arrives at a prediction based on it. The second is the “predict-and-explain” (P-E) approach, where the model first makes a prediction and then explains it. These self-explanation capabilities of LLMs offer a new dimension to understanding their outputs.

RISE explainable AI example saliency map — Saliency maps can help understand why an ML model makes the right or wrong decisions

Comparing LLM explanation methods

In their study, the researchers used sentiment analysis examples with ChatGPT to compare feature attribution methods such as LIME with the two self-explanation methods. To make a better sense of self-explanation, they used different prompt and instruction formats.

In some experiments, they provided explicit instructions to the model to output a list of the top-k words it identified as relevant to its prediction. In others, they required the model to assign a relevance score to each word. They also instructed the model to provide a confidence score for its prediction.

The researchers compared self-explanation and traditional explanation methods on two primary fronts: a suite of faithfulness evaluation metrics and a set of disagreement measurements among explanation techniques. Traditional explanation methods require access to model weights and gradients, which is not feasible with closed models like ChatGPT. To circumvent this, the researchers used the “occlusion method.” This involved rerunning the same prompt into ChatGPT multiple times, each time removing certain words to observe their impact on the model’s output. They used this method to rank the importance of each word.

To measure the faithfulness of the explanation methods, they used various techniques. For instance, they removed the top-k words reported as most important to see if it altered the model’s decision.

chatgpt self-explain — ChatGPT self-explain vs traditional ML explanation techniques (source: arXiv)

Accuracy vs interpretability

The researchers tested the explanation techniques on a dataset of movie reviews and their corresponding sentiments. They discovered that the performance of self-explanations is comparable to traditional methods in faithfulness evaluations. This means they usually highlighted the input words that were rightly associated with the labeled sentiment.

Given that traditional methods like LIME require multiple prompts to ChatGPT, they can be time-consuming and costly. This makes self-explanation a viable substitute.

The researchers found that different self-explanation prompting techniques were “intuitively reasonable in highlighting words of strong intrinsic sentiment values.” However, they also observed a drop in the overall accuracy of the model when it was asked to explain its prediction. The researchers hypothesize that “feature attribution explanations may not be the best form of explanation for sentiment analysis, which forces the model into an uncomfortable accuracy-interpretability tradeoff.”

Interestingly, they also discovered a high level of disagreement between different explanation methods, complicating the evaluation process. “We find that explanations that perform similarly on faithfulness metrics also have high disagreement (for the ChatGPT model),” the researchers noted.

Importantly, the researchers conclude that “the classic interpretability pipeline of defining and evaluating model explanations may be fundamentally ill-suited for these LLMs with quite human-like reasoning ability.” Given the lack of previous work on studying LLM-generated feature attribution explanations, they acknowledged that “it is likely that our solution is not optimal, and better ways to elicit self-explanations could be developed.”

This work is part of the broader efforts to study the reasoning abilities of LLMs such as ChatGPT. It is widely accepted that these abilities are limited or at the very least much different from those of humans. Better understanding, harnessing, and enhancing reasoning in LLMs will be crucial to creating robust applications with these models.

How the AI arms race moved from smart models to full-stack…

Why LLMs should stop thinking out loud (and what comes after…

Beyond vibe coding: How Codev 3.0 engineers the AI-powered dev team

How Cursor’s Composer 2.5 uses self-distillation to beat the frontier LLMs…

Vertical integration as AI infrastructure: What 21D’s full arch implant system…

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Demystifying loop engineering: Get more from AI agents, avoid loopmaxxing

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Can ChatGPT self-explain its predictions?

Traditional ML explanations vs self-explanation

Comparing LLM explanation methods

Accuracy vs interpretability

Like this:

Leave a ReplyCancel reply

Traditional ML explanations vs self-explanation

Comparing LLM explanation methods

Accuracy vs interpretability

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks