This article is part of our coverage of the latest in AI research.
One of the interesting abilities of large language models (LLM) like ChatGPT and Bard is self-explanation. These models can provide step-by-step details on solving complex math problems or explain sentiment analysis in movie or book reviews. But do these explanations genuinely mirror the model’s inner workings, or do they merely offer a veneer of transparency, detached from the actual prediction process?
A recent study by researchers at the University of California, Santa Cruz, and MIT tries to answer this question. The scientists compare self-explanation with other traditional methods for interpreting the predictions of machine learning models. Their findings offer valuable insights into the efficacy of various explanation techniques. Most notably, they discover that while self-explanation enhances transparency, it does so at the cost of model accuracy.
Traditional ML explanations vs self-explanation
The traditional way to interpret the decisions of machine learning models involves “feature attribution.” This set of methods assesses how different elements of the model’s input contribute to its output. For instance, in image classifiers, explainability techniques often generate heat or saliency maps. These maps highlight areas in the image that are pertinent to the class assigned by the model.
In natural language processing applications, such as sentiment analysis or text classification, feature attribution typically assigns scores to different words in the input sentence, indicating their relevance to the output class.
In contrast, LLMs possess the unique ability to self-explain their outputs. For example, if an LLM classifies a product review as positive, it can also provide an explanation for this classification. There are essentially two methods for self-explanation. The first is the “explain-then-predict” (E-P) approach, where the model first generates an explanation and then arrives at a prediction based on it. The second is the “predict-and-explain” (P-E) approach, where the model first makes a prediction and then explains it. These self-explanation capabilities of LLMs offer a new dimension to understanding their outputs.
Comparing LLM explanation methods
In their study, the researchers used sentiment analysis examples with ChatGPT to compare feature attribution methods such as LIME with the two self-explanation methods. To make a better sense of self-explanation, they used different prompt and instruction formats.
In some experiments, they provided explicit instructions to the model to output a list of the top-k words it identified as relevant to its prediction. In others, they required the model to assign a relevance score to each word. They also instructed the model to provide a confidence score for its prediction.
The researchers compared self-explanation and traditional explanation methods on two primary fronts: a suite of faithfulness evaluation metrics and a set of disagreement measurements among explanation techniques. Traditional explanation methods require access to model weights and gradients, which is not feasible with closed models like ChatGPT. To circumvent this, the researchers used the “occlusion method.” This involved rerunning the same prompt into ChatGPT multiple times, each time removing certain words to observe their impact on the model’s output. They used this method to rank the importance of each word.
To measure the faithfulness of the explanation methods, they used various techniques. For instance, they removed the top-k words reported as most important to see if it altered the model’s decision.
Accuracy vs interpretability
The researchers tested the explanation techniques on a dataset of movie reviews and their corresponding sentiments. They discovered that the performance of self-explanations is comparable to traditional methods in faithfulness evaluations. This means they usually highlighted the input words that were rightly associated with the labeled sentiment.
Given that traditional methods like LIME require multiple prompts to ChatGPT, they can be time-consuming and costly. This makes self-explanation a viable substitute.
The researchers found that different self-explanation prompting techniques were “intuitively reasonable in highlighting words of strong intrinsic sentiment values.” However, they also observed a drop in the overall accuracy of the model when it was asked to explain its prediction. The researchers hypothesize that “feature attribution explanations may not be the best form of explanation for sentiment analysis, which forces the model into an uncomfortable accuracy-interpretability tradeoff.”
Interestingly, they also discovered a high level of disagreement between different explanation methods, complicating the evaluation process. “We find that explanations that perform similarly on faithfulness metrics also have high disagreement (for the ChatGPT model),” the researchers noted.
Importantly, the researchers conclude that “the classic interpretability pipeline of defining and evaluating model explanations may be fundamentally ill-suited for these LLMs with quite human-like reasoning ability.” Given the lack of previous work on studying LLM-generated feature attribution explanations, they acknowledged that “it is likely that our solution is not optimal, and better ways to elicit self-explanations could be developed.”
This work is part of the broader efforts to study the reasoning abilities of LLMs such as ChatGPT. It is widely accepted that these abilities are limited or at the very least much different from those of humans. Better understanding, harnessing, and enhancing reasoning in LLMs will be crucial to creating robust applications with these models.