This article is part of our coverage of the latest in AI research.
It’s fascinating how in a few years, large language models (LLM) went from intriguing new deep learning models (the transformer architecture) to one of the hottest areas of AI research. Of special interest is the capacity of LLMs like OpenAI’s GPT-3 and DeepMind’s Gopher to generate long sequences of (mostly) coherent text.
But one of the problems with LLMs is that they always have an answer to your prompt, even if that answer is completely wrong. And there have been numerous cases of LLMs making wrong claims and generating text that, although impressive, is utter nonsense.
LLMs are gradually finding their way into real-world applications, from composing emails and writing articles, to answering questions and filling in for customer service agents. Accordingly, there is growing interest in finding ways to determine the reliability and trustfulness of the answers these machine learning models produce. According to a new study by researchers at OpenAI and the University of Oxford, large language models can be calibrated to express their level of certainty in the answers they provide. The study, which focuses on GPT-3, shows that with the right training, LLMs can contribute to making AI systems aligned with human goals and intents.
Logits and confidence levels in machine learning
“Having language models express their uncertainty is a crucial aspect of honesty: there will always be things that models don’t know for sure, and so uncertainty is necessary for conveying the model’s knowledge faithfully,” Jacob Hilton, AI researcher at OpenAI and co-author of the paper, told TechTalks.
Measuring confidence is not a new problem in machine learning. Most ML models have one way or another to reveal the reliability of their predictions. For example, consider a convolutional neural network (CNN) designed to recognize handwritten digits, classifying images into one of ten classes (0-9). The output layer of the neural network provides ten values, each of which is the probability that the input image fed to the model belongs to one of the target classes. Usually, applications consider the output with the highest probability as the deep learning model’s predicted class.
These probabilities are often called “log probabilities” or “logits” (depending on how the neural network is arranged and what type of activation function is used in the last layer). Logits are very useful in many applications, like the image classification example mentioned above. For example, if there’s a very big difference between the highest logit value and the rest, it shows that the model has very high confidence in its prediction.
But if two or more logits are close to each other, it shows that the neural network is not confident in its prediction (for example, some people write the digit 1 in a way that makes the neural network confuse it with a 7).
However, when it comes to more complicated applications of deep neural networks like language processing, logits do not align with human understanding of confidence.
“In other contexts, such as image classification, logits can often be used to infer the model’s confidence,” Hilton said. “However, for a language model, logits only tell you the model’s confidence that a claim will be stated in a particular way, not the model’s confidence in the claim itself.”
In other words, if a large language model like GPT-3 can produce the same output using different wordings, then each individual way of saying it will have a low logit value. This represents the model’s uncertainty “over tokens,” the researchers write. Ideally, the model should express its confidence in its knowledge and claim, which the researchers define as “epistemic uncertainty.”
In their paper, the researchers focus on teaching LLMs to express their uncertainty in numeric and verbal form along with their output (e.g., “Confidence: 61% / Medium”). The benefit of verbalized probabilities is that they apply to “any model that outputs natural language” and “mirror human expression of uncertainty,” according to the researchers.
“This allows models to respond to prompts from non-technical users (e.g., ‘How sure are you about what you just said?’, ‘I’ve told you my confidence on a scale from 1-5. Can you do the same?’),” the researchers write. “This also allows models to decide when and how to provide uncertainty information (depending on the human audience).”
Setting a benchmark for LLM uncertainty
To finetune large language models and evaluate their capacity to express their epistemic uncertainty, the researchers propose CalibratedMath, a benchmark for arithmetic problem-solving. CalibratedMath defines a set of problems distributed across 21 categories, including basic operations, rounding, and finding remainders. GPT-3’s performance on different sub-tasks varies, which is “crucial for a challenging test of calibration,” the researchers write.
Numerous studies have shown that neural networks can improve their score in benchmarks without learning the logical functions underlying the tasks on which the model is evaluated. This becomes evident when the ML model cannot generalize its learned behavior beyond its training distribution, which means it performs poorly when pitted against real-world examples.
The researchers designed the training and test examples of the CalibratedMath benchmark to maximize generalization over distribution shift. For example, the training set includes “add-subtract” examples that have a unique correct answer (e.g., “What is 952 – 55?”), while the evaluation set is composed of problems that can have multiple answers (e.g., “Name any number that is smaller than 621”) or multiplication-division problems.
Finetuning language models to express uncertainty
The ultimate goal of CalibratedMath is not to improve the model’s answers but its expression of uncertainty over its answers. Therefore the model is finetuned using supervised learning on labeled data of confidence expression. The researchers train GPT-3 on examples that include question-answer pairs along with the answer’s confidence score. During the evaluation phase, the model is given new question-answer pairs and must specify the confidence level of the answer.
In the study, the researchers test two uncertainty expression methods. First is the numeric and verbalized confidence score described earlier, in which the label is the percentage value (e.g., 61%) or a textual description (e.g., lowest, low, medium, high, highest) of the model’s uncertainty in its answer.
In the second method, called “indirect logit,” the label is a “true/false” value that indicates whether the model’s answer was correct. The label is compared against the ground truth to calculate the cross-entropy loss, which is used in training binary classification ML models.
“The way to incentivize a model to represent its true level of uncertainty is to optimize a proper scoring rule,” Hilton said. “Cross-entropy loss is an example of this (as we use in our ‘indirect logit’ method). This is not normally how language models are trained to verbalize uncertainty, however, and so in practice, language models do learn to rehash canned responses from their training data.”
The researchers’ experiments show that when calibrated for verbalized probabilities, GPT-3 generalizes well to the “multi-answer” and “multiply-divide” evaluation sets and remains “moderately calibrated under a substantial distribution shift.” However, while it outperforms the baseline and the indirect logit method, verbalized probability calibration still performs better on its training set than the multi-answer evaluation set. This is because the model’s answers to multi-answer questions are more likely to be correct than answers to add-subtract problems.
The indirect logit method, on the other hand, generalizes reasonably well on multi-answer questions while performing poorly on multiply-divide questions. “Further work could explore how the indirect logit compares to verbalized probability with different training setups (e.g. a more diverse distribution on probabilities and questions),” the researchers write.
One interesting finding in the study is that GPT-3 has learned the relevant features of its inputs during pre-training, which means the fine-tuning only adjusts the model to express those “latent” representations. “GPT-3 learns to express its own (pre-existing) uncertainty about answers and exhibits ‘honesty’ (i.e. communicating its actual epistemic state in words),” the researchers write.
This is an important finding because it can help guide future research on investigating what large language models learn and steer them in the right direction.
As for further investigation on LLM expression of uncertainty, the researchers propose testing LLM families other than GPT-3, “especially models that have a better grasp of probability before being finetuned.” They also suggest test calibration in other domains such as history and biology and other prompt formats such as chat and long-form question answering.
Another possible direction is to replace supervised finetuning with a more flexible approach such as reinforcement learning. RL could remove the manual labeling bottleneck that supervised learning imposes, but it might have other challenges.
“In theory, RL can be used to incentivize the model to verbalize its true level of uncertainty—using a proper scoring rule, for example,” Hilton said. “However, this requires access to ground truth about how likely the model’s claim is to be correct, which can become increasingly challenging to obtain as models become more intelligent. This is known as the ‘scalable oversight’ problem, and is seen as an important bottleneck to aligning advanced AI systems with human interests.”