LLMs can’t self-correct in reasoning tasks, DeepMind study finds

Confused robot
Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

Scientists are inventing various strategies to enhance the accuracy and reasoning abilities of large language models (LLM) such as retrieval augmentation and chain-of-thought reasoning.

Among these, “self-correction”—a technique where an LLM refines its own responses—has gained significant traction, demonstrating efficacy across numerous applications. However, the mechanics behind its success remain elusive. 

A recent study conducted by Google DeepMind in collaboration with the University of Illinois at Urbana-Champaign reveals that LLMs often falter when self-correcting their responses without external feedback. In fact, the study suggests that self-correction can sometimes impair the performance of these models, challenging the prevailing understanding of this popular technique.

What is self-correction?

Self-correction is predicated on the idea that LLMs can assess the accuracy of their outputs and refine their responses. For instance, an LLM might initially fail a math problem but correct its answer after reviewing its own output and reasoning.

Several studies have observed this process, also known as “self-critique,” “self-refine,” or “self-improve.”

However, the effectiveness of self-correction is not universal across all tasks. The paper from DeepMind and University of Illinois reveals that the success of self-correction is largely contingent on the nature of the task at hand. In reasoning tasks, self-correction techniques typically succeed only when they can leverage external sources, such as human feedback, an external tool like a calculator or code executor, or a knowledge base.

The researchers underscore the fact that high-quality feedback is not always accessible in many applications. This makes it crucial to understand the inherent capabilities of LLMs and to discern how much of the self-correction can be attributed to the model’s internal knowledge. They introduce the concept of “intrinsic self-correction,” which refers to a scenario where the model attempts to correct its initial responses based solely on its built-in capabilities, without any external feedback. 

LLM self-correction techniques
Different LLM self-correction techniques (source: GitHub)

Testing self-correction on reasoning tasks

The researchers put self-correction to the test on several benchmarks that measure model performance in solving math word problems, answering multiple-choice questions, and tackling question-answering problems that require reasoning. They employed a three-step process for self-correction. First, they prompt the model for an answer. Next, they prompt it to review its previous response. Finally, they prompt it a third time to answer the original question based on its self-generated feedback.

Their findings reveal that self-correction works effectively when the models have access to the ground-truth labels included in the benchmark datasets. This is because the algorithm can accurately determine when to halt the reasoning process and avoid changing the answer when it is already correct. As the researchers state, “These results use ground-truth labels to prevent the model from altering a correct answer to an incorrect one. However, determining how to prevent such mischanges is, in fact, the key to ensuring the success of self-correction.”

However, this assumption does not reflect real-world scenarios, where access to the ground truth is not always available. If the ground truth were readily accessible, there would be no need to employ a machine learning model to predict it. The researchers demonstrate that when they remove the labels from the self-correction process, the performance of the models begins to decline significantly.

Interestingly, the models often produce the correct answer initially, but switch to an incorrect response after self-correction. For instance, in GPT-3.5-Turbo (the model used in the free version of ChatGPT), the performance dropped by almost half on the CommonSenseQA question-answering dataset when self-correction was applied. GPT-4 also exhibited a performance drop, albeit by a smaller margin.

LLM self-correction errors
In many cases, intrinsic self-correction causes models to switch from the right answer to the wrong answer

According to the researchers, if the model is well-aligned and paired with a thoughtfully designed initial prompt, “the initial response should already be optimal given the conditions of the prompt and the specific decoding algorithm.” In this case, introducing feedback can be viewed as adding an additional prompt, potentially skewing the model’s response away from the optimal prompt. “In an intrinsic self-correction setting, on the reasoning tasks, this supplementary prompt may not offer any extra advantage for answering the question. In fact, it might even bias the model away from producing an optimal response to the initial prompt, resulting in a decrease in performance,” the researchers write.

Self-correction is also prevalent in multi-agent LLM applications. In these scenarios, multiple instances of an LLM, such as ChatGPT, are given different instructions to perform distinct roles in a multi-sided debate. For instance, one agent might be tasked with generating code, while another is instructed to review the code for errors.

In these applications, self-correction is implemented by instructing agents to critique each other’s responses. However, the researchers found that this multi-agent critique does not lead to any form of improvement through debate. Instead, it results in a form of “self-consistency,” where the different agents generate multiple responses and then engage in a form of majority voting to select an answer. 

“Rather than labeling the multi-agent debate as a form of “debate” or “critique”, it is more appropriate to perceive it as a means to achieve “consistency” across multiple model generations,” the researchers write.

Post-hoc vs pre-hoc prompting

While self-correction may not enhance reasoning, the researchers found that it can be effective in tasks such as modifying the style of the LLM’s output or making the response safer. They refer to these tasks as “post-hoc prompting,” where the prompting is applied after the responses have been generated. They write, “Scenarios in which self-correction enhances model responses occur when it can provide valuable instruction or feedback that pre-hoc prompting cannot.”

Another key finding of the paper is that the improvement attributed to self-correction in certain tasks may be due to an inadequately crafted initial instruction that is outperformed by a carefully constructed feedback prompt. In such cases, incorporating the feedback into the initial instruction, referred to as the “pre-hoc prompt,” can yield better results and reduce inference costs. The researchers state, “It is meaningless to employ a well-crafted post-hoc prompt to guide the model in ‘self-correcting’ a response generated through a poorly constructed pre-hoc prompt. For a fair comparison, equal effort should be invested in both pre-hoc and post-hoc prompting.”

The researchers conclude by urging the community to approach the concept of self-correction with skepticism and to apply it judiciously. 

“It is imperative for researchers and practitioners to approach the concept of self-correction with a discerning perspective, acknowledging its potential and recognizing its boundaries,” the researchers write. “By doing so, we can better equip this technique to address the limitations of LLMs, steering their evolution towards enhanced accuracy and reliability.”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.