This article is part of our coverage of the latest in AI research.
In the aftermath of the remarkable success of models like ChatGPT and GPT-4, the spotlight has been cast on reinforcement learning from human feedback (RLHF), the technique that enables these large language models (LLM) to better align with human instructions, intents, and values.
Yet, amidst the excitement, there has been little discussion on the limitations of RLHF. A new paper by researchers from various academic institutions delves into the challenges of RLHF. The authors also propose potential solutions to mitigate these shortcomings and help create more robust and reliable artificial intelligence systems.
What is RLHF?
RLHF is a method that uses human guidance to fine-tune a pre-trained LLM. It is composed of three interconnected processes: feedback collection, reward modeling, and policy optimization.
The feedback collection gathers human evaluations of LLM outputs. This feedback data is then used to train a reward model through supervised learning. The reward model is meant to emulate human preferences. Subsequently, the policy optimization process uses a reinforcement learning loop to optimize the LLM to produce outputs that receive favorable evaluations from the reward model. These steps are executed iteratively or concurrently.
RLHF has emerged as the primary strategy to make LLMs safe and aligned with human objectives. AI labs use it in their commercial and open-source LLMs such as ChatGPT and LLaMA 2.
As the authors of the new paper note, “RLHF enables humans to communicate goals without hand-specifying a reward function. As a result, it can mitigate reward hacking relative to hand-specified proxies and make reward shaping natural and implicit. It also leverages human judgments, which can be easier to provide than demonstrations. These advantages have made RLHF useful for helping policies learn intricate solutions in control environments and for fine-tuning LLMs.”
However, despite its effectiveness, RLHF is not without challenges. Models trained using RLHF can suffer from issues such as hallucinations and bias. They are also susceptible to adversarial attacks, including unusual jailbreaks that can trick them into bypassing their safeguards. In essence, while RLHF-trained models perform exceptionally well, they can also make unpredictable errors that would not typically be expected from human behavior.
In their paper, the researchers provide a comprehensive overview of the challenges associated with the three main components of RLHF (human feedback, reward model, and policy optimization).
They distinguish between problems that can be addressed through modifications of the RLHF process and challenges that necessitate fundamentally different approaches.
“The key distinction between the two is that fundamental challenges are substantial enough that overcoming them would require a method that is no longer a form of RLHF. As a result, these fundamental challenges must be either avoided by not using RLHF or compensated for by other safety measures,” the researchers write.
Challenges with human feedback in RLHF
The human component is a critical aspect of RLHF, but it also presents a unique set of challenges. The subjectivity of human goals, intents, and preferences can lead to inconsistencies and ambiguities. The researchers note, “Humans can pursue harmful goals, either innocently or maliciously.”
A significant challenge lies in selecting a representative sample of annotators who can provide quality feedback. This task is far from straightforward, as the personal opinions of individual annotators can negatively influence the model. Furthermore, there is the threat of data poisoning, where a human annotator deliberately provides incorrect feedback signals, steering the reward model to favor undesirable behavior.
Human annotators are also susceptible to a variety of cognitive traps. Fatigue, attention decay, false memories, and common misconceptions can all compromise the quality of feedback, especially when sessions become long. Additionally, the artificial interactions generated during the data collection process may not accurately reflect the LLM’s deployment environment, leading to a disconnect between the training and real-world environments.
Another concern arises when knowledge workers outsource their tasks to chatbots. This practice can introduce its own set of problems, further complicating the RLHF process.
Some of these challenges can be addressed through adjustments to RLHF. However, some problems extend beyond the scope of RLHF.
The researchers highlight that humans often struggle to evaluate performance on complex tasks accurately. They write, “Even given perfect information and extended time, humans can still provide poor feedback when examples are hard to evaluate.” Tasks such as assessing text summarizations made by LLMs or vetting code generated by LLMs for security vulnerabilities exemplify such challenges.
The researchers also caution that RLHF can mislead humans during the annotation process. “In particular, language models trained with RLHF can sound confident even when they are incorrect, which can lead humans to provide more positive feedback,” they write. “Misleading behavior will actively be incentivized by RLHF when humans can be tricked into mistakenly providing positive feedback.”
Finally, the researchers discuss the tradeoffs inherent in defining the feedback signal. The current prevalent method involves users ranking two or more examples. While this approach is simple, fast, and cost-effective, it omits crucial information, such as the intensity of preference or the correct answer. These shortcomings could be addressed by providing more complex feedback instructions to annotators, such as assigning a scalar value to each LLM output or writing their own correct output to the prompt. However, these additional steps could slow the feedback collection process and significantly increase costs.
Challenges with the RLHF reward model
Modeling human preferences is very difficult. The fluidity, context-dependence, and complexity of human preferences make it difficult to encapsulate them within a loss function or numeric values.
“Humans possess a range of intricate and context-dependent preferences that evolve over time and are difficult to model accurately. Models of human goals based on incorrect assumptions about human decision-making can impair reward inference,” the researchers write.
Moreover, the researchers highlight that most RLHF work does not consider the personality and context-dependence of human preferences. Previous research demonstrates that a mixture of reward functions cannot be identified from binary preferences without additional context. This lack of context can lead to a misrepresentation of human goals and preferences, thereby affecting the accuracy of the reward model.
Another challenge arises from the diversity of human preferences and capabilities. Annotators often have differing opinions, making it difficult to create a single reward model that can generalize across many different groups. The researchers write, “Attempting to condense feedback from a variety of humans into a single reward model without taking these differences into account is… a fundamentally misspecified problem.”
RLHF models are also susceptible to “reward hacking,” a problem inherent in complex deep learning models. Essentially, the model identifies a shortcut within the problem space that allows it to minimize the loss function without truly learning the crucial aspects of the problem. This issue can lead to models that perform well on training data but fail to deliver in real-world scenarios.
Finally, evaluating the reward model presents its own set of challenges. The ambiguity of the reward signal makes it difficult to assess the model’s performance. Since human preferences cannot be defined deterministically, a deep learning system is used to model them, creating a black box that is nearly inscrutable. This black box requires evaluation through the reinforcement learning policy, adding another layer of complexity to the process.
Challenges with the RLHF policy
One of the most significant issues with the reinforcement learning component of RLHF is its susceptibility to adversarial attacks. These attacks can be particularly problematic as they can be applied even to black box models such as ChatGPT and GPT-4. The researchers highlight this vulnerability, stating, “Even when learned policies are trained with a perfect reward signal, perform well at the task they are trained for, and generalize to a wide range of scenarios, they can still perform poorly in adversarial situations.” This is a critical concern, especially considering that models deployed in real-world scenarios can be adversarially attacked by humans or other AI systems.
Another challenge lies in the influence of the training data on the RLHF process. The researchers caution that the biases present in the training dataset can inadvertently shape the RLHF process. They write, “For example, if sounding confident and producing correct answers are correlated in the base model, the reward model will learn that sounding confident is good and reinforce this in the policy.”
Lastly, RL fine-tuning can lead to “mode collapse,” where the model’s preference for rare and improbable answers diminishes over time, leading to a decrease in creativity and diversity. The researchers note, “RL incentivizes the policy to output high-scoring completions with high probability, rather than with a probability in line with a training distribution.” This can result in a model that is less innovative and varied over time.
Addressing the challenges of RLHF
In their paper, the researchers propose several measures to mitigate the risks associated with RLHF.
One interesting solution is to optimize the use of human feedback resources. For example, designers of LLM systems can use their budget to generate fewer but more fine-grained feedback examples. Then, they can use the long-form human feedback to train AI tools to automate feedback generation at scale. This approach can compensate for the low volume of human data, providing more fine-grained feedback examples that can enhance the performance of LLMs.
The researchers also advocate for the use of reward models with constraints that account for multi-modal distributions. This approach deviates from the traditional method of optimizing for a unimodal majority preference. They also suggest that an ensemble of reward models can help maintain the diversity of the LLM’s output.
Another significant recommendation is to focus on the self-supervised pre-training phase of LLMs. The researchers suggest using human feedback to train a model that filters or annotates the pre-training data. They argue that this can simplify the process of “aligning models by having them exhibit desirable behaviors from the outset rather than having them learn undesirable behavior and then attempt to unlearn it during finetuning.”
However, the researchers caution that RLHF alone can pose risks to the development of safe AI. They assert that while RLHF is useful, it does not solve the fundamental challenges of developing human-aligned AI. “No single strategy should be treated as a comprehensive solution,” they write. Instead, they propose a multi-pronged approach that includes various safety measures to compensate for each other’s failures.
For instance, machine learning engineers can use anomaly detection techniques to flag abnormal inputs that can trigger bad behavior. Additionally, AI explainability and interpretability can be leveraged to verify hypotheses about how models make decisions, including whether the decision-making process is trustworthy.
The researchers also emphasize the need for transparency in the RLHF process and data. They argue that transparency would improve the AI safety community’s understanding of RLHF and support the ability to track technical progress on its challenges. AI labs need to publish a lot more about their models, including the training data, annotation process, instructions to annotators, and the recruiting process of the RLHF trainers.
“RLHF has clear advantages for aligning AI systems with human goals. As a result, it has been key to the development of state-of-the-art LLMs and will likely continue to play a major role in modern AI,” the researchers write. “However, its use and influence should be accompanied by a commensurate research effort to better understand RLHF and address its flaws.”