How language models can teach themselves to follow instructions

self-repairing robot
Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

There is a growing interest in techniques that enable large language models (LLM) to improve their capabilities with little or no human intervention. Among the areas where LLMs are self-improving is instruction fine-tuning (IFT), where the model is taught to follow human instructions.

IFT is one of the main reasons models such as ChatGPT and Claude have become very successful. However, IFT is a complicated process that requires much time and human labor. A new technique called “self-rewarding language models,” introduced in a paper by Meta and New York University, provides a recipe that enables a pre-trained language model to create and evaluate examples to train itself for instruction fine-tuning.

The advantage of this method is that it continues to improve the model when applied multiple times. Self-rewarding language models not only improve their instruction-following capabilities but also become better at reward modeling.

Self-rewarding language models

The common way to fine-tune LLMs for instruction-following is reinforcement learning from human feedback (RLHF). 

In RLHF, the language model learns to optimize its responses based on the feedback it receives from a reward model. The reward model is trained based on feedback from human annotators, which helps to align the model’s responses with human preferences. RLHF consists of three phases: pre-training the LLM, creating a reward model trained on human-ranked outputs, and a reinforcement learning loop where the LLM is fine-tuned based on the reward model’s scores to generate high-quality text aligned with human judgments.

reinforcement learning from human feedback rlhf
Reinforcement learning from human feedback (RLHF) (Source: arXiv)

An alternative is direct preference optimization (DPO), in which the model produces several answers and receives direct feedback from humans on which one is preferable. In DPO, there is no need to create a separate reward model.

While these techniques have proven to be effective, they are both limited by the size and quality of the human preference data. RLHF has the added limitation that once trained, the reward model is frozen and its quality does not change throughout the fine-tuning of the main LLM.

The idea of self-rewarding language models (SRLM) is to create a training algorithm that overcomes these limitations. “The key to such an approach is to develop an agent that possesses all the abilities desired during training, rather than separating them out into distinct models such as a reward model and a language model,” the researchers write in their paper.

SRLM has two main capabilities. First, it can provide helpful and harmless responses to instructions from users. Second, it can create and evaluate examples of instructions and candidate responses.

This enables it to iteratively train itself on AI Feedback (AIF) and gradually improve itself by creating and training on its own data.

In each iteration, the model becomes better at following instructions. Accordingly, it also improves at creating examples for its next round of training.

How SRLM works

self-rewarding language model
Self-rewarding language models (SRLM) create their own training examples and evaluate them (source: arxiv)

Self-rewarding language models start with a foundational LLM trained on a large corpus of text. The model is then fine-tuned on a small seed of human-annotated examples. The seed data includes instruction fine-tuning (IFT) examples that include pairs of instruction and the response. 

To improve the results, the seed data can also include evaluation fine-tuning (EFT) examples. In EFT, the model is provided with an instruction and a set of responses. It must sort the responses based on their relevance to the input prompt. The evaluation result consists of a reasoning description followed by a final score. These examples enable the LLM to perform the role of the reward model.

Once trained on the initial dataset, the model can generate data for its next training iterations. In this stage, the model samples examples from the original IFT dataset and generates a new instruction prompt. It then generates several candidate responses for the newly created prompt. 

Finally, the model uses LLM-as-a-Judge to evaluate the responses. LLM-as-a-Judge requires a special prompt that includes the original request, candidate responses, and instructions on evaluating the responses.

LLM-as-a-judge
LLM-as-a-judge prompt (source: arxiv)

Once the model has created instruction examples and ranked the responses, SRLM uses them to create an AI Feedback Training (AIFT) dataset. There are two ways to assemble the training dataset. You can use the instructions along with both response and ranking scores to create a preference dataset. This dataset can be used along with direct preference optimization (DPO) to teach the model to tell the difference between good and bad responses. Alternatively, you can create a supervised fine-tuning (SFT) dataset that comprises only the highest-ranking response. The researchers found that including ranking data improved the performance of the trained model.

Once the newly created examples are added to the original dataset, the model can be trained again. This process is repeated several times, with each cycle creating a model that is better at both following instructions as well as evaluating responses.

“Importantly, because the model can both improve its generation ability, and it acts as its own reward model through the same generation mechanism, this means the reward model itself can improve through these iterations, deviating from standard practices where the reward model is fixed,” the researchers write. “We believe this can increase the ceiling of the potential for self-improvement of these learning models going forward, removing a constraining bottleneck.”

Experimenting with SRLM

The researchers tested self-rewarding language models with the Llama-2-70B as the base model. As the seed data for instruction fine-tuning, they used the Open Assistant dataset, which contains thousands of instruction fine-tuning examples. Open Assistant also has examples of instructions with multiple ranked responses, which can be used for evaluation fine-tuning (EFT).

Their experiments show that every iteration of self-rewarded language modeling improves the LLM’s instruction-following abilities. Moreover, the LLM becomes better at reward modeling, which in turn enables it to create better training examples for the next iteration. Their tests on the AlpacaEval benchmark show that Llama-2 with three iterations of SRLM outperformed Claude 2, Gemini Pro, and GPT-4-0613.

There are limitations to this approach. Like other techniques that allow LLMs to self-improve, SRLM can lead to the model falling into a “reward hacking” trap, where it starts to optimize its responses for the desired output but for the wrong reasons. Reward hacking can lead to unstable models that perform poorly on real-world applications and situations that are different from their training examples. It is also not clear how far this process can be scaled on model size and number of iterations.

But SRLM has the clear advantage of giving you more for your training data. If you already have a dataset of annotated training examples, you can use SRLM to boost the abilities of your LLM without the need to add more examples to your dataset.

“We believe this is an exciting avenue of research because this means the model is better able to assign rewards in future iterations for improving instruction following – a kind of virtuous circle,” the researchers write. “While this improvement likely saturates in realistic scenarios, it still allows for the possibility of continual improvement beyond the human preferences that are typically used to build reward models and instruction following models today.”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.