This article is part of our coverage of the latest in AI research.
Nvidia’s researchers have unveiled a new technique that could revolutionize how we align large language models (LLM) with user instructions. Called SteerLM, the technique aims to overcome the limitations of reinforcement learning from human feedback (RLHF), the conventional method used to align LLMs.
SteerLM is reportedly more efficient than RLHF and provides users with enhanced control over the LLM’s behavior. Unlike RLHF, which conditions responses on a single reward, SteerLM conditions on multiple attributes, promising a more nuanced and user-aligned model performance.
How LLM alignment works
LLMs like ChatGPT have gained popularity due to their proficiency in following user instructions. The prevalent method to fine-tune these models involves a two-stage process. Initially, the pre-trained model undergoes supervised fine-tuning (SFT), where it is trained on human-annotated examples of instructions and responses. SFT enables the model to better align its responses with user instructions. Subsequently, the model undergoes reinforcement learning from human feedback (RLHF), where a reward model trained on human preferences further refines the LLM to align more closely with user goals.
However, in their paper, the Nvidia researchers point out several limitations in this current alignment paradigm. First, SFT does not enable the model to differentiate between high- and low-quality responses, which can lead to poor user experience.
Second, the RLHF process is inherently complex, which can limit its widespread application among organizations and companies.
Finally, RLHF operates by guiding the model through a single-dimensional reward, neglecting the multifaceted nature of human preferences, such as utility, safety, and more. This approach makes it difficult for users to fine-tune different aspects of the model’s behavior at inference time. The researchers believe that these shortcomings can be addressed by their new technique, SteerLM.
What is SteerLM?
Nvidia describes SteerLM as “a novel approach to model alignment through SFT that overcomes the limitations associated with conventional SFT and RLHF methods.” SteerLM trains the model to learn both the quality of responses and a comprehensive range of human preferences.
Unlike RLHF, SteerLM employs an offline and scalable process that generates and labels its own training data. This approach reduces complexity, making SteerLM accessible to a broader range of organizations and users. As the researchers explain, “We train the generation of responses to be conditioned on both the prompt instructions and the annotated attributes, enabling SteerLM to effectively capture human preferences and generate responses that align with them.”
Tests conducted by the authors indicate that SteerLM outperforms both SFT and RLHF in following user instructions. In both automated and human-evaluated tests, a 43-billion parameter LLaMA model fine-tuned with SteerLM surpassed other baseline models, including the larger ChatGPT 3.5. Remarkably, even at 13 billion parameters, SteerLM outperformed most baseline models.
Beyond performance, SteerLM also offers enhanced control over various aspects of the model’s output, such as humor, creativity, and helpfulness. This increased control allows for a more personalized and user-aligned AI experience. The researchers conclude, “We hope our work will inspire further research into developing simple and effective model alignment methods that empower better AI assistants for everyone.”
How SteerLM works
SteerLM operates in four steps. The first step involves training an “Attribute Prediction Model” (APM). Unlike traditional models that predict a singular quality value, the APM predicts various aspects of the response, such as humor, helpfulness, toxicity, creativity, and language quality. To train the APM, the researchers used an open-source dataset manually annotated with response attributes, transforming it into a manageable supervised learning problem.
The second step leverages the trained APM to annotate additional collected data. This process is scalable and significantly faster than creating manually labeled data. The researchers also highlight that annotation with the APM can address some issues associated with crowd-sourced human-annotated data. This includes noise caused by annotators misinterpreting instructions, inadequate expertise in annotating responses, and limited language comprehension proficiency. “By employing an Attribute Prediction Model, it becomes possible to mitigate these issues by denoising the human-annotated attributes and calibrating scores across annotators,” the researchers explain.
The third step, termed “Attribute-Conditioned SFT,” is an extension of regular SFT that incorporates reward signal information through attribute labels. The LLM is trained on offline examples annotated with the APM instead of the online sample gathering used in RLHF. “By utilizing a purely offline training approach, this greatly simplifies the training configuration compared to RLHF’s heterogenous setup,” the researchers write. This is where the model is conditioned on both the response and its attributes.
Following the SFT stage, SteerLM uses a two-step process that mirrors RLHF. Initially, it samples multiple responses from the fine-tuned model for each prompt by specifying maximum quality. Subsequently, it ranks these responses using the APM. Then it uses the feedback to conduct another round of Attribute-Conditioned SFT. This iterative process allows the model to continually refine its responses, aligning more closely with user preferences and instructions.
Why SteerLM is important?
SteerLM has several appealing characteristics that set it apart from the current. Beyond its high-quality output, it simplifies the traditionally complex RLHF pipeline that necessitates the coordination of a large workforce.
SteerLM uses examples extracted from open-source datasets, including the OpenAssistant dataset, the Helpful and Harmless – Reinforcement Learning from Human Feedback dataset, and the Model Self-Identification Dataset. Other researchers and organizations can use the source code, training recipe, and data to further expand the research. The trained 13-billion-parameter SteerLM model is also available on Hugging Face.
However, it’s important to note that the training process for SteerLM remains computationally expensive. The researchers report that to train the Attribution Prediction Model and the Attribute Conditioned Supervised Fine-Tuning model, they used a cluster of 128 A100-80GB GPUs, costing approximately $200 per hour.
Despite this, SteerLM represents a significant improvement over the previous costs associated with RLHF. We can expect other researchers to enhance the technique in the coming weeks and months.