This article is part of our coverage of the latest in AI research.
One of the long-coveted goals of artificial intelligence is to create agents that can effectively accomplish tasks in the real world by following natural language instructions. Large language models (LLM) have made significant strides toward this objective, demonstrating an impressive ability to handle well-defined tasks. However, their capabilities are currently limited, often falling short when faced with tasks that require a broader understanding of the world.
A new research paper from scientists at UC Berkeley proposes an innovative approach to this challenge. The paper introduces a novel technique dubbed Dynalang that aims to design reinforcement learning agents that can learn a world model with the help of natural language. This approach is not just about teaching an AI to perform a task; it’s about enabling the AI to understand the context of its environment and perform tasks more robustly and efficiently.
Experiments show that Dynalang can robustly handle tasks in a variety of contexts. This research could open up new venues for research toward creating better AI agents for the real world.
LLMs in the physical world
The recent advancements in LLMs have sparked a wave of excitement across various fields, including robotics and real-world task-performing agents. One of the promising aspects of LLMs is their ability to bridge the gap between language and visual data, giving rise to visual language models (VLM).
VLMs have the capability to map text to visual data and vice versa, a feature that has been leveraged in different applications, including text-to-image models and AI image search. A more advanced application of this technology is the mapping of natural language commands to actions in the real world. This is sometimes referred to as “embodied language models.”
Some techniques have combined reinforcement learning with VLMs to train agents capable of carrying out specific instructions.
However, the current models have their limitations. They excel at executing commands for very specific tasks, such as “pick up the blue box.” Recent advancements have added a layer of abstraction to these commands, enabling VLM-powered agents to understand and execute more complex instructions like “pick up the toy that represents an extinct animal.”
But in the real world, commands and utterances are often context-dependent. For instance, the phrase “I put the bowls away” could mean different things to an agent if it’s cleaning dishes or serving food. The UC Berkeley researchers note, “When language does not talk about the task, it is only weakly correlated with optimal actions the agent should take.”
The researchers propose a different approach. Instead of training agents to accomplish tasks right away, they suggest training them first to predict the future by learning a world model with the help of language instructions. “Similar to how next-token prediction allows language models to form internal representations of world knowledge, we hypothesize that predicting future representations provides a rich learning signal for agents to understand language and how it relates to the world,” the researchers write.
This approach could help AI agents to understand the context of their environment and perform tasks more robustly and efficiently.
The UC Berkeley researchers have proposed a technique called Dynalang, which they describe as “an agent that learns a world model of language and images from online experience and uses the model to learn how to act.” This technique is unique in its approach and has two distinct training modes.
First, Dynalang learns to model the world through text and visual observations. The researchers explain, “We train the world model to predict future latent representations with experience collected online as the agent acts in the environment.” This approach mirrors one form of self-supervised learning humans use to map observations in their environment to language. The researchers refer to this as a “language-conditioned world model.” Notably, Dynalang is multi-modal, meaning it predicts not only text but also visual representations of the future.
Second, Dynalang learns its action policy through reinforcement learning on the representations of the world model and tasks. “We train the policy to take actions that maximize task reward, taking the latent representation of the world model as input,” the researchers write.
In essence, Dynalang is designed to learn a world model through language and visual observations, and then use this model to learn how to act effectively in various contexts. This approach could potentially enhance the robustness and efficiency of AI agents in different contexts.
How Dynalang works
The UC Berkeley researchers developed Dynalang using a clever combination of different machine learning techniques. At its core, Dynalang is an AI system designed to perform actions, and its structure is based on a reinforcement learning loop. This loop consists of an agent, environment, actions, states, and rewards. The fundamental goal is to train an agent that can maximize its reward.
Dynalang is a model-based reinforcement learning system, meaning it predicts actions and states from the world model. In parallel, a replay buffer of past actions is used as a supervised learning stream to train the world model. Depending on the environment, the action space can consist of motor commands, text generation, and other types of actions.
One of the interesting characteristics of Dynalang is that it receives text instructions and descriptions as a stream of tokens along with image frames. This is in contrast to other techniques that provide a full chunk of instruction text at the beginning of an episode. The researchers explain, “For humans, reading, listening, and speaking extends over time, during which we receive new visual inputs and can perform motor actions. Analogously, we provide our agent with one video frame and one language token at each time step and the agent produces one motor action, and in applicable environments one language token, per time step.”
Like many applications of language models, Dynalang can be pre-trained on raw data (text and images), where it learns the latent representations of each modality. It can then be fine-tuned on smaller datasets of sensory and action data. However, there is one caveat, as the researchers note: “Unlike the typical language modeling objective, the model is not explicitly trained to predict the next token from the prefix, except through the prediction of the representation at the next timestep.”
How effective is Dynalang?
The Dynalang research paper is currently in pre-print, meaning it has yet to undergo the rigorous process of peer review. However, the authors of the paper include highly respected figures in the field of AI research, including Pieter Abeel, the Director of the Berkeley Robot Learning Lab and co-director of the Berkeley AI Research Lab. This lends a degree of credibility to the findings presented in the paper.
The researchers put Dynalang through its paces in a variety of environments, each with unique settings and challenges. Where possible, they compared the performance of Dynalang with baseline reinforcement learning models operating in the same environments.
One such environment was HomeGrid, a multitask gridworld where agents receive task specifications in language form, along with language hints. These hints can include descriptions of object locations, information about the environment’s dynamics, and corrections about actions.
The researchers note, “Notably, agents never receive direct supervision about what the hints mean in HomeGrid, and hints are often far removed from the objects or observations they refer to.” This means the agent must learn the hints’ meaning by correlating them to the states observed by the world model. The experiments demonstrated that Dynalang was adept at utilizing these hints, unlike RL models that had to learn action distributions through trial and error.
In another environment, Vision-and-Language Navigation in Continuous Environments (VLN-CE), the agent is required to navigate a 3D environment to reach a specified destination. Each episode includes an environment and natural language instructions on how to reach the destination.
The experiments showed that Dynalang was significantly more effective than pure RL methods in reaching the goal, as it learned to associate the text instructions with environment observations and actions.
However, the authors caution that “[Dynalang] is not yet competitive with state-of-the-art VLN methods (many of which use expert demonstrations or specialized architectures).” This means that while Dynalang is not as effective as SOTA techniques, it also requires a lot less manual annotations and can learn from near-raw data.
The paper also explores two other interesting environments: the Messenger game environment and the LangRoom embodied question-answering challenge. For a detailed analysis of how Dynalang performs in these environments, I recommend reading the full paper.
One of the key findings of the paper is that pretraining the model on text-only datasets significantly enhances the model’s performance on the final task. This suggests that the model’s ability to learn from text is a crucial factor in its overall effectiveness.
However, the researchers acknowledge that Dynalang has considerable room for improvement. They suggest that better language modeling techniques and architectures that can support actions in long horizons could enhance the model’s performance. I’m personally interested to see how it will improve if it is combined with more advanced transformer models.
Furthermore, it remains to be seen how well such techniques would fare in real-world contexts, which are often far more unpredictable and complex than controlled environments. But the researchers are optimistic about Dynalang’s potential, especially in taking advantage of the vast amount of unlabeled data that is available online. The researchers write, “The ability to pretrain on video and text without actions or rewards suggests that Dynalang could be scaled to large web datasets, paving the way towards a self-improving multimodal agent that interacts with humans in the world.”