This article is part of our coverage of the latest in AI research.
Reinforcement learning is one of the fascinating fields of computer science, and it has proven useful in solving some of the toughest challenges of artificial intelligence and robotics. Some scientists believe that reinforcement learning will play a key role in cracking the enigma of human-level artificial intelligence.
But many hurdles stand between current reinforcement learning systems and a possible path toward more general and robust forms of AI. Many RL systems struggle with long-term planning, training-sample efficiency, transferring knowledge to new tasks, dealing with the inconsistencies of input signals and rewards, and other challenges that occur in real-world applications. There are dozens of reinforcement learning algorithms—and more recently deep RL—each of which addresses some of these challenges while struggling with others.
A new reinforcement learning technique developed by researchers at the University of California, San Diego, brings together two major branches of RL to create more efficient and robust agents. Dubbed Temporal Difference Learning for Model Predictive Control (TD-MPC), the new technique combines the strengths of “model-based” and “model-free” RL to match and outperform state-of-the-art algorithms in challenging control tasks.
Model-free vs model-based reinforcement learning
In reinforcement learning, an agent seeks a goal such as moving to a destination location, winning a game, reducing energy consumption in a factory, or maximizing ad clicks. The agent can interact with its environment through a set of actions, such as displacing pieces on a chessboard, displaying an ad on a website, or moving a limb on a robot. Every action changes the state of the environment and provides the agent with a reward, a feedback signal that helps it determine whether it is getting closer to its goal or not. Through multiple training episodes, RL agents learn action policies that maximize their rewards.
One common way to categorize RL algorithms is to split them between model-based and model-free techniques.
Model-free RL agents develop their policies by trying out many actions and learning successful sequences. A model-free RL agent learns the values of actions purely based on past experience. For example, consider an RL algorithm for driving cars. An untrained RL agent will not know that driving straight over a cliff will cause the car to drop and crash. But once it experiences it, it will get the negative reward signal and learn to take another action (hit the brakes or steer in another direction?) the next time it reaches a cliff.
On the other hand, model-based RL tries to learn the dynamics that govern its environment. The benefit of modeling the environment is that the RL agent can predict the outcome of actions it hasn’t taken before. In the self-driving car example mentioned above, a model-based RL agent doesn’t need to experience jumping over a cliff or bridge to realize that they lead to bad outcomes.
Model-based RL is very appealing because it is more sample-efficient in comparison to model-free algorithms. Model-based RL agents need less experience gathering and can “imagine” different scenarios without taking actual actions in the environment. However, learning an accurate model of the environment is often very difficult, especially in complex environments where states and actions are continuous (as opposed to discrete actions such as in chess and go) and the environment is stochastic and can change without the agent taking any action (such as the real world).
Therefore, model-free reinforcement learning remains the main solution for many applications, especially where the RL agents must handle continuous actions and states.
Temporal Difference Learning for Model Predictive Control (TD-MPC)
Temporal Difference Learning for Model Predictive Control, the new technique developed by the researchers at UCSD, combines the strengths of model-free and model-based reinforcement learning to overcome their respective weaknesses and train RL agents to become more robust and sample-efficient.
The model-based algorithm they used is “model predictive control.” MPC is very useful for finding local solutions to control tasks. But it has a limited time horizon and performs poorly when faced with problems that require long-term planning, especially in applications where compute resources are scarce.
“A key feature of MPC is that it approximates a global, long-horizon solution by solving a local, finite-horizon optimization problem,” Nicklas Hansen, lead author of the TD-MPC paper, told TechTalks. “This is both a blessing and a curse, as it ties performance with the planning horizon and hence computational budget. This is a non-issue in applications where models are easy to evaluate, but when the model is a neural network, it can be rather costly to do planning over long horizons.”
MPC performs very well when the model is accurate and there are sufficient compute resources for planning. But as learning an accurate model of the environment becomes harder, engineers must adopt large models that are slow and costly to train. In comparison, model-free RL algorithms directly optimize a policy for long-horizon performance, which makes them incredibly fast but somewhat less expressive.
“We wanted to incorporate the key strengths of model-free RL into the MPC framework to get the best of both worlds,” Hansen said.
To achieve this goal, the researchers chose to optimize a combination of short-term model-based predictions and long-term value estimations using “temporal difference learning” (TD learning), a technique commonly used in model-free RL algorithms. TD learning is a mechanism to calculate the values of different action trajectories for RL agents.
TD-MPC uses model-based reinforcement learning to optimize short-term decisions and the model-free value function to estimate long-term decisions. For example, in a humanoid locomotion task, the model-based component can control the joint movements while the model-free component chooses the trajectory of the agent.
Learning latent representations of the environment
One of the great challenges of modeling complex environments is finding the right elements and features to extract and predict. Many popular model-based RL systems try to learn and predict full image frames when dealing with visual input. This is a very difficult task that requires immense computational resources and yields unstable results.
To make the environment modeling more efficient, Hansen and his colleagues designed a “Task-Oriented Latent Dynamics” (TOLD) model that is jointly learned with a terminal value function using TD-learning. Basically, instead of trying to model the entire input space, TOLD extracts a smaller subset of features that are relevant to the task and goal of the RL agent.
“Whereas most previous data-driven MPC methods try to model everything in the environment via a future state or image prediction loss, our TD-MPC algorithm instead learns a task-oriented latent dynamics model through reward and value predictions,” Hansen said. “This allows us to only model parts of the environment that are actually relevant to planning, which we find is both easier to optimize and faster. Our findings tie nicely into the common wisdom that it is generally better to directly optimize for the objective that you care about.”
By discarding the irrelevant features, TOLD performs dimensionality reduction on the input space, simplifying the problem and making it possible for reinforcement learning algorithms to support continuous action spaces, arbitrary input modalities, and sparse reward signals.
Putting TD-MPC to test
The researchers evaluated TD-MPC with a TOLD model on 73 diverse and continuous control tasks from DeepMind Control Suite and Meta-World v2, two virtual environments for training and testing reinforcement learning systems.
They compared the performance with several benchmark reinforcement learning algorithms, including the model-free “soft actor critic” (SAC), an RL technique that performs strongly on DeepMind Control and Meta-World; a vanilla MPC model; and LOOP, another hybrid RL algorithm that combines model-free and model-based techniques.
“We find our method to outperform or match baselines in most tasks considered (often by a large margin), generally with larger gains on complex tasks such as Humanoid, Dog (DMControl), and Bin Picking (Meta-World),” the researchers write in their paper.
Another benefit of TD-MPC is its capacity for transfer learning. The latent representations learned by the TOLD model are often shared across tasks, which dramatically reduces the costs to retrain an RL agent for a new task.
However, their findings also show that TD-MPC is inferior to other state-of-the-art models in tasks that require more sophisticated exploration strategies, such as the DMControl Finger Turn Hard problem. The researchers also acknowledge that TOLD does not yet generalize to unrelated tasks in the same environment. These are some of the directions of research they will be exploring in the future.
“Because TD-MPC is so versatile, we expect a wide variety of applications to benefit from it, in particular whenever a model is either not known a priori or is costly to evaluate,” Hansen said. “As it is modality-agnostic by design, it also serves as a drop-in replacement for applications where model-free RL might typically be favored, such as robotics tasks with visual feedback.”
The scientists have shared the source code for TD-MPC online and hope the research community will build and expand on the idea of combining model-free and model-based reinforcement learning algorithms.
“Our next steps are to demonstrate the effectiveness of TD-MPC at solving real-world problems,” Hansen said. “As part of this goal, we have open-sourced our implementation of TD-MPC, and we look forward to seeing which kinds of applications our incredible community will apply it to in the future.”