Reinforcement learning for the real world

8 min read
reinforcement learning real-world

This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.

Labor- and data-efficiency remain two of the key challenges of artificial intelligence. In recent decades, researchers have proven that big data and machine learning algorithms reduce the need for providing AI systems with prior rules and knowledge. But machine learning—and more recently deep learning—have presented their own challenges, which require manual labor albeit of different nature.

Creating AI systems that can genuinely learn on their own with minimal human guidance remain a holy grail and a great challenge for scientists. According to Sergey Levine, assistant professor at the University of California, Berkeley, a promising direction of research for the AI community is “self-supervised offline reinforcement learning.”

This is a variation of the RL paradigm that is very close to how humans and animals learn to reuse previously acquired data and skills, and it can be a great boon for applying AI to real-world settings. In a paper titled “Understanding the World Through Action” and a talk at the NeurIPS 2021 conference, Levine explained how self-supervised learning objectives and offline RL can help create generalized AI systems that can be applied to various tasks.

From rule-based AI to machine learning


One common argument in favor of machine learning algorithms is their ability to scale with the availability of data and compute resources. Decades of work on developing symbolic AI systems have produced limited results. These systems require human experts and engineers to manually provide the rules and knowledge that define the behavior of the AI system.

The problem is that in some applications, the rules can be virtually limitless, while in others, they can’t be explicitly defined.

In contrast, machine learning models can derive their behavior from data, without the need for explicit rules and prior knowledge. Another advantage of machine learning is that it can glean its own solutions from its training data, which are often more accurate than knowledge engineered by humans.

But machine learning faces its own challenges. Most ML applications are based on supervised learning and require training data to be manually labeled by human annotators. Data annotation poses severe limits to the scaling of ML models.

More recently, researchers have been exploring unsupervised and self-supervised learning, ML paradigms that obviate the need for manual labels. These approaches have helped overcome the limits of machine learning in some applications such as language modeling and medical imaging. But they’re still faced with challenges that prevent their use in more general settings.

Current methods for learning without human labels still require “considerable human insight (which is often domain-specific!) to engineer self-supervised learning objectives that allow large models to acquire meaningful knowledge from unlabeled datasets,” Levine writes.

Levine writes that the next objective should be to create AI systems that don’t require manual labeling or the manual design of self-supervised objectives. These models should be able to “distill a deep and meaningful understanding of the world and can perform downstream tasks with robustness generalization, and even a degree of common sense.”

Reinforcement learning

maze reinforcement learning

Reinforcement learning is inspired by intelligent behavior in animals and humans. Reinforcement learning pioneer Richard Sutton describes RL as the “first computational theory of intelligence.” An RL agent develops its behavior by interacting with its environment, weighing the punishments and rewards of its actions, and developing policies that maximize rewards.

RL, and more recently deep RL, have proven to be particularly efficient at solving complicated problems such as playing games and training robots. And there’s reason to believe reinforcement learning can overcome the limits of current ML systems.

But before it does, RL must overcome its own set of challenges that limit its use in real-world settings.

“We could think of modern RL research as consisting of three threads: (1) getting good results in simulated benchmarks (e.g., video games); (2) using simulation + transfer; (3) running RL in the real world,” Levine told TechTalks. “I believe that ultimately (3) is the most important thing, because that’s the most promising approach to solve problems that we can’t solve today.”

Games are simple environments. Board games such as chess and go are closed worlds with deterministic environments. Even games such as StarCraft and Dota, which are played in real-time and have near unlimited states, are much simpler than the real world. Their rules don’t change. This is partly why game-playing AI systems have found very few applications in the real world.

On the other hand, physics simulators have seen tremendous advances in recent years. One of the popular methods in fields such as robotics and self-driving cars has been to train reinforcement learning models in simulated environments and then finetune the models with real-world experience. But according to Levine, this approach is limited too “because the domains where we most need learning—the ones where humans far outperform machines—are also the ones that are hardest to simulate.”

“This approach is only effective at addressing tasks that can be simulated, which is bottlenecked by our ability to create lifelike simulated analogues of the real world and to anticipate all the possible situations that an agent might encounter in reality,” Levine said.

Rewards, data-driven learning, and generalization

human brain gears

“One of the biggest challenges we encounter when we try to do real-world RL is generalization,” Levine said.

For example, in 2016, Levine was part of a team that constructed an “arm farm” at Google with 14 robots all learning concurrently from their shared experience. The team collected more than half a million grasp attempts, and their RL models were able to learn effective grasping policies in this way.

“But we can’t repeat this process for every single task we want robots to learn with RL,” he says. “Therefore, we need more general-purpose approaches, where a single ever-growing dataset is used as the basis for a general understanding of the world on which more specific skills can be built.”

In his paper, Levine points to two key obstacles in reinforcement learning. First, RL systems require manually defined reward functions or goals before they can learn the behaviors that help accomplish those goals. And second, most reinforcement learning systems require online experience and are not data-driven, which makes it hard to train them on existing data. Most recent accomplishments in RL have relied on engineers at very wealthy tech companies using massive compute resources to generate immense episodes of actions instead of reusing available data.

Therefore, RL systems need solutions that can learn from past experience and repurpose their learnings in more generalized ways. Moreover, they should be able to handle the continuity of the real world. Unlike simulated environments, you can’t reset the real world and start everything from scratch. You need learning systems that can quickly adapt to the constant and unpredictable changes to their environment.

In his NeurIPS talk, Levine compares real-world RL to the story of Robinson Crusoe, a man who is stranded on an island and learns to deal with unknown situations through inventiveness and creativity, using his knowledge of the world and continued exploration in his new habitat.

“RL systems in the real world have to deal with a lifelong learning problem, evaluate objectives and performance based entirely on realistic sensing without access to privileged information, and must deal with real-world constraints, including safety,” Levine said. “These are all things that are typically abstracted away in widely used RL benchmark tasks and video game environments.”

However, RL does work in more practical real-world settings, Levine says. For example, in 2018, he and his colleagues developed an RL-based robotic grasping system that attained state-of-the-art results with raw sensory perception. In contrast to static learning behaviors that choose a grasp point and then execute the desired grasp, in their method, the robot continuously updated its grasp strategy based on the most recent observations to optimize long-horizon grasp success.

“To my knowledge this is still the best existing system for grasping from monocular RGB images,” Levine said. “But this sort of thing requires algorithms that are somewhat different from those that perform best in simulated video game settings: it requires algorithms that are adept at utilizing and reusing previously collected data, algorithms that can train large models that generalize, and algorithms that can support large-scale real-world data collection.”

Self-supervised and offline reinforcement learning

Reinforcement learning artificial intelligence

Levine’s reinforcement learning solution includes two key components: unsupervised/self-supervised learning and offline learning.

In his paper, Levine describes self-supervised reinforcement learning as a system that can “learn behaviors that control the world in meaningful ways” and provides some mechanism “to learn to control [the world] in as many ways as possible.”

Basically, this means that instead of being optimized for a single goal, the RL agent should be able to achieve many different goals by computing counterfactuals, learning causal models, and obtaining a deep understanding of how actions affect its environment in the long term. This will help achieve new goals—or downstream tasks—faster.

However, creating self-supervised RL models that can solve various goals would still require a massive amount of experience. To address this challenge, Levine proposes offline reinforcement learning, which makes it possible for models to continue learning from previously collected data without the need for continued online experience.

“Offline RL can make it possible to apply self-supervised or unsupervised RL methods even in settings where online collection is infeasible, and such methods can serve as one of the most powerful tools for incorporating large and diverse datasets into self-supervised RL,” he writes.

The combination of self-supervised and offline RL can help develop agents that can create building blocks for learning new tasks and continue learning with little need for new data.

This is very similar to how we learn in the real world. For example, when you want to learn basketball, you start with basic skills you acquired in the past such as walking, running, jumping, handling objects, etc. You use these capabilities to develop new skills such as dribbling, crossovers, jump shots, free throws, layups, straight and bounce passes, eurosteps, dunks (if you’re tall enough), etc. These skills build on each other and help you reach the bigger goal, which is to outscore your opponent. At the same time, you can learn from offline data by reflecting on your past experience and thinking about counterfactuals (e.g., what would have happened if you passed to an open teammate instead of taking a contested shot). You can also learn by processing other data such as videos of yourself and your opponents. In fact, on-court experience is just part of your continuous learning.

In a paper, Yevgen Chetobar, one of Levine’s colleagues, shows how self-supervised offline RL can learn policies for fairly general robotic manipulation skills, directly reusing data that they had collected for another project.

“This system was able to reach a variety of user-specified goals, and also act as a general-purpose pretraining procedure (a kind of ‘BERT for robotics’) for other kinds of tasks specified with conventional reward functions,” Levine said.

No more simulations

self-driving car simulation carla

One of the great benefits of offline and self-supervised RL is learning from real-world data instead of simulated environments.

“Basically, it comes down to this question: is it easier to create a brain, or is it easier to create the universe? I think it’s easier to create a brain, because it is part of the universe,” he said.

This is, in fact, one of the great challenges engineers face when creating simulated environments. For example, Levine says, effective simulation for autonomous driving requires simulating other drivers, “which requires having an autonomous driving system, which requires simulating other drivers, which requires having an autonomous driving system, etc.”

“Ultimately, learning from real data will be more effective because it will simply be much easier and more scalable, just as we’ve seen in supervised learning domains in computer vision and NLP, where no one worries about using simulation,” he said. “My perspective is that we should figure out how to do RL in a scalable and general-purpose way using real data, and this will spare us from having to expend inordinate amounts of effort building simulators.”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.