What is reinforcement learning?

openai dactyl reinforcement learning robot hand
A robotic hand developed by OpenAI uses reinforcement learning to handle objects (Image credit: YouTube/OpenAI)

This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding artificial intelligence.

In late 2017, AlphaZero, an artificial intelligence program developed by Google-owned research lab DeepMind, managed to defeat all state-of-the-art AI at the boardgames chess, shogi (Japanese chess) and Go (including DeepMind’s own AlphaGo).

Playing games has been a constant in AI research for decades. However, what made AlphaZero special was the way it learned the game. In previous approaches, engineers either had to meticulously write all the different ways of playing the game or provide the AI with huge sets of data from games played by humans.

But in the case of AlphaZero, its engineers only provided it with the basic rules of the game and let the AI randomly explore the game environment until it “learned” combinations of moves that would win. It only took AlphaZero 24 hours (and access to Google’s near-limitless processing power) to prove it was superior to other game-playing AI models.

Reinforcement learning, the special AI technique used in AlphaZero, is considered by many the holy grail of artificial intelligence, because it can create autonomous systems that truly self-learn tasks without human intervention (though things are a bit more complicated in reality).

A quick primer on machine learning

Reinforcement learning is a subset of machine learning, a branch of AI that has become popular in the past years. Classical approaches to creating AI required programmers to manually code every rule that defined the behavior of the software.

A telling example is Stockfish, an open-source AI chess engine that has been developed with contribution from hundreds of programmers and chess experts who have turned their experience into game rules.

In contrast to rule-based AI, machine learning programs develop their behavior by examining vast amounts of example data and spotting meaningful correlations. When creating a machine learning–based chess engine, instead of providing every single rule of gameplay, the engineers create a basic algorithm and train it with data collected from thousands of games played by human chess players.

The AI model will peruse the data and find the similarities between the moves made by the winners. When provided with a new game, the AI will decide which move will most likely lead to a win based on the examples it has previously seen.

While machine learning, and its more advanced subset deep learning, can solve many problems that were previously thought to be out of bounds for computers, they are dependent on vast amounts of quality, annotated training data. This makes their application limited in domains where labeled data is scarce.

This is where reinforcement learning comes into play.

How reinforcement learning works

AI AlphaStar StarCraft II
AlphaStar, an AI developed by DeepMind, used reinforcement learning to master the complex real-time strategy game Starcraft

Unlike other types of machine learning, reinforcement learning doesn’t require a lot of training examples. Instead, reinforcement learning models are given an environment, a set of actions they can perform, and a goal or a reward they must pursue.

The AI agent must try to make moves that maximize its reward or bring it closer to the goal. At the beginning, the AI knows nothing about the environment and makes random actions, measuring the rewards and registering the quality of each action in something called a Q-table. Basically, a Q-table is a function to which you give the current state of the environment and an action, and it returns the reward that action will produce.

The more training a deep learning model goes through, the more data it gathers from its environment and the more precise its Q-table becomes.

An example of a Q-table used in reinforcement learning (image credit: Wikipedia)

With enough training, a reinforcement learning model will be able to develop a rich Q-table that can predict the best action for each given state.

For instance, in the example below, the AI is trying to learn the Atari game Breakout. Its actions include moving the paddle left or right (or doing nothing). If the ball reaches the bottom of screen, it receives the ultimate penalty and the game ends. If it keeps the ball alive, it receives a reward. Every brick it hits receives an extra reward, and if it destroys all the bricks, it receives the ultimate reward and wins the game.

As the video shows, in the beginning, the AI makes random decisions, exploring the space and weighing the responses of the environment to its actions. The more it plays the game, the better it becomes at predicting the outcome of its moves and making decisions that are likely to provide the most reward. After playing 600 games, the AI learns that if it pushes the ball to the corner, it will get stuck behind the wall and automatically destroy many bricks.

Likewise, the chess-playing reinforcement learning model starts with a clean slate, and is only given the basic rules of moving the pieces and the ultimate goal, which is to drive the opponent into check mate. At the beginning, the AI knows nothing about the tactics of the game and makes random moves.

But after playing against itself thousands and millions of times, it starts to develop a statistical model of what sequences of moves are likely to win each situation.

Why is this important? Unlike other machine learning techniques, reinforcement learning is not limited by human-labeled data. AlphaZero created and trained on its own data instead of relying on games played by humans. It also means that we can apply reinforcement learning to areas where training data is non-existent, scarce, or limited by regulatory constraints.

Another benefit of reinforcement learning is that the AI is not bound to learn from the way humans work. Therefore, it can come up with totally new ways to solve problems that might not have occurred to humans. This was confirmed by many of the people who observed DeepMind’s AlphaGo beat Go world champion Lee Sedol (note: reinforcement learning was one of the several AI techniques used in AlphaGo, the precursor to AlphaGo Zero and AlphaZero).

What is deep reinforcement learning

neural network concept
Image credit: Depositphotos

Reinforcement learning with Q-tables work great in settings where the states and actions are limited. But for more complex problems, such as open environments where possibilities are virtually limitless, it’s very hard to create a comprehensive Q-table.

To address this issue, researchers came up with the idea of deep reinforcement learning. First introduced by DeepMind, deep reinforcement learning combines concepts from reinforcement learning and deep learning to create AI models that are much more versatile and can learn to solve problems in complex environments where the states are very numerous and information is often incomplete.

Deep reinforcement learning replaces the Q-table with a “deep Q neural network.” You provide the neural network with the current state and it returns a list of possible actions with their predictable rewards.

In the past year, deep reinforcement learning has been used to master games of various complexity, including Atari, StarCraft II and Dota 2. AlphaZero and its predecessors also used deep reinforcement learning to master their respective crafts.

Applications of reinforcement learning

Teaching AI to play chess and Go are interesting scientific challenges, but there’s more to reinforcement learning than mastering games. Today, scientists and researchers are applying reinforcement learning to solve real-world problems.

Robotics is one of the areas where reinforcement learning is very useful. Creating robots that can handle objects is a very complicated task, and something that requires a lot of trial and error.

Dactyl, an AI system developed by research lab OpenAI, used reinforcement learning to teach a robotic hand to handle objects with impressive dexterity (in truth, it’s nowhere near what you would expect from a human, but it’s stunning by robot standards).

Meanwhile, there are multiple efforts aimed at applying reinforcement learning to different domains, such as traffic lights management, resource management and personalized recommendations.

However, one thing to note is that reinforcement learning can only solve problems that can be broken down to goals and rewards, which limits its application to domains that require general problem-solving instead of optimizing for a single goal.

To work around this limit, researchers are using reinforcement learning in combination with other artificial intelligence techniques. For instance, in DeepMind’s AlphaStar, the AI that mastered the complex real-time strategy game Starcraft II, reinforcement learning was one of multiple AI techniques used.

The challenges of reinforcement learning

Server room interior
Reinforcement learning models require access to huge compute resources, making their access limited to large research labs and companies.

Some people and media outlets compare reinforcement learning with artificial general intelligence (AGI), the kind of AI that can solve abstract and commonsense problems like the human mind.

This couldn’t be farther from the truth. Current blends of AI are very different from human intelligence, and no matter how advanced, reinforcement learning suffers from distinct limits.

Reinforcement learning requires vast amounts of compute resources. This limits its use to large tech companies and research labs that either own these resources or can burn cash without worrying about their next round of funding.

For instance, according to DeepMind’s AlphaStar blog post, the company used 16 Google TPU v3 to train each of its agents for 14 days (and this is just one of the several phases of developing the AI). At current pricing rates ($8.00 / TPU hour), the company spent $43,000 to train each AI agent, and according to the paper, there were at least 18 agents, which amounts to $774,000—just for training! (Of course, DeepMind is owned by Google, which means it’s probably going to cost the company much less.)

OpenAI’s Dota 2–playing bot consumed 800 petaflop seconds per day for 10 months. To put that in perspective, Nvidia’s super powerful DGX-2 AI computer, which sells at a whopping $400,000, gives you 2 petaflops. That doesn’t mean that the cost of Open AI Five’s training was (800 / 2 * $400,000), but it still tells a lot about the price of such undertakings.

Another problem with reinforcement learning is that in many cases, designing a suitable reward function is very difficult. In many real-life situations, AI agents must find a balance between different rewards and tradeoffs, and in these situations, reinforcement learning often makes the wrong decision, optimizing for a short-term reward at the expense of the main goal.

For instance, in the example below, the game rewards the AI for hitting checkpoints and collecting powerups. But the AI gets stuck in a loop of accumulating these minor rewards while missing the ultimate goal, which is to win the race.

So while reinforcement learning obviates the need for collecting labeled training data, it requires other kinds of human-led efforts, such as tuning the AI model for the right amount of exploring its environment vs exploiting local rewards.

We’re still very far from self-learning, general problem–solving AI models. But each new innovation brings us closer.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.