DeepMind RL method promises better co-op between AI and humans

6 min read
Overcooked reinforcement learning

This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.

From Go to StarCraft to Dota, artificial intelligence researchers are creating reinforcement learning systems that can defeat human experts at complicated games. But the bigger challenge of AI is creating RL systems that can team up with humans instead of competing with them.

In a new paper, AI researchers at DeepMind present a new technique to improve the capacity of reinforcement learning agents to cooperate with humans at different skill levels. Accepted at the annual NeurIPS conference, the technique is called Fictitious Co-Play (FCP) and it does not require human-generated data to train the RL agents.

When tested with the puzzle-solving game Overcooked, FCP created RL agents that provided better results and caused less confusion when teamed up with humans. The findings can provide important directions for future research in human-AI systems.

Training reinforcement learning agents

Reinforcement learning can tirelessly learn any task that has well-defined rewards, actions, and states. Given enough computation power and time, an RL agent can play around with its environment and learn sequences of actions—or “policies”—that maximize its rewards. Reinforcement learning has proven to be very efficient at playing games.

But often, RL agents learn policies that are not compatible with human gameplay. When teamed up with humans, they perform actions that confuse their co-players, making it difficult to use them in applications that require co-planning and division of labor between participants. Bridging the gap between AI and human gameplay has become an important challenge for the AI community.

Researchers are seeking ways to create versatile reinforcement learning agents that can adapt to the habits of various partners, including other RL agents and humans.

Reinforcement learning training methods
Different ways to train reinforcement learning agents

The classic way of training RL for games is self-play (SP), where the agent continuously plays against copies of itself. SP can be very efficient at quickly learning policies that maximize the game’s reward, but the resulting RL model overfits to its own gameplay. It is terrible at cooperating with players that are trained in a different way.

Another training method, population play (PP), trains the RL agent along with a diverse set of partners that have different parameters and architectures. PP agents are much better than self-play in cooperating with humans in competitive games. But they still lack the diversity needed in common-payoff settings, where players must solve problems together and coordinate their tactics based on changes in the environment.

An alternative is behavioral cloning play (BCP), which uses human-generated data to train RL agents. Instead of starting by randomly exploring their environments, BCP models tune their parameters to data collected from human-played games. These agents develop behaviors that are closer to gameplay patterns found in humans. If the data is collected from a diverse set of users with different skill levels and playing styles, the agents can become more flexible in adapting to teammate behavior. Therefore, they are more likely to be compatible with human players. However, generating human data is challenging, especially since reinforcement learning models often need inhuman amounts of gameplay to reach optimal settings.

Fictitious Co-Play

The main idea behind fictitious co-play (FCP), DeepMind’s new reinforcement learning technique, is to create agents that can assist players with different styles and skill levels without relying on human-generated data.

FCP training takes place in two stages. First, DeepMind’s researchers created a set of self-play RL agents. The agents were trained independently and with different initial conditions. As a result, they converge on different parameter settings and create a diverse pool of RL agents. To diversify the skill level of the agent pool, the researchers saved snapshots of each agent at different stages of the training process.

“The final checkpoint represents a fully-trained ‘skillful’ partner, while earlier checkpoints represent less skilled partners. Notably, by using multiple checkpoints per partner, this additional diversity in skill incurs no extra training cost,” the researchers note in the paper.

In the second stage, a new RL model is trained with all the agents in the pool as its partners. This way, the new agent must tune its policy to be able to cooperate with partners that have different parameter values and skill levels. “FCP agents are prepared to follow the lead of human partners, and learn a general policy across a range of strategies and skills,” the DeepMind researchers write.

Putting FCP to test

DeepMind’s AI researchers applied FCP to Overcooked, a puzzle-solving game in which the players must move around a grid world, interact with objects, and perform a series of steps to cook and deliver a meal. Overcooked is interesting because it has very simple dynamics but at the same time requires coordination and distribution of labor between teammates.

To test FCP, DeepMind simplified Overcooked to include a subset of the tasks performed in the full game. The AI researchers also included a carefully selected range of maps that presented various challenges such as forced coordination and cramped spaces.

Overcooked simplified environment
DeepMind used a simplified version of Overcooked to test reinforcement learning with Fictitious Co-Play

The researchers trained a set of SP, PP, BCP, and FCP agents. To compare their performance, they first tested each RL agent type against three populations of teammates, including a BC model trained on human gameplay data, a set of self-play agents trained at different skill levels, and randomly initialized agents that represent low-skilled players. They measured performance based on the number of meals delivered throughout an equal number of episodes.

Their findings show that FCP outperforms all other types of RL agents by a significant margin, suggesting that it generalizes well across various skill levels and play styles. Furthermore, one of their surprising findings was that other training methods are very brittle. “This suggests that they may not perform well with humans who are not highly skilled players,” the researchers write.

FCP compared to other RL methods
FCP outperforms other methods for training reinforcement learning agents

They then tested how each type of RL agent performed when teamed up with human players. The researchers conducted an online study with 114 human players, each of whom played 20 games. In each episode, the players were placed in a random kitchen and teamed up with one of the RL agents without knowing which type it was.

According to the results of DeepMind’s experiments, the human-FCP duo outperformed all other types of RL agents.

After every two episodes, the participants rated their experience with the RL agents on a scale of 1-5. The participants showed a clear preference for FCP over other agents, and their feedback shows that the behavior of FCP is much more coherent, predictable, and adaptable. For example, the RL agent seems to be aware of the actions of its teammate and prevents confusion by choosing a specific role in each cooking environment.

On the other hand, the participants in the survey described the actions of other reinforcement learning agents as “chaotic” and difficult to adapt to.

FCP co-op with human players
DeepMind teamed up human players with different reinforcement learning agents

More work to be done

The researchers point out some of the limits of their work in the paper. For example, the FCP agent was trained with a pool of 32 reinforcement learning partners, which is enough for the watered-down version of Overcooked, but could be limited for more complex environments. “For more complex games, FCP may require an unrealistically large partner population size to represent sufficiently diverse strategies,” DeepMind’s researchers write.

Reward definition is another challenge that can limit the use of FCP in complicated domains. In Overcooked, the reward is simple and well-defined. In other environments, RL agents must accomplish subgoals until they get the main reward. The way they achieve the subgoals needs to be compatible with those of their human counterparts, which is difficult to assess and adjust without human data. “If a task’s reward function is poorly aligned with how humans approach the task, our method may well produce subpar partners, as would any method without access to human data,” the researchers write.

DeepMind’s research is part of broader work being done on human-AI collaboration. A recent study by scientists at MIT explored the limits of reinforcement learning agents in playing the card game Hanabi with human teammates.

DeepMind’s new reinforcement learning technique is a step toward bridging the gap between human and AI problem-solving. The researchers hope to “establish a strong foundation for future research on the important challenge of human-agent collaboration for benefiting society.”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.