Blog

Is DeepMind’s new reinforcement learning system a step toward general AI?

August 2, 2021

DeepMind reinforcement learning AI XLand

This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.

One of the key challenges of deep reinforcement learning models—the kind of AI systems that have mastered Go, StarCraft 2, and other games—is their inability to generalize their capabilities beyond their training domain. This limit makes it very hard to apply these systems to real-world settings, where situations are much more complicated and unpredictable than the environments where AI models are trained.

But scientists at AI research lab DeepMind claim to have taken the “first steps to train an agent capable of playing many different games without needing human interaction data,” according to a blog post about their new “open-ended learning” initiative. Their new project includes a 3D environment with realistic dynamics and deep reinforcement learning agents that can learn to solve a wide range of challenges.

The new system, according to DeepMind’s AI researchers, is an “important step toward creating more general agents with the flexibility to adapt rapidly within constantly changing environments.”

The paper’s findings show some impressive advances in applying reinforcement learning to complicated problems. But they are also a reminder of how far current systems are from achieving the kind of general intelligence capabilities that the AI community has been coveting for decades.

The brittleness of deep reinforcement learning

The key advantage of reinforcement learning is its ability to develop behavior by taking actions and getting feedback, similar to the way humans and animals learn by interacting with their environment. Some scientists describe reinforcement learning as “the first computational theory of intelligence.”

The combination of reinforcement learning and deep neural networks, known as deep reinforcement learning, has been at the heart of many advances in AI, including DeepMind’s famous AlphaGo and AlphaStar models. In both cases, the AI systems were able to outmatch human world champions at their respective games.

But reinforcement learning systems are also notoriously renowned for their lack of flexibility. For example, a reinforcement learning model that can play StarCraft 2 at an expert level won’t be able to play a game with similar mechanics (e.g., Warcraft 3) at any level of competency. Even slight changes to the original game will considerably degrade the AI model’s performance.

“These agents are often constrained to play only the games they were trained for – whilst the exact instantiation of the game may vary (e.g. the layout, initial conditions, opponents) the goals the agents must satisfy remain the same between training and testing. Deviation from this can lead to catastrophic failure of the agent,” DeepMind’s researchers write in a paper that provides the full details on their open-ended learning.

Humans, on the other hand, are very good at transferring knowledge across domains.

The XLand environment

The goal of DeepMind’s new project was to create “an artificial agent whose behaviour generalises beyond the set of games it was trained on.”

To this end, the team created XLand, an engine that can generate 3D environments composed of static topology and moveable objects. The game engine simulates rigid-body physics and allows players to use the objects in various ways (e.g., create ramps, block paths, etc.).

XLand is a rich environment in which you can train agents on a virtually unlimited number of tasks. One of the main advantages of XLand is the capability to use programmatic rules to automatically generate a vast array of environments and challenges to train AI agents. This addresses one of the key challenges of machine learning systems, which often require vast amounts of manually curated training data.

According to the blog post, the researchers created “billions of tasks in XLand, across varied games, worlds, and players.” The games include very simple goals such as finding objects to more complex settings in which the AI agents much weigh the benefits and tradeoffs of different rewards. Some of the games include cooperation or competition elements involving multiple agents.

Deep reinforcement learning

DeepMind uses deep reinforcement learning and a few clever tricks to create AI agents that can thrive in the XLand environment.

The reinforcement learning model of each agent receives a first-person view of the world, the agent’s physical state (e.g., whether it holding an object), and its current goal. Each agent finetunes the parameters of its policy neural network to maximize its rewards on the current task. The neural network architecture contains an attention mechanism to ensure the agent can balance optimization for the subgoals required to accomplish the main goal.

Once the agent masters its current challenge, the computational task generator creates a new challenge for the agent. Each new task is generated according to the agent’s training history and in a way to help distribute the agent’s skills across a vast range of challenges.

DeepMind also used its vast computational resources (courtesy of its owner Alphabet Inc.) to train a large population of agents in parallel and transfer learned parameters across different agents to improve the general capabilities of the reinforcement learning systems.

DeepMind XLand agent training — DeepMind uses a multi-step and population-based mechanism to train many reinforcement learning agents

The performance of the reinforcement learning agents was evaluated based on their general ability to accomplish a wide range of tasks they had not been trained on. Some of the test tasks include well-known challenges such as “capture the flag” and “hide and seek.”

According to DeepMind, each agent played around 700,000 unique games in 4,000 unique worlds within XLand and went through 200 billion training steps across 3.4 million unique tasks (in the paper, the researchers write that 100 million steps are equivalent to approximately 30 minutes of training).

“At this time, our agents have been able to participate in every procedurally generated evaluation task except for a handful that were impossible even for a human,” the AI researchers wrote. “And the results we’re seeing clearly exhibit general, zero-shot behaviour across the task space.”

Zero-shot machine learning models can solve problems that were not present in their training dataset. In a complicated space such as XLand, zero-shot learning might imply that the agents have obtained fundamental knowledge about their environment as opposed to memorizing sequences of image frames in specific tasks and environments.

The reinforcement learning agents further manifested signs of generalized learning when the researchers tried to adjust them for new tasks. According to their findings, 30 minutes of fine-tuning on new tasks was enough to create an impressive improvement in a reinforcement learning agent trained with the new method. In contrast, an agent trained from scratch for the same amount of time would have near-zero performance on most tasks.

High-level behavior

According to DeepMind, the reinforcement learning agents exhibit the emergence of “heuristic behavior” such as tool use, teamwork, and multi-step planning. If proven, this can be an important milestone. Deep learning systems are often criticized for learning statistical correlations instead of causal relations. If neural networks could develop high-level notions such as using objects to create ramps or cause occlusions, it could have a great impact on fields such as robotics and self-driving cars, where deep learning is currently struggling.

But those are big ifs, and DeepMind’s researchers are cautious about jumping to conclusions on their findings. “Given the nature of the environment, it is difficult to pinpoint intentionality — the behaviours we see often appear to be accidental, but still we see them occur consistently,” they wrote in their blog post.

But they are confident that their reinforcement learning agents “are aware of the basics of their bodies and the passage of time and that they understand the high-level structure of the games they encounter.”

Such fundamental self-learned skills are another one of the highly sought goals of the artificial intelligence community.

Theories of intelligence

Some of DeepMind’s top scientists published a paper recently in which they hypothesize that a single reward and reinforcement learning are enough to eventually reach artificial general intelligence (AGI). An intelligent agent with the right incentives can develop all kinds of capabilities such as perception and natural language understanding, the scientists believe.

Although DeepMind’s new approach still requires the training of reinforcement learning agents on multiple engineered rewards, it is in line with their general perspective of achieving AGI through reinforcement learning.

“What DeepMind shows with this paper is that a single RL agent can develop the intelligence to reach many goals, rather than just one,” Chris Nicholson, CEO of Pathmind, told TechTalks. “And the skills it learns in accomplishing one thing can generalize to other goals. That is very similar to how human intelligence is applied. For example, we learn to grab and manipulate objects, and that is the foundation of accomplishing goals that range from pounding a hammer to making your bed.”

Nicholson also believes that other aspects of the paper’s findings hint at progress toward general intelligence. “Parents will recognize that open-ended exploration is precisely how their toddlers learn to move through the world. They take something out of a cupboard, and put it back in. They invent their own small goals—which may seem meaningless to adults—and they master them,” he said. “DeepMind is programmatically setting goals for its agents within this world, and those agents are learning how to master them one by one.”

The reinforcement learning agents have also shown signs of developing embodied intelligence in their own virtual world, Nicholson said, like the kind humans have. “This is one more indication that the rich and malleable environment that people learn to move through and manipulate is conducive to the emergence of general intelligence, and that the biological and physical analogies of intelligence can guide further work in AI,” he said.

Sathyanaraya Raghavachary, Associate Professor of Computer Science at the University of Southern California, is a bit more skeptical on the claims made in DeepMind’s paper, especially the conclusions on proprioception, awareness of time, and high-level understanding of goals and environments.

“Even we humans are not fully aware of our bodies, let alone those VR agents,” Raghavachary said in comments to TechTalks, adding that perception of the body requires an integrated brain that is co-designed for suitable body awareness and situatedness in space. “Same with the passage of time—that too would require a brain that has memory of the past, and a sense for time in relation to that past. What they (paper authors) might mean relates to the agents’ tracking progressive changes in the environment resulting from their actions (eg. as a resulting of moving a purple pyramid), state changes which the underlying physics simulator would generate.

Raghavachary also points out, if the agents could understand the high-level structure of their tasks, they would not need 200 billion steps of simulated training to reach optimal results.

“The underlying architecture lacks what it takes, to achieve these three things (body awareness, time passage, understanding high-level task structure) they point out in conclusion,” he said. “Overall, XLand is simply ‘more of the same.’”

The gap between simulation and the real world

In a nutshell, the paper proves that if you can create a complex enough environment, design the right reinforcement learning architecture, and expose your models to enough experience (and have a lot of money to spend on compute resources), you’ll be able to generalize to various kinds of tasks in the same environment. And this is basically how natural evolution has delivered human and animal intelligence.

In fact, DeepMind has already done something similar with AlphaZero, a reinforcement learning model that managed to master multiple two-player turn-based games. The XLand experiment has extended the same notion to a much greater level by adding the zero-shot learning element.

But while I think that the experience from the XLand-trained agents will ultimately be transferable to real-world applications such as robotics and self-driving cars, I don’t think it will be a breakthrough. You’ll still need to make compromises (such as creating artificial limits to reduce the complexity of the real world) or create artificial enhancements (such as imbuing the machine learning models with prior knowledge or extra sensors).

DeepMind’s reinforcement learning agents might have become the masters of the virtual XLand. But their simulated world doesn’t even have a fraction of the intricacies of the real world. That gap will continue to remain a challenge for a long time.

2 COMMENTS

dave August 3, 2021 at 4:04 pm

I have a question about AI’s ability to understand 3d images. Do have to train AI so that it knows what the object has on different sides, or do you have to show the sides to the AI. Does a baby know that a box has 4 sides without seeing them?

Loading...

matter of time August 22, 2021 at 1:53 pm

DeepMind’s reinforcement learning agents might have become the masters of the virtual XLand. But their simulated world doesn’t even have a fraction of the intricacies of the real world. That gap will continue to remain a challenge for a long time.

Well, that is what go players said when computer wins the chess champion. go is too complicated for computer… it will take decades or even hundred years to win go. but it took 19 years. if it is just about complexity and scale, it won’t take long.

Loading...

Moving beyond passive RAG: How to implement active memory reconstruction for…

How self-improving harnesses are rewriting the agent engineering playbook

How Nvidia’s ASPIRE framework accelerates robot programming with self-improving AI

How the AI arms race moved from smart models to full-stack…

Why LLMs should stop thinking out loud (and what comes after…

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Demystifying loop engineering: Get more from AI agents, avoid loopmaxxing

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Is DeepMind’s new reinforcement learning system a step toward general AI?

The brittleness of deep reinforcement learning

The XLand environment

Deep reinforcement learning

High-level behavior

Theories of intelligence

The gap between simulation and the real world

Like this:

2 COMMENTS

Leave a ReplyCancel reply

The brittleness of deep reinforcement learning

The XLand environment

Deep reinforcement learning

High-level behavior

Theories of intelligence

The gap between simulation and the real world

Like this:

2 COMMENTS

Leave a ReplyCancel reply

Discover more from TechTalks