System 2 deep learning: The next step toward artificial general intelligence

This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.

Say you’ve been driving on the roads of Phoenix, Arizona, all your life, and then you move to New York. Do you need to learn driving all over again? Probably not. You just have to drive a bit more cautiously and adapt yourself to the new environment.

The same can’t be said about deep learning algorithms, the cutting edge of artificial intelligence, which are also one of the main components of autonomous driving. Despite having propelled the field of AI forward in recent years, deep learning, and its underlying technology, deep neural networks, suffer from fundamental problems that prevent them from replicating some of the most basic functions of the human brain.

These challenges of deep learning are well known, and a growing slate of scientists are acknowledging that those problems might cause serious hurdles for the future of AI.

In this year’s Conference on Neural Information Processing Systems (NeurIPS 2019), Yoshua Bengio, one of the three pioneers of deep learning, delivered a keynote speech that shed light on possible directions that can bring us closer to human-level AI. Titled, “From System 1 Deep Learning to System 2 Deep Learning,” Bengio’s presentation is very technical and draws on research he and others have done in recent years.

Where does deep learning stand today?

neural networks deep learning stochastic gradient descent

“Some people think it might be enough to take what we have and just grow the size of the dataset, the model sizes, computer speed—just get a bigger brain,” Bengio said in his opening remarks at NeurIPS 2019.

This simple sentence succinctly represents one of the main problems of current AI research. Artificial neural networks have proven to be very efficient at detecting patterns in large sets of data. And they can do it in a scalable way. Increasing the size of neural networks and training them on larger sets of annotated data will, in most cases, improve their accuracy (albeit in a logarithmic way).

This characteristic has created a sort of “bigger is better” mentality, pushing some AI researchers to seek improvements and breakthroughs by creating larger and larger AI models and datasets.

While, arguably, size is a factor and we still don’t have any neural network that matches the human brain’s 100-billion-neuron structure, current AI systems suffer from flaws that will not be fixed by making them bigger.

“We have machines that learn in a very narrow way. They need much more data to learn tasks than human examples of intelligence,” Bengio said.

For instance, an AI system trained to play a board or video game will not be able to do anything else, not even play another game that is slightly different. Also, in most cases, deep learning algorithms need millions of examples to learn tasks. An example is OpenAI’s Dota-playing neural networks, which required 45,000 years’ worth of gameplay before being able to beat the world champions, more than any one human—or ten, or hundred—can play in a lifetime. Aristo, a system developed by the Allen Institute for AI, needed 300 gigabytes of scientific articles and knowledge graphs to be able to answer 8th grade-level multiple-choice science questions.

Finally, Bengio remarks that current deep learning systems “make stupid mistakes” and are “not very robust to changes in distribution.” This is one of the principal concerns of current AI systems. Neural networks are vulnerable to adversarial examples, perturbations in data that cause the AI system to act in erratic ways.

Adversarial vulnerabilities are hard to plug and can be especially damaging in sensitive domains, where errors can have fatal consequences.

Moving from system 1 to system 2 deep learning

human brain thinking cognitive science

Despite their limits, current deep learning technologies replicate one of the underlying components of natural intelligence, which Bengio refers to as “system 1” cognition.

“System 1 are the kinds of things that we do intuitively, unconsciously, that we can’t explain verbally, in the case of behavior, things that are habitual,” Bengio said. “This is what current deep learning is good at.”

Bengio’s definition of the extents of deep learning is in line with what other thought leaders in the field have said. “If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future,” Andrew Ng, co-founder of Coursera and former head of Baidu AI and Google Brain, wrote in an essay for Harvard Business Review in 2016.

Deep learning has already created many useful system 1 applications, especially in the domain of computer vision. AI algorithms now perform tasks like image classification, object detection and facial recognition with accuracy that often exceeds that of humans. Voice recognition and speech-to-text are other domains where current deep learning systems perform very well.

But there are limits to how well system 1 works, even in areas where deep learning has made substantial progress.

Here’s how Bengio explains the difference between system 1 and system 2: Imagine driving in a familiar neighborhood. You can usually navigate the area subconsciously, using visual cues that you’ve seen hundreds of times. You don’t need to follow directions. You might even carry out a conversation with other passengers without focusing too much on your driving.

But when you move to a new area, where you don’t know the streets and the sights are new, you must focus more on the street signs, use maps and get help from other indicators to find your destination.

The latter scenario is where your system 2 cognition kicks into play. It helps humans generalize previously gained knowledge and experience to new settings. “What’s going on there is you’re generalizing in a more powerful way and you’re doing it in a conscious way that you can explain,” Bengio said at NeurIPS.

“The kinds of things we do with system 2 include programming. So we come up with algorithms, recipes, we can plan, reason, use logic,” Bengio says. “Usually, these things are very slow if you compare to what computers do for some of these problems. These are the things that we want future deep learning to do as well.”

Not a return to symbolic AI

computer algorithm chart

The limits and challenges of deep learning are well documented. In the past couple of years, there have been many discussions in this regard, and there are various efforts into solving individual problems such as creating AI systems that are explainable and less data-hungry.

Some of the initiatives in the field involve the use of elements of symbolic artificial intelligence, the rule-based approach that dominated the field of AI before the rise of deep learning. An example is the Neuro-Symbolic Concept Learner (NSCL), a hybrid AI system developed by researchers at MIT and IBM.

But Bengio stressed that he does not plan to revisit symbolic AI. “Some people think we need to invent something completely new to face these challenges, and maybe go back to classical AI to deal with things like high-level cognition,” Bengio said, adding that “there’s a path from where we are now, extending the abilities of deep learning, to approach these kinds of high-level questions of cognitive system 2.”

Bengio stands firmly by the belief of not returning to rule-based AI. In fact, somewhere in the speech, he used the word “rule,” and then quickly clarified that he doesn’t mean it in the way that symbolic AI is used. At the end of his speech, when one of the participants described his solution as a “hybrid” approach to AI, again he clarified that he does not propose a solution where you combined symbolic and connectionist AI.

Bengio had voiced similar thoughts to Martin Ford, the author of Architects of Intelligence, a compilation of interviews with leading AI scientists. “Note that your brain is all neural networks. We have to come up with different architectures and different training frameworks that can do the kinds of things that classical AI was trying to do, like reasoning, inferring an explanation for what you’re seeing and planning,” Bengio said to Ford in 2018.

In his NeurIPS speech, Bengio laid out the reasons why symbolic AI and hybrid systems can’t help toward achieving system 2 deep learning.

Intelligent systems should be able to generalize efficiently and on a large scale. Machine learning systems can scale with the availability of compute resources and data. In contrast, symbolic AI systems require human engineers to manually specify the rules of their behavior, which has become a serious bottleneck in the field.

They should also be able to handle the uncertainties and messiness of the world, which is an area where machine learning outperforms symbolic AI.

What are the requirements of system 2 deep learning?

“When you learn a new task, you want to be able to learn it with very little data,” Bengio said. For instance, when you put on a pair of sunglasses, the input your visual system receives becomes very different. But you’re quickly able to adapt and process the information and adapt yourself. Current AI systems need to be trained anew when the slightest change is brought to their environment.

To replicate this behavior, AI systems to discover and handle high-level representations in their data and environments. “We want to have machines that understand the world, that build good world models, that understand cause and effect, and can act in the world to acquire knowledge,” Bengio said.

In his speech, Bengio provided guidelines on how you can improve deep learning systems to achieve system 2 capabilities. The details are very technical and refer to several research papers and projects in the past couple of years. But some of the recurring themes in his speech give us hints on what the next steps can be.

Out of order (OOD) distribution is key to the future of deep learning

randome dice roll
Image credit: Depositphotos

Current machine learning systems are based on the hypothesis of independently and identically distributed (IID) data. Basically, machine learning algorithms perform best when their training and test data are equally distributed. This is an assumption that can work well in simple frameworks like flipping coins and throwing dice.

But the real world is messy, and distributions are almost never uniform. That’s why machine learning engineers usually gather as much data as they can, shuffle them to ensure their balanced distribution, and then split them between train and test sets.

“When we do that, we destroy important information about those changes in distribution that are inherent in the data we collect,” Bengio said. “Instead of destroying that information, we should use it in order to learn how the world changes.”

Intelligent systems should be able to generalize to different distributions in data, just as human children learn to adapt themselves as their bodies and environment changes around them. “We need systems that can handle those changes and do continual learning, lifelong learning and so on,” Bengio said in his NeurIPS speech. “This is a long-standing goal for machine learning, but we haven’t yet built a solution to this.”

Attention and compositionality in deep learning

One of the concepts that will help AI systems to behave more consistently is how they decompose data and find the important bits. There’s already work done in the field, some of which Bengio himself was involved in.

One of the key efforts in this area is “attention mechanisms,” techniques that enable neural networks to focus on relevant bits of information. Attention mechanisms have become very important in natural language processing (NLP), the branch of AI that handles tasks such as machine translation and question-answering.

But current neural network structures mostly perform attention based on vector calculations. Data is represented in the form of an array of numerical values that define their features. The next step would be to enable neural networks to perform attention and representation based on name-value pairs, something like variables as used in rule-based programs. But it should be done in a deep learning–friendly way.

There is already great progress in the field of transfer learning, the discipline of mapping the parameters of one neural network to another. But better compositionality can lead to deep learning systems that can extract and manipulate high-level features in their problem domains and dynamically adapt them to new environments without the need for extra tuning and lots of data. Efficient composition is an important step toward out of order distribution.

Deep learning systems with causal structures

It is no secret that causality is one of the major shortcomings of current machine learning systems, which are centered around finding and matching patterns in data. Bengio believes that having deep learning systems that can compose and manipulate these named objects and semantic variables will help move us toward AI systems with causal structures.

“In order to facilitate the learning of the causal structure, the learner should try to infer what was the intervention, on which variable was the change performed. That’s something we do all the time,” he said in his NeurIPS speech.

The entire speech contains a lot of very valuable information about topics such as consciousness, the role of language in intelligence, and the intersection of neuroscience and machine learning. Unfortunately, all of that cannot be covered and unpacked in a single post. I suggest watching the entire video (twice).

Bengio is one of many scientists who are trying to move the field of artificial intelligence beyond predictions and pattern-matching and toward machines that think like humans. It will be interesting to see how these efforts evolve and converge.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.