This article is part of our coverage of the latest in AI research.
Large language models like GPT-3 have advanced to the point that it has become difficult to measure the limits of their capabilities. When you have a very large neural network that can generate articles, write software code, and engage in conversations about sentience and life, you should expect it to be able to reason about tasks and plan as a human does, right?
Wrong. A study by researchers at Arizona State University, Tempe, shows that when it comes to planning and thinking methodically, LLMs perform very poorly, and suffer from many of the same failures observed in current deep learning systems.
Interestingly, the study finds that, while very large LLMs like GPT-3 and PaLM pass many of the tests that were meant to evaluate the reasoning capabilities and artificial intelligence systems, they do so because these benchmarks are either too simplistic or too flawed and can be “cheated” through statistical tricks, something that deep learning systems are very good at.
With LLMs breaking new ground every day, the authors suggest a new benchmark to test the planning and reasoning capabilities of AI systems. The researchers hope that their findings can help steer AI research toward developing artificial intelligence systems that can handle what has become popularly known as “system 2 thinking” tasks.
The illusion of planning and reasoning
“Back last year, we were evaluating GPT-3’s ability to extract plans from text descriptions—a task that was attempted with special purpose methods earlier—and found that off-the-shelf GPT-3 does quite well compared to the special purpose methods,” Subbarao Kambhampati, professor at Arizona State University and co-author of the study, told TechTalks. “That naturally made us wonder what ‘emergent capabilities’—if any–GPT3 has for doing the simplest planning problems (e.g., generating plans in toy domains). We found right away that GPT3 is pretty spectacularly bad on anecdotal tests.”
However, one interesting fact is that GPT-3 and other large language models perform very well on benchmarks designed for common-sense reasoning, logical reasoning, and ethical reasoning, skills that were previously thought to be off-limits for deep learning systems. A previous study by Kambhampati’s group at Arizona State University shows the effectiveness of large language models in generating plans from text descriptions. Other recent studies include one that shows LLMs can do zero-shot reasoning if provided with a special trigger phrase.
However, “reasoning” is often used broadly in these benchmarks and studies, Kambhampati believes. What LLMs are doing, in fact, is creating a semblance of planning and reasoning through pattern recognition.
“Most benchmarks depend on shallow (one or two steps) type of reasoning, as well as tasks for which there is sometimes no actual ground truth (e.g., getting LLMs to reason about ethical dilemmas),” he said. “It is possible for a purely pattern completion engine with no reasoning capabilities to still do fine on some of such benchmarks. After all, while System 2 reasoning abilities can get compiled to System 1 sometimes, it is also the case that System 1’s ‘reasoning abilities’ may just be reflexive responses from patterns the system has seen in its training data, without actually doing anything resembling reasoning.”
System 1 and System 2 thinking
System 1 and System 2 thinking were popularized by psychologist Daniel Kahneman in his book Thinking Fast and Slow. The former is the fast, reflexive, and automated type of thinking and acting that we do most of the time, such as walking, brushing our teeth, tying our shoes, or driving in a familiar area. Even a large part of speech is performed by System 1.
System 2, on the other hand, is the slower thinking mode that we use for tasks that require methodical planning and analysis. We use System 2 to solve calculus equations, play chess, design software, plan a trip, solve a puzzle, etc.
But the line between System 1 and System 2 is not clear-cut. Take driving, for example. When you are learning to drive, you must fully concentrate on how you coordinate your muscles to control the gear, steering wheel, and pedals while also keeping an eye on the road and the side and rear mirrors. This is clearly System 2 at work. It consumes a lot of energy, requires your full attention, and is slow. But as you gradually repeat the procedures, you learn to do them without thinking. The task of driving shifts to your System 1, enabling you to perform it without taxing your mind. One of the criteria of a task that has been integrated into System 1 is the ability to do it subconsciously while focusing on another task (e.g., you can tie your shoe and talk at the same time, brush your teeth and read, drive and talk, etc.).
Even many of the very complicated tasks that remain in the domain of System 2 eventually become partly integrated into System 1. For example, professional chess players rely a lot on pattern recognition to speed up their decision-making process. You can see similar examples in math and programming, where after doing things over and over again, some of the tasks that previously required careful thinking come to you automatically.
A similar phenomenon might be happening in deep learning systems that have been exposed to very large datasets. They might have learned to do the simple pattern-recognition phase of complex reasoning tasks.
“Plan generation requires chaining reasoning steps to come up with a plan, and a firm ground truth about correctness can be established,” Kambhampati said.
A new benchmark for testing planning in LLMs
“Given the excitement around hidden/emergent properties of LLMs however, we thought it would be more constructive to develop a benchmark that provides a variety of planning/reasoning tasks that can serve as a benchmark as people improve LLMs via finetuning and other approaches to customize/improve their performance to/on reasoning tasks. This is what we wound up doing,” Kambhampati said.
The team developed their benchmark based on the domains used in the International Planning Competition (IPC). The framework consists of multiple tasks that evaluate different aspects of reasoning. For example, some tasks evaluate the LLMs capacity to create valid plans to achieve a certain goal while others will test whether the generated plan is optimal. Other tests include reasoning about the results of a plan, recognizing whether different text descriptions refer to the same goal, reusing parts of one plan in another, shuffling plans, and more.
To carry out the tests, the team used Blocks world, a problem framework that revolves around placing a set of different blocks in a particular order. Each problem has an initial condition, an end goal, and a set of allowed actions.
“The benchmark itself is extensible and is meant to have tests from several of the IPC domains,” Kambhampati said. “We used the Blocks world examples for illustrating the different tasks. Each of those tasks (e.g., Plan generation, goal shuffling, etc.) can also be posed in other IPC domains.”
The benchmark Kambhampati and his colleagues developed uses few-shot learning, where the prompt given to the machine learning model includes a solved example plus the main problem that must be solved.
Unlike other benchmarks, the problem descriptions of this new benchmark are very long and detailed. Solving them requires concentration and methodical planning and can’t be cheated through pattern recognition. Even a human who would want to solve them would have to carefully think about each problem, take notes, possibly make visualizations, and plan the solution step by step.
“Reasoning is a system-2 task in general. The collective delusion of the community has been to look at those types of reasoning benchmarks that could probably be handled via compilation to system 1 (e.g., ‘the answer to this ethical dilemma, by pattern completion, is this’) as against actually doing reasoning that is needed for the task at hand,” Kambhampati said.
Large language models are bad at planning
The researchers tested their framework on Davinci, the largest version of GPT-3. Their experiments show that GPT-3 has mediocre performance on some types of planning tasks but performs very poorly in areas such as plan reuse, plan generalization, optimal planning, and replanning.
“The initial studies we have seen basically show that LLMs are particularly bad on anything that would be considered planning tasks–including plan generation, optimal plan generation, plan reuse or replanning,” Kambhampati said. “They do better on the planning-related tasks that don’t require chains of reasoning–such as goal shuffling.”
In the future, the researchers will add test cases based on other IPC domains and provide performance baselines with human subjects on the same benchmarks.
“We are also ourselves curious as to whether other variants of LLMs do any better on these benchmarks,” Kambhampati said.
Kambhampati stresses that the goal of the project is to put the benchmark out and give an idea of where the current baseline is. The researchers hope that their work opens new windows for developing planning and reasoning capability for current AI systems. For example, one direction they propose is evaluating the effectiveness of finetuning LLMs for reasoning and planning in specific domains. The team already has preliminary results on an instruction-following variant of GPT-3 that seems to do marginally better on the easy tasks, although it too remains around the 5-percent level for actual plan generation tasks, Kambhampati said.
Kambhampati also believes that learning and acquiring world models would be an essential step for any AI system that can reason and plan. Other scientists, including deep learning pioneer Yann LeCun, have made similar suggestions.
“If we agree that reasoning is part of intelligence, and want to claim LLMs do it, we certainly need plan generation benchmarks there,” Kambhampati said. “Rather than take a magisterial negative stand, we are providing a benchmark, so that people who believe that reasoning can be emergent from LLMs even without any special mechanisms such as world models and reasoning about dynamics, can use the benchmark to support their point of view.”