Can GPT-4 and GPT-4V perform abstract reasoning like humans?

mind puzzle
Image source: 123RF

This article is part of our coverage of the latest in AI research.

There is an ongoing debate on whether large language models (LLM) like GPT-4 truly mimic human logic and reasoning. Some researchers posit that LLMs may develop emerging capabilities for abstract reasoning, pattern recognition, and analogy-making as they scale.

Others argue that the internal workings that facilitate such capabilities remain unexplained. Some experiments show that while these models fail to generalize beyond the scope of their training data.

“Abilities for creating and reasoning with abstract representations are fundamental to robust generalization, so it is essential to understand the extent to which LLMs have achieved such abilities,” scientists at Santa Fe Institute write in a recent paper.

In their study, the researchers delve into the meaning of abstract reasoning and offer a framework for its assessment in LLMs. The findings reveal that despite their sophistication, both GPT-4 and its multimodal counterpart, GPT-4V, fall short of human-level abstract reasoning. 

What is abstract reasoning?

Abstract reasoning is the capacity to discern a rule or pattern from sparse data and to extrapolate it to novel scenarios. This trait is a cornerstone of human intelligence—children demonstrate proficiency in learning abstract rules from minimal examples.

Evaluating abstract reasoning capabilities is difficult. One fair measure is the Abstraction and Reasoning Corpus (ARC) by Francois Chollet. ARC is a framework for assessing abstract reasoning in both humans and AI. The test comprises 1,000 handcrafted analogy puzzles, each presenting a few examples of grid transformations and a final, incomplete grid that the solver must correctly fill in. These puzzles are designed to negate any unfair advantage, such as similarity to training data or reliance on extraneous knowledge.

Abstraction Reasoning Corpus problem
The Abstraction Reasoning Corpus (ARC), introduced by AI scientist François Chollet, tests intelligence systems with few training examples. (Source: Arxiv.org)

To solve these puzzles, one must infer the overarching abstract rule from the few demonstrations and apply it to the test grid. The foundational knowledge required to tackle ARC is believed to be innate in humans, encompassing concepts like object recognition, quantity assessment, and basic principles of geometry and topology.

“[ARC] is meant to capture the crux of abstract reasoning: inducing general rules or patterns from small numbers of examples and applying these flexibly to new, previously unseen situations,” the researchers write in their paper. 

Human performance on ARC hovers around 84%. In contrast, attempts to solve ARC with current AI systems have been underwhelming. The top entry from a renowned Kaggle competition, which employed program-synthesis techniques, managed to solve a mere 21% of these puzzles, with no capacity to generalize beyond its narrow scope. LLMs, touted as general problem-solvers, fare even worse, solving just 10-12% of ARC challenges in recent experiments.

Testing GPT-4 on reasoning tasks

ConceptARC examples
Examples of ConceptARC puzzles

The Santa Fe Institute researchers performed new experiments with ConceptARC, a variant of the ARC designed to be more accessible for human participants and to facilitate the assessment of specific conceptual understandings. To adapt ConceptARC for the text-based GPT-4, visual puzzles were translated into character sequences. The model received a prompt containing instructions, a worked-out example, and a new problem to solve. GPT-4’s task was to generate a character sequence representing the solution, with up to three attempts allowed.

A previous test showed GPT-4 scoring 19% and 25% on ConceptARC under different temperatures. But with the new and more comprehensive prompting technique, the results improved. When tested across all 480 ConceptARC tasks, with the model’s temperature setting adjusted to 0 and 0.5, GPT-4’s average performance was approximately 33%. 

Despite this progress, GPT-4’s capabilities lag significantly behind human performance, which stands at an impressive 91% on ConceptARC. The Santa Fe scientists note, “GPT-4’s performance remains well below the high performance of humans, supporting the conclusion that, even with more informative prompting, the system lacks basic abstract reasoning abilities tested by this corpus.”

ConceptARC test on GPT-4 text-only version
ConceptARC test on GPT-4 text-only version

Does multimodality improve GPT-4’s performance?

The researchers also tested ConceptARC on GPT-4V, the multimodal version of GPT-4 capable of processing images in addition to text. The prevailing assumption was that GPT-4V, with its enhanced capabilities, would surpass its text-only counterpart. However, due to the prohibitive costs of a full-scale test, the researchers limited GPT-4V’s evaluation to a select group of ConceptARC puzzles known as “attention checks,” where humans typically achieve a 95% success rate.

Interestingly, when these attention checks were converted into text-only formats for GPT-4, the model secured a score of 65-69%, suggesting these tasks were less challenging than the complete set. GPT-4V’s performance, however, averaged 23-25% on the tasks, falling short of the text-only version’s achievements.

The paper’s observations on GPT-4V’s responses are interesting: “GPT-4V often included descriptions of an abstract transformation rule as part of its solution… In certain cases, the model accurately described the output grid despite identifying an incorrect abstract rule, which we classified as a success. On the other hand, we classified as failures instances in which the model correctly identified the abstract rule but failed to accurately describe the output grid.”

GPT-4V’s performance on the complete ConceptARC corpus would likely be even worse than the attention check subset. This outcome suggests that multimodal capabilities do not necessarily give superior abstract reasoning in LLMs.

What does it mean for LLM applications?

The findings from the Santa Fe Institute’s study underscore a significant disparity in abstract reasoning between humans and the most sophisticated AI systems currently available. 

The researchers write, “Our results support the hypothesis that GPT-4, perhaps the most capable ‘general’ LLM currently available, is still not able to robustly form abstractions and reason about basic core concepts in contexts not previously seen in its training data. It is possible that other methods of prompting or task representation would increase the performance of GPT-4 and GPT-4V; this is a topic for future research.” 

Consequently, it would be best to be cautious when integrating these models into decision-making processes that require precise logic. And human oversight remains crucial in AI applications, particularly for sensitive applications. 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.