This article is part of our coverage of the latest in AI research.
Even before the recent craze about sentient chatbots, large language models (LLM) had been the source of much excitement and concern. In recent years, LLMs, deep learning models that have been trained on vast amounts of text, have shown remarkable performance on several benchmarks that are meant to measure language understanding.
Large language models such as GPT-3 and LaMDA manage to maintain coherence over long stretches of text. They seem to be knowledgeable about different topics. They can remain consistent in lengthy conversations. LLMs have become so convincing that some people associate them with personhood and higher forms of intelligence.
But can LLMs do logical reasoning like humans? According to a research paper by scientists at the University of California, Los Angeles, transformers, the deep learning architectures used in LLMs, don’t learn to emulate reasoning functions. Instead, they find clever ways to learn statistical features that inherently exist in the reasoning problems.
The researchers tested BERT, a popular transformer architecture, on a confined problem space. Their findings show that BERT can accurately respond to reasoning problems on in-distribution examples in the training space but can’t generalize to examples drawn from other distributions based on the same problem space.
Their work highlights some of the shortcomings of deep neural networks as well as the benchmarks used to evaluate them.
How do you measure logical reasoning in AI?
There are several benchmarks that test AI systems against natural language processing and understanding problems, such as GLUE, SuperGLUE, SNLI, and SqUAD. Transformers have been able to incrementally improve on these benchmarks as they grow bigger and are trained on larger datasets.
What is notable is that the performance of AI systems on these benchmarks is often compared to human intelligence. Human performance on these benchmarks is closely tied to common sense and the capacity for logical reasoning. But it is not clear whether large language models are improving because they have acquired logical reasoning capabilities or simply because they have been exposed to very large amounts of text.
To verify this, the UCLA researchers developed SimpleLogic, a class of logical reasoning problems that are based on propositional logic. To make sure that language models are strictly tested for their reasoning abilities, the researchers removed language variance by using templated language structures.
A SimpleLogic problem consists of a set of facts, rules, queries, and labels. Facts are predicates that are known to be true. Rules are conditions, defined as clauses. The query is the problem that the ML model must respond to. And the label is the answer to the query, “true” or “false.”
The SimpleLogic problems are compiled into continuous text strings with the signals and separators that language models expect during training and inference.
One of the characteristics of SimpleLogic is that its problems are self-contained and require no prior knowledge. This is especially important because, as many scientists argue, when humans speak, they omit their shared knowledge. This is why language models often fall into traps when asked questions about basic world knowledge that every human knows. In contrast, SimpleLogic provides you with everything you need to solve its problems.
Therefore, anyone looking at a few problems posed in the SimpleLogic format should be able to deduce its rules and be able to process new examples regardless of their background knowledge.
Statistical features and logical reasoning
The researchers prove that the problem space in SimpleLogic can be represented by a reasoning function. The researchers further show that BERT has more than enough capacity to solve all the problems in SimpleLogic and they can manually adjust the ML model’s parameters to represent the reasoning function.
However, when they trained BERT on a dataset of SimpleLogic examples, the model could not learn the reasoning function by itself. The machine learning model managed to achieve near-perfect accuracy on one data distribution. But it did not generalize to other distributions within the same problem space. This is despite the training dataset covering the entire problem space and all distributions being derived from the same reasoning function.
(Note: This is different from the out-of-distribution generalization challenge, which applies to open-space problems. When a model can’t generalize to OOD data, its performance drops significantly when processing data that fall outside of the distribution of its training set.)
The researchers write, “Upon further investigation, we provide an explanation for this paradox: the model attaining high accuracy only on in-distribution test examples has not learned to reason. In fact, the model has learned to use statistical features in logical reasoning problems to make predictions rather than to emulate the correct reasoning function.”
This finding highlights an important challenge in using deep learning for language tasks. Neural networks are very good at finding and fitting statistical features. In some applications, this can be very useful. For example, in sentiment analysis, there is a strong correlation between certain words and classes of sentiments.
However, for logical reasoning tasks, even if statistical features are present, the model should try to find and learn the underlying reasoning function.
“Caution should be taken when we seek to train neural models end-to-end to solve NLP tasks that involve both logical reasoning and prior knowledge [emphasis mine] and are presented with language variance,” the researchers write, stressing that the challenges posed by SimpleLogic become exacerbated in real-world situations, where a lot of the information that LLMs require is simply not included in the data.
The researchers observed that when they removed a statistical feature from the training dataset, it resulted in an improvement in the performance of the language model on other distributions in the same problem space. However, the problem is that finding and removing multiple statistical features is easier said than done. As the researchers note in their paper, “such statistical features can be countless and extremely complicated, and thus very difficult to be removed from training data.”
Reasoning in deep learning
Unfortunately, the logical reasoning problem does not go away as language models become larger. It just becomes hidden it in their huge architecture and very large training corpus. LLMs can spit out facts and nicely stitched-together sentences, but when it comes to logical reasoning, they are still using statistical features to make inferences, which is not a solid foundation.
And there is no sign that by adding layers, parameters, and attention heads to transformers, the logical reasoning gap will be bridged.
The paper highlights one of the main challenges that current language models face. As the UCLA researchers note, “On the one hand, when a model is trained to learn a task from data, it always tends to learn statistical patterns, which inherently exist in reasoning examples; on the other hand, however, the rules of logic never rely on statistical patterns to conduct reasoning. Since it is difficult to construct a logical reasoning dataset that contains no statistical features, it follows that learning to reason from data is difficult.”