In the rising excitement about large language models, particularly ChatGPT, people are taking sides. There are now collections of the gaffes and so-called hallucinations committed by these models and other collections about the success. What is largely missing is any critical analysis of what these failures and successes tell us about the relationship between these models and intelligence. The point is not about whether these models succeed (sometimes they do) or whether they fail (sometimes they do)—it is about what those successes and failures tell us. What does a large language model learn? What does that tell us about its role in intelligence?
Large language models learn just what their name implies. They model language patterns. In 1948, Claude Shannon pointed out that words in discourse can be predicted. The more context one uses to make the prediction, the better the approximation will be to English prose. In those days the prose that could be analyzed economically and the amount of effort required limited the predictive context to only a few words. Current large language models use several hundred to a few thousand words to provide their predictive context. GPT uses a 2048-word context. It learns to predict each word in that context one at a time from the other 2047. The model is an approximation to the specific word patterns, because there are just too many combinations of specific words to consider literally each possible one. As large as these models are—for example GPT-3 has 175 billion parameters (variables that are set during the modeling)—that is still only a tiny fraction of the number of combinations of words and contexts that are possible. So the word patterns get compressed. Although the models are capable of reproducing exact sequences of words that they have seen, they are not restricted to doing so.
Each word in the context constrains the contributions of the other words in the context. When the model produces text, it chooses the words according to their probability given these constraints. Once a word is produced, it contributes to the constraints for subsequent words. All of these constraints contribute to the fluent patterns of language that the model produces. If these word patterns were produced by a human, we would be likely to infer that the person is intelligent. Alan Turing argued that the ability to hold a conversation that was indistinguishable from a human would be an indication that the computer implements the same function as the human. Turing’s criterion was part of the tradition that views intelligence as the output of educated western intellectuals. Chess playing, logical reasoning, and conversation were taken to be the pinnacles of intelligence. But a moment’s reflection indicates that this cannot be true.
Fluent speech can be mistaken for intelligent speech. One can learn, for example, to make a speech in a foreign language without being able to say another word in that language.
Aphasia is a loss of language ability, but it does not change a person’s intelligence.
Williams syndrome is a developmental disorder associated with delayed language onset, but then verbose, fluent speech in late childhood and adulthood despite mild to moderate cognitive impairment.
With some instruction, Helen Keller went from a vocabulary of about 70 signs to being able to write books, but there is no reason to think that her intelligence changed at that point.
Language seems to have been invented between 100,000 and 50,000 years ago. Language seems to be a tool for intelligence, but is not identical to it. It seems to have been instrumental in expanding the capabilities of people to do and think things, but people almost certainly had some kind of intelligence before language emerged. Language is a tool for expanding intelligence, not the cause of it (see more discussion in Algorithms are not enough: Creating General Artificial Intelligence).
People are not limited to thinking only about things that they have names for (e.g., as the Whorf-Sapir hypothesis suggest). People make up new names for things (the Jabberwocky effect) and use old words in new ways (the Humpty Dumpty syndrome). If language and thought were identical, then people would not be capable of thinking any thought or reasoning about anything that they did not already have a word for.
On the other side, it is one thing to talk and another thing to do. A language model just connects some language to other language, but never connects with anything else. Plato had the allegory of the cave wall. People were said to be like captives who could not experience the world, only the shadows that actors in the world cast on the cave wall in front of them. In the case of language models, they do not even have access to the shadows, merely to descriptions of the shadows. The words that they relate to one another cannot be truthful or wrong because all the model has are the words. The language models have no way of “understanding” that something could be a lie, or an error, or an opinion. Mary may believe that John would make a good husband, but unknown to her, he is really Jack the Ripper. Does Mary believe that Jack the ripper would make a good husband?