The understanding debate

8 min read

By Herbert Roitblat

language models gopher gpt-3

Language modeling (such as GPT or Gopher) has been used to perform hundreds of tasks, including some with very few training examples. So remarkable are these systems, that the philosopher David Chalmers, not generally known for hyperbole, has said that “GPT-3 is instantly one of the most interesting and important AI systems ever produced. … More remarkably, GPT-3 is showing hints of general intelligence.”

As a further hint toward general intelligence, I have used a similar deep learning model to generate custom personality profiles for each of my readers. You can see a brief sketch of your profile in the box. It should autopopulate within a few seconds.

Reading score

Most people find that their profiles describe them quite well. On average, people rate the quality of their profile at 4.5 (on a 5-point scale, with 5 being “excellent”). Before considering the method by which this profile was produced, let’s consider some of the other language models that have been so prominent over the last few years.

Large-scale language modeling uses massive amounts of textual data typically derived from a crawl of the World Wide Web, books, and other sources.  The texts are broken into tokens, which may be words or morphemes, which are semantic word features.  The word “unfriendly,” for example, consists of three morphemes, “un” meaning “not,” “friend,” meaning a class of social relation, and “ly” meaning that this word is an adverb.  Large language models use additional tokenization methods that break some of the words into other units to provide a “vocabulary” of reasonable size.  The important point is that tokens correspond only roughly to words for these models.

Once the text is tokenized, these models use a method analogous to autoregression to predict future tokens in a sequence based on earlier tokens in the sequence. These systems are trained on billions of text tokens and they employ deep learning networks with millions, billions, or even trillions of parameters to encode the token co-occurrence patterns contained in the text on which they are trained. Gopher, for example, was trained on 300 billion tokens with a 2,048-token context window. Its 280 billion parameters are used to represent the probability of each text token conditional on 2,048 nearby text tokens. Based on this training, Gopher was then successfully tested on 152 tasks, involving reading comprehension, fact checking, and reasoning. 

Large language models, including GPT-3, have been described in the press as a major breakthrough in artificial intelligence. Others have been more reticent in their praise. With more parameters and more input data, these models get better at predicting the next word in a sequence. But that improvement does not necessarily indicate a substantial breakthrough in artificial intelligence. These models do just one thing—they model the probability of text tokens conditional on the context provided by other text tokens. Any task that can be represented as a sequence prediction problem can be solved by a sufficiently large language model, but does that mean that these models have gained understanding, or are they merely computing predicted sequences? With all of the training examples contained in their training set, examples similar to the desired output of most of the tasks that are used to evaluate these models can be found.  

One test, the Winogrande Challenge, for example, presents two sentences, like:

“John poured water from the bottle into the glass until it was empty.” 

“John poured water from the bottle into the glass until it was full.”

The task is for the computer to choose whether the bottle or the glass was empty in the first sentence and to choose whether the bottle or the glass was full in the second sentence.  

According to Levesque (2011), who originated this task,  

Clever tricks involving word order or other features of words or groups of words will not work. Contexts where “give” can appear are statistically quite similar to those where “receive” can appear, and yet the answer must change. This helps make the test Google-proof: having access to a large corpus of English text would likely not help much (assuming, that answers to the questions have not yet been posted on the Web, that is)!

Levesque argues that a large corpus of English text would “likely not help much,” but provides no evidence to support this contention. The contexts for words like “give” and “receive” might be similar according to some measures, but at the time he asserted that statistics would not help, there were no examples of using a 300 billion-token text collection or a 2,048-word context. We simply do not know, but could investigate, the statistical properties of word choices at this kind of scale. The original examples may even appear somewhere in the collection. Therefore, this test—and most if not all of the tests used to evaluate language models—cannot serve as evidence of a non-statistical insight in the absence of evidence that there are no statistical differences that can account for the systems’ performance. 

The Winogrande Challenge is intended to demonstrate that these large language models have achieved a level of understanding that emerges from the statistics. To be sure, co-occurrence plays an important role in word meaning. If you don’t know the meaning of the word “hapax,” or “hapax legomenon,” for example, you might learn it from a sentence like “A hapax is a word that occurs only once in a collection of text.” The co-occurrence of “hapax” and the other words in the sentence provides enough context to, in some sense, understand the meaning of “hapax.”  Large language models have enough context to learn the meaning of words relative to other words, and some of the time that will appear to be enough.

I say, “appear to be enough” because there may be more levels of understanding beyond what can be supplied by word co-occurrence patterns. To put machine understanding in perspective, we need to consider the “Barnum effect.” Put simply and for this context, people find meaning in text whether it is there or not. The meaning that is apparently captured by the language model may be in the head of the person reading the output of the language model, not in the model itself.

That brings us back to the personality profile at the start of this essay. As you probably guessed, I do not actually have a deep learning model that generates an individual profile for each reader. Rather, this profile is taken from a 1949 study by Bertram Forer, who told his college student participants that they would each receive a personalized brief report based on a personality assessment that they had completed. Instead, he gave each of them the same summary presented earlier and asked them to score how well the summary fit them individually. The average fitness score was 4.5 out of 5.

The students reported that these summaries accurately captured their personality characteristics. Paul Meehl later called this phenomenon the “Barnum effect.”  (Meehl’s article is worth reading for its relevance to contemporary artificial intelligence). The same effect is at the core of certain claimed psychic phenomena, such as “cold reading” or fortune telling. In a cold reading, a claimed psychic mentions vague “facts” and their victim recognizes them as significant.

The argument in this context is that the language models produce examples of language, but it is the people who read those examples who inject (most of) the meaning. One famous example of a reader contributing the meaning to an otherwise ambiguous text is the Delphic Oracle’s prophecy: “If Croesus goes to war, he will destroy a great empire.”  Croesus, of course, interpreted the prophecy to mean that he would be successful in war against the Persian empire, but it was Croesus’s empire that was destroyed (according to Herodotus).

There are many examples of the success of language models but few examples of their more quotidian output, and a dramatic overestimate of what they actually accomplish.  

As an exercise, I looked at how GPT-3 would respond to a prompt taken from Raymond Chandler’s famous story, “The Red Wind:”

There was a desert wind blowing that night. It was one of those hot dry Santa Anas that come down through the mountain passes and curl your hair and make your nerves jump and your skin itch. On nights like that every booze party ends in a fight. Meek little wives feel the edge of the carving knife and study their husbands’ necks. Anything can happen. You can even get a full glass of beer at a cocktail lounge.

― Raymond Chandler, Red Wind: A Collection of Short Stories

Here are a few of the GPT-3 responses (all obtained with default settings, with no selection or editing)

It was one of those hot dry Santa Anas that start blowing through in October. The Santa Anas come down through the mountain passes and they hit Los Angeles and temperatures soar. You need quite a few of them to have a good fire. “It’s been a few years since we’ve had a good fire. This is one of the big ones.”

It was one of those hot dry Santa Anas that come down through the mountain passes and curl your hair and make your nerves jump and your skin itch. On nights like that every booze party ends in a fight. Meek little wives feel the edge of the carving knife and study their husbands’ necks. Anything can happen. You can even get a full glass of beer [produced twice]

It was one of those hot dry Santa Anas that spring up in the winter time down there. The sky was clear and blue, the ocean a little rough, with a nice breeze blowing. We were just cruising along in the old bus, eating an apple, and talking about the Dodgers. I remember we were talking about what a shame it was that Jocko

It was one of those hot dry Santa Anas. I was sweating like a pig. I had just come in from the ranch and I was scared to death. I’d never been to a funeral before. I thought my mother would eventually recover, but I knew I could never go back to school again. I was afraid to go back to the ranch. I was

It was one of those hot dry Santa Anas that begin in the desert and blow into the Los Angeles basin. The fires that started that afternoon might have been no big deal. The wind had already blown down a power line that afternoon, causing a blackout in the San Fernando Valley. But around 4:45 p.m., someone noticed a strange smell in the air

Each of the generated passages seems to cohere in a more or less sensible way. But is there any indication that the model understands the text beyond the co-occurrence patterns?  Is there any reason that authors should be scared for their careers as the Guardian newspaper headlined a story about GPT-3?  It is easy to read meaning into these snippets, but even in the case of human authors, the meaning is often what the reader ascribes to the passage, not what the author puts in. To paraphrase poet, Archibald MacLeish, a language model should be, not mean.

About the author

Herbert Roitblat

Herbert Roitblat is the author of Algorithms are not enough: How to create artificial general intelligence (MIT Press, 2020).


  1. Ah, GPT-3, the multi-million dollar artificial intelligence that uses millions of humans as pre-processors and still manages to spew out nonsense. Thanks for the enjoyable article.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.