Are the emergent abilities of LLMs like GPT-4 a mirage?

LLM emergent abilities

This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.

Large language models (LLM) like ChatGPT and GPT-4 have captivated the imagination of the world. They have manifested many fascinating abilities, and many researchers believe that we have barely scratched the surface.

But a new study by researchers at Stanford University suggests that some of these abilities might be misunderstood. The researchers studied the previously reported “emergent abilities” that LLMs acquire as they grow larger. And their findings show when you choose the right metrics to evaluate LLMs, their emergent abilities disappear.

This study is important because it demystifies some of the magical and obscure abilities that have been attributed to LLMs. It also questions the notion that scale is the only way to create better language models.

The emergent abilities of LLMs

Several studies have examined the emergent abilities of LLMs. One study defined emergence as abilities that are “not present in smaller models but are present in larger models.” Basically, it means that a machine learning model will have random performance on some task until its size reaches a certain threshold. After that, it will start improving as it grows bigger. You can see emergent abilities in the following graph, where the performance of the LLM suddenly jumps at a certain scale.

LLMs emergent abilities
Large language models show emergent abilities at scale, where performance on a task remains at random levels until the model’s size reaches a certain threshold. After that, performance jumps and starts to improve as the model grows larger.

Researchers have studied emergent abilities in LLMs with more than 100 billion parameters, such as LaMDA, GPT-3, Gopher, Chinchilla, and PaLM. The studies include tasks selected from BIG-Bench, a crowd-sourced benchmark that includes many fields across linguistics, common-sense reasoning, and mathematics. They also used challenges from TruthfulQA, Massive Multi-task Language Understanding (MMLU), and Word in Context (WiC), all benchmarks that are designed to test the limits of LLMs in tackling complicated language tasks.

Several reasons make emergent abilities very important. First, these studies indicate that scaling LLMs without adding further innovation can continue to yield advances toward more general AI capabilities. Second, they suggest that we can’t predict what to expect from LLMs as they grow bigger. Naturally, these findings will further intensify the mystical aura around large language models.

Why emergence in LLMs might be hyped

The new study by Stanford puts a different light on the supposed emergent abilities of LLMs. According to its findings, the observation of emergence is often caused by the choice of metrics, not scale. The researchers suggest that “existing claims of emergent abilities are creations of the researcher’s analyses, not fundamental changes in model behavior on specific tasks with scale.” The researchers find “strong supporting evidence that emergent abilities may not be a fundamental property of scaling AI models.”

Specifically, they suggest that “emergent abilities seem to appear only under metrics that nonlinearly or discontinuously scale any model’s per-token error rate.” Basically, it means that when measuring performance on a task, some metrics might show emergence at scale while others show continuous improvement.

For example, some tests only measure the number of correct tokens that the LLM outputs. This happens particularly in tasks related to classification and mathematics, where the output is only correct if all generated tokens are correct.

In reality, the tokens that the model produces gradually become closer to the correct ones. But since the final answer is different from the ground truth, they are all classified as incorrect until they reach that threshold where all tokens are correct.

In their study, the researchers show that if they use alternate metrics on the same outputs, the emergent abilities disappear and model performance improves smoothly. These metrics measure the linear distance to the true answer instead of just counting correct answers.

llm emergent abilities mirage
Top: When evaluated with non-linear metrics, LLMs show emergent behavior Bottom: When evaluated with linear metrics, performance improves smoothly

The researchers also found that in some cases, emergence was due to not having enough test data. By creating a larger test dataset, performance improvements became smooth.

To further drive the point, the researchers tried to see if they could reproduce emergence in other types of deep neural networks. They ran tests on vision tasks and convolutional neural networks (CNN). Their findings show that if they used non-linear metrics to evaluate the performance of the models, then they would observe the same kind of emergence seen in LLMs.

Why it matters

The researchers make an important observation at the end of the paper: “The main takeaway is for a fixed task and a fixed model family, the researcher can choose a metric to create an emergent ability or choose a metric to ablate an emergent ability. Ergo, emergent abilities may be creations of the researcher’s choices, not a fundamental property of the model family on the specific task.”

While the researchers state that they do not claim that large language models cannot display emergent abilities, they stress that previously claimed emergent abilities in LLMs “might likely be a mirage induced by researcher analyses.”

The important takeaway is to have a more critical perspective on the performance of large language models. Given the amazing results of LLMs, there is already a tendency to anthropomorphize them or associate them with properties they do not possess.

I think the paper’s findings are important because they will help bring more reality to the field and better understand the effects of scaling models. A recent paper by Sam Bowman states that “when a lab invests in training a new LLM that advances the scale frontier, they’re buying a mystery box: They’re justifiably confident that they’ll get a variety of economically valuable new capabilities, but they can make few confident predictions about what those capabilities will be or what preparations they’ll need to make to be able to deploy them responsibly.” With better techniques to measure and predict improvement, scientists will be better equipped to evaluate the benefits and risks of larger models.

This approach also helps encourage exploring alternatives to creating bigger LLMs. While only big tech companies can afford training and testing very large models, smaller organizations can do research on smaller models. With these metrics, they will be able to better explore the capabilities of these smaller models and find new directions of research to improve them.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.