This article is part of our coverage of the latest in AI research.
Large language models (LLMs) have become the center of attention and hype because of their seemingly magical abilities to produce long stretches of coherent text, do things they weren’t trained on, and engage (to some extent) in topics of conversation that were thought to be off-limits for computers.
But there is still a lot to be learned about the way LLMs work and don’t work. A new study by researchers at Google, Stanford University, DeepMind, and the University of North Carolina at Chapel Hill explores novel tasks that LLMs can accomplish as they grow larger and are trained on more data.
The study sheds light on the relation between the scale of large language models and their “emergent” abilities.
What is emergence?
This new study is focused on emergence in the sense that has long been discussed in domains such as physics, biology, and computer science. In an essay titled, “More is Different” (PDF), Nobel laureate physicist Philip Anderson discussed the idea that quantitative changes can lead to qualitatively different and unexpected phenomena.
Inspired by Anderson’s work, Jacob Steinhardt, Professor at UC Berkeley, defined emergence as “when quantitative changes in a system result in qualitative changes in behavior.”
“Since we wanted to provide a more precise definition, we defined emergent abilities as abilities that are ‘not present in smaller models but are present in larger models,’” Rishi Bommasani, PhD student at Stanford University and co-author of the paper, told TechTalks.
To identify emergent abilities in large language models, the researchers looked for phase transitions, where below a certain threshold of scale, model performance is near-random, and beyond that threshold, performance is well above random.
“This distinguishes emergent abilities from abilities that smoothly improve with scale: it is much more difficult to predict when emergent abilities will arise,” Bommasani said.
Scale can be measured in different ways, including computation (FLOPs), model size (number of parameters), or data size. In their study, the researchers focus on computation and model size, but stress that “there is not a single proxy that adequately captures all aspects of scale.”
Emergent abilities in large language models
Large language models are an especially interesting case study because they have shown very clear signs of emergence. LLMs are very large transformer neural networks, often spanning across hundreds of billions of parameters, trained on hundreds of gigabytes of text data. They can be used for a wide range of tasks, including text generation, question answering, summarization, and more.
One of the interesting features of LLMs is their capacity for few-shot and zero-shot learning, the ability to perform tasks that were not included in their training examples. Few-shot learning in LLMs drew much attention with the introduction of OpenAI’s GPT-3 in 2020, and its extent and limits have been much studied since then.
In their study, the researchers tested several popular LLM families, including LaMDA, GPT-3, Gopher, Chinchilla, and PaLM. They chose several tasks from BIG-Bench, a crowd-sourced benchmark of over 200 tasks “that are believed to be beyond the capabilities of current language models.” They also used challenges from TruthfulQA, Massive Multi-task Language Understanding (MMLU), and Word in Context (WiC), all benchmarks that are designed to test the limits of LLMs in tackling complicated language tasks.
The researchers also took extra efforts to test the LLMs on multi-step reasoning, instruction following, and multi-step computation.
“GPT-3 is iconic in having introduced the truly distinctive first wave of emergent abilities in LMs with the now well-known few-shot prompting/in-context learning,” Bommasani said. “Here, a task can be specified in natural language with a description and maybe five or so examples of the input-output structure of the task, and the largest models (i.e., the 175B model) could do fairly well on some tasks. In other words, you needed much less task-specific data and could specify the task without having to do fine-tuning/gradient-based methods.”
The findings of the study show that scale is highly correlated with the emergence of new abilities. Each of the LLM families, which come in different sizes, show random or below-random performance on the tasks below a certain size. After that, they see a sudden jump in accuracy and continue to improve as the model grows larger.
“An interesting example is the Word in Context (WiC) benchmark of Pilehvar and Camacho-Collados (2019). On that benchmark, GPT-3 and Chinchilla basically get random one-shot performance but PaLM, which uses about 5x as many FLOPs, finally demonstrates performance well-above chance,” Bommasani said.
The reasons for emergent behavior in LLMs
The presence of emergent abilities in large language models shows that we can’t predict the capabilities of LLMs by extrapolating the performance of smaller scale models.
“Emergent few-shot prompted tasks are also unpredictable in the sense that these tasks are not explicitly included in pre-training, and we likely do not know the full scope of few-shot prompted tasks that language models can perform. The overall implication is that further scaling will likely endow even larger language models with new emergent abilities,” the researchers write.
However, one outstanding question is whether the models are really learning the knowledge required for these emerging skills. Some studies show that when a neural network provides correct results, it is often mapping inputs to output without learning causal relations, common sense, and other knowledge underlying the learned skill.
“In general, how LMs acquire capabilities/skills is not well understood at a conceptual level,” Bommasani said. “In broad strokes, I would say there is (i) evidence that models become more robust in some ways with scale, (ii) that even our best models are not robust/stable in critical ways that I would not expect to be resolved by scale, and (iii) the relationship overall between robustness/stability/causality and scale is not well-understood.”
In their paper, the researchers also discuss some of the limits of scale, including hardware and data bottlenecks. Moreover, they observe that some abilities might not even emerge with scale, including tasks that are far out of the distribution of the model’s training dataset. They also warn that once an ability emerges, there’s no guarantee that it will continue to improve with scale.
“I do not expect all desired behavior to be emergent, but I do expect there is more we will see as we scale (especially in regimes beyond the dense autoregressive Transformers that are text-only English-only),” Bommasani said. “At a higher level, I expect we will continue to see significant surprises in the foundation models paradigm for a while; the progress from Minerva on the MATH benchmark surprising professional forecasters is one concrete recent example.”
Exploring alternatives to scale
As the machine learning community moves toward creating larger language models, there’s growing concern that research and development on LLMs will be centralized within a few organizations that have the financial and computational resources to train and run the models. There have been several efforts to democratize LLMs by releasing open-source models or reducing the costs and technical overhead of training and running them.
In their paper, the researchers discuss some of the alternatives to scale, including fine-tuning smaller models on task-specific datasets. “Once an ability is discovered, further research may make the ability available for smaller scale models,” the authors write, referring to recent research on new fine-tuning methods to improve the accuracy of small-scale LLMs.
“As we continue to train ever-larger language models, lowering the scale threshold for emergent abilities will become more important for allowing research on such abilities to [become] available to the community broadly,” the researchers write.
“The benefits of scale, whether via emergence or not, may incentivize the concentration of resources that one could project will motivate/exacerbate the centralization of power,” Bommasani said. “Historically, it is clear that AI research has benefited immensely from collaboration across academia and industry with strong traditions of open science. In light of the resource-intensive nature of scaling, I believe these must persist with several complementary paths forward: (i) norms to govern access to existing models for researchers, (ii) open collaborations (e.g. BigScience, EleutherAI, Masakhane, ML Collective) to build new models supported by structural changes that support decentralization, (iii) structural resources to provide the necessary compute and data (e.g., the National Research Cloud in the US as a National AI Research Resource).”
What is certain is that large language models will remain a mainstay of machine learning research in the foreseeable future. And as they move into real-world applications, we need to continue studying their capabilities and limits.
“Emergent abilities of LLMs have had a significant impact on NLP, concretely shifting the research in the field to better understand and develop such abilities. They also have sociologically influenced the overall nature of NLP and AI, indicating scale is an important factor in current systems,” Bommasani said. “We should build shared understanding of these abilities as well explore the unrealized potential and ultimate limits of scale.”