Blog

Why we must rethink AI benchmarks

December 6, 2021

This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.

For decades, researchers have used benchmarks to measure progress in different areas of artificial intelligence such as vision and language. Especially in the past few years, with deep learning becoming very popular, benchmarks have become a narrow focus for many research labs and scientists. But while benchmarks can help compare the performance of AI systems on specific problems, they are often taken out of context, sometimes to harmful results.

In a paper accepted at the NeurIPS 2021 conference, scientists at University of California, Berkeley, University of Washington, and Google outline the limits of popular AI benchmarks. The scientists warn that progress on benchmarks is often used to make claims of progress toward general areas of intelligence, which is far beyond the tasks these benchmarks are designed for.

“We do not deny the utility of such benchmarks, but rather hope to point to the risks inherent in their framing,” the researchers write.

Benchmarks for specific tasks

Benchmarks are datasets composed of tests and metrics to measure the performance of AI systems on specific tasks. An example is ImageNet, a popular benchmark for evaluating image classification systems. ImageNet contains millions of images labeled for more than a thousand categories. Performance in ImageNet is measured with metrics such as “top 1 accuracy” and “top 5 accuracy.” An image classifier gets a 0.98 score on “top 5 accuracy” if its five highest predictions include the right label on 98 percent of the test photos in ImageNet. “Top 1 accuracy” only considers the highest prediction of the classifier.

Benchmarks such as ImageNet and General Language Understanding Evaluation (GLUE) have become very popular in the past decade thanks to growing interest in deep learning algorithms. Extensive work in the field has shown that as you add more layers and data to deep learning models and train them on larger datasets, they perform better at benchmark tests.

However, better performance at ImageNet and GLUE does not necessarily bring AI closer to general abilities such as understanding language and visual information as humans do.

ImageNet measures performance on specific types of objects under the conditions that are in the dataset. Likewise, GLUE and its more advanced version, SuperGLUE, are not measures of understanding language in general.

Benchmarks for the whole wide world

“We had a shared frustration about the focus on chasing SOTA (state of the art) on leaderboards in ML and other fields where ML gets applied (including NLP) and a strong skepticism about the claims of generality,” Emily M. Bender, Professor of Linguistics at the University of Washington and co-author of the paper, told TechTalks.

The authors compare benchmarks to the Sesame Street children’s storybook Grover and the Everything in the Whole Wide World Museum. In the book, Grover visits a museum that claims to have “everything in the whole wide world.” The museum has rooms for all sorts of crazy categories, such as “things you see in the sky,” “things you see on the ground,” “things that are on the wall,” underwater things, carrots, noisy things, and much more.

After going through many rooms, Grover says, “I have seen many things in this museum, but I still have not seen everything in the whole wide world. Where did they put everything else?” And then he finds a door labeled “Everything Else.” The door opens to the outside world.

The main lesson from the story is that you can’t classify everything in the world in a finite set of categories. Accordingly, Bender and her coauthors have aptly titled the paper, “AI and the Everything in the Whole Wide World Benchmark,” after the Sesame Street storybook.

“In a relatively early meeting, the discussion of the way in which benchmarks are treated as fully representative of the world reminded me of a storybook I had really loved as a child: Grover and the Everything in the Whole Wide World Museum,” Bender said.“That story was new to the others, but the metaphor clicked and we ([lead author Deborah Raji] in particular) ran with it.”

In the paper, the authors write: “We argue that benchmarks presented as measurements of progress towards general ability within vague tasks such as ‘visual understanding’ or ‘language understanding’ are as ineffective as the finite museum is at representing ‘everything in the whole wide world,’ and for similar reasons—being inherently specific, finite and contextual.”

When benchmarks go beyond their limits

Benchmarks are taken out of context and projected beyond their limits at several stages, Bender says.

“First, benchmark creators frame their work as something general, rather than tightly scoped: ‘Visual understanding’ as opposed to ‘classification of photographs taken from XYZ sources into N classes as labeled by such-and-such crowd-workers,’” she said.

For example, the creators of ImageNet praised the benchmark as “the most comprehensive and diverse coverage of the image world” and later described the project as an attempt to “map the entire world of objects” (which is oddly similar to what Grover’s museum was trying to do). Likewise, the authors of GLUE and SuperGLUE described their benchmarks as “evaluation framework[s] for research towards general-purpose language understanding technologies.”

“Second, benchmark consumers work to optimize performance on the benchmark but talk about their results in terms of the general framing rather than what their actual experiments could possibly show,” Bender said.

In the paper, the authors highlight some of the limits that these benchmarks suffer from. For example, the ImageNet dataset has been gathered from photos posted online, where the objects are usually centered in the image and are seen in normal angles. Several studies and examples show that machine learning models optimized for such benchmarks perform poorly in real-world situations.

But when researchers overstate the generality of these benchmark datasets, they gradually become the main target of the field. And this puts researchers into a trap of chasing better benchmark results. Unfortunately, the AI community—which includes reviewers of papers in esteemed conferences—often considers incremental improvements to benchmark performance as novel and interesting work.

“Third, company or university PR firms and the media then go on to talk about the results as if ‘AI’ is a reality,” Bender said. “That is, when terms like ‘visual understanding’ or ‘understanding language’ are used without any definitions, people naturally interpret them to mean ‘understanding like how I understand what I see/hear/read’ when that has not in fact been established. Sometimes the PR folks will even claim that computers do this better than humans!”

There are plenty of examples in which PR firms and the media interpret better performance on limited benchmarks as steps toward general forms of intelligence.

For example, in a blog post in January 2021, Microsoft announced that its DeBERTa language model surpassed human performance on the SuperGLUE benchmark. The blog post states that the model does not reach human-level natural language understanding, but nonetheless describes DeBERTa’s performance on SuperGLUE as “an important milestone toward general AI.”

The risks of misunderstanding benchmarks

“The more the public believes that ‘AI’ can ‘understand’ (visual scenes and/or language), the more primed the public is to accept systems that purportedly use ‘AI’ to screen job applicants, determine who might commit a crime, detect whether test-takers are cheating, etc.,” Bender said.

Unwarranted trust in machine learning systems have led to many failures in the past years, ranging from embarrassing, such as mislabeling images, to harmful, such as declining a loan for the wrong reasons.

The focus on benchmarks performance can detract attention from areas of concern such as harmful biases, adversarial vulnerabilities, and instability in tasks that require causal reasoning.

“If the public at large has an inaccurate sense of what ‘AI’ can do (due to hyped results from purportedly general benchmarks), the public at large is ill-positioned to reject the application of these systems. The result: we will be stuck with systems presented as ‘objective’ (because they’re computers) which are in fact perpetuating racism, sexism, ableism, transphobia, ageism, etc.,” Bender said.

At the same time, the focus on benchmark performance has brought a lot of attention to machine learning at the expense of other promising directions of research. Thanks to the growing availability of data and computational resources, many researchers find it easier to train very large neural networks on huge datasets to push the needle on a well-known benchmark rather than to experiment with alternative approaches.

“I think the focus on benchmarks as the primary measure of ‘progress’ has left a lot of work out in the cold, specifically work that seeks to understand the shape of the problem, work on languages without large quantities of data (and associated benchmarks), and work that seeks to understand the relative strengths of different approaches (even to the benchmark-having problems),” Bender said. “I’m not saying that there isn’t any work like this, just that there is less than there should be.”

Bender also believes that the focus on benchmarks is a factor that encourages a fast pace of research, where many researchers focus on the same problems and race to be the first to apply a certain approach to a given problem and score state-of-the-art (SOTA), albeit temporarily.

“If papers showing SOTA on benchmarks are considered publishable (by default), then the overall expectations of pace (how many papers an individual should publish each year, etc.) go up,” she said. “Everyone is constantly rushing to publish while simultaneously being overwhelmed by reviewing demands and trying to keep up with the literature. And it’s even harder to make time for science that happens on a slower time scale. This isn’t healthy for the researchers nor for the field, and it also exacerbates issues around lack of diversity: the people who are able to keep up that pace are the ones who don’t have caregiving responsibilities, health issues, substantial community commitments, etc.”

Benchmarks in context

machine learning benchmarks ImageNet GLUE

In their paper, the scientists highlight that just as no museum can catalog “everything in the whole wide world,” no dataset will be able to “capture the full complexity of the details of existence.”

The paper suggests two paths for future work on benchmarks. First, benchmarks should be developed, presented, and understood as intended—to evaluate concrete, well-scoped, and contextualized tasks.

Second, researchers should use alternative methods to probe their models for broad objectives, behaviors, and capabilities. The authors suggest techniques that can spot harmful biases, blind spots, potential failures, energy and memory consumption, and measure other important aspects that go unnoticed in classic benchmarks.

“Benchmarking, appropriately deployed, is not about winning a contest but more about surveying a landscape—the more we can re-frame, contextualize and appropriately scope these datasets, the more useful they will become as an informative dimension to more impactful algorithmic development and alternative evaluation methods,” the authors write.

How the AI arms race moved from smart models to full-stack…

Why LLMs should stop thinking out loud (and what comes after…

Beyond vibe coding: How Codev 3.0 engineers the AI-powered dev team

How Cursor’s Composer 2.5 uses self-distillation to beat the frontier LLMs…

Vertical integration as AI infrastructure: What 21D’s full arch implant system…

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Demystifying loop engineering: Get more from AI agents, avoid loopmaxxing

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Why we must rethink AI benchmarks

Benchmarks for specific tasks

Benchmarks for the whole wide world

When benchmarks go beyond their limits

The risks of misunderstanding benchmarks

Benchmarks in context

Like this:

Leave a ReplyCancel reply

Benchmarks for specific tasks

Benchmarks for the whole wide world

When benchmarks go beyond their limits

The risks of misunderstanding benchmarks

Benchmarks in context

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks