Blog

Why AI benchmarks are broken

December 15, 2025

OpenAI finally released GPT-5.2 after declaring “code red” following Google’s release of Gemini 3 Pro. But did OpenAI reclaim the throne in the AI race? The answer is complicated.

GPT-5.2 sets a new record on the prestigious ARC-AGI-2 benchmark, which evaluates the ability of models to solve visual puzzles that require abstract reasoning. It also leads on several practical benchmarks, including GDPval, SWE-Bench Verified, and GPQA.

Meanwhile, at the time of this writing, GPT-5.2 is still lagging behind Gemini 3 Pro on the overall ranking in the independent Artificial Analysis Index and the Epoch Capabilities Index (ECI). On the Simple Bench leaderboard, which tracks the capabilities of LLMs in simple reasoning questions, GPT-5.2 Pro, the most capable version of the model, stands at a disappointing 8th place.

And if you look on X, you’ll find all kinds of opinions and anecdotes of GPT-5.2 either being one step away from artificial general intelligence (AGI) or too dumb to count the number of r’s in the word “garlic.”

“How many R’s in garlic”

I asked this same question to four AI models. Here’s how they performed:

GPT 5.2: incorrect
Gemini 3: correct
DeepSeek R1: correct
Qwen3-Max: correct pic.twitter.com/X6JnJuDuRz
— Kyle Chan (@kyleichan) December 12, 2025

AI benchmarks are a mess, and they are confusing and frustrating users. They were created for the purpose of measuring the capabilities of models in different categories of tasks. But over time, they have much of their purpose and value. There are a few reasons for that.

1 – The nature of the market

Today, AI labs are using benchmarks to one-up each other, earn bragging rights, secure the next round of funding, or raise stock prices. There is a lot of pressure on companies to prove their models are better than those of their competitors. And in the absence of other measures, benchmarks are becoming the main vector of competition.

This has led to “benchmaxxing,” the practice of overfitting models to perform well on key benchmarks. There are different ways companies are benchmaxxing. Some of them are outright cheating, like directly training on benchmarks.

This causes the model to overfit on that particular dataset but perform very poorly on real-world tasks, even when they are similar to those included in the benchmark dataset.

Other benchmaxxing methods are more meticulous. For example, AI labs might gather data that is similar to those in the benchmark datasets and train the models on them. AI companies are reportedly paying huge sums to hire experts to create training examples for specific datasets, such as the hard-to-beat Humanity’s Last Exam.

This is the problem with benchmarks. Once they become too important (like HLE below), then an AI company’s success (or next funding round) will depend on scoring high on that benchmark. This leads to all kinds of unproductive behavior, such as training on the dataset or training… https://t.co/Vtd848HNER
— Ben Dickson (@bendee983) December 11, 2025

While this is not necessarily cheating, it gives an inaccurate impression of the model’s capabilities. The model might perform very well on that specific benchmark and problems that are similar to it but fail to generalize to the broader range of tasks that would be expected of it.

As long as benchmarks remain the main vector of competition, we can expect them to become subject to Goodhart’s Law: “When a measure becomes the target, it ceases to be a good measure.”

When evals/benchmarks become the training goal, they lose their value. https://t.co/zFTO5qCKhb
— Ben Dickson (@bendee983) December 13, 2025

Another method is to… simply create a benchmark on which your own model excels. OpenAI released GDPVal in September and DeepMind released the FACTS Benchmark Suite in December. And expectedly, their models lead the pack in their respective benchmarks.

Nice, but I'm usually skeptical of a company's model doing well on a benchmark they themselves designed. I'd rather see how they do on an independent benchmark.
— Ben Dickson (@bendee983) December 11, 2025

2 – The nature of the models

Before the rise of large language models (LLMs), machine learning models were mostly trained for specific tasks. Their capabilities could be evaluated and compared through a few benchmarks. GPT-3 ushered in the era of general-purpose models that can perform many tasks and adapt to new tasks through in-context learning and few-shot examples. However, evaluating them is becoming increasingly challenging as you have to track dozens of benchmarks for each model.

And training the models in a way that maintains their capabilities across all these different tasks and benchmarks is very difficult. Models can suffer from “catastrophic forgetting,” where training for a new task causes the model to lose its previous knowledge and capabilities. This can happen when the model is overtrained for

Whether you love or hate a model, you’re likely to find a benchmark to prove your point.

Another problem is “jagged intelligence,” a term coined by AI researcher Anrej Karpathy. LLMs can “both perform extremely impressive tasks (e.g. solve complex math problems) while simultaneously struggle with some very dumb problems.” So you can have a model that scores super-high on a challenging benchmark like HLE but fails at a simple task that requires basic intuition, as those laid out in Simple Bench.

Jagged Intelligence

The word I came up with to describe the (strange, unintuitive) fact that state of the art LLMs can both perform extremely impressive tasks (e.g. solve complex math problems) while simultaneously struggle with some very dumb problems.

E.g. example from two… pic.twitter.com/3C7pCdBShQ
— Andrej Karpathy (@karpathy) July 25, 2024

LLMs (and machine learning models in general) develop skills in a much more different way than humans. In seconds, they can ingest huge amounts of data that would take humans ages to read. And they might arrive at the correct answer, but not necessarily in the same manner as humans do. They can reward-hack their way to the answers without necessarily performing the intermediate steps we would expect from them.

Which brings me to the third point…

Microsoft’s new Rho-alpha model brings tactile sensing to robotics

Vulnerability in Perplexity’s BrowseSafe shows why single models can’t stop prompt…

How test-time training allows models to ‘learn’ long documents instead of…

VL-JEPA is a lean, fast vision-language model that rivals the giants

URM shows how small, recurrent models can outperform big LLMs in…

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

OpenAI’s GPT-5: A reality check for the AI hype train

OpenAI’s grand return to open source: unpacking the gpt-oss release

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Why AI benchmarks are broken

1 – The nature of the market

2 – The nature of the models

Like this:

1 – The nature of the market

2 – The nature of the models

Subscribe to continue reading

Like this: