OpenAI finally released GPT-5.2 after declaring “code red” following Google’s release of Gemini 3 Pro. But did OpenAI reclaim the throne in the AI race? The answer is complicated.
GPT-5.2 sets a new record on the prestigious ARC-AGI-2 benchmark, which evaluates the ability of models to solve visual puzzles that require abstract reasoning. It also leads on several practical benchmarks, including GDPval, SWE-Bench Verified, and GPQA.
Meanwhile, at the time of this writing, GPT-5.2 is still lagging behind Gemini 3 Pro on the overall ranking in the independent Artificial Analysis Index and the Epoch Capabilities Index (ECI). On the Simple Bench leaderboard, which tracks the capabilities of LLMs in simple reasoning questions, GPT-5.2 Pro, the most capable version of the model, stands at a disappointing 8th place.
And if you look on X, you’ll find all kinds of opinions and anecdotes of GPT-5.2 either being one step away from artificial general intelligence (AGI) or too dumb to count the number of r’s in the word “garlic.”
AI benchmarks are a mess, and they are confusing and frustrating users. They were created for the purpose of measuring the capabilities of models in different categories of tasks. But over time, they have much of their purpose and value. There are a few reasons for that.
1 – The nature of the market
Today, AI labs are using benchmarks to one-up each other, earn bragging rights, secure the next round of funding, or raise stock prices. There is a lot of pressure on companies to prove their models are better than those of their competitors. And in the absence of other measures, benchmarks are becoming the main vector of competition.
This has led to “benchmaxxing,” the practice of overfitting models to perform well on key benchmarks. There are different ways companies are benchmaxxing. Some of them are outright cheating, like directly training on benchmarks.
This causes the model to overfit on that particular dataset but perform very poorly on real-world tasks, even when they are similar to those included in the benchmark dataset.
Other benchmaxxing methods are more meticulous. For example, AI labs might gather data that is similar to those in the benchmark datasets and train the models on them. AI companies are reportedly paying huge sums to hire experts to create training examples for specific datasets, such as the hard-to-beat Humanity’s Last Exam.
While this is not necessarily cheating, it gives an inaccurate impression of the model’s capabilities. The model might perform very well on that specific benchmark and problems that are similar to it but fail to generalize to the broader range of tasks that would be expected of it.
As long as benchmarks remain the main vector of competition, we can expect them to become subject to Goodhart’s Law: “When a measure becomes the target, it ceases to be a good measure.”
Another method is to… simply create a benchmark on which your own model excels. OpenAI released GDPVal in September and DeepMind released the FACTS Benchmark Suite in December. And expectedly, their models lead the pack in their respective benchmarks.
2 – The nature of the models
Before the rise of large language models (LLMs), machine learning models were mostly trained for specific tasks. Their capabilities could be evaluated and compared through a few benchmarks. GPT-3 ushered in the era of general-purpose models that can perform many tasks and adapt to new tasks through in-context learning and few-shot examples. However, evaluating them is becoming increasingly challenging as you have to track dozens of benchmarks for each model.
And training the models in a way that maintains their capabilities across all these different tasks and benchmarks is very difficult. Models can suffer from “catastrophic forgetting,” where training for a new task causes the model to lose its previous knowledge and capabilities. This can happen when the model is overtrained for
Whether you love or hate a model, you’re likely to find a benchmark to prove your point.
Another problem is “jagged intelligence,” a term coined by AI researcher Anrej Karpathy. LLMs can “both perform extremely impressive tasks (e.g. solve complex math problems) while simultaneously struggle with some very dumb problems.” So you can have a model that scores super-high on a challenging benchmark like HLE but fails at a simple task that requires basic intuition, as those laid out in Simple Bench.
LLMs (and machine learning models in general) develop skills in a much more different way than humans. In seconds, they can ingest huge amounts of data that would take humans ages to read. And they might arrive at the correct answer, but not necessarily in the same manner as humans do. They can reward-hack their way to the answers without necessarily performing the intermediate steps we would expect from them.
Which brings me to the third point…
Subscribe to continue reading
Become a paid subscriber to get access to the rest of this post and other exclusive content.




















