Blog

The truth about ChatGPT’s degrading capabilities

July 24, 2023

Image source: 123RF (with modifications)

This article is part of our coverage of the latest in AI research.

There have many recent discussions on the capabilities of ChatGPT models like GPT-3.5 and GPT-4 degrading over time. OpenAI has publicly rejected these claims.

A new study from researchers at Stanford University and UC Berkeley provides evidence that the behaviors of these large language models have “drifted substantially” (which is different from the degradation of capabilities).

The findings are a warning about the risks of building applications on top of black-box AI systems like ChatGPT that could produce inconsistent or unpredictable results over time. The lack of transparency on how models like GPT-3.5 and GPT-4 are trained and updated makes it impossible to anticipate or explain shifts in their performance.

Complaints over ChatGPT’s poor performance

As far back as May, users were complaining on the OpenAI forum about GPT-4 “struggling to do things it did well previously.” Users were dissatisfied not only with the degrading performance but also with OpenAI’s lack of responsiveness and explanation.

On July 12, Business Insider reported that users were describing GPT-4 as “lazier” and “dumber” compared with its previous reasoning capabilities and other output. Absent a response from OpenAI, experts started speculating on the reasons behind GPT-4’s performance drop.

Some suggested that OpenAI was using smaller models behind the API to cut down the costs of running ChatGPT. Others speculated that the company was running a mixture of experts (MOE) approach, where several small specialized models replace a large, generalized LLM.

OpenAI later rejected the idea that it was intentionally making GPT-4 dumber. “Quite the opposite: we make each new version smarter than the previous one,” Peter Welinder, VP Product at OpenAI, tweeted. “Current hypothesis: When you use it more heavily, you start noticing issues you didn’t see before.”

No, we haven't made GPT-4 dumber. Quite the opposite: we make each new version smarter than the previous one.

Current hypothesis: When you use it more heavily, you start noticing issues you didn't see before.
— Peter Welinder (@npew) July 13, 2023

Testing ChatGPT’s performance over time

To verify how ChatGPT’s behavior changed over time, the researchers from Stanford and UC Berkeley tested two versions of GPT-3.5 and GPT-4, from March and June 2023. They evaluated the models on four common benchmark tasks: math problems, answering sensitive questions, code generation, and visual reasoning.

They chose these four areas because they are diverse tasks frequently used to evaluate LLMs, and they are relatively objective and thus easy to evaluate.

The researchers used two sets of metrics to evaluate the performance of the model. The main metrics were task-specific (e.g., accuracy for math, direct execution for coding). They also tracked verbosity (length of output) and overlap (the level of similarity between the answers of two versions of the LLM).

ChatGPT’s performance has drifted from March to June

For math problems, the researchers used “chain-of-thought” prompting, often used to elicit reasoning capabilities in LLMs. Their findings show a significant drift in the models’ performance: GPT-4’s accuracy plunged from 97.6 percent to 2.4 percent from March to June, while its response length dropped by over 90 percent. GPT-3.5 showed the opposite trend, with accuracy rising from 7.4 to 86.8 percent and verbosity increasing 40 percent. The authors note that this “interesting phenomenon indicates that the same prompting approach, even these widely adopted such as chain-of-thought, could lead to substantially different performance due to LLM drifts.”

For answering sensitive questions, the LLMs were evaluated on how often they answered controversial prompts. GPT-4’s direct answer rate fell from 21 to 5 percent from March to June, suggesting that the model has become more conservative.

Meanwhile, GPT-3.5 went from directly answering 2 percent of the questions to answering 8 percent of them. Both models also provided less explanation when refusing inappropriate questions in June compared to March. “These LLM services may have become safer, but also provide less rationale for refusing to answer certain questions,” the researchers write.

In code generation, the researchers tested whether the LLMs’ outputs were directly executable by submitting them to an online judge that runs and evaluates code. They found that over 50 percent of GPT-4’s outputs were directly executable in March, but only 10 percent in June. For ChatGPT 3.5, executable outputs dropped from 22 percent in March to 2 percent in June. The June versions often added non-executable sequences like triple quotes (“`) around code snippets. “This is particularly challenging to identify when LLM’s generated code is used inside a larger software pipeline,” the researchers warn.

For visual reasoning, the researchers evaluated the models on a subset of examples from the Abstract Reasoning Corpus (ARC) dataset, a collection of visual puzzles that test a model’s ability to infer abstract rules. They noticed marginal performance improvements for both GPT-4 and GPT-3.5. But performance was low overall, 27.4 percent for GPT-4 and 12.2 percent for GPT-3.5. However, the June version of GPT-4 made mistakes on some queries it correctly answered in March. “This underlines the need of fine-grained drift monitoring, especially for critical applications,” the researchers write.

How far can you trust ChatGPT in your apps?

While the paper’s findings do not necessarily suggest that the models have gotten worse, it does confirm that their behavior has changed. For example, in the coding examples, the model’s answers could be correct, but they had a few artifacts that made them non-executable without a bit of cleaning.

The researchers conclude that the drift in GPT-3.5 and GPT-4’s behavior “highlights the need to continuously evaluate and assess the behavior of LLMs in production applications.”

As we build software systems that use LLMs as components, we need to develop new development practices and workflows to ensure reliability and accountability. Using LLMs through public APIs requires new software development practices and workflows. We have yet to discover and refine these practices.

“For users and companies using LLM services as a component in their ongoing workflow, we recommend that they should implement similar monitoring analysis as we do here for their applications,” the researchers write.

The findings also highlight the need for more transparency in the data and methods used to train and fine-tune LLMs. Without such transparency, building stable applications on top of them becomes very difficult.

ChatGPT’s behavior drift misinterpreted

In an article following the publication of the paper, Arvind Narayanan, computer scientist and professor at Princeton University, and Sayash Kapoor, computer scientist at Princeton University, argue that the media has misinterpreted the results of the paper as confirmation that GPT-4 has gotten worse.

“Unfortunately, this is a vast oversimplification of what the paper found. And while the findings are interesting, some of the methods are questionable,” they write.

For example, Narayanan and Kapoor found that all 500 math problems used in the evaluation were in the form “Is number X prime?” And all the numbers in the dataset were primes. The March version of GPT-4 almost always guesses that the number is prime, and the June version almost always guesses that it is composite.

“The authors interpret this as a massive performance drop — since they only test primes,” Narayanan and Kapoor write. When GPT-4 was tested on 500 composite numbers, the degradation was gone.

ChatGPT math tests Arvind Narayanan Sayash Kapoor — ChatGPT’s behavior changes, but it doesn’t necessarily mean its capabilities degrade (source: AI Snake Oil)

The computer scientists argue that the LLMs are pretending to be calculating prime numbers while guessing the outcome. “In reality, all four models are equally awful… They all guess based on the way they were calibrated. To simplify a bit, during fine tuning, maybe some model was exposed to more math questions involving prime numbers, and the other, composites,” they write. “In short, everything in the paper is consistent with the behavior of the models changing over time. None of it suggests a degradation in capability.”

How Cursor’s Composer 2.5 uses self-distillation to beat the frontier LLMs…

Vertical integration as AI infrastructure: What 21D’s full arch implant system…

Why sandboxing OpenClaw doesn’t stop data exfiltration

Google brings multi-token prediction Gemma 4 LLMs

How Memory Sparse Attention scales LLM memory to 100 million tokens

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

OpenAI’s GPT-5: A reality check for the AI hype train

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

The truth about ChatGPT’s degrading capabilities

Complaints over ChatGPT’s poor performance

Testing ChatGPT’s performance over time

ChatGPT’s performance has drifted from March to June

How far can you trust ChatGPT in your apps?

ChatGPT’s behavior drift misinterpreted

Like this:

Leave a ReplyCancel reply

Complaints over ChatGPT’s poor performance

Testing ChatGPT’s performance over time

ChatGPT’s performance has drifted from March to June

How far can you trust ChatGPT in your apps?

ChatGPT’s behavior drift misinterpreted

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks