This article is part of our coverage of the latest in AI research.
There have many recent discussions on the capabilities of ChatGPT models like GPT-3.5 and GPT-4 degrading over time. OpenAI has publicly rejected these claims.
A new study from researchers at Stanford University and UC Berkeley provides evidence that the behaviors of these large language models have “drifted substantially” (which is different from the degradation of capabilities).
The findings are a warning about the risks of building applications on top of black-box AI systems like ChatGPT that could produce inconsistent or unpredictable results over time. The lack of transparency on how models like GPT-3.5 and GPT-4 are trained and updated makes it impossible to anticipate or explain shifts in their performance.
Complaints over ChatGPT’s poor performance
As far back as May, users were complaining on the OpenAI forum about GPT-4 “struggling to do things it did well previously.” Users were dissatisfied not only with the degrading performance but also with OpenAI’s lack of responsiveness and explanation.
On July 12, Business Insider reported that users were describing GPT-4 as “lazier” and “dumber” compared with its previous reasoning capabilities and other output. Absent a response from OpenAI, experts started speculating on the reasons behind GPT-4’s performance drop.
Some suggested that OpenAI was using smaller models behind the API to cut down the costs of running ChatGPT. Others speculated that the company was running a mixture of experts (MOE) approach, where several small specialized models replace a large, generalized LLM.
OpenAI later rejected the idea that it was intentionally making GPT-4 dumber. “Quite the opposite: we make each new version smarter than the previous one,” Peter Welinder, VP Product at OpenAI, tweeted. “Current hypothesis: When you use it more heavily, you start noticing issues you didn’t see before.”
Testing ChatGPT’s performance over time
To verify how ChatGPT’s behavior changed over time, the researchers from Stanford and UC Berkeley tested two versions of GPT-3.5 and GPT-4, from March and June 2023. They evaluated the models on four common benchmark tasks: math problems, answering sensitive questions, code generation, and visual reasoning.
They chose these four areas because they are diverse tasks frequently used to evaluate LLMs, and they are relatively objective and thus easy to evaluate.
The researchers used two sets of metrics to evaluate the performance of the model. The main metrics were task-specific (e.g., accuracy for math, direct execution for coding). They also tracked verbosity (length of output) and overlap (the level of similarity between the answers of two versions of the LLM).
ChatGPT’s performance has drifted from March to June
For math problems, the researchers used “chain-of-thought” prompting, often used to elicit reasoning capabilities in LLMs. Their findings show a significant drift in the models’ performance: GPT-4’s accuracy plunged from 97.6 percent to 2.4 percent from March to June, while its response length dropped by over 90 percent. GPT-3.5 showed the opposite trend, with accuracy rising from 7.4 to 86.8 percent and verbosity increasing 40 percent. The authors note that this “interesting phenomenon indicates that the same prompting approach, even these widely adopted such as chain-of-thought, could lead to substantially different performance due to LLM drifts.”
For answering sensitive questions, the LLMs were evaluated on how often they answered controversial prompts. GPT-4’s direct answer rate fell from 21 to 5 percent from March to June, suggesting that the model has become more conservative.
Meanwhile, GPT-3.5 went from directly answering 2 percent of the questions to answering 8 percent of them. Both models also provided less explanation when refusing inappropriate questions in June compared to March. “These LLM services may have become safer, but also provide less rationale for refusing to answer certain questions,” the researchers write.
In code generation, the researchers tested whether the LLMs’ outputs were directly executable by submitting them to an online judge that runs and evaluates code. They found that over 50 percent of GPT-4’s outputs were directly executable in March, but only 10 percent in June. For ChatGPT 3.5, executable outputs dropped from 22 percent in March to 2 percent in June. The June versions often added non-executable sequences like triple quotes (“`) around code snippets. “This is particularly challenging to identify when LLM’s generated code is used inside a larger software pipeline,” the researchers warn.
For visual reasoning, the researchers evaluated the models on a subset of examples from the Abstract Reasoning Corpus (ARC) dataset, a collection of visual puzzles that test a model’s ability to infer abstract rules. They noticed marginal performance improvements for both GPT-4 and GPT-3.5. But performance was low overall, 27.4 percent for GPT-4 and 12.2 percent for GPT-3.5. However, the June version of GPT-4 made mistakes on some queries it correctly answered in March. “This underlines the need of fine-grained drift monitoring, especially for critical applications,” the researchers write.
How far can you trust ChatGPT in your apps?
While the paper’s findings do not necessarily suggest that the models have gotten worse, it does confirm that their behavior has changed. For example, in the coding examples, the model’s answers could be correct, but they had a few artifacts that made them non-executable without a bit of cleaning.
The researchers conclude that the drift in GPT-3.5 and GPT-4’s behavior “highlights the need to continuously evaluate and assess the behavior of LLMs in production applications.”
As we build software systems that use LLMs as components, we need to develop new development practices and workflows to ensure reliability and accountability. Using LLMs through public APIs requires new software development practices and workflows. We have yet to discover and refine these practices.
“For users and companies using LLM services as a component in their ongoing workflow, we recommend that they should implement similar monitoring analysis as we do here for their applications,” the researchers write.
The findings also highlight the need for more transparency in the data and methods used to train and fine-tune LLMs. Without such transparency, building stable applications on top of them becomes very difficult.
ChatGPT’s behavior drift misinterpreted
In an article following the publication of the paper, Arvind Narayanan, computer scientist and professor at Princeton University, and Sayash Kapoor, computer scientist at Princeton University, argue that the media has misinterpreted the results of the paper as confirmation that GPT-4 has gotten worse.
“Unfortunately, this is a vast oversimplification of what the paper found. And while the findings are interesting, some of the methods are questionable,” they write.
For example, Narayanan and Kapoor found that all 500 math problems used in the evaluation were in the form “Is number X prime?” And all the numbers in the dataset were primes. The March version of GPT-4 almost always guesses that the number is prime, and the June version almost always guesses that it is composite.
“The authors interpret this as a massive performance drop — since they only test primes,” Narayanan and Kapoor write. When GPT-4 was tested on 500 composite numbers, the degradation was gone.
The computer scientists argue that the LLMs are pretending to be calculating prime numbers while guessing the outcome. “In reality, all four models are equally awful… They all guess based on the way they were calibrated. To simplify a bit, during fine tuning, maybe some model was exposed to more math questions involving prime numbers, and the other, composites,” they write. “In short, everything in the paper is consistent with the behavior of the models changing over time. None of it suggests a degradation in capability.”