This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.
Since the release of ChatGPT—and even more so since GPT-4—I have seen a recurring pattern of hype and disappointment. First, a study claims that ChatGPT, GPT-4, or [name your LLM] has passed or aced some difficult test designed for humans: the bar exam, math exams, MIT exams, coding competitions, comprehension tests, etc. And then, another study disproves the results of the previous study. It turns out that when examined more closely, the model is providing the right answers for the wrong reasons.
The science and research community is still exploring the correct ways to evaluate the capabilities of large language models (LLM). And in the meantime, we’re discovering why the initial results of LLMs on human tests are misleading. Among the key reasons for these mistakes is “data contamination,” which basically means the test examples were included in the model’s training data.
Data contamination is common in all areas of machine learning, and ML engineers take great care to avoid it. However, when it comes to LLMs, data contamination is more complicated, nuanced, and harder to detect. Here is what you need to know about LLM data contamination and how to avoid it.
Data contamination in machine learning
When training machine learning models, ML engineers split their dataset into train and test sets (in many cases they also add a validation dataset). As the name suggests, the train set, which usually accounts for the biggest part of the dataset, is used to train the model. As training proceeds, the model becomes increasingly sensitive to the training data, and its performance increases.
The test set determines whether the model can generalize to unseen examples. If there is a gap between the model’s performance on the train and test set, then the model has probably overfitted (i.e., memorized its training data) and needs to be revised.
This is why it is very important to make sure there is no overlap between the train and test sets. When training examples find their way into the test set, the dataset is said to be contaminated. A contaminated test set will provide misleading results because it will evaluate the model on examples it has already seen. It’s like giving students the answers along with the test. They might ace the test, but it doesn’t mean they learned the topic.
(When there is enough data, machine learning engineers add a validation set to compare different versions of the trained model and configure hyperparameters. Using separate validation and test sets helps avoid second-order data contamination. When you use continuously evaluate the trained model on the validation set, the data in the validation set will end up affecting the training process.)
The method for splitting the dataset depends on the type of problem that the model is solving. For example, if you’re solving a regression problem and there is no dependence between different examples in the dataset, then you can split them randomly. You must just make sure that an example is not included in both the train and test sets. If you’re solving a simple classification problem, in addition to random splitting, you must ensure that there is class balance in the training and test sets. If you’re solving a time series problem, then you must split your data based on the sequence of occurrence and make sure the examples in the test set all happen after the training set.
For classic ML problems, detecting data contamination is usually straightforward. You can compare train and test examples, temporal features, class balance, etc. For LLMs, things become complicated.
Why LLM data contamination is complicated
The same basic rule of separating train and test sets also applies to large language models. When you’re evaluating your LLM on a benchmark dataset, you must take care to not include your test examples in the model’s training data.
However, there are few reasons that make it difficult to deal with data contamination in LLMs:
Dataset size: Foundational LLMs are trained on hundreds of billions or even trillions of tokens. The data comes from many sources and includes different languages, types of information, tasks, etc. It is really difficult to make sure that your test data or a version of it has not been already included in the dataset. There have been several examples where researchers reported that an LLM could solve a complicated task only to later find that the model could generate the examples verbatim, which meant they were included in its training data.
Prompting errors: LLMs can perform few-shot learning, where you include a few solved examples in the prompt to enable the model to perform a new task without updating its parameters. In one study, researchers developed an automated system that used similarity search to retrieve relevant examples to create a few-shot prompt for the model. In some instances, those examples included the actual question and its answer. In this case, the prompt was contaminated with the answer.
Model complexity: LLMs are huge models with tens or hundreds of billions of parameters. But their dataset is much larger than their parameter size, which means they do not completely memorize the data. They are sometimes referred to as “stochastic parrots.” So, they parrot their training data but not verbatim and there is some randomness. They are good at generating sequences of mostly meaningful tokens, but they also often generate utterly wrong responses. They can do complicated math, but they also fail at elementary problems. Some tests show that LLMs can do reasoning while others show that they have no notion of planning and reasoning. So it’s really hard to say exactly what they learn during training beyond statistical regularities in their training data. This all makes it very difficult to know for sure if the model provided the right answer to a problem because it knew the answer or because it had learned how to solve it.
Problem confusion: LLMs are trained for next-token prediction and designed as models that can solve many different kinds of problems. But as I mentioned above, checking for data contamination depends largely on the type of problem you’re solving. So, data contamination rules would be different for math, coding, text generation, question answering, planning, and other problem types that LLMs are solving.
Lack of transparency: Finally, one of the biggest problems the field is facing is diminishing transparency. AI companies and research labs are increasingly incentivized to keep the details of their models secret. The most powerful LLMs are becoming more and more obscure. OpenAI provided no details on the architecture and training data of GPT-4. Google took a similar approach with PaLM 2. We don’t know much about the training data of other LLMs such as Claude and Bard. The lack of transparency makes it very difficult to detect data contamination on independent tests.
How to avoid LLM data contamination
Given the unique characteristics of LLMs, some of which I mentioned above, I think we need a new approach to detecting and preventing data contamination.
First we must start by encouraging more transparency and verification in the field. The field needs to return to its roots of sharing knowledge. There should either be access to the training data or tools to verify whether some example was used in training.
There should also be better tools to test the similarity of test examples and training data. And similarity measures vary across different types of tasks. And as some scientists have pointed out, studies on the capabilities of AI systems should come with more granular access to the evaluation examples.
We should also acknowledge that if LLMs are intelligent, their intelligence is very different from ours. As some scientists have pointed out, tests that are designed for measuring human intelligence are not suitable for evaluating LLMs. Humans have limited memorization and conscious data-processing capabilities. They build their skills on top of each other and learn to generalize. For example, before you learn calculus, you must master algebra and elementary math. So if you ace a calculus test (without cheating), then you’re expected to have all those underlying skills.
But deep learning systems can find shortcuts to solutions without learning the prerequisite skills. Therefore, we need tests designed to make sure the model is not giving the right answer for the wrong reasons. For example, tests can be more thorough and evaluate the building blocks that are taken for granted in humans. Another technique that is useful is testing the model on different variants of the same problem. If the model has memorized the problem and solution, it will succeed on one variant and fail on others.
As language models continue to evolve, so will the ways we design, train, and test them. Data contamination will remain an issue. The ways to deal with it must change.