Why you should be careful about LLMs that imitate ChatGPT

model imitation chatgpt

This article is part of our coverage of the latest in AI research.

The success of large language models (LLM) such as ChatGPT, Bard, and Claude has sparked efforts to replicate their capabilities. However, these LLMs are black-box APIs, and their creators share very little information on model architecture, weights, and training data.

To overcome these limitations, researchers have developed several innovative techniques. Among them is “model imitation,” in which developers fine-tune a pre-trained model on examples extracted from the black box model whose capabilities they want to replicate.

Studies have shown that with imitation learning, small open-source LLMs can be trained to perform at levels that compare to ChatGPT and Google Bard. These results have raised interest (and worries) in the potential of imitation learning.

However, there are also limits to what model imitation can do. A recent paper by researchers at the University of California, Berkeley, shows that imitation models can learn the style of ChatGPT but not its knowledge. The findings have implications for how far you can trust imitation learning.

Imitating ChatGPT

closed vs open source language models

While we don’t have access to the details of an LLM like ChatGPT, you can learn much about it by interacting with its API. The main idea behind model imitation is to collect a dataset of prompts and responses from a black-box model, and use this dataset to fine-tune another model to mimic the behavior of the original LLM.

Model imitation can be narrow, where you focus on collecting examples for a specific set of tasks. For example, you can focus on solving math problems or analyzing the sentiment of tweets.

Alternatively, you can do broad imitation, where you try to cover the entire range of the capabilities of the target model.

Model imitation can be a quick and affordable alternative to manual data curation. Instead of hiring people to write prompts and responses, you can automate the process by having a state-of-the-art LLM produce it for you. (However, note that there are restrictions on LLMs that have been trained with model imitation techniques. For example, OpenAI’s ToS prevent you from using model imitation to deploy an LLM that competes with ChatGPT.)

One method to gather the data for model imitation is “self-instruct.” You start with a small seed of manually created prompts and have the larger LLM generate similar prompts for you.

For example, to create the training data for the open-source LLM Alpaca, researchers at Stanford University started with 175 manually written prompt-response pairs. They then prompted OpenAI’s text-davinci-003 language model to generate similar prompts. With $500 worth of inference, they were able to generate 52,000 unique instruction-output examples. They used the dataset to fine-tune Meta’s LLaMA language model for instruction following, which became Alpaca. The main advantage of this method is that you have full control over the kind of tasks for your fine-tuning examples. The downside is that you’ll have to pay for inference costs (though not as much as you would have to pay for manual data labeling).

alpaca training process
Alpaca LLM training process (source: Stanford.edu)

Another method to gather data is to use datasets that have already been curated from interactions with ChatGPT, Bard, or other commercial LLMs. One such source is ShareGPT, a website where users upload their prompts and ChatGPT’s responses. At the time of this writing, ShareGPT has more than 275,000 uploaded examples. The advantage of sources like ShareGPT is that you don’t pay for API calls. The downside is that you’ll have a harder time finding prompt-response pairs that correspond to the task you have in mind. Also, people tend to share more interesting and crazy examples on these sites. In many cases, you’ll be interested in mundane examples that are useful for training your models.

Vicuna, another open-source LLM based on LLaMA, was trained on 70,000 examples gathered from ShareGPT.

vicuna training process
Vicuna LLM training process (source: lmsys.org)

LLMs trained with “model imitation” have reported great results. For example, the creators of Alpaca write, “We performed a blind pairwise comparison between text-davinci-003 and Alpaca 7B, and we found that these two models have very similar performance: Alpaca wins 90 versus 89 comparisons against text-davinci-003.”

This is quite impressive, given that Alpaca is only 7 billion parameters in comparison to OpenAI’s 175-billion parameter LLM.

On the other hand, the developers of Vicuna used GPT-4 to evaluate their model against LLaMA, Alpaca, ChatGPT, and Bard. The researchers write, “GPT-4 prefers Vicuna over state-of-the-art open-source models (LLaMA, Alpaca) in more than 90% of the questions, and it achieves competitive performance against proprietary models (ChatGPT, Bard). In 45% of the questions, GPT-4 rates Vicuna’s response as better or equal to ChatGPT’s.”

The limitations of model imitation

alphabet tiles natural language
Image credit: 123RF

The researchers at UC Berkeley did a fairly extensive study of model imitation on ChatGPT. They trained several open-source LLMs, including GPT-2 and LLaMA, on both task-specific datasets and broad-ranging examples. They evaluated their models through blind pairwise output comparison: They provided human evaluators with two prompts and outputs without telling them which one was generated by ChatGPT and which by their smaller imitation LLM. Then they asked them to rate which one was better.

They also used GPT-4 as an evaluator, providing it with prompt-output pairs and asking it to rate them.

Their initial findings show that their fine-tuned imitation models were impressive in staying on task compared to the base LLM. Crowd-workers and GPT-4 often rated their outputs as equal or superior to ChatGPT.

However, when they tested the LLM on task-specific benchmarks such as coding, problem-solving, and factual knowledge, they found that imitation learning did not improve the model’s performance. “We found that across every benchmark that we measured, ShareGPT-mix imitation models do not improve (or even decline) in accuracy as compared to the base model, even when adding additional imitation data,” the researchers write.

Then how do human evaluators rank the smaller models’ output highly? The model has learned ChatGPT’s style without learning its knowledge, according to the paper’s findings. Therefore, the model generates text that sounds confident and highly authoritative without being factually correct.

ChatGPT model imitation failure
Open-source LLMs can learn to imitate ChatGPT’s style but not its knowledge. The red sections are factually false, though they sound authoritative. (source: arXiv)

“We argue that this occurs because ChatGPT has captured far more knowledge and capabilities from the web as compared to LLaMA,” the researchers write. “In turn, it is unreasonable to expect that a small amount of imitation data (e.g., 1000x less data than pre-training) would enable one to bridge this gap.”

Bridging this gap would require an extremely large and diverse imitation dataset that would probably be at the scale of data used to pre-train the base LLM.

On the bright side, they found imitation learning to be much more successful when the model is fine-tuned on a task-specific dataset. “It is far more feasible to distill a specific behavior from ChatGPT as opposed to broadly matching its capabilities,” the researchers write.

How to use model imitation

The researchers conclude that the capabilities gap between today’s open-source language models and their closed-source counterparts cannot be closed by cheaply fine-tuning them on imitation data. “Instead, the best way to improve open-source LMs is to tackle the difficult challenge of developing better base LMs, whether it be via model scaling or other means,” they write.

However, this does not mean that model imitation is useless. Learning to imitate ChatGPT’s conversational style and task focus is extremely valuable. And to make up for factual incorrectness, you can use augmentation techniques to couple your LLM with your own proprietary knowledge base (see a practical example here).

With model augmentation techniques, you get a killer combination: A small and efficient model that can run on your own servers and can mimic ChatGPT’s style but is also factually reliable and customized to your organization’s needs.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.