The complete guide to LLM fine-tuning

Ben Dickson

10 months ago

This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.

Pre-trained large language models (LLM) can do impressive things off the shelf, including text generation, summarization, and coding. However, LLMs are not one-size-fits-all solutions that are suitable for every application. Occasionally (or frequently, depending on your application), you’ll run into a task your language model can’t accomplish.

In such situations, one of the options you have is to fine-tune the LLM. Basically fine-tuning is the process of retraining a foundation model on new data. It can be expensive, complicated, and not the first solution that should come to mind. But it is nonetheless a very powerful technique that should be in the toolbox of organizations that are integrating LLMs into their applications.

Here is what you need to know about fine-tuning large language models. Even if you don’t have the expertise to do it yourself, knowing how fine-tuning works can help you make the right decisions.

What is LLM fine-tuning?

While this is an article about LLM fine-tuning, this is not a problem that is specific to language models. Any machine learning model might require fine-tuning or retraining on different occasions. When a model is trained on a dataset, it tries to approximate the patterns of the underlying data distribution.

To better illustrate the concept, consider a convolutional neural network (CNN) designed to detect images of cars. The model has been trained on tens of thousands of images of passenger cars in urban settings. It has tuned its parameters to the shapes, colors, and pixel patterns that are often seen in those kinds of cars and environments. And it performs very well when used on images of cars in cities.

Now, suppose you want to use the same model in an application that involves detecting trucks on highways. The model’s performance will suddenly drop because the underlying distribution is significantly different.

In this case, one option would be to train the model from scratch on images of trucks on highways. But this would require you to create a very large dataset containing tens of thousands of labeled images of trucks, which can be expensive and time-consuming.

Trained ML models do not perform well on out-of-distribution examples

Incidentally, trucks and passenger cars have a lot of visual features in common. Therefore, instead of training the new model from scratch, you can continue where the trained model left off. With a small dataset of truck images (maybe a few thousand or even a few hundred) and several epochs of training, you can optimize the old model for the new application. Basically, under the hood, fine-tuning updates the model’s parameters to match the distribution of the new dataset.

This is the idea behind fine-tuning. You take a trained ML model and use new data to update its parameters for new settings or repurpose it for new applications.

The same rule applies to language models. If the distribution of the data used to train your model is significantly different from your application, you might want to consider fine-tuning it. This can happen if, for example, you’re using an LLM for a medical application but its training data did not contain any medical literature. However, fine-tuning LLMs has its own nuances that are worth exploring.

ML models fine-tuned on new data can improve on downstream tasks

Different LLM fine-tuning techniques

Not all forms of fine-tuning are equal and each is useful for different applications. In some cases, you want to repurpose a model for a different application. For example, you have a pre-trained LLM that can generate text. Now, you want to use it for a different type of application, such as sentiment or topic classification. In this case, you will repurpose the model by making a small change to its architecture before fine-tuning it.

Transformers for Natural Language Processing is an excellent introduction to the technology underlying LLMs

For this application, you will only use the embeddings that the transformer part of the model produces. Embeddings are numerical vectors that capture the different features of the input prompt. Some language models directly generate embeddings. Others like the GPT family of LLMs use the embeddings to generate tokens (or text).

In repurposing, you connect the model’s embedding layer to a classifier model (e.g., a set of fully connected layers) that maps the embeddings to class probabilities. In this setting, you just need to train the classifier on the embeddings generated by the model. The LLM’s attention layers are frozen and don’t need to be updated, which results in huge compute cost savings. However, to train the classifier, you’re going to need a supervised learning dataset composed of examples of text and the corresponding class. The size of your fine-tuning dataset will depend on the complexity of the task and your classifier component.

But in some cases, you’ll need to update the parameter weights of the transformer model. For this, you’ll need to unfreeze the attention layers and perform full fine-tuning on the entire model. This operation can be computationally expensive and complicated, depending on the size of your model. (In some cases, you can keep parts of the model frozen to reduce the costs of fine-tuning. And there are several techniques that can reduce the costs of fine-tuning LLMs—more on that in a bit.)

Unsupervised vs supervised fine-tuning (SFT)

In some cases, you just want to update the knowledge of the LLM. For example, you might want to fine-tune the model on medical literature or a new language. For these situations, you can use an unstructured dataset, such as articles and scientific papers gathered from medical journals. The goal is to train the model on enough tokens to be representative of the new domain or the kind of input that it will face in the target application.

Generative AI with LangChain is a great intro to programming LLMs with one of the most popular libraries.

The advantage of unstructured data is that it is scalable because models can be trained through unsupervised or self-supervised learning. Most foundation models are trained on unstructured datasets composed of hundreds of billions of tokens. Gathering unstructured data for fine-tuning the model for a new domain can also be relatively easy, especially if you have in-house knowledge bases and documents.

However, in some cases, updating the knowledge of the model is not enough and you want to modify the behavior of the LLM. In these situations, you will need a supervised fine-tuning (SFT) dataset, which is a collection of prompts and their corresponding responses. SFT datasets can be manually curated by users or generated by other LLMs. Supervised fine-tuning is especially important for LLMs such as ChatGPT, which have been designed to follow user instructions and stay on a specific task across long stretches of text. This specific type of fine-tuning is also referred to as instruction fine-tuning.

Unsupervised vs supervised LLM fine-tuning

Reinforcement learning from human feedback (RLHF)

Some companies take SFT or instruction fine-tuning to the next level and use reinforcement learning from human feedback (RLHF). This is a complicated and expensive process that requires recruiting human reviewers and setting up auxiliary models to fine-tune the LLM. This is why, for the moment, only companies and AI labs with large technical and financial resources can afford RLHF.

There are various ways to do RLHF but the general idea is this: When you train an LLM on billions of tokens, it generates sequences of tokens that are most likely to appear next to each other. The text is mostly coherent and makes sense. But it may not be what the user or application requires. RLHF brings humans in the loop to steer the LLM in the right direction. Human reviewers rate the output of the model on prompts. These ratings act as signals to fine-tune the model to generate high-rating output.

One popular example of RLHF is ChatGPT. OpenAI fine-tuned the model based on its InstructGPT paper. First, they fine-tuned a GPT-3.5 model through SFT on a set of manually generated prompts and responses. In the next step, they recruited human reviewers and had them rate the output of the model on various prompts. They used the human feedback data to train a reward model that tries to emulate human preferences. Finally, they fine-tuned the language model through a deep reinforcement learning (RL) loop in which the LLM generates outputs, the reward model rates them, and the LLM updates its parameters in a way that maximizes its reward.

ChatGPT training process (source: OpenAI)

Parameter-efficient fine-tuning (PEFT)

An interesting area of research in LLM fine-tuning is reducing the costs of updating the parameters of the models. This is the goal of parameter-efficient fine-tuning (PEFT), a set of techniques that try to reduce the number of parameters that need to be updated.

There are various PEFT techniques. One of them is low-rank adaptation (LoRA), a technique that has become especially popular among open-source language models. The idea behind LoRA is that fine-tuning a foundation model on a downstream task does not require updating all of its parameters. There is a low-dimension matrix that can represent the space of the downstream task with very high accuracy.

Fine-tuning with LoRA trains this low-rank matrix instead of updating the parameters of the main LLM. The parameter weights of the LoRA model are then integrated into the main LLM or added to it during inference. LoRA can cut the costs of fine-tuning by up to 98 percent. It also helps store multiple small-scale fine-tuned models that can be plugged into the LLM at runtime.

When to not use LLM fine-tuning

In some cases, LLM fine-tuning is not possible or not useful:

1- Some models are only available through application programming interfaces (API) that have no or limited fine-tuning services.

2- You might not have enough data to fine-tune the model for the downstream task or the domain of your application.

3- The data in the application might change frequently. Fine-tuning the model frequently might not be possible or might be detrimental. For example, the data in news-related applications changes every day.

4- The application might be dynamic and context-sensitive. For example, if you’re creating a chatbot that customizes its output for each user, you can’t fine-tune the model on user data.

In such cases, you can use in-context learning or retrieval augmentation, where you provide the model with context during inference time. For example, if you want the LLM to assist you in writing an article or an email, you prepend your prompt with relevant documents (news reports, Wikipedia pages, company documents, etc.) and condition its responses on their content. Another example is an LLM that must provide user-specific answers, say on their financial data, health data, emails, etc. Again, when the user enters a prompt, the application retrieves their data and prepends it to the prompt to condition the model.

A workflow to augment LLMs with contextual documents

One useful design pattern is to create a vector database that stores embeddings of company documents. When the user enters a prompt, the vector DB retrieves relevant documents and sends them as context to the model.

Sometimes, you can use hybrid approaches, where you fine-tune the model on an application-specific dataset and then provide user-specific context during inference.

With growing interest in using LLMs in different applications, we can expect more interesting fine-tuning techniques and alternative solutions to emerge in the near future.