Large language models (LLM) are known for being expensive to train, fine-tune, and run. It was thought that it takes models with hundreds of billions of parameters trained with millions of dollars’ worth of compute to match the capabilities of GPT-3.5 and ChatGPT.
However, recently released open-source LLMs have proven that you don’t need very large models to compete with the state of the art. Researchers have trained LLMs with a few billion parameters to perform at a level that is comparable to very large models. The success of open-source large language models has sparked interest and growing activity in the field.
Some of these efforts focus on making the fine-tuning of LLMs more cost-efficient. One of the techniques that helps reduce the costs of fine-tuning enormously is “low-rank adaptation” (LoRA). With LoRA, you can fine-tune LLMs at a fraction of the cost it would normally take.
Here is what you need to know about LoRA.
How does fine-tuning LLMs work?
Open-source LLMs such as LLaMA, Pythia, and MPT-7B are foundation models that have been pre-trained on hundreds of billions of words. Developers and machine learning engineers can download the model with the pre-trained weights and fine-tune it for downstream tasks such as instruction following.
Every LLM is a transformer model that is composed of several layer blocks, each of which contains learnable parameters. Fine-tuning continues where pre-training left off.
The model is provided input from the fine-tuning dataset. It then predicts the next tokens and compares its output with the ground truth. It then adjusts the weights to correct its predictions. By doing this over and over, the LLM becomes fine-tuned to the downstream task.
Now, let’s make a small modification to the fine-tuning process. In this new method, we freeze the original weights of the model and don’t modify them during the fine-tuning process. Instead, we apply the modifications to a separate set of weights and we add their new values to the original parameters. Let’s call these two sets “pre-trained” and “fine-tuned” weights.
Separating the pre-trained and fine-tuned parameters is an important part of LoRA.
Before moving on to LoRA, let’s think about our model parameters as very large matrices. If you remember your linear algebra class, matrices can form vector spaces. In this case, we’re talking about a very large vector space with many dimensions that models language.
Every matrix has a “rank,” which is the number of linearly independent columns it has. If a column is linearly independent, it means that it can’t be represented as a combination of other columns in the matrix. On the other hand, a dependent column is one that can be represented as a combination of one or more columns in the same matrix. You can remove dependent columns from a matrix without losing information.
LoRA, proposed in a paper by researchers at Microsoft, suggests that when fine-tuning an LLM for a downstream task, you don’t need the full-rank weight matrix. They proposed that you could preserve most of the learning capacity of the model while reducing the dimension of the downstream parameters. (This is why it makes sense to separate the pre-trained and fine-tuned weights.)
Basically, in LoRA, you create two downstream weight matrices. One transforms the input parameters from the original dimension to the low-rank dimension. And the second matrix transforms the low-rank data to the output dimensions of the original model.
During training, modifications are made to the LoRA parameters, which are now much fewer than the original weights. This is why they can be trained much faster and at a fraction of the cost of doing full fine-tuning. At inference time, the output of LoRA is added to the pre-trained parameters to calculate the final values.
Extra improvements with LoRA
Since LoRA requires that we keep the pre-trained and fine-tuned weights separately, we incur a memory overhead. Also, the operation of adding the pre-trained and fine-tuned weights at inference time causes a small computation penalty. To overcome this penalty, you can merge the fine-tuned and pre-trained weights after finetuning your LLM with LoRA.
However, since the downstream weights only occupy a fraction of the original weights (sometimes down to a thousandth), then you might want to keep them separate. There are benefits to separating pre-trained and downstream weights.
For example, say you’re hosting an LLM that several clients use for different applications. Each client wants to fine-tune the model with their own specific datasets and for their own applications. Instead of creating a separate fine-tuned version of the model for each client, you can use LoRA to create a set of downstream weights for each client or application. At inference time, you load the base model and the LoRA weights of each client to make the final compute. You will have a slight performance hit, but the gains in storage will be immense.
Implementing LoRA in Python
This post was a very broad overview of how LoRA works. It has more technical details and nuances, such as which types of weights it applies to and the hyperparameters that it has. Sebastian Raschka has a lengthy post in which he provides technical and implementation details on LoRA.
Chris Alexiuk also has a detailed two-part video series on LoRA. The first one digs into the theory and the second one is a hands-on implementation with Python and Google Colab.
LoRA is one of several techniques that can help reduce the costs of training open-source LLMs. The field is moving very fast. It will be interesting to see how the field develops.