What is...

The complete guide to LLM compression

September 18, 2023

baby llama llm compression — Image created with ForeFront.ai

This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.

Large language models (LLM) have been making waves, demonstrating exceptional performance in many tasks. However, their impressive capabilities come with a significant drawback: high computational costs.

Top-tier models such as LLaMA 2 and Falcon can demand dozens, if not hundreds, of gigabytes of GPU memory. This not only makes them expensive to run but also presents a formidable challenge in terms of setup. Furthermore, their resource-intensive nature makes it nearly impossible to run them on edge devices without access to robust cloud servers.

To overcome these hurdles, researchers have been developing a range of innovative compression techniques. These methods aim to make LLMs more compact, enabling them to fit on devices with limited resources. Additionally, they can enhance the speed of these models and reduce inference latency, making them more efficient.

In this article, we will delve into the world of LLM compression techniques. We’ll explore how they work, the trade-offs involved, and the impact they can have on LLM applications.

LLM pruning

Spring pruning roses in the garden, gardener's hands with secateur — Image source: 123RF

Like other deep neural networks, large language models are composed of many components. However, not all of these components contribute significantly to the model’s output. In fact, some may have little to no effect at all. These non-essential components can be pruned, making the model more compact while maintaining the model’s performance.

There are several ways to perform LLM pruning, each with its own set of advantages and challenges. Unstructured pruning involves removing irrelevant parameters without considering the model’s structure. Essentially, unstructured pruning sets parameters below a certain threshold to zero, effectively eliminating their impact. This results in a sparse model where zero and non-zero weights are randomly distributed.

Unstructured pruning is easy to implement. However, the random distribution of weights in unstructured pruning makes it difficult to leverage hardware optimization. It requires additional computation and processing steps to compress the sparse model. Moreover, the compressed model often requires further retraining to achieve optimal performance.

Despite these challenges, there have been significant advancements in unstructured pruning. One such development is SparseGPT, a technique developed by researchers at the Institute of Science and Technology Austria (ISTA). SparseGPT performs one-shot pruning on large transformer models such as BLOOM and OPT, eliminating the need for retraining.

Another technique, LoRAPrune, combines low-rank adaptation (LoRA) with pruning to enhance the performance of LLMs on downstream tasks. LoRA is a parameter-efficient fine-tuning (PEFT) technique that only updates a small subset of the parameters of a foundational model. This makes it a highly efficient method for improving model performance.

On the other hand, structured pruning involves removing entire parts of a model, such as neurons, channels, or layers. The advantage of structured pruning is that it simplifies model compression and improves hardware efficiency. For instance, removing an entire layer can reduce the computational complexity of the model without introducing irregularities in the model structure.

However, structured pruning requires a deep understanding of the model’s architecture and how different parts contribute to overall performance. There’s also a higher risk of significantly impacting the model’s accuracy, as removing entire neurons or layers can potentially eliminate important learned features.

One promising technique for structured pruning is LLM-Pruner. This task-agnostic method minimizes reliance on original training data and selectively removes non-critical coupled structures based on gradient information. This approach maximally preserves the majority of the LLM’s functionality, making it an effective tool for model compression.

LLM Knowledge Distillation

Knowledge distillation is a machine learning technique where a small “student” model is trained to emulate the behavior of a larger, more complex “teacher” model. The training process effectively transfers knowledge from the teacher to the student model, creating a more compact yet capable model.

In the realm of LLMs, knowledge distillation techniques fall into two main categories. The first, standard knowledge distillation, aims to transfer the general knowledge of the teacher model to the student. For instance, you can gather a series of prompts and responses from ChatGPT and use them to train a smaller open-source LLM. However, it’s important to note that there are restrictions on training LLMs on data gathered from commercial models.

The challenge with standard knowledge distillation lies in accurately capturing the underlying data distributions. MiniLLM, a technique developed by researchers at Tsinghua University and Microsoft Research, addresses this issue. It employs different objective and optimization functions specifically designed for LLMs, enhancing the effectiveness of the distillation process.

The second category, emergent ability distillation, seeks to extract a specific ability that the teacher model has learned and transfer it to the student model. Emergent abilities are capabilities that are present in large models but not in smaller ones. For example, you can gather prompts and responses on mathematics or reasoning problems from GPT-4 and try to transfer them to a smaller model like Vicuna. The advantage of EA distillation is that it is much easier to measure because it focuses on a narrow set of tasks. However, it’s crucial to remember that there are limits to the abilities of LLMs that mimic the emergent behaviors of larger models.

LLM Quantization

numbers quantization — Image source: 123RF

LLMs like GPT-3 typically store their parameters as floating-point values. At half-precision, each parameter occupies two bytes, leading to a model the size of GPT-3 requiring hundreds of gigabytes of memory. Quantization, a compression technique, converts these parameters into single-byte or smaller integers, significantly reducing the size of an LLM.

Quantization has gained popularity as it enables open-source LLMs to run on everyday devices like laptops and desktop computers. GPT4All and Llama.cpp are two notable examples of quantized LLMs that have leveraged this technique effectively.

Quantization can be applied at various stages of the model’s training cycle. In quantization-aware training (QAT), quantization is integrated into the training process. This approach allows the model to learn low-precision representations from the start, mitigating the precision loss caused by quantization. However, the downside of QAT is that it requires training the model from scratch, which can be resource-intensive and costly.

Quantization-aware fine-tuning (QAFT) is another approach where a pre-trained high-precision model is adapted to maintain its quality with lower-precision weights. Techniques like QLoRA and parameter-efficient and quantization-aware adaptation (PEQA) are commonly used for QAFT.

Lastly, post-training quantization (PTQ) involves transforming the parameters of the LLM to lower-precision data types after the model is trained. PTQ aims to reduce the model’s complexity without altering the architecture or retraining the model. Its main advantage is its simplicity and efficiency because it does not require any additional training. But it may not preserve the original model’s accuracy as effectively as the other techniques.

LLM compression is a fascinating field of research that is constantly evolving. For a more technical overview of LLM compression, read the paper “A Survey on Model Compression for Large Language Models.”

What OpenELM language models say about Apple’s generative AI strategy

Will infinite context windows kill LLM fine-tuning and RAG?

How to turn any LLM into an embedding model

AI in healthcare: Real-world applications for cost-savings and innovation

Stanford’s ReFT fine-tunes LLMs at a fraction of the cost

Fine-tune a Llama-2 language model with a single instruction

What to know about the rising threat of deepfake scams

4 reasons to use open-source LLMs (especially after the OpenAI drama)

No-code retrieval augmented generation (RAG) with LlamaIndex and ChatGPT

How to make your LLMs lighter with GPTQ quantization

What to know about open-source alternatives to GPT-4 Vision