Tips & Tricks

How to make your LLMs lighter with GPTQ quantization

November 8, 2023

lightweight flying llamas — Image generated with Bing Image Creator

One of the big challenges of large language models (LLM) is their hefty memory and computational demands, often requiring tens of gigabytes of GPU memory. This makes them not only expensive but also difficult to run.

To mitigate these issues, researchers have developed several LLM compression techniques, including “quantization.” Quantization reduces the model’s by changing how the parameters are stored. One such efficient and speedy algorithm is GPTQ. Supported by popular frameworks like Hugging Face via the AutoGPTQ library, GPTQ offers a cost-effective solution. Here is what you need to know about quantizing your LLMs with GPTQ.

What is quantization?

Transformer models, such as LLMs, typically store parameters using 16-bit floating point (fp16) numbers. Consequently, a model with one billion parameters demands at least 2 gigabytes of memory, plus additional overhead resources.

Quantization offers a solution to this problem by converting these parameters into a smaller integer format, such as int8 or int4, effectively reducing the model’s size. The challenge for quantization algorithms is to achieve this reduction while minimizing any loss in the model’s accuracy.

Quantization techniques fall into three main categories:

1. Quantization-aware training (QAT): This technique integrates quantization into the training process. By allowing the model to learn low-precision representations from the onset, QAT mitigates the precision loss typically associated with quantization.

2. Quantization-aware fine-tuning (QAFT): This approach adapts a pre-trained high-precision model to maintain its quality with lower-precision weights. Notable QAFT techniques include QLoRA and Parameter-Efficient and Quantization-Aware Adaptation (PEQA), both designed to preserve model quality while reducing size.

3. Post-training quantization (PTQ): This method transforms the parameters of the LLM to lower-precision data types after the model has been trained. PTQ aims to reduce the model’s complexity without altering its architecture or requiring retraining.

GPTQ

GPTQ is a post-training quantization technique, making it an ideal choice for very large models where full training or even fine-tuning can be prohibitively expensive. It has the capability to quantize models to 2-, 3-, or 4-bit format, offering flexibility based on your specific needs.

GPTQ employs a suite of optimization techniques that simplify the quantization process while maintaining the model’s accuracy. According to the original paper, GPTQ more than doubles the compression gains compared to previously proposed one-shot quantization methods, demonstrating its superior efficiency.

Experimental results show that GPTQ can accelerate inference by approximately 3.25x when using high-end GPUs like the NVIDIA A100, and by 4.5x when using more cost-effective options like the NVIDIA A6000.

In a practical comparison, the BLOOM model, with its 176 billion parameters, can be quantized in less than 4 GPU-hours using GPTQ. In contrast, the alternative quantization algorithm OBQ takes 2 GPU-hours to quantize the much smaller BERT model, which has only 336 million parameters.

AutoGPTQ

The creators of GPTQ, based at the IST Austria Distributed Algorithms and Systems Lab, have made the code publicly available on GitHub. This implementation supports the OPT and BLOOM families of LLMs.

There are also several other implementations that apply GPTQ to Llama models, including the well-known Llama.cpp project. However, for a broader range of transformer models, the AutoGPTQ library is a robust choice. It’s compatible with the widely used Hugging Face Transformers library, allowing you to upload your AutoGPTQ models to Hugging Face, making them accessible to applications and other developers.

Hugging Face already hosts several models quantized with AutoGPTQ, simplifying their deployment. The Hugging Face AutoGPTQ integration also supports AMD GPUs and parameter-efficient fine-tuning, including low-rank adaptation (LoRA).

You can run your AutoGPTQ models using Hugging Face’s Text-Generation-Inference (TGI) toolkit. According to Hugging Face, you can host a 70-billion-parameter model on a single A100-80GB GPU using AutoGPTQ, which is impossible with fp16-format models.

For coding examples and more information on running AutoGPTQ, see this Google Colab notebook by Hugging Face.

Will infinite context windows kill LLM fine-tuning and RAG?

How to turn any LLM into an embedding model

AI in healthcare: Real-world applications for cost-savings and innovation

Stanford’s ReFT fine-tunes LLMs at a fraction of the cost

How generative AI is transforming the shopping experience

Fine-tune a Llama-2 language model with a single instruction

What to know about the rising threat of deepfake scams

4 reasons to use open-source LLMs (especially after the OpenAI drama)

No-code retrieval augmented generation (RAG) with LlamaIndex and ChatGPT