Tips & Tricks

How to make your LLMs lighter with GPTQ quantization

November 8, 2023

lightweight flying llamas — Image generated with Bing Image Creator

One of the big challenges of large language models (LLM) is their hefty memory and computational demands, often requiring tens of gigabytes of GPU memory. This makes them not only expensive but also difficult to run.

To mitigate these issues, researchers have developed several LLM compression techniques, including “quantization.” Quantization reduces the model’s by changing how the parameters are stored. One such efficient and speedy algorithm is GPTQ. Supported by popular frameworks like Hugging Face via the AutoGPTQ library, GPTQ offers a cost-effective solution. Here is what you need to know about quantizing your LLMs with GPTQ.

What is quantization?

Transformer models, such as LLMs, typically store parameters using 16-bit floating point (fp16) numbers. Consequently, a model with one billion parameters demands at least 2 gigabytes of memory, plus additional overhead resources.

Quantization offers a solution to this problem by converting these parameters into a smaller integer format, such as int8 or int4, effectively reducing the model’s size. The challenge for quantization algorithms is to achieve this reduction while minimizing any loss in the model’s accuracy.

Quantization techniques fall into three main categories:

1. Quantization-aware training (QAT): This technique integrates quantization into the training process. By allowing the model to learn low-precision representations from the onset, QAT mitigates the precision loss typically associated with quantization.

2. Quantization-aware fine-tuning (QAFT): This approach adapts a pre-trained high-precision model to maintain its quality with lower-precision weights. Notable QAFT techniques include QLoRA and Parameter-Efficient and Quantization-Aware Adaptation (PEQA), both designed to preserve model quality while reducing size.

3. Post-training quantization (PTQ): This method transforms the parameters of the LLM to lower-precision data types after the model has been trained. PTQ aims to reduce the model’s complexity without altering its architecture or requiring retraining.

GPTQ

GPTQ is a post-training quantization technique, making it an ideal choice for very large models where full training or even fine-tuning can be prohibitively expensive. It has the capability to quantize models to 2-, 3-, or 4-bit format, offering flexibility based on your specific needs.

GPTQ employs a suite of optimization techniques that simplify the quantization process while maintaining the model’s accuracy. According to the original paper, GPTQ more than doubles the compression gains compared to previously proposed one-shot quantization methods, demonstrating its superior efficiency.

Experimental results show that GPTQ can accelerate inference by approximately 3.25x when using high-end GPUs like the NVIDIA A100, and by 4.5x when using more cost-effective options like the NVIDIA A6000.

In a practical comparison, the BLOOM model, with its 176 billion parameters, can be quantized in less than 4 GPU-hours using GPTQ. In contrast, the alternative quantization algorithm OBQ takes 2 GPU-hours to quantize the much smaller BERT model, which has only 336 million parameters.

AutoGPTQ

The creators of GPTQ, based at the IST Austria Distributed Algorithms and Systems Lab, have made the code publicly available on GitHub. This implementation supports the OPT and BLOOM families of LLMs.

There are also several other implementations that apply GPTQ to Llama models, including the well-known Llama.cpp project. However, for a broader range of transformer models, the AutoGPTQ library is a robust choice. It’s compatible with the widely used Hugging Face Transformers library, allowing you to upload your AutoGPTQ models to Hugging Face, making them accessible to applications and other developers.

Hugging Face already hosts several models quantized with AutoGPTQ, simplifying their deployment. The Hugging Face AutoGPTQ integration also supports AMD GPUs and parameter-efficient fine-tuning, including low-rank adaptation (LoRA).

You can run your AutoGPTQ models using Hugging Face’s Text-Generation-Inference (TGI) toolkit. According to Hugging Face, you can host a 70-billion-parameter model on a single A100-80GB GPU using AutoGPTQ, which is impossible with fp16-format models.

For coding examples and more information on running AutoGPTQ, see this Google Colab notebook by Hugging Face.

How Nvidia’s ASPIRE framework accelerates robot programming with self-improving AI

How the AI arms race moved from smart models to full-stack…

Why LLMs should stop thinking out loud (and what comes after…

Beyond vibe coding: How Codev 3.0 engineers the AI-powered dev team

How Cursor’s Composer 2.5 uses self-distillation to beat the frontier LLMs…

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Demystifying loop engineering: Get more from AI agents, avoid loopmaxxing

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

How to make your LLMs lighter with GPTQ quantization

What is quantization?

GPTQ

AutoGPTQ

Like this:

Leave a ReplyCancel reply

What is quantization?

GPTQ

AutoGPTQ

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks