Blog

How to run thousands of LoRA language models on one GPU

December 13, 2023

llamas on gpu — Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

Fine-tuning large language models (LLM) has become a pivotal resource for companies aiming to tailor AI applications to specific needs. Despite their potential, the steep costs of fine-tuning and deploying these LLMs deter widespread use.

Parameter-efficient fine-tuning (PEFT) methods help overcome this barrier by drastically reducing expenses by modifying only a fraction of the model’s parameters. Among these, low-rank adaptation (LoRA) stands out for its cost-effectiveness and the unique ability to function as a detachable adapter, separate from the core LLM.

S-LoRA, a new framework developed by Stanford and UC Berkeley researchers, scales up LoRA’s capabilities. S-LoRA’s breakthrough allows for the concurrent operation of hundreds, even thousands, of LoRA adapters on a single GPU. This scalability not only slashes operational costs but also paves the way for broader access to customized LLM applications, potentially making it possible to have one fine-tuned model per user at a negligible cost.

How LoRA works

Traditionally, fine-tuning LLMs for new applications—or “downstream tasks”—involves updating many layers and parameters within a pre-trained model. Given that LLMs typically have billions of parameters, this method demands substantial computational power. LoRA identifies and adjusts a minimal subset of the LLM’s parameters specifically relevant to the downstream task. This targeted approach dramatically slashes the volume of trainable parameters and, consequently, the memory footprint.

LoRA’s reduces trainable parameters by several orders of magnitude while maintaining accuracy levels on par with full-parameter fine-tuning. This balance of efficiency and efficacy has led to widespread adoption within the AI community, with numerous LoRA adapters developed for various pre-trained LLMs and diffusion models.

The original LoRA paper proposed integrating the fine-tuned low-rank weights back into the base LLM. However, an alternative and increasingly popular approach is to maintain the LoRA weights as standalone adapters. These adapters can be dynamically plugged into the base model during inference. The weights of the LoRA adapter and base model are computed independently and then combined. The additional computational overhead is minimal due to the LoRA adapter’s small size.

This modular design allows for multiple LoRA adapters to coexist, each fine-tuned with different datasets, occupying only a fraction of the main model’s memory, and representing a distinct fine-tuned version of the model.

The LoRA adapters can be plugged into the model at runtime based on the application.

The challenges of scaling LoRA

Running multiple LoRA models alongside a full-parameter LLM presents several technical challenges. Memory management is a primary concern; the finite capacity of GPU memory restricts the number of LoRA adapters that can be simultaneously active with the main model. Additionally, LLM servers typically employ caching and batching techniques to process numerous requests collectively and enhance throughput. However, the variable sizes of LoRA adapters and their separate computation from the base model introduce memory and computational complexities that can impede the inference speed.

Moreover, the integration of LoRA adapters becomes even more complex with larger LLMs that necessitate parallel processing across multiple GPUs. The additional weights and computations from LoRA adapters complicate the already intricate task of synchronizing processes over several GPUs, posing a significant challenge to maintaining efficient operations.

S-LoRA

S-LoRA, which stands for “scalable LoRA,” overcomes the hurdles of running multiple adapters simultaneously. It introduces three key innovations that collectively streamline the process.

First is its dynamic memory management system. This system efficiently loads all LoRA weights into the main memory and dynamically loads them into and unloads them from the GPU as needed to handle incoming batched requests for fine-tuned models. This approach ensures that the right resources are allocated at the right time, optimizing memory usage and avoiding long delays in responding to requests.

Secondly, S-LoRA incorporates a Unified Paging system. This system manages both query caches and adapter weights, allowing the server to process hundreds or even thousands of batched queries. It does so without causing memory fragmentation, a common bottleneck that can force the model to recalculate the compute-intensive key-value (KV) cache values.

The third innovation is a novel tensor parallelism system designed specifically for batched LoRA inference, facilitating the use of multi-GPU setups for large transformer models. This system ensures that the parallel processing of LoRA adapters across GPUs is both efficient and effective.

These features enable S-LoRA to serve a multitude of LoRA adapters on a single GPU or across multiple GPUs with minimal overhead. In testing, S-LoRA was used to serve various versions of the open-source Llama LLM across different GPU configurations. The results were impressive: S-LoRA could smoothly handle thousands of LoRA adapters on a single GPU, adding only a small overhead.

In comparison to Hugging Face PEFT library, a state-of-the-art parameter-efficient fine-tuning tool, S-LoRA enhanced throughput by up to 30X. Compared to vLLM, a high-throughput serving system with basic support for LoRA serving, S-LoRA improved throughput by up to 4X and significantly increased the number of served adapters.

S-LoRA can be paired with in-context learning or retrieval augmented generation (RAG), providing users with a personalized adapter and incorporating their recent data into the LLM’s prompt as context.

The S-LoRA code is publicly available on GitHub, and there are plans to integrate it with widely-used LLM-serving frameworks. This integration will enable companies to readily incorporate S-LoRA into their applications.
While S-LoRA is a significant advancement, it’s not alone in this space. Predibase’s LoRAX is another framework designed for serving LoRA adapters at scale, offering seamless operation of hundreds of LoRA models without the complexities of memory management and model switching.

Will infinite context windows kill LLM fine-tuning and RAG?

How to turn any LLM into an embedding model

AI in healthcare: Real-world applications for cost-savings and innovation

Stanford’s ReFT fine-tunes LLMs at a fraction of the cost

How generative AI is transforming the shopping experience

Fine-tune a Llama-2 language model with a single instruction

What to know about the rising threat of deepfake scams

4 reasons to use open-source LLMs (especially after the OpenAI drama)

No-code retrieval augmented generation (RAG) with LlamaIndex and ChatGPT

How to make your LLMs lighter with GPTQ quantization

What to know about open-source alternatives to GPT-4 Vision

The complete guide to LLM compression

A simple guide to gradient descent in machine learning

The complete guide to LLM fine-tuning

What is low-rank adaptation (LoRA)?

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

Democratizing the hardware side of large language models

How to run thousands of LoRA language models on one GPU

How LoRA works

The challenges of scaling LoRA

S-LoRA

Like this:

Leave a ReplyCancel reply

How LoRA works

The challenges of scaling LoRA

S-LoRA

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks