How to run thousands of LoRA language models on one GPU

llamas on gpu
Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

Fine-tuning large language models (LLM) has become a pivotal resource for companies aiming to tailor AI applications to specific needs. Despite their potential, the steep costs of fine-tuning and deploying these LLMs deter widespread use. 

Parameter-efficient fine-tuning (PEFT) methods help overcome this barrier by drastically reducing expenses by modifying only a fraction of the model’s parameters. Among these, low-rank adaptation (LoRA) stands out for its cost-effectiveness and the unique ability to function as a detachable adapter, separate from the core LLM.

S-LoRA, a new framework developed by Stanford and UC Berkeley researchers, scales up LoRA’s capabilities. S-LoRA’s breakthrough allows for the concurrent operation of hundreds, even thousands, of LoRA adapters on a single GPU. This scalability not only slashes operational costs but also paves the way for broader access to customized LLM applications, potentially making it possible to have one fine-tuned model per user at a negligible cost.

How LoRA works

Low-rank adaptation (LoRA)
Low-rank adaptation (LoRA)

Traditionally, fine-tuning LLMs for new applications—or “downstream tasks”—involves updating many layers and parameters within a pre-trained model. Given that LLMs typically have billions of parameters, this method demands substantial computational power. LoRA identifies and adjusts a minimal subset of the LLM’s parameters specifically relevant to the downstream task. This targeted approach dramatically slashes the volume of trainable parameters and, consequently, the memory footprint.

LoRA’s reduces trainable parameters by several orders of magnitude while maintaining accuracy levels on par with full-parameter fine-tuning. This balance of efficiency and efficacy has led to widespread adoption within the AI community, with numerous LoRA adapters developed for various pre-trained LLMs and diffusion models.

The original LoRA paper proposed integrating the fine-tuned low-rank weights back into the base LLM. However, an alternative and increasingly popular approach is to maintain the LoRA weights as standalone adapters. These adapters can be dynamically plugged into the base model during inference. The weights of the LoRA adapter and base model are computed independently and then combined. The additional computational overhead is minimal due to the LoRA adapter’s small size.

This modular design allows for multiple LoRA adapters to coexist, each fine-tuned with different datasets, occupying only a fraction of the main model’s memory, and representing a distinct fine-tuned version of the model.

The LoRA adapters can be plugged into the model at runtime based on the application. 

The challenges of scaling LoRA

Running multiple LoRA models alongside a full-parameter LLM presents several technical challenges. Memory management is a primary concern; the finite capacity of GPU memory restricts the number of LoRA adapters that can be simultaneously active with the main model. Additionally, LLM servers typically employ caching and batching techniques to process numerous requests collectively and enhance throughput. However, the variable sizes of LoRA adapters and their separate computation from the base model introduce memory and computational complexities that can impede the inference speed.

Moreover, the integration of LoRA adapters becomes even more complex with larger LLMs that necessitate parallel processing across multiple GPUs. The additional weights and computations from LoRA adapters complicate the already intricate task of synchronizing processes over several GPUs, posing a significant challenge to maintaining efficient operations.

S-LoRA

s-lora architecture
S-LoRA architecture

S-LoRA, which stands for “scalable LoRA,” overcomes the hurdles of running multiple adapters simultaneously. It introduces three key innovations that collectively streamline the process.

First is its dynamic memory management system. This system efficiently loads all LoRA weights into the main memory and dynamically loads them into and unloads them from the GPU as needed to handle incoming batched requests for fine-tuned models. This approach ensures that the right resources are allocated at the right time, optimizing memory usage and avoiding long delays in responding to requests.

Secondly, S-LoRA incorporates a Unified Paging system. This system manages both query caches and adapter weights, allowing the server to process hundreds or even thousands of batched queries. It does so without causing memory fragmentation, a common bottleneck that can force the model to recalculate the compute-intensive key-value (KV) cache values.

The third innovation is a novel tensor parallelism system designed specifically for batched LoRA inference, facilitating the use of multi-GPU setups for large transformer models. This system ensures that the parallel processing of LoRA adapters across GPUs is both efficient and effective.

These features enable S-LoRA to serve a multitude of LoRA adapters on a single GPU or across multiple GPUs with minimal overhead. In testing, S-LoRA was used to serve various versions of the open-source Llama LLM across different GPU configurations. The results were impressive: S-LoRA could smoothly handle thousands of LoRA adapters on a single GPU, adding only a small overhead.

In comparison to Hugging Face PEFT library, a state-of-the-art parameter-efficient fine-tuning tool, S-LoRA enhanced throughput by up to 30X. Compared to vLLM, a high-throughput serving system with basic support for LoRA serving, S-LoRA improved throughput by up to 4X and significantly increased the number of served adapters.

S-LoRA can be paired with in-context learning or retrieval augmented generation (RAG), providing users with a personalized adapter and incorporating their recent data into the LLM’s prompt as context. 

The S-LoRA code is publicly available on GitHub, and there are plans to integrate it with widely-used LLM-serving frameworks. This integration will enable companies to readily incorporate S-LoRA into their applications.
While S-LoRA is a significant advancement, it’s not alone in this space. Predibase’s LoRAX is another framework designed for serving LoRA adapters at scale, offering seamless operation of hundreds of LoRA models without the complexities of memory management and model switching.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.