Blog

StreamingLLM gives language models unlimited context

November 27, 2023

robot llm long context — Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

Large language models (LLM) are renowned for their ability to process long text sequences. However, when dealing with prolonged chat sessions, these models often reach their context limit.

This poses a challenge when the need arises to extend the context of the model to even longer sequences. Current solutions to this problem are either computationally demanding, memory-intensive, or imprecise.

A breakthrough solution is StreamingLLM, developed by a collaborative team of researchers from Meta AI, MIT, and Carnegie Mellon University. This innovative technique can extend an LLM’s context to millions of tokens without the need for vast compute and memory resources, all while preserving the model’s high-quality performance. StreamingLLM is poised to be an invaluable tool for applications that require long-sequence text processing.

LLMs and context windows

LLMs are inherently designed with a fixed context length, a feature dictated by their architecture and training methodologies. For instance, Llama-2, a popular LLM, has a context of approximately 4,000 tokens, equivalent to around 3,000 words. As long as the interaction with the language model remains within this context limit, the model can maintain its high-quality performance. However, this finite sequence length restricts its broader applications.

llm context window — Different methods to extend an LLM’s context window (source: arxiv)

One potential solution to this limitation is to create a model with a longer context length. However, this approach requires modifying the model’s architecture and retraining the model, a process that can be prohibitively expensive and inaccessible for many organizations. Furthermore, extending the context length incurs quadratic costs, meaning that doubling the context of an LLM would result in a quadrupling of memory and compute costs.

An alternative approach is the implementation of a sliding context window. In this scenario, if a model’s context is 4,000 tokens, the model is always fed the last 4,000-x tokens, where ‘x’ is the number of tokens it is expected to generate.

While this technique seems intuitive, it carries significant practical drawbacks.

Autoregressive LLMs employ a mechanism known as “KV caching” to enhance efficiency. This mechanism computes and stores the value of attention heads for previous tokens, eliminating the need to recompute them for each new token. The attention value of each token is dependent on its preceding tokens. When the context window is shifted, the entire KV cache must be recomputed, significantly reducing the model’s throughput.

Another solution is to move the window while retaining the cached values for the tokens that overlap between the old and new context. While this method does offer some improvement, it is not without its flaws. The model’s quality tends to decline rapidly once the context begins to deviate from the initial setting.

Attention sinks

In their paper, the researchers highlight an intriguing characteristic of autoregressive LLMs like GPT-3.5 and Llama-2: a substantial proportion of the attention score is allocated to the initial tokens, regardless of their relevance to the language modeling task. They refer to these tokens as “attention sinks”.

Interestingly, they observe that the model’s perplexity dramatically increases when the text length exceeds the cache size, primarily due to the exclusion of these initial tokens. (Perplexity is the model’s uncertainty in its predictions, with lower values indicating higher precision.) This finding suggests that these attention sinks, irrespective of their distance from the tokens being predicted, play a pivotal role in maintaining the stability of LLMs.

llm attention sink visualization — Visualization of attention maps in LLMs show that the initial tokens are very important, especially in the deeper layers (source: arxiv)

The reason behind this phenomenon is intuitive. Given the autoregressive nature of language modeling, initial tokens are visible to almost all subsequent tokens, making them prime candidates to serve as attention sinks. Conversely, later tokens are only visible to a limited set of subsequent tokens. As a result, initial tokens are more readily trained to act as attention sinks, thereby capturing a disproportionate amount of attention.

Consequently, when the attention values of the first few tokens are removed from the context, the model’s performance begins to deteriorate due to the significant loss of attention value. The preservation of these attention sinks forms the fundamental premise of the StreamingLLM technique, offering a promising solution to the limitations of current LLMs.

How StreamingLLM works

StreamingLLM is an innovative framework that allows large language models to handle text of infinite length without the need for finetuning. This technique preserves attention sinks to maintain a near-normal attention score distribution. When the sequence of the conversation with the LLM surpasses the model’s context length, StreamingLLM retains the KV cache for the attention sink tokens—four initial tokens are sufficient—and discards subsequent tokens to make room for the sliding window tokens. This approach enables the model to extend its context and stabilize its performance without having to recompute the entire KV values.

“The introduction of four initial tokens, as attention sinks, suffices to restore the LLM’s performance,” the researchers write. “In contrast, adding just one or two doesn’t achieve full recovery. We believe this pattern emerges because these models didn’t include a consistent starting token across all input samples during pre-training.”

Under the StreamingLLM framework, the KV cache comprises the attention sinks and the rolling KV cache that retains the most recent tokens vital for language modeling. The researchers emphasize the versatility of StreamingLLM, stating, “StreamingLLM’ design is versatile and can be seamlessly incorporated into any autoregressive language model that employs relative positional encoding.”

streamingllm kv cache — StreamingLLM keeps the attention sinks in the KV cache and moves the rest of the slots with the context window (source: arxiv)

According to the researchers, LLMs such as Llama-2 (7-70 billion parameters), Falcon (7-40 billion parameters), and Pythia (2.9-12 billion parameters) can reliably model up to 4 million tokens and potentially more under the StreamingLLM framework. This technique effectively addresses the challenges posed by other methods, offering fast inference, high quality, and low memory requirements.

“StreamingLLM firstly decouples the LLM’s pre-training window size and its actual text generation length, paving the way for the streaming deployment of LLMs,” the researchers write.

(Note: StreamingLLM does not extend the context of the model to 4 million tokens. It allows the model to maintain its quality up to and possibly beyond that amount. At any moment, the model only has memory of the amount of tokens that its architecture allows, e.g., 4k tokens.)

Pretraining language models with attention sinks

The researchers highlight that a significant factor contributing to the model’s excessive attention to multiple initial tokens is the lack of a designated sink token to absorb excessive attention scores. Consequently, the model unintentionally assigns globally visible tokens, predominantly the initial ones, as attention sinks.

“A potential remedy can be the intentional inclusion of a global trainable attention sink token, denoted as a ‘Sink Token,’ which would serve as a repository for unnecessary attention scores,” they propose.

With this insight, language models can be pre-trained to require only a single attention sink token for streaming deployment. The only prerequisite is including an extra learnable token at the start of all training samples to act as the attention sink.

To validate this approach, the researchers trained several 160-million-parameter language models from scratch, incorporating a single attention sink token at the beginning of the training examples. Their experiments demonstrated that the addition of this single sink token during inference effectively preserves the model’s performance in streaming cases.

“This stands in contrast to vanilla models, which necessitate the reintroduction of multiple initial tokens as attention sinks to achieve the same performance level,” the researchers note.

Moreover, they found that the inclusion of a sink token during pre-training does not negatively impact model convergence or subsequent performance on a variety of natural language processing (NLP) benchmarks.

StreamingLLM in action

The authors of the research paper have made the code for StreamingLLM publicly accessible on GitHub. This Python library is compatible with Llama-2, MPT, Falcon, and Pythia models.

Additionally, another open source implementation of StreamingLLM functions as a drop-in replacement for the Hugging Face transformers libraries and is compatible with other models on the Hugging Face platform.

Hugging Face is also closely monitoring the development of StreamingLLM and considering its integration into their transformers library. This development promises to provide enhanced tools for implementing StreamingLLM in various applications, marking a significant advancement in the field of language modeling.

4 COMMENTS

Jonathan Hostetler November 29, 2023 at 3:41 am

This seems amazing but I’m a bit confused. From what I understand you saying in this article, StreamingLLM could expand the context window of an LLM such as Llama to 4 million tokens, meaning I could hypothetically input 3 million words. However, the GitHub page explicitly says that it does not expand the context window. Am I missing something?

Loading...

- Ben Dickson November 29, 2023 at 6:20 am
  
  Hi Jonathan. StreamingLLM does not change the architecture of the model to expand the context window. What it does is shift the context window while maintaining the accuracy and the reused part of the KV cache. So basically, you can extend the conversation with the LLM into millions of tokens as if its context window was unlimited, but without making any changes to the model or retraining it. I hope this helps.
  
  Loading...
  
  - Andy Tenland November 29, 2023 at 3:12 pm
    
    Your explanation is inaccurate. It does not change the context window in any way. If the LLM has a 4k context window, it can only respond using the context of the latest 4k tokens. StreamingLLM makes LLMs more efficient by removing the need the reset the cache and improves accuracy vs LLMs that aren’t resetting their cache. It doesn’t make it so that an LLM with a 4k context window can accurately respond to a 128k token prompt. This article is spreading misinformation. Read the FAQ section here. https://github.com/mit-han-lab/streaming-llm
    
    Loading...
  - Ben Dickson November 29, 2023 at 9:26 pm
    
    That is what I meant. It enables you to continue your conversation with the LLM past the context window, though as you said, it sticks to length of the context window (e.g., 4k tokens). That’s what the article says too if you read it carefully.
    
    Loading...

Will infinite context windows kill LLM fine-tuning and RAG?

How to turn any LLM into an embedding model

AI in healthcare: Real-world applications for cost-savings and innovation

Stanford’s ReFT fine-tunes LLMs at a fraction of the cost

How generative AI is transforming the shopping experience

Fine-tune a Llama-2 language model with a single instruction

What to know about the rising threat of deepfake scams

4 reasons to use open-source LLMs (especially after the OpenAI drama)

No-code retrieval augmented generation (RAG) with LlamaIndex and ChatGPT

How to make your LLMs lighter with GPTQ quantization

What to know about open-source alternatives to GPT-4 Vision

The complete guide to LLM compression

A simple guide to gradient descent in machine learning

The complete guide to LLM fine-tuning

What is low-rank adaptation (LoRA)?

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

Democratizing the hardware side of large language models

StreamingLLM gives language models unlimited context

LLMs and context windows

Attention sinks

How StreamingLLM works

Pretraining language models with attention sinks

StreamingLLM in action

Like this:

4 COMMENTS

Leave a ReplyCancel reply

LLMs and context windows

Attention sinks

How StreamingLLM works

Pretraining language models with attention sinks

StreamingLLM in action

Like this:

4 COMMENTS

Leave a ReplyCancel reply

Discover more from TechTalks