How to turn any LLM into an embedding model

robot solving rubik's cube
Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

Embedding models have become an important part of LLM applications, enabling tasks such as measuring text similarity, information retrieval, and clustering. However, embedding models are mostly based on a transformer architecture that is different from the one used in generative tasks. This makes it difficult to transfer the massive work being done on generative models to improve embedding models and requires extra parallel efforts.

In a new study, researchers at Quebec AI Institute (Mila) and ServiceNow Research introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. Experiments show that LLM2Vec models have state-of-the-art performance on embedding tasks.

LLM2Vec is intuitive and efficient, and it can open up new venues for organizations to create specialized embedding models quickly and at a very low cost. 

Decoder-only transformers for embedding

Currently, the main architecture used for embedding models is pre-trained bidirectional encoders or encoder-decoders such as BERT and T5. As opposed to decoder models like GPT and Llama, these models have not been designed to generate tokens but to encode their semantic content as a numerical vector.

Recently, researchers have been exploring decoder-only LLMs for embedding text. The limited use of decoder models is partly due to their one-way attention mechanism, which is useful for generative tasks but subpar for learning rich embeddings.

“The need for a powerful instruction-following text encoder, for instance in retrieval, was a reason for us to explore how to use decoder-only LLMs for text representations,” Parishad BehnamGhader, PhD Student at McGill University and Mila and lead author of the LLM2Vec paper, told TechTalks.

There are several reasons for decoder-only LLMs to be suitable for embedding tasks. First, their training recipe makes them learn from all input tokens as opposed to encoder models that mask part of the input during training. Second, there is a lot of activity on decoder LLMs, and there is a rich ecosystem of models, techniques, and tools to choose from. Finally, LLMs that have been fine-tuned for instruction following and human preferences can be very suitable foundations for universal text embedding models that generalize across many tasks.

LLM2Vec

LLM2Vec is a simple unsupervised approach that can be used to transform any decoder-only LLM into an embedding model.

“When we started the project, only a few papers existed that used decoder-only LLMs for text representations. However, they mostly focused on supervised fine-tuning,” BehnamGhader said. “In contrast, our goal was to build a general recipe which is simple, can be applied to any decoder-only LLM, and does not necessarily require labeled training data. Hence, our focus was on parameter-efficient fine-tuning with both unsupervised and supervised publicly available data.”

LLM2Vec
LLM2Vec (source: arxiv)

LLM2Vec consists of three simple steps. First, the model is modified to enable bidirectional attention. This enables each token to attend to all other tokens instead of only seeing the previous ones, as is the case of decoder-only LLMs. 

Next, the model is trained on masked next-token prediction (MNTP). MNTP combines next-token prediction with masked language modeling. 

Finally, the model is fine-tuned on SimCSE, an unsupervised contrastive learning technique for sentence embeddings. For this step training input sequence is duplicated, each with different masked nodes due to the model’s dropout. This results in two different representations for the same sequence. The contrastive learning objective forces the model to maximize the similarity of embeddings between modified versions of the same input sequence and minimize the similarity with representations of other sequences in the batch. This step can be applied to any collection of sequences, which considerably reduces the effort required to gather training data.

“LLM2Vec is a general approach that is applicable to any decoder-only LLM,” BehnamGhader said. “Given the ubiquity of decoder-only LLMs in our field, we believe that it is important to be able to convert them into encoders which provides a compute-efficient alternative to training encoders from scratch.”

Many LLM applications use retrieval-augmented generation (RAG), which requires embedding passages and documents. With LLM2Vec, the same decoder-only LLM can be the backbone for both embedding and generation tasks. This reduces infrastructure needs and aligns the embedding and generation models.

“We believe that text embedding tasks such as retrieval can greatly benefit from the capabilities of decoder-only LLMs such as their instruction-following behavior,” BehnamGhader said. 

LLM2Vec in action

The researchers applied LLM2vec to three decoder-only LLMs ranging from 1.3 billion to 7 billion parameters and evaluated the resulting models on word and sequence-level tasks.

They performed both the MNTP and the unsupervised contrastive learning steps with data from English Wikipedia, as it was included in the pre-training mixture of the models they experimented with. This was to make sure that these two adaptation steps are not teaching the model any new knowledge beyond how to attend to future tokens and how to construct sequence representations.

They used low-rank adaptation (LoRA) to reduce the costs of training the models. Interestingly the entire process to fine-tune a 7-billion parameter model with LLM2Vec took no more than four hours to complete on an 80GB A100 GPU, which means the cost is less than $10. And given that the training data used in LLM2Vec is drawn from the corpus used in training the decoder model, the entire process is fast and efficient.

Their experiments show that LLM2Vec models are especially good for sequence-level tasks. On the Massive Text Embeddings Benchmark (MTEB), LLM2Vec-transformed models set a new state-of-the-art for unsupervised models.

Since most encoder-only models are a fraction of the size of decoder-only LLMs, the researchers also tested LLM2Vec on small models.

“For sure the size of the decoder-only model you are going to convert with LLM2Vec has an  effect on its performance,” BehnamGhader said. “To show that LLM2Vec does not only work for large models, we also applied it to a relatively small-scale decoder-only model.” 

Their experiments on the 1.3-billion-parameter Sheared-LLaMA model show that LLM2Vec is effective also at small scales. The researchers open-sourced the code for LLM2Vec and also released the models they trained for their experiments.

“LLM2vec is a first step in the direction of making better use of these models for text embedding tasks,” BehnamGhader said. “We are particularly excited about applying our approach to low-resource settings and using it to create encoder models for languages for which we don’t have sufficient training data to train models from scratch.”

1 COMMENT

  1. As long as the source of data is authentic. Different dialects of language could be an issue as well as same words different meanings. Computer, windows.. Bird outside the window.. computer bird outside. Exception handling. Windows, Computer AND outside.. outside on the tree……

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.