Blog

Train your LLMs to choose between RAG and internal memory automatically

May 6, 2024

robot memory document — Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

Retrieval-augmented generation (RAG) pipelines enable large language models (LLM) to use external sources of information in their responses. But RAG applications retrieve external information for every request sent to the LLM. This makes the process inefficient as the LLM already contains plenty of knowledge that can be used without retrieval.

What if we could configure LLMs to only use RAG when its internal knowledge does not suffice? Adapt-LLM, a technique developed by researchers at University of Bozen-Bolzano and Fondazione Bruno Kessler, trains LLMs to dynamically determine whether they need to retrieve additional context information in question-answering tasks. Adapt-LLM can help avoid unnecessary retrieval and make LLM applications more efficient.

Memory vs retrieval

There are two main methods for LLMs to answer questions. The first is relying on their parametric memory obtained during training. The limitation of parametric memory is that it is completely based on the training corpus. You can improve the performance of parametric through fine-tuning or few-shot prompting techniques that focus the model’s attention on relevant parameters. But these techniques are not useful in scenarios where the model must dynamically use new information such as recent news or private information not included in the training corpus.

The second category uses an information retriever to provide contextual information to the model. Retrieval augmented generation falls into this category.

The problem with information retrieval is that sometimes the model doesn’t need additional context information and has enough internal knowledge to answer the question. The two methods can be compared to closed-book and open-book question-answering.

Humans use a hybrid approach. For example, when we know the answer to a question by heart, we can answer it immediately. But when we are not confident about our knowledge, we use an external source. Some LLM techniques use this hybrid approach through popularity scores. The assumption is that when the question is very popular, the model has the internal knowledge to respond. For less popular questions, the model will need the help of a RAG system to obtain the necessary information.

However, this approach requires the questions to have a popularity score attached to them, which is not always available.

Adapt-LLM

Adapt-LLM trains language models for “adaptive retrieval,” enabling them to autonomously determine when to use an information retrieval system for additional context.

“In this approach, if the solution to the task is encoded in the parameters of the model, the model will be directly used for generating a solution. Conversely, if the answer is not encoded in the knowledge of the model, the answer generation will be augmented with external knowledge,” the researchers write.

Adapt-LLM works in four steps.

1) The first prompt containing the question is sent to the Adapt-LLM model.

2) The model evaluates the prompt to determine whether additional context is necessary to answer the question effectively.

3) If the model determines that it doesn’t require additional context, it directly responds from parametric memory.

4) If the Adapt-LLM model requires additional context, it returns a special token such as <RET>. The application can then use an information retriever to obtain context based on the question and combine it with the original prompt.

This flexible behavior allows the model to strike a balance between using external context and giving direct answers.

Training Adapt-LLM

To train a model for Adapt-LLM, you start with a dataset of tuples containing questions, context, and answers. Then for each tuple, the model is given the question without the context and instructions to answer directly if it is confident about its knowledge or return <RET> if it needs additional context.

If the model returns the correct answer, then it has parametric knowledge and a new training instance is created that contains the question and the answer (but not the context). If the model returns the wrong answer, two training instances are created: A “parametric prompt” that contains the question and the <RET> answer and a “context prompt” that contains the question, context, instructions, and answer.

The base model is then trained on the dataset containing both types of examples, which results in Adapt-LLM behavior.

Adapt-LLM in action

The researchers conducted several experiments with Adapt-LLM on the PopQA, a dataset of questions curated from various online platforms. They used Llama-2 7B as their base LLM and trained it on an Adapt-LLM dataset created from NQ and SQuAD question-answering datasets. They compared the Adapt-LLM model against a pure never-retrieve and an always-retrieve model.

If you enjoyed this article, please consider supporting TechTalks with a paid subscription (and gain access to subscriber-only posts)

Subscribe to TechTalks

Predictably, their findings show that Adapt-LLM performs much better than the never-retrieve model that only relies on parametric memory.

It also decreases the usage of retrieval in comparison to the always-retrieve model while also improving performance when its parametric memory is better than the information returned by the RAG system.

“When ADAPT-LLM decides to retrieve additional information, the results obtained with the context are significantly better than those without it. Similarly, when ADAPT-LLM directly answers questions relying on its parametric memory, it achieves high accuracies,” the researchers write. “These observations indicate that the model effectively discerns when to retrieve information and when it can answer a question without further context.”

The good and bad

Unfortunately, the researchers did not release the code and models for Adapt-LLM, which makes it difficult to verify the results of their experiments. Since this is a very practical technique, it would have been good if they had released findings on token usage and inference time.

Fortunately, the algorithm is easy to implement and anyone can create their own version of Adapt-LLM. It will be interesting to see how it performs with datasets from other domains and what practical applications can be built on top of it.

2 COMMENTS

michaelmior May 17, 2024 at 2:01 pm

The code of the model is now available https://github.com/tLabruna/Adapt-LLM

Loading...

- Ben Dickson May 17, 2024 at 2:03 pm
  
  Thanks!
  
  Loading...

Why OpenAI did not release a native ChatGPT app for Windows

How far can you trust chain-of-thought prompting?