How to customize LLMs for low-frequency topics

robot documents
Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

There is a lot of excitement around configuring large language models for enterprise applications. There are various ways to customize LLMs, including retrieval augmented generation (RAG) and fine-tuning (FT).

But with the field still in its infancy, there is no universal recipe to make LLMs work for all applications. The results can vary depending on the application and its similarity to the data used to train the model.

A new study by researchers at Radboud University and the University of Amsterdam explores the effect RAG and fine-tuning on LLM applications when your data is not present in the model’s training examples. Their findings can provide useful guidelines for enterprises working on very specialized LLM applications.

LLMs for unpopular topics

LLMs can memorize a lot of knowledge during training and perform well on tasks that are related to the information contained in their training data. But when used for very specialized applications that deal with concepts that were not present or scarce in the training data, the performance of the models diminishes. This can happen often in enterprise settings where a company wants to use the model with its proprietary data. The main solutions to enhance the performance of models for domain-specific applications are fine-tuning (FT) and retrieval-augmented generation (RAG).

In their study, the researchers explore the effects of fine-tuning and RAG on “less-popular or low-frequency concepts and entities.”

“The primary motivation behind our research was to address real-world applications,” Heydar Soudani, Doctoral Researcher at Radboud University and lead author of the paper, told TechTalks. “In particular, we focus on the scenario in which employees/customers need to access knowledge lying in a company’s intranet and was not accessible by the LLM during its training. In such a situation the LLM has a hard time understanding and answering questions related to information in the intranet.” 

RAG vs fine-tuning

The researchers carried out experiments on the base, small, and large versions of Google’s Flan-T5 model on PopQA, a dataset containing 14,000 question-answer pairs. PopQA is an open-domain dataset, which means it covers QA pairs for a wide range of topics. Each QA pair also contains references to relevant Wikipedia pages. The researchers retrieved the pageviews of the Wikipedia pages corresponding to each QA pair to measure the topic’s popularity.

“We simulated such a scenario by selectively sampling Wikipedia pages about less popular concepts/entities, though we also included pages on more popular entities for a comprehensive comparison,” Soudani said. “One can think of these Wikipedia pages as information in the intranet of a company.”

They tested three approaches: RAG, FT, and the combination of RAG and FT (RAG+FT).

To measure the effectiveness of RAG, for each QA object, they used a minimalist, zero-shot template that includes the context information plus the question. The model’s response is considered accurate if it contains a substring that exactly matches one of the correct answers.

For fine-tuning, the researchers tested different data generation methods, including training a model to generate QA pairs and using prompting techniques on pre-trained models. They also tested different training methods, including full fine-tuning and parameter-efficient fine-tuning (PEFT).

RAG vs FT
How different RAG and FT techniques perform on unpopular topics (source: arxiv)

The key findings are as follows:

– All models performed very poorly on less popular topics.

– RAG accounted for the biggest gain in performance and is much more effective than FT alone.

– FT specifically boosts performance for both the most and least popular entities.

– Expectedly, RAG+FT outperforms RAG and FT

– In RAG+FT, PEFT fine-tuning yields better results than full fine-tuning. “This suggests that PEFT enables the LLM to maintain its reasoning abilities based on the provided prompts,” the researchers write.

– The performance of RAG depends on the retrieval mechanism. Their experiments show that dense passage retrieval (DPR) outperforms other methods.

– In most fine-tuning experiments, prompt-based QA generation provided better results than training a model for generating examples.

“Our experiments, along with insights from related papers, indicate that the size of the models does not fundamentally alter our core findings,” Soudani said.

How to apply the findings to your company’s data

Based on the paper’s findings, here are a few steps to consider when creating LLM applications based on your company’s data.

1- Establish the baseline performance of the model on your company’s domain: Create a dataset with a few hundred QA pairs that capture the knowledge domain of your company. Test the model(s) on the dataset to determine the baseline performance. If your knowledge is very specialized, you can expect the model’s performance to be very poor.

2- Determine the desired accuracy: Based on your test dataset, determine what level of accuracy would be acceptable for your application. This is a very important step because it can save you time down the pipeline by obviating the need to take extra steps to improve performance.

3- Start with RAG: RAG is much easier to implement than FT and results in the highest gain in performance. If your company’s knowledge base is not ready for integration into a RAG pipeline, create an experimental setup with a subset of your documents. Experiment with different retrieval mechanisms to find the one that shows the best performance on your data. 

4- Use advanced RAG: If simple RAG does not reach the necessary accuracy, try more advanced techniques to improve the performance. One very effective way is to use out-of-the-box reranking tools can considerably boost the performance of your RAG pipeline at a very low cost. (For more, see our guide on RAG optimization.)

5- Create a fine-tuning dataset: If RAG does not produce the level of accuracy that you require, consider fine-tuning your model. Use a strong model to create QA pairs for your company data. For this, you can use frontier models like GPT-4 and Claude 3 or you can use open-weight models like Mixtral and Llama 2-70B. Here is an example of a prompt template for creating QA training examples.

QA generation prompt
Prompt for generating QA pairs

6- Use parameter-efficient fine-tuning: According to the paper, PEFT is more efficient than full fine-tuning. By combining RAG and FT, you can boost your application’s performance by a few percentage points.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.