Blog

How to customize LLMs for low-frequency topics

March 18, 2024

robot documents — Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

There is a lot of excitement around configuring large language models for enterprise applications. There are various ways to customize LLMs, including retrieval augmented generation (RAG) and fine-tuning (FT).

But with the field still in its infancy, there is no universal recipe to make LLMs work for all applications. The results can vary depending on the application and its similarity to the data used to train the model.

A new study by researchers at Radboud University and the University of Amsterdam explores the effect RAG and fine-tuning on LLM applications when your data is not present in the model’s training examples. Their findings can provide useful guidelines for enterprises working on very specialized LLM applications.

LLMs for unpopular topics

LLMs can memorize a lot of knowledge during training and perform well on tasks that are related to the information contained in their training data. But when used for very specialized applications that deal with concepts that were not present or scarce in the training data, the performance of the models diminishes. This can happen often in enterprise settings where a company wants to use the model with its proprietary data. The main solutions to enhance the performance of models for domain-specific applications are fine-tuning (FT) and retrieval-augmented generation (RAG).

In their study, the researchers explore the effects of fine-tuning and RAG on “less-popular or low-frequency concepts and entities.”

“The primary motivation behind our research was to address real-world applications,” Heydar Soudani, Doctoral Researcher at Radboud University and lead author of the paper, told TechTalks. “In particular, we focus on the scenario in which employees/customers need to access knowledge lying in a company’s intranet and was not accessible by the LLM during its training. In such a situation the LLM has a hard time understanding and answering questions related to information in the intranet.”

RAG vs fine-tuning

The researchers carried out experiments on the base, small, and large versions of Google’s Flan-T5 model on PopQA, a dataset containing 14,000 question-answer pairs. PopQA is an open-domain dataset, which means it covers QA pairs for a wide range of topics. Each QA pair also contains references to relevant Wikipedia pages. The researchers retrieved the pageviews of the Wikipedia pages corresponding to each QA pair to measure the topic’s popularity.

“We simulated such a scenario by selectively sampling Wikipedia pages about less popular concepts/entities, though we also included pages on more popular entities for a comprehensive comparison,” Soudani said. “One can think of these Wikipedia pages as information in the intranet of a company.”

They tested three approaches: RAG, FT, and the combination of RAG and FT (RAG+FT).

To measure the effectiveness of RAG, for each QA object, they used a minimalist, zero-shot template that includes the context information plus the question. The model’s response is considered accurate if it contains a substring that exactly matches one of the correct answers.

For fine-tuning, the researchers tested different data generation methods, including training a model to generate QA pairs and using prompting techniques on pre-trained models. They also tested different training methods, including full fine-tuning and parameter-efficient fine-tuning (PEFT).

RAG vs FT — How different RAG and FT techniques perform on unpopular topics (source: arxiv)

The key findings are as follows:

– All models performed very poorly on less popular topics.

– RAG accounted for the biggest gain in performance and is much more effective than FT alone.

– FT specifically boosts performance for both the most and least popular entities.

– Expectedly, RAG+FT outperforms RAG and FT

– In RAG+FT, PEFT fine-tuning yields better results than full fine-tuning. “This suggests that PEFT enables the LLM to maintain its reasoning abilities based on the provided prompts,” the researchers write.

– The performance of RAG depends on the retrieval mechanism. Their experiments show that dense passage retrieval (DPR) outperforms other methods.

– In most fine-tuning experiments, prompt-based QA generation provided better results than training a model for generating examples.

“Our experiments, along with insights from related papers, indicate that the size of the models does not fundamentally alter our core findings,” Soudani said.

How to apply the findings to your company’s data

If you enjoyed this article, please consider supporting TechTalks with a paid subscription (and gain access to subscriber-only posts)

Subscribe to TechTalks

Based on the paper’s findings, here are a few steps to consider when creating LLM applications based on your company’s data.

1- Establish the baseline performance of the model on your company’s domain: Create a dataset with a few hundred QA pairs that capture the knowledge domain of your company. Test the model(s) on the dataset to determine the baseline performance. If your knowledge is very specialized, you can expect the model’s performance to be very poor.

2- Determine the desired accuracy: Based on your test dataset, determine what level of accuracy would be acceptable for your application. This is a very important step because it can save you time down the pipeline by obviating the need to take extra steps to improve performance.

3- Start with RAG: RAG is much easier to implement than FT and results in the highest gain in performance. If your company’s knowledge base is not ready for integration into a RAG pipeline, create an experimental setup with a subset of your documents. Experiment with different retrieval mechanisms to find the one that shows the best performance on your data.

4- Use advanced RAG: If simple RAG does not reach the necessary accuracy, try more advanced techniques to improve the performance. One very effective way is to use out-of-the-box reranking tools can considerably boost the performance of your RAG pipeline at a very low cost. (For more, see our guide on RAG optimization.)

5- Create a fine-tuning dataset: If RAG does not produce the level of accuracy that you require, consider fine-tuning your model. Use a strong model to create QA pairs for your company data. For this, you can use frontier models like GPT-4 and Claude 3 or you can use open-weight models like Mixtral and Llama 2-70B. Here is an example of a prompt template for creating QA training examples.

QA generation prompt — Prompt for generating QA pairs

6- Use parameter-efficient fine-tuning: According to the paper, PEFT is more efficient than full fine-tuning. By combining RAG and FT, you can boost your application’s performance by a few percentage points.

Will infinite context windows kill LLM fine-tuning and RAG?

How to turn any LLM into an embedding model

AI in healthcare: Real-world applications for cost-savings and innovation

Stanford’s ReFT fine-tunes LLMs at a fraction of the cost

How generative AI is transforming the shopping experience

Fine-tune a Llama-2 language model with a single instruction

What to know about the rising threat of deepfake scams

4 reasons to use open-source LLMs (especially after the OpenAI drama)

No-code retrieval augmented generation (RAG) with LlamaIndex and ChatGPT

How to make your LLMs lighter with GPTQ quantization

What to know about open-source alternatives to GPT-4 Vision

The complete guide to LLM compression

A simple guide to gradient descent in machine learning

The complete guide to LLM fine-tuning

What is low-rank adaptation (LoRA)?

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

Democratizing the hardware side of large language models

How to customize LLMs for low-frequency topics

LLMs for unpopular topics

RAG vs fine-tuning

How to apply the findings to your company’s data

Like this:

Leave a ReplyCancel reply

LLMs for unpopular topics

RAG vs fine-tuning

How to apply the findings to your company’s data

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks