How to ensure your LLM RAG pipeline retrieves the right documents

RAG pipeline
Image generated with Bing Image Creator

Retrieval-augmented generation (RAG) is a pivotal application for large language models (LLM) in enterprise settings. RAG enhances the capabilities of LLMs by integrating contextual data relevant to specific prompts. This process allows LLMs to tailor their responses, drawing from specialized resources like proprietary corporate materials or detailed technical manuals—content beyond their initial training scope. 

However, the efficacy of RAG hinges on the precision of its retrieval phase. If the system fails to find the most pertinent documents, the quality of the generated response will suffer. In this guide, we’ll explore practical techniques to refine your RAG pipeline to ensure your language model has the most relevant information to improve the quality of its output.

How the RAG pipeline works

The retrieval-augmented generation pipeline is composed of several components, including a vector database, a document store, an embedding model, and the main language model. Its core function begins with the embedding model, which translates documents into numerical representations that are stored in the vector database. 

When a user sends a prompt, the embedding model encodes it and sends it to the vector database. The vector DB compares the prompt’s embedding against the database’s document embeddings to identify those with the highest similarity.

Once the most relevant documents are identified, their content is added with the original prompt as context and sent to the main LLM. The LLM uses this contextual information to generate its responses. 

The critical step in this process is the retrieval of documents that are truly relevant to the prompt. And it relies on the precision of semantic representations within the embeddings. If the embeddings fail to align the prompts and documents, the vector database may return inappropriate documents, leading to subpar responses.

llm chatbot embedding database
Using embeddings and a vector database to retrieve relevant documents

To delve deeper into this subject, read our comprehensive guide on augmented language models and our starter guide on RAG. Now, here are several strategies to optimize your RAG pipeline to ensure it retrieves the most pertinent documents for the user’s prompt.

Choose the right chunk size

LLMs have a limited context window, typically spanning a few thousand tokens. This presents a challenge when working with extensive documents such as books or lengthy articles that exceed the LLM’s capacity to process in a single go. To work around this limitation, you must segment your documents into smaller, more manageable chunks before adding them to the vector database for use within the RAG pipeline.

Determining the optimal chunk size is crucial for the RAG pipeline’s effectiveness. With small sizes, the RAG system can fit more context chunks within the model’s context window, potentially accelerating the retrieval process. However, very small chunks may lack the necessary information to respond to the user’s prompt.

Larger chunks ensure a richer context, providing more substantial information for the user’s query. But if the chunk is too large, it fills the LLM’s context window. This not only limits the number of documents that can be added to the prompt but also decelerates the model’s response time. Furthermore, an excessively large chunk may dilute the specificity of the embedding values, complicating the alignment with user prompts.

So, what constitutes an appropriate chunk size? While it largely hinges on the specific application, chunk sizes of 512 or 1,024 tokens generally strike a balance between context richness and operational efficiency. Implementing an overlap of approximately 100-200 tokens between consecutive chunks is also advisable. This overlap can prevent critical information from being split across two chunks, which would otherwise impede the RAG pipeline’s performance.

For those seeking empirical insights into chunk size optimization, Jerry Liu of LlamaIndex provides a thorough examination in this blog post. Liu has also published a Google Colab notebook that offers a practical framework for testing various chunk sizes tailored to your unique application. 

Rerank the retrieved documents

The vector database is the linchpin that ranks and retrieves documents based on similarity scores with the user’s prompt. Typically, a RAG pipeline fetches a set of top-ranked documents from the vector database in response to a user’s query. However, as previously discussed, these documents may not always align closely with the query’s intent.

To mitigate this discrepancy, a secondary model can be used to rerank the documents retrieved by the vector database based on the user’s prompt. When the original RAG pipeline’s output is less than ideal, reranking can enhance its precision significantly. 

While reranking models may not match the speed and efficiency of a vector database, they compensate with heightened accuracy. The fast retrieval capabilities of the vector database and the accuracy of a reranking model can be a killer combo.

In this two-tiered approach, the vector database first shortlists a batch of potentially relevant documents. Subsequently, the reranking model evaluates this subset, selecting the top-k documents that best align with the user’s query. For example, if your RAG pipeline is designed to fetch the top three documents as context, you can configure the vector database to retrieve six documents, re-evaluate them with the reranking model, and send the new top three documents as context to the LLM.

cohere rerank feature
Cohere has a rerank feature that can enhance documents retrieved from your vector store (source: Cohere blog)

An added benefit of reranking models is their plug-and-play nature. They can be integrated into existing RAG pipelines without the need to recompute the document embeddings stored in the vector database.

Cohere provides a very useful reranking feature that can enhance your RAG pipelines. It is compatible with widely-used libraries such as LangChain and LlamaIndex. Moreover, Cohere has recently introduced options for fine-tuning its rerank model, allowing for even greater optimization tailored to specific use cases. 

Fine-tune embedding models

Another way to enhance your RAG pipeline’s precision is to fine-tune the embedding model to better suit your specific application. This method, while effective, is both costly and complex. It requires retraining the embedding model and recalculating the embeddings for all documents within your corpus. Furthermore, this option is often unavailable with closed-source services, which typically do not allow for model fine-tuning. In such cases, you may need to transition to an alternative embedding model or adopt an open-source solution like Sentence Transformers.

An alternative approach is to use a deep learning model that acts as an intermediary between the prompt embeddings and the document embeddings, learning to align them more effectively. This technique is used by Deep Memory, a feature provided by Activeloop, which specializes in vector databases for AI applications.

Deep Memory
Activeloop’s Deep Memory feature improves the alignment of prompt and document embeddings (source: Activeloop documentation)

Activeloop’s DeepLake vector database enables you to train the Deep Memory component by feeding it pairs of prompts and the corresponding document vectors from the database. During inference, Deep Memory applies its learned transformation to the prompt embeddings, significantly enhancing retrieval accuracy.

The advantage of Deep Memory is that you do not need to recompute the embeddings of your existing vector store, making it a cost-effective solution. It’s also compatible with any embedding model, including those from closed-source services, adding to its versatility. LlamaIndex supports integration with both DeepLake and Deep Memory and has a comprehensive guide detailing its implementation.

1 COMMENT

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.