How to customize LLMs like ChatGPT with your own data and documents

Ben Dickson

1 year ago

Image source: 123RF (with modifications)

chatgpt llm document-aware chatbot — Image source: 123RF (with modifications)

Large language models (LLMs) like GPT-4 and ChatGPT can generate high-quality text that is useful for many applications, including chatbots, language translation, and content creation. However, these models are limited to the information contained within their training datasets.

If you prompt ChatGPT about something contained within your own organization’s documents, it will provide an inaccurate response. This can be problematic if you are working on an application where the language is highly technical or domain-specific.

To solve this problem, we can augment our LLMs with our own custom documents. In this article, I will show you a framework to give context to ChatGPT or GPT-4 (or any other LLM) with your own data by using document embeddings.

Providing context to language models

Language models are context sensitive. If you give them a plain prompt, they will respond based on the knowledge they have extracted from their training data. But if you prepend your prompt with custom information, you can modify their behavior.

For example, if you ask ChatGPT the question, “What are the risks of using run rate?” it will provide a long answer (which is not bad).

However, you can prompt ChatGPT to provide the answer from a specific document. In the following example, I ask ChatGPT the same question, but I prepend my prompt with “Answer my questions based on the following document:” followed by the text of an article from Investopedia about run rate. This time, ChatGPT provides a different answer, extracted from the article’s text.

Giving ChatGPT context from a document

ChatGPT responds based on document context

The value of this technique is evident, especially in applications where context is very important. However, manually adding context to your prompts is not practical, especially when you have thousands of documents.

Say you have a website that has thousands of pages with rich content on financial topics and you want to create a chatbot based on the ChatGPT API that can help users navigate this content. You need a systematic approach to match users’ prompts with the right pages and use the LLM to provide context-aware responses. This is where document embeddings can help.

Using embeddings to capture semantics

Before we get into embeddings, let’s create a high-level framework for our chatbot:

1- The user enters a prompt
2- Retrieve the best document that is relevant to the prompt
3- Create a new prompt that includes the user’s question as well as the context from the document
4- Give the newly crafted prompt to the language model
5- Return the answer to the user

From a programming standpoint, this process is straightforward except for step 2. How do we decide which document is relevant to the user’s query? A rudimentary answer would be to use classic indexing and keyword search. A better solution is to use embeddings.

An embedding is a numerical vector—a list of numbers—that captures the different features of a piece of information. The more dimensions the embedding has, the more features it can learn.

You can use embeddings for different types of data. For example, in image-related tasks, embeddings can represent the presence or absence of different objects, the intensity of different colors, the distance between different objects, etc.

In text, embeddings capture different semantical aspects of texts. For example, word embeddings might contain information about whether the word relates to a city or country, a species of animals, a sports activity, a political concept, etc. In the same sense, phrase embeddings create a numerical representation of the content of a sequence of words. By measuring the distance between two embedding vectors, you can obtain the similarity of their corresponding content.

You create embeddings by training a machine learning model—usually a deep neural network—on a large dataset of examples. In many cases, the embedding model is a modified version of the same model used for the final application (e.g., text generation or image classification).

Creating an embedding database for our documents

To integrate embeddings into your chatbot workflow, you’ll need a database that contains the embeddings of all your documents. If your documents are already available in plain text in a database, then you’re ready to create the embeddings. If not, you’ll need to use some sort of technique such as web scraping with Python Beautiful Soup to extract the text from the web pages. If your documents are PDF files, such as research papers, you’ll need to extract the text from them (you can do this with the Python PyPDF library).

To create embeddings for your documents, you can use an online service such as OpenAI’s Embeddings API. You provide the API with the text of your document, and it returns its embedding. OpenAI’s embeddings are 1,536 dimensions, which is among the largest. Alternatively, you can use other embedding services such as Hugging Face or your own custom transformer model.

Once you have your embeddings, you must store them in a “vector database.” Vector databases are specialized for embeddings and provide different features, such as querying based on different measures (Euclidean distance, cosine similarity, etc.).

A popular open-source vector database is Faiss by Facebook, which provides a rich Python library for hosting your own embedding data. Alternatively, you can use Pinecone, an online vector database system that abstracts the technical complexities of storing and retrieving embeddings.

You now have everything you need to create an LLM application that is customized for your own proprietary data. We can now change the logic of the application as follows:

1- The user enters a prompt
2- Create the embedding for the user prompt
3- Search the embedding database for the document that is nearest to the prompt embedding
4- Retrieve the actual text of the document
5- Create a new prompt that includes the user’s question as well as the context from the document
6- Give the newly crafted prompt to the language model
7- Return the answer to the user
8- Bonus: provide a link to the document where the user can further obtain information

Using embeddings and a vector database to retrieve relevant documents

To avoid creating the entire workflow manually, you can use LangChain, a Python library for creating LLM applications. LangChain support different types of LLMs and embeddings, including OpenAI, Cohere, AI21 Labs, as well as open source models. It also supports different vector databases, including Pinecone and FAISS. And it has ready-made templates for different types of applications, including chatbots, question answering, and active agents.

Important considerations for embeddings

To make proper use of embeddings with large language models, keep the following considerations in mind:

– Remain consistent in the embeddings framework you use: Make sure you use the same embedding model across the entire application. For example, if you chose OpenAI embeddings, make sure to use the same API and model for creating document embeddings, user prompt embeddings, and searching your vector database. Otherwise, you will get inconsistent results.

– Token limitations: Every LLM has a token limit. For example, ChatGPT can preserve context up to 4,096 tokens. GPT-4 has an 8,000 and 32,000 token limits. Many open source models are limited to 2,048 tokens. This includes the document context, user prompt, and model’s response. Therefore, you have to make sure that your context data doesn’t fill the LLM’s memory. A good rule of thumb is to limit documents to 1,000 tokens. If your document is longer than that, you can break it into several chunks with a bit of overlap (around 100 tokens) between each part.

– Using multiple documents: Your response does not have to be limited to a single document. You can retrieve several documents whose embeddings are similar to the prompt and use them to obtain responses. To make sure you don’t run into token limits, you can prompt the model separately for each document.

Why not fine-tune language models?

Why not fine-tune the LLM instead of using context embeddings? Fine-tuning is a good option, and using it will depend on your application and resources. With proper fine-tuning, you can get good results from your LLMs without the need to provide context data, which reduces token and inference costs on paid APIs. However, fine-tuning can be costly and complicated. Using context embeddings is an easy option that can be achieved with minimal costs and effort.

Eventually, if you have a good data-collection pipeline, you can improve your system by fine-tuning a model for your purposes.