How to customize LLMs like ChatGPT with your own data and documents

chatgpt llm document-aware chatbot
Image source: 123RF (with modifications)

Large language models (LLMs) like GPT-4 and ChatGPT can generate high-quality text that is useful for many applications, including chatbots, language translation, and content creation. However, these models are limited to the information contained within their training datasets.

If you prompt ChatGPT about something contained within your own organization’s documents, it will provide an inaccurate response. This can be problematic if you are working on an application where the language is highly technical or domain-specific.

To solve this problem, we can augment our LLMs with our own custom documents. In this article, I will show you a framework to give context to ChatGPT or GPT-4 (or any other LLM) with your own data by using document embeddings.

Providing context to language models

Language models are context sensitive. If you give them a plain prompt, they will respond based on the knowledge they have extracted from their training data. But if you prepend your prompt with custom information, you can modify their behavior.

For example, if you ask ChatGPT the question, “What are the risks of using run rate?” it will provide a long answer (which is not bad).

chatgpt what is run rate
Generic answer by ChatGPT

However, you can prompt ChatGPT to provide the answer from a specific document. In the following example, I ask ChatGPT the same question, but I prepend my prompt with “Answer my questions based on the following document:” followed by the text of an article from Investopedia about run rate. This time, ChatGPT provides a different answer, extracted from the article’s text. 

The value of this technique is evident, especially in applications where context is very important. However, manually adding context to your prompts is not practical, especially when you have thousands of documents.

Say you have a website that has thousands of pages with rich content on financial topics and you want to create a chatbot based on the ChatGPT API that can help users navigate this content. You need a systematic approach to match users’ prompts with the right pages and use the LLM to provide context-aware responses. This is where document embeddings can help.

Using embeddings to capture semantics

Before we get into embeddings, let’s create a high-level framework for our chatbot:

1- The user enters a prompt
2- Retrieve the best document that is relevant to the prompt
3- Create a new prompt that includes the user’s question as well as the context from the document
4- Give the newly crafted prompt to the language model
5- Return the answer to the user

context-aware prompt for chatgpt
Providing context to ChatGPT

From a programming standpoint, this process is straightforward except for step 2. How do we decide which document is relevant to the user’s query? A rudimentary answer would be to use classic indexing and keyword search. A better solution is to use embeddings.

An embedding is a numerical vector—a list of numbers—that captures the different features of a piece of information. The more dimensions the embedding has, the more features it can learn.

You can use embeddings for different types of data. For example, in image-related tasks, embeddings can represent the presence or absence of different objects, the intensity of different colors, the distance between different objects, etc.

In text, embeddings capture different semantical aspects of texts. For example, word embeddings might contain information about whether the word relates to a city or country, a species of animals, a sports activity, a political concept, etc. In the same sense, phrase embeddings create a numerical representation of the content of a sequence of words. By measuring the distance between two embedding vectors, you can obtain the similarity of their corresponding content.

You create embeddings by training a machine learning model—usually a deep neural network—on a large dataset of examples. In many cases, the embedding model is a modified version of the same model used for the final application (e.g., text generation or image classification).

embeddings

Creating an embedding database for our documents

To integrate embeddings into your chatbot workflow, you’ll need a database that contains the embeddings of all your documents. If your documents are already available in plain text in a database, then you’re ready to create the embeddings. If not, you’ll need to use some sort of technique such as web scraping with Python Beautiful Soup to extract the text from the web pages. If your documents are PDF files, such as research papers, you’ll need to extract the text from them (you can do this with the Python PyPDF library).

To create embeddings for your documents, you can use an online service such as OpenAI’s Embeddings API. You provide the API with the text of your document, and it returns its embedding. OpenAI’s embeddings are 1,536 dimensions, which is among the largest. Alternatively, you can use other embedding services such as Hugging Face or your own custom transformer model.

Once you have your embeddings, you must store them in a “vector database.” Vector databases are specialized for embeddings and provide different features, such as querying based on different measures (Euclidean distance, cosine similarity, etc.).

A popular open-source vector database is Faiss by Facebook, which provides a rich Python library for hosting your own embedding data. Alternatively, you can use Pinecone, an online vector database system that abstracts the technical complexities of storing and retrieving embeddings.

You now have everything you need to create an LLM application that is customized for your own proprietary data. We can now change the logic of the application as follows:

1- The user enters a prompt
2- Create the embedding for the user prompt
3- Search the embedding database for the document that is nearest to the prompt embedding
4- Retrieve the actual text of the document
5- Create a new prompt that includes the user’s question as well as the context from the document
6- Give the newly crafted prompt to the language model
7- Return the answer to the user
8- Bonus: provide a link to the document where the user can further obtain information

llm chatbot embedding database
Using embeddings and a vector database to retrieve relevant documents

To avoid creating the entire workflow manually, you can use LangChain, a Python library for creating LLM applications. LangChain support different types of LLMs and embeddings, including OpenAI, Cohere, AI21 Labs, as well as open source models. It also supports different vector databases, including Pinecone and FAISS. And it has ready-made templates for different types of applications, including chatbots, question answering, and active agents.

Important considerations for embeddings

To make proper use of embeddings with large language models, keep the following considerations in mind:

–  Remain consistent in the embeddings framework you use: Make sure you use the same embedding model across the entire application. For example, if you chose OpenAI embeddings, make sure to use the same API and model for creating document embeddings, user prompt embeddings, and searching your vector database. Otherwise, you will get inconsistent results.

– Token limitations: Every LLM has a token limit. For example, ChatGPT can preserve context up to 4,096 tokens. GPT-4 has an 8,000 and 32,000 token limits. Many open source models are limited to 2,048 tokens. This includes the document context, user prompt, and model’s response. Therefore, you have to make sure that your context data doesn’t fill the LLM’s memory. A good rule of thumb is to limit documents to 1,000 tokens. If your document is longer than that, you can break it into several chunks with a bit of overlap (around 100 tokens) between each part.

– Using multiple documents: Your response does not have to be limited to a single document. You can retrieve several documents whose embeddings are similar to the prompt and use them to obtain responses. To make sure you don’t run into token limits, you can prompt the model separately for each document.

Why not fine-tune language models?

Why not fine-tune the LLM instead of using context embeddings? Fine-tuning is a good option, and using it will depend on your application and resources. With proper fine-tuning, you can get good results from your LLMs without the need to provide context data, which reduces token and inference costs on paid APIs. However, fine-tuning can be costly and complicated. Using context embeddings is an easy option that can be achieved with minimal costs and effort.

Eventually, if you have a good data-collection pipeline, you can improve your system by fine-tuning a model for your purposes.

11 COMMENTS

  1. This is very helpful as we are preparing to issue an RFP for AI services to be incorporated into our systems. We have these large datasets of documents that we need to integrate and this will be key to the success of it all.

  2. Can you comment on how cloud providers like AWS, Azure, and GCP can simplify the process. How can one create document embeddings for all of the documents in one’s Intranet?

  3. Beautifully written and informative. Finally something useful about how to implement and customise chatGPT.

  4. Hi, the article mentions “ChatGPT can preserve context up to 4,096 tokens. GPT-4 has an 8,000 and 32,000 token limits.” As most of my context documents exceed 4k tokens, a chatGPT query with context doesn’t work. Arbitrarily cutting the context into small chunks looses the overall-view that is required for some of the queries. How can i use the larger limits of 8k and 32k?

    • Hi Willem,
      As I mentioned in the article, if your content exceeds the limit, then you must break each piece of content into several chunks, calculate their embeddings, and store them separately. I suggest limiting chunks to around 1k tokens (you need space for the user’s prompt as well as the LLM’s response) for 2k and 4k models.

      • I think the question was about losing the overall view of a document, that is, losing the necessary general context to correctly answer a question. However, in the article, it talks about creating overlap (with a bit of overlap (around 100 tokens, between each part)… is this to avoid losing the overall view?

  5. This was a very helpful article. Adding extra contexts to every API call can get very expensive fast though, so be warned.

  6. Thank you for this helpful article. Is running your own LLM cost prohibitive? For example, if this was an enterprise application where multiple users could query their data, wouldn’t the CPU cost be significant?

  7. You have written

    Token limitations: Every LLM has a token limit. For example, ChatGPT can preserve context up to 4,096 tokens. GPT-4 has an 8,000 and 32,000 token limits. Many open source models are limited to 2,048 tokens. This includes the document context, user prompt, and model’s response. Therefore, you have to make sure that your context data doesn’t fill the LLM’s memory. A good rule of thumb is to limit documents to 1,000 tokens. If your document is longer than that, you can break it into several chunks with a bit of overlap (around 100 tokens) between each part.

    why should the model’s response be included in the token limit – or the maximum input sequence, which is all about input alone.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.