Tips & Tricks

How to customize LLMs like ChatGPT with your own data and documents

May 1, 2023

chatgpt llm document-aware chatbot — Image source: 123RF (with modifications)

Large language models (LLMs) like GPT-4 and ChatGPT can generate high-quality text that is useful for many applications, including chatbots, language translation, and content creation. However, these models are limited to the information contained within their training datasets.

If you prompt ChatGPT about something contained within your own organization’s documents, it will provide an inaccurate response. This can be problematic if you are working on an application where the language is highly technical or domain-specific.

To solve this problem, we can augment our LLMs with our own custom documents. In this article, I will show you a framework to give context to ChatGPT or GPT-4 (or any other LLM) with your own data by using document embeddings.

Providing context to language models

Language models are context sensitive. If you give them a plain prompt, they will respond based on the knowledge they have extracted from their training data. But if you prepend your prompt with custom information, you can modify their behavior.

For example, if you ask ChatGPT the question, “What are the risks of using run rate?” it will provide a long answer (which is not bad).

chatgpt what is run rate — Generic answer by ChatGPT

However, you can prompt ChatGPT to provide the answer from a specific document. In the following example, I ask ChatGPT the same question, but I prepend my prompt with “Answer my questions based on the following document:” followed by the text of an article from Investopedia about run rate. This time, ChatGPT provides a different answer, extracted from the article’s text.

The value of this technique is evident, especially in applications where context is very important. However, manually adding context to your prompts is not practical, especially when you have thousands of documents.

Say you have a website that has thousands of pages with rich content on financial topics and you want to create a chatbot based on the ChatGPT API that can help users navigate this content. You need a systematic approach to match users’ prompts with the right pages and use the LLM to provide context-aware responses. This is where document embeddings can help.

Using embeddings to capture semantics

Before we get into embeddings, let’s create a high-level framework for our chatbot:

1- The user enters a prompt
2- Retrieve the best document that is relevant to the prompt
3- Create a new prompt that includes the user’s question as well as the context from the document
4- Give the newly crafted prompt to the language model
5- Return the answer to the user

context-aware prompt for chatgpt — Providing context to ChatGPT

From a programming standpoint, this process is straightforward except for step 2. How do we decide which document is relevant to the user’s query? A rudimentary answer would be to use classic indexing and keyword search. A better solution is to use embeddings.

An embedding is a numerical vector—a list of numbers—that captures the different features of a piece of information. The more dimensions the embedding has, the more features it can learn.

You can use embeddings for different types of data. For example, in image-related tasks, embeddings can represent the presence or absence of different objects, the intensity of different colors, the distance between different objects, etc.

In text, embeddings capture different semantical aspects of texts. For example, word embeddings might contain information about whether the word relates to a city or country, a species of animals, a sports activity, a political concept, etc. In the same sense, phrase embeddings create a numerical representation of the content of a sequence of words. By measuring the distance between two embedding vectors, you can obtain the similarity of their corresponding content.

You create embeddings by training a machine learning model—usually a deep neural network—on a large dataset of examples. In many cases, the embedding model is a modified version of the same model used for the final application (e.g., text generation or image classification).

Creating an embedding database for our documents

To integrate embeddings into your chatbot workflow, you’ll need a database that contains the embeddings of all your documents. If your documents are already available in plain text in a database, then you’re ready to create the embeddings. If not, you’ll need to use some sort of technique such as web scraping with Python Beautiful Soup to extract the text from the web pages. If your documents are PDF files, such as research papers, you’ll need to extract the text from them (you can do this with the Python PyPDF library).

To create embeddings for your documents, you can use an online service such as OpenAI’s Embeddings API. You provide the API with the text of your document, and it returns its embedding. OpenAI’s embeddings are 1,536 dimensions, which is among the largest. Alternatively, you can use other embedding services such as Hugging Face or your own custom transformer model.

Once you have your embeddings, you must store them in a “vector database.” Vector databases are specialized for embeddings and provide different features, such as querying based on different measures (Euclidean distance, cosine similarity, etc.).

A popular open-source vector database is Faiss by Facebook, which provides a rich Python library for hosting your own embedding data. Alternatively, you can use Pinecone, an online vector database system that abstracts the technical complexities of storing and retrieving embeddings.

You now have everything you need to create an LLM application that is customized for your own proprietary data. We can now change the logic of the application as follows:

1- The user enters a prompt
2- Create the embedding for the user prompt
3- Search the embedding database for the document that is nearest to the prompt embedding
4- Retrieve the actual text of the document
5- Create a new prompt that includes the user’s question as well as the context from the document
6- Give the newly crafted prompt to the language model
7- Return the answer to the user
8- Bonus: provide a link to the document where the user can further obtain information

llm chatbot embedding database — Using embeddings and a vector database to retrieve relevant documents

To avoid creating the entire workflow manually, you can use LangChain, a Python library for creating LLM applications. LangChain support different types of LLMs and embeddings, including OpenAI, Cohere, AI21 Labs, as well as open source models. It also supports different vector databases, including Pinecone and FAISS. And it has ready-made templates for different types of applications, including chatbots, question answering, and active agents.

Important considerations for embeddings

To make proper use of embeddings with large language models, keep the following considerations in mind:

– Remain consistent in the embeddings framework you use: Make sure you use the same embedding model across the entire application. For example, if you chose OpenAI embeddings, make sure to use the same API and model for creating document embeddings, user prompt embeddings, and searching your vector database. Otherwise, you will get inconsistent results.

– Token limitations: Every LLM has a token limit. For example, ChatGPT can preserve context up to 4,096 tokens. GPT-4 has an 8,000 and 32,000 token limits. Many open source models are limited to 2,048 tokens. This includes the document context, user prompt, and model’s response. Therefore, you have to make sure that your context data doesn’t fill the LLM’s memory. A good rule of thumb is to limit documents to 1,000 tokens. If your document is longer than that, you can break it into several chunks with a bit of overlap (around 100 tokens) between each part.

– Using multiple documents: Your response does not have to be limited to a single document. You can retrieve several documents whose embeddings are similar to the prompt and use them to obtain responses. To make sure you don’t run into token limits, you can prompt the model separately for each document.

Why not fine-tune language models?

Why not fine-tune the LLM instead of using context embeddings? Fine-tuning is a good option, and using it will depend on your application and resources. With proper fine-tuning, you can get good results from your LLMs without the need to provide context data, which reduces token and inference costs on paid APIs. However, fine-tuning can be costly and complicated. Using context embeddings is an easy option that can be achieved with minimal costs and effort.

Eventually, if you have a good data-collection pipeline, you can improve your system by fine-tuning a model for your purposes.

11 COMMENTS

Justin C Davis May 2, 2023 at 5:40 pm

This is very helpful as we are preparing to issue an RFP for AI services to be incorporated into our systems. We have these large datasets of documents that we need to integrate and this will be key to the success of it all.

Loading...

Bill Jacobi May 2, 2023 at 7:14 pm

Can you comment on how cloud providers like AWS, Azure, and GCP can simplify the process. How can one create document embeddings for all of the documents in one’s Intranet?

Loading...

- Ben Dickson May 2, 2023 at 7:40 pm
  
  Hi Bill. I’m not sure if I understand your problem correctly. Do you want to use embeddings on AWS/Azure/GCP or do you want to use them offline?
  
  Loading...
  
- Stu May 18, 2023 at 3:13 pm
  
  Azure offers a similar solution through their partnership with OpenAI. You can use their Cognitive Search Service to store/search all your documents, which are then passed as part of the prompt to OpenAI API. https://techcommunity.microsoft.com/t5/ai-applied-ai-blog/revolutionize-your-enterprise-data-with-chatgpt-next-gen-apps-w/ba-p/3762087
  
  Loading...
  
Amir Mahrad May 2, 2023 at 11:55 pm

Beautifully written and informative. Finally something useful about how to implement and customise chatGPT.

Loading...

Willem Van Krimpen May 3, 2023 at 4:57 am

Hi, the article mentions “ChatGPT can preserve context up to 4,096 tokens. GPT-4 has an 8,000 and 32,000 token limits.” As most of my context documents exceed 4k tokens, a chatGPT query with context doesn’t work. Arbitrarily cutting the context into small chunks looses the overall-view that is required for some of the queries. How can i use the larger limits of 8k and 32k?

Loading...

- Ben Dickson May 3, 2023 at 8:34 am
  
  Hi Willem,
  As I mentioned in the article, if your content exceeds the limit, then you must break each piece of content into several chunks, calculate their embeddings, and store them separately. I suggest limiting chunks to around 1k tokens (you need space for the user’s prompt as well as the LLM’s response) for 2k and 4k models.
  
  Loading...
  
  - JC May 18, 2023 at 4:50 pm
    
    I think the question was about losing the overall view of a document, that is, losing the necessary general context to correctly answer a question. However, in the article, it talks about creating overlap (with a bit of overlap (around 100 tokens, between each part)… is this to avoid losing the overall view?
    
    Loading...
Tommy May 3, 2023 at 7:19 pm

This was a very helpful article. Adding extra contexts to every API call can get very expensive fast though, so be warned.

Loading...

XAVIER DIOKNO June 1, 2023 at 4:58 pm

Thank you for this helpful article. Is running your own LLM cost prohibitive? For example, if this was an enterprise application where multiple users could query their data, wouldn’t the CPU cost be significant?

Loading...

Abhijit June 13, 2023 at 8:41 am

You have written

Token limitations: Every LLM has a token limit. For example, ChatGPT can preserve context up to 4,096 tokens. GPT-4 has an 8,000 and 32,000 token limits. Many open source models are limited to 2,048 tokens. This includes the document context, user prompt, and model’s response. Therefore, you have to make sure that your context data doesn’t fill the LLM’s memory. A good rule of thumb is to limit documents to 1,000 tokens. If your document is longer than that, you can break it into several chunks with a bit of overlap (around 100 tokens) between each part.

why should the model’s response be included in the token limit – or the maximum input sequence, which is all about input alone.

Loading...

Will infinite context windows kill LLM fine-tuning and RAG?

How to turn any LLM into an embedding model

AI in healthcare: Real-world applications for cost-savings and innovation

Stanford’s ReFT fine-tunes LLMs at a fraction of the cost

How generative AI is transforming the shopping experience

Fine-tune a Llama-2 language model with a single instruction

What to know about the rising threat of deepfake scams

4 reasons to use open-source LLMs (especially after the OpenAI drama)

No-code retrieval augmented generation (RAG) with LlamaIndex and ChatGPT

How to make your LLMs lighter with GPTQ quantization

What to know about open-source alternatives to GPT-4 Vision

The complete guide to LLM compression

A simple guide to gradient descent in machine learning

The complete guide to LLM fine-tuning

What is low-rank adaptation (LoRA)?

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…

Democratizing the hardware side of large language models

How to customize LLMs like ChatGPT with your own data and documents

Providing context to language models

Using embeddings to capture semantics

Creating an embedding database for our documents

Important considerations for embeddings

Why not fine-tune language models?

Like this:

11 COMMENTS

Leave a ReplyCancel reply

Providing context to language models

Using embeddings to capture semantics

Creating an embedding database for our documents

Important considerations for embeddings

Why not fine-tune language models?

Like this:

11 COMMENTS

Leave a ReplyCancel reply

Discover more from TechTalks