By Jason Ly
Six months have passed since we were catapulted into the post-ChatGPT era, and every day AI news is making more headlines. Releases of new large language models (LLM) by specialized research houses like Anthropic with Claude, and DeepMind with Chinchilla, are reaching our ears. Meta has released LLaMA, while Google has been reminding us that LaMDA has existed all this time.
Naturally, this has inspired many to ask how to get their hands on their ‘own LLM’, or sometimes more ambitiously, their ‘own ChatGPT’. Enterprises want a chatbot that is equipped with knowledge of information from their company’s documentation and data.
Hot new deals and products (hello Duolingo and Allen & Overy) are in the pipeline: through ‘prompt architecting’, companies are combining existing LLMs with the right software, and data science know-how, to create new solutions. ‘Prompt architecting’ is a new term that plays on ‘prompt engineering’ – it is to prompt engineering what software architecture is to software engineering!
TL;DR Right now. If your goal is to create new products or cut costs by automating processes, you don’t need your own LLM. Even the traditional data science practice of taking an existing model and fine-tuning it is likely to be impractical for most businesses. Instead, consider what I call prompt architecting as an alternative that lets you borrow the power of an LLM, but allows you to fully control the chatbot’s processes, check for factual correctness, and keep everything on-brand.
What does it actually mean to ‘want your own ChatGPT’?
One of the most common things people tell us is “we want our own ChatGPT”. Sometimes the more tech-savvy tell us “we want our own LLM” or “we want a fine-tuned version of ChatGPT”. If any of these sound familiar, read on.
Our first thought is usually “but do you actually?”. The phrase “We want our own LLM” is a super vague wish that has been thrown around a lot recently.
For fun, let’s think about what happens if you take this request literally. This means you want your own LLM. In practice, this means you need to either:
1- create your own, or
2- take an existing one and customize it (“fine-tuning” in technical lingo”).
For the vast majority of enterprises, either option is a bad idea. Starting from scratch is super ambitious. You’d be competing against our lord and savior ChatGPT itself, along with Google, Meta, and many specialized offshoot companies like Anthropic who started with a meager $124 million in funding and was considered a small player in this space.
If you want ‘your own ChatGPT’, expect to collect 45 terabytes of text (about 25 million copies of the Bible), hire a research team to create a state-of-the-art architecture, and spend around $200m on computing power before you get any tangible results.
To put things into perspective, OpenAI has the cash, but can’t get its hands on enough hardware to keep up with their goals, even with Microsoft’s help. With the right budget and a university partnership, you could manage to pull something off on a timeline of 2-3 years. But be warned: there are no guarantees. A new department, be it LLM research or venture capital, will likely be needed in your business either way.
What about fine-tuning?
Fine-tuning is comparatively more doable and promises to yield some pretty valuable outcomes. The appeal derives from a chatbot that better handles domain-specific information with improved accuracy and relevance, while leaving a lot of the legwork to the big players. If you go down the open-source route or get a license from the original creator, you might get to deploy the LLM on-premise, which is sure to keep your data security and compliance teams happy.
OpenAI is happy for people to fine-tune GPT-3, saying on their website ‘Once a model has been fine-tuned, you won’t need to provide examples in the prompt anymore. This saves costs and enables lower-latency requests.’
Another option is taking an open-source model with a sufficiently permissive license (like the recently released Falcon 40B). This is a lot more elaborate to set up and more computationally expensive to fine-tune, but you get to literally download the model onto your machine!
Sounds great – what’s the catch?
Glad you asked.
Are there any complications of fine-tuning?
Firstly, you need absolute bucketloads of documents to feed your data-hungry model. For most use cases – such as automatic contract generation, askHR, or customer service applications – there simply do not exist the thousands of document examples required.
Let’s suppose you pull together this colossal dataset (congratulations if so – it’s not for the faint of heart!). Whether you hire a data scientist to work with an open source model, or use one of the big players’ APIs, expect to invest $250k in total for the finetune. If you’re looking to deploy this on-premise, $500k is more realistic.
If this all sounds reasonable, be aware of the double-edged sword waiting on the other side. You’ll get your widely sought-after ‘own LLM’, but coupled with the usual steerability problems.
LLMs are ‘free spirits’, not because they keep disappearing off digital nomading in Latin America, but because, at best, you can only encourage them to do your bidding. Fine-tuning an LLM is like feeding an intern a pile of examples without explanation and hoping that they just ‘get it’. Fine-tuning without prompt architecting won’t get you to your promised land!
Hallucinations are another big problem: LLMs like to make stuff up, and disobey your instructions, which can have harmful consequences when users lacking expertise in a subject over-rely on the chatbot’s convincing nonsense. Scandals involving offensive, false or otherwise off-brand content can destroy your customers’ perception of your company, or even land you in legal hot water!
The usual data scientist solution to all this is “add more data”. The flaw here is that, if this data even exists (you probably used everything you could get your hands on the first time), due to the black-box nature of an LLM, we can never be sure how the model will react to this new data, nor exactly what dataset will achieve the desired outcome. Worse still, every time we add new data we need to pay the computing bill for the retrain which can be $100s—if not $1000s—a pop!
Under what circumstances is fine-tuning appropriate?
In all seriousness, there are two situations where fine-tuning as the main strategy makes sense:
1- Your company requires all data and communications to be on-premise, and has a $500k budget (we are looking at the quant firms among you that weigh your employees before letting them vacate the premises!)
2- You want to do an industry-specific push and create a domain-specific foundational LLM that you collaborate on with your competitors ($10s of millions in investment, and the even harder task of convincing the entire industry to voluntarily give you all their documents for training).
Prompt architecting is the way to go for the majority of companies
As a recap, creating an LLM from scratch is a no-go unless you want to set up a $150m research startup.
Fine-tuning might be an option if you seriously need something on-prem, but unless you’re in finance, energy, or defense, that’s probably not you.
Then how do we bend LLMs to our will so that we can produce reproducible, reliable outcomes that help our customers both external and internal complete their productive endeavors?
People come to us wanting to achieve things like:
1- Lawyers: “I want to be able to automatically generate contracts based on my firm’s best practices”
2- Head of HR: “I want to be able to automatically answer HR-related employee queries based on our handbook”
3- Head of CX: “I want to automatically answer customer queries based on our product troubleshooting instructions”
Luckily, none of these use cases require fine-tuning to solve!
We advocate creating software products to cleverly use prompts to steer ChatGPT the way you want. ‘Prompt architecting’ is what we name this approach. It is similar to prompt engineering, but with a key difference.
Instead of engineering individual prompts that achieve a single goal, we create entire pieces of software that chain, combine, and even generate tens, if not hundreds, of prompts, on the fly to achieve a desired outcome. This method could be behind the Zoom partnership with Anthropic to use the Claude Chatbot on its platform.
How is this done in practice? The specific architecture for any given problem will be heavily specialized. However, every solution will rely on some variation of the following steps:
1- We accept a message from the user. This could be as simple as “Hello, my name is Jason”, but let’s dig into the following example:
“How many days of annual leave am I entitled to?”
2- We identify the context of the message and embellish the user’s message with some information. Continuing with our annual leave example:
User Context: “Jessica is an Associate, she is currently on probation, please answer questions accordingly.”
Contextual Information: “Employees are entitled to 30 days annual leave per year, excluding bank holidays. During the probationary period, employees are only allowed to take at most 10 days of their annual leave. The subsequent entitlement may only be taken after their probationary period.”
Chatbot instructions: “You are a HR chatbot, answer the following query using the above context in a polite and professional tone.”
Question: “How many days of annual leave am I entitled to?”
3- We send the message to our favorite LLM and receive an answer!
“Employees are entitled to 30 days annual leave per year excluding bank holidays. Since you are in probation, you can take at most 5 days until the end of your probation period”
4- Then we check the answer for errors (utilizing multiple techniques including a secondary LLM, and semantic similarity checks). Here there’s a mistake about 5 days instead of 10, we then ask the LLM to regenerate the answer amending the mistake.
“Employees are entitled to 30 days annual leave per year excluding bank holidays. Since you are in probation, you can take at most 10 days until the end of your probation period”
5- Send the message to the user!
Using this general methodology and some clever software to help implement some architectures, we now have a framework that grants:
1- Full control: Using a software layer with contexts, and checks you control exactly what ChatGPT does.
2- Accuracy: You directly provide the information that ChatGPT uses, and you can even reference the original source.
3- Steerability: You give your chatbot a persona and make sure that it stays on-brand.
You get a conversational solution that does what you expect. It runs checks and balances to ensure it does not go rogue and ruin your brand’s image.
We have developed multiple components that largely deal with the hallucination issue. They let you customize this process for all sorts of applications, with multiple types of context used for embellishing messages, and response checkers that scan for offensive language, tone of voice, factual correctness, semantic similarity, and even response length.
Some tasks inevitably remain challenging: handling large quantities of data, long conversations, and managing sources of truth for our chatbots (nobody likes managing multiple versions of the same information in different formats).
In conclusion, while the allure of owning a bespoke LLM, like a fine-tuned version of ChatGPT, can be enticing, it is paramount for businesses to consider the feasibility, cost, and possible complications of such endeavors. Developing a unique LLM from scratch or even fine-tuning an existing one poses substantial challenges, requires extensive resources, and carries the risk of unanticipated steerability issues.
The novel approach of ‘prompt architecting’, combining off-the-shelf LLMs with cleverly designed software, offers a more practical, cost-effective solution for most enterprises. This strategy not only helps in achieving specific goals but also provides companies with control over their chatbot’s behavior, ensuring accuracy and maintaining brand identity. Leveraging the power of existing LLMs via prompt architecting is therefore the sensible way forward for enterprises looking to exploit AI capabilities without incurring extravagant costs or risking unintended outcomes.
About the author
Jason Ly is Cofounder and Head of Engineering at Springbok AI.