Since OpenAI released GPT-4 last week, my Twitter timeline has been filled with exciting applications of the large language model. GPT-4 can generate HTML code from hand-drawn mockups of websites. Users have shown that it can find physical addresses from credit card transactions, generate draft lawsuits, pass the SAT math test, help in education, and even create a first-person shooter game.
The capabilities of GPT-4 are truly amazing, and we can expect more from the LLM as more users gain access to its multimodal version. However, while we celebrate the progress that scientists have made in LLMs, we must also be careful of their limits.
LLMs such as GPT-4 can perform many tasks, but they’re not necessarily the best tools for those tasks. And if they successfully perform a task, it does not mean that they are reliable in that field more generally.
The scientific breakthrough of LLMs
The release of GPT-4 has triggered a lot of criticism toward OpenAI—much of which I think is justified. They have become less and less transparent with every release of GPT. The technical report OpenAI published with the release of GPT-4 contains very little detail on the architecture, training data, and other important aspects of the model. There is every sign that OpenAI is gradually transforming from an artificial intelligence research lab into a company that sells products.
However, this does not diminish anything from the fascinating breakthroughs that LLMs have ushered in. And OpenAI has had a substantial role in these developments. In just a few years, we’ve gone from deep learning models that were at best mediocre at handling language tasks to LLMs that can generate text that is—at least on the surface—very human-like.
Moreover, with enough parameters, compute power, and training data, transformers (the architecture used in LLMs) can learn to perform several tasks with a single model. This is very important because until recently, deep learning models were known to be good for one task alone. Now, language models can perform several tasks with zero-shot and few-shot learning, and even show emergent abilities as they scale.
ChatGPT put the latest capabilities of LLMs on full display. It could perform coding, question-answering, text generation, and a host of other tasks in a single conversation. And thanks to its training technique, reinforcement learning from human feedback (RLHF), it is much better at following instructions.
GPT-4 and other multimodal language models are displaying a new wave of capabilities, such as including images and voice messages in conversations.
What are good applications for GPT-4?
Once you move past scientific achievements, you can start thinking about what type of applications LLMs such as GPT-4 can be trusted with. For me, the guiding principle to determine whether LLMs are fit for an application is their mechanism.
Like other machine learning models, LLMs are prediction machines. Based on the patterns in their training data, they predict the next token in a sequence that they receive as input. And they do this very effectively.
Next-token prediction is an excellent solution for some tasks, such as text generation. When the LLM is trained with an instruction-following technique such as RLHF, it can perform language tasks with stunning results, including writing articles, summarizing text, explaining concepts, and answering questions. This is one area where LLMs are currently the most accurate and useful solution.
However, there are still limits to what LLMs can do in text generation. LLMs are known to hallucinate, or make up things that are not correct. Therefore, you should not trust them as sources of knowledge. This includes GPT-4. For example, in my exploration of ChatGPT, I’ve found that it can sometimes generate very eloquent descriptions of complicated topics, such as how deep learning works. This helps a lot when I’m trying to explain a concept to a person who might not be knowledgeable about it. But I’ve also found ChatGPT to make factual mistakes.
My rule of thumb for text generation is to only trust GPT-4 in areas that I’m knowledgeable about and can verify its output. There are some ways to improve the accuracy of the output, including fine-tuning the model on domain-specific knowledge or giving it context by prepending your prompt with relevant information. But again, those methods require that you know enough about that domain to be able to provide the extra knowledge. Therefore, don’t trust GPT-4 with generating text about health, legal advice, or science unless you already know the topic.
Code generation is another interesting application for GPT-4. I’ve already reviewed GitHub Copilot, which is based on a fine-tuned version of GPT-3 called Codex. Code generation becomes increasingly effective when it is integrated into your IDE (as Copilot is) and can use existing code as context to improve the LLMs output. However, the same rule still applies. Only use the LLM to generate code that you can fully review and vet. Blindly trusting the language model can result in non-functional and insecure code.
What are not good applications for GPT-4?
For some tasks, language models such as GPT-4 are not the ideal solution, even if they can solve examples. For example, one of the topics often discussed is the capability of LLMs to perform math. They have been tested against different math benchmarks. GPT-4 has reportedly performed very well on complicated math tests.
It is worth noting, however, that LLMs do not compute math equations in the step-by-step way that humans do. When you provide GPT-4 with the prompt “1+1=” it will provide you with the right answer. But behind the scenes it is not performing “add” and “mov” operations. It is performing the same matrix operations it uses for all other inputs, predicting the next token in the sequence. It gives a probabilistic answer to a deterministic problem. This is why the accuracy of GPT-4 and other LLMs in math depends largely on their training dataset and works on a hit-and-miss basis. You might see them produce stunning results on very complicated math topics but fail on simple elementary math problems.
This does not mean that GPT-4 is not useful for math. One approach that can help is using model augmentation techniques, such as combining the LLM with a math solver. The LLM extracts the equation data from the prompt and passes it on to the solver, which computes and returns the result.
Another interesting use case for GPT-4 is what Khan Academy is doing. They have integrated the LLM into their online learning platform as a tutor for learners and assistant for teachers. Since this is one of the partnerships that OpenAI has advertised in the GPT-4 launch, my guess is that it is not the vanilla version of the model. They have probably fine-tuned GPT-4 with the content of Khan Academy’s courses. The model is also integrated nicely with the platform’s content to provide context and reduce errors. However, it is worth noting that GPT-4 does not solve math problems but rather guides students and teachers in learning and teaching math concepts.
GPT-4 as a product
Once you have decided whether GPT-4 is suitable or not for your application, you must look at it from a product perspective. Every token GPT-4 generates is the result of hundreds of billions of operations. The first question you must ask yourself is, does my application need that kind of computation? In fact, in many cases, even if GPT-4 provides a perfect answer, there might be much simpler, faster, and less costly solutions. For example, in many cases, a simple solution such as a regular expression is more efficient at extracting information.
In fact, I had an interesting conversation with ChatGPT, in which I asked it what it means to “GPT-ify” something. ChatGPT described it as “to apply the techniques and capabilities of a language model based on the GPT (Generative Pre-trained Transformer) architecture, such as GPT-3 or GPT-Neo, to a particular task or problem.”
Then I asked whether GPT-ify can be used to mock someone who is using GPT for something that doesn’t need it. Here’s what ChatGPT said: “Yes, ‘GPT-ify’ can certainly be used ironically or sarcastically to suggest that someone is overusing or misusing GPT for a task or problem that does not require such a sophisticated language model. For example, if someone were to use a pre-trained GPT model to generate a simple grocery list, one might jokingly say that they have ‘GPT-ified their shopping list.’”
Therefore, don’t GPT-ify your products without exploring other options. Like every other technology, treat GPT-4 as one of many tools in your product toolbox. As ChatGPT recommended, “The key is to choose the right tool for the job based on the specific requirements and constraints of the task at hand.”