Large language models (LLM) have made progress in one of the most challenging areas of artificial intelligence. They can generate text with impressive consistency and linguistic accuracy.
However, LLMs often fail at simple tasks. Consider the following example, in which I asked Bing Chat to compute a simple elementary math problem. It failed utterly, even when I tried to formulate the problem in a more structured manner.
Some people have found that another variant of the problem works in Bing Chat. Others have found that it works well when Bing Chat is used in “Creative” mode. I tried the same problem in ChatGPT with different but equally false results. Some users made it work with the GPT-4 version of ChatGPT. But the point is that LLMs perform inconsistently on such elementary problems that require concrete knowledge and reasoning.
This is representative of some of the bigger problems with LLMs. These problems might require more fundamental changes to language models and our approach to AI. But in the meantime, some emerging solutions are helping to make language models more robust in everyday applications. Among them are augmented language models, techniques that combine LLMs with other specialized applications.
The problem with next-token prediction
Transformers, the architecture used in language models, are designed for “next-token prediction.” After training on a very large corpus of text, code, math, etc., the model can look at a prompt and predict what comes after. This is very effective for many tasks such as translation, article generation, and text summarization.
However, not all problems have predictive solutions. For example, the math problem mentioned at the beginning of this article has a very deterministic answer. You might want to formulate the answer in different sentences.
But the core elements—the number of balls that each person has—don’t require guesswork. Basically, every math problem, whether formulated in natural language or written as equations, has the same characteristics.
Another problem with next-token prediction is grounding, sometimes referred to as the “hallucination problem.” LLMs generate text that is plausible but is not necessarily factual. Things such as dates of events, geographical locations, and names of people, for the most part, don’t require guesses and predictions. They can be retrieved from a reliable source.
For example, I asked ChatGPT who was the hundredth person to go to space. It said it was Lord British himself, Richard Garriott, the creator of the Ultima series (one of my all-time favorites). But then I asked the reverse question, and it said that Garriott was not the hundredth person to go to space. (A Wired article from 2011 states that Garriott was the 483rd person to go to space. According to Wikipedia, Leonid Kizim and Gennadi Strekalov were the 99th and 100th space travelers.)
There are two other problems with LLMs in their plain forms. First, training language models is very expensive and time-consuming. This is why you can’t update the model as quickly as knowledge is being produced. And second, LLMs are very inefficient solutions for simple problems. LLMs that are as large as GPT-3 and ChatGPT require hundreds of gigabytes of memory and run billions of operations for every token they predict. Simple math and knowledge retrieval problems don’t require such compute-intensive solutions.
Augmented language models address these problems by combining LLMs with an external source of knowledge. Although I usually don’t make analogies with human intelligence, I’ll make an exception here. Just as humans can recognize the limits of our knowledge and use external tools (calculators, search engines, encyclopedias, etc.) to retrieve information and verify facts, augmented language models use external sources to improve their output. Here are a few examples of augmented language models.
Retrieval-augmented language modeling
One of the early efforts of augmenting language models is retrieval-augmented language modeling (REALM) by Google. REALM combines language models with a “neural retriever” that obtains documents from an external knowledge source. Whenever the REALM model receives a prompt, the neural retriever pulls relevant data from the Wikipedia text corpus. It adds the knowledge as context to the prompt, which enables the LLM to provide more accurate information.
A more recent approach to retrieval-augmentation is in-context RALM by AI21 Labs. In-context RALM can be added to many kinds of LLMs, including those accessible through API. The basic idea behind RALM is to retrieve knowledge documents and insert them in the context of the conversation. For example, if you’re having a conversation with ChatGPT, the retriever will insert the knowledge into your conversation history for context.
Basically, retrieval augmentation decouples language generation from knowledge. The role of the LLM is to generate linguistically correct output. The retriever obtains and inserts the right knowledge. Experiments from AI21 Labs show that retrieval-augmented LLMs can produce high-quality output with smaller models and training datasets. They also don’t need to be constantly retrained with new knowledge.
Bing’s new chat feature uses retrieval augmentation by combining ChatGPT with Microsoft’s search engine. Bing Chat generates a search query from your prompt, retrieves relevant documents, and uses them as context for its results. Bing Chat also provides links to sources of information for the sentences it generates.
I’ve used Bing Chat for all kinds of tasks from updating my Linux Python version to finding general information. It is not perfect and still gets things wrong, but at least you get sources where you can verify the answers. For example, when I asked who was the hundredth person to go to space, it was conservative, replying that it couldn’t find relevant information. But when I asked if Richard Garriott was the hundredth person to go to space, it wrongly said that he was.
API-augmented language models
Another interesting approach to augmenting language models is the “Toolformer” technique by researchers at Meta AI. The idea behind Toolformer is to teach the LLM to use external APIs to obtain relevant information for its output. These APIs provide a wide range of services, including math solvers, question-answering systems, and search engines.
During training, Toolformer is provided with a limited number of human-annotated examples that show how to use APIs. The model then uses self-supervised learning to determine where to use each API. For example, if a prompt includes mathematical calculations, then Toolformer will call the math API instead of trying to guess the answer through next-token prediction.
Like retrieval augmentation, Toolformer enables LLMs to do more with less. According to experiments by the Meta AI team, a Toolformer with 6.7 billion parameters outperformed the largest GPT-3 model (175B params) on several tasks.
APIs can be a big deal for LLM applications. Recently, OpenAI released plugins for ChatGPT. Like Toolformer, ChatGPT plugins provide the LLM with API endpoints. According to OpenAI’s documentation, “OpenAI will inject a compact description of your plugin in a message to ChatGPT, invisible to end users. This will include the plugin description, endpoints, and examples.” The model will call the relevant APIs and integrate the results into its response. ChatGPT supports several applications including Wolfram, Zapier, Expedia, and Instacart. This allows the LLM to retrieve up-to-date information and perform complicated math.
These APIs also open the way for ChatGPT to become not only a tool to retrieve information but also to perform actions. You can use ChatGPT plugins to reserve restaurant tables, order food, and perform other tasks. This can become the basis of a new application platform.
But using LLMs to carry out tasks still requires caution and further investigation. As OpenAI warns, “there’s a risk that plugins could increase safety challenges by taking harmful or unintended actions, increasing the capabilities of bad actors who would defraud, mislead, or abuse others. By increasing the range of possible applications, plugins may raise the risk of negative consequences from mistaken or misaligned actions taken by the model in new domains.”
Augmented language models are still in their early stages. As the competition around the LLM market gets hotter, we can expect newer and more robust augmentation techniques to emerge.