Large language models (LLM) have seen significant advances in recent years, generating text with quality that was previously unimaginable. But LLMs also suffer from a serious problem: albeit human-like and fluid, the text they generate can be factually wrong.
This challenge, sometimes called the “hallucination” problem, can be amusing when people tweet about LLMs making egregiously false statements. But it makes it very difficult to use LLMs in real-world applications.
AI21 Labs is among the organizations that are trying to address this problem by creating language models that are reliable for various applications. In an interview with TechTalks, Yoav Levine, the company’s chief scientist, explained why LLMs struggle with factuality and how his research team is working on creating language models that can ground their text in real facts.
Are LLMs reliable sources of knowledge?
The transformer architecture, used in today’s language models, has brought amazing achievements in tasks that require generating sequences of words or other kinds of data. We can see this in the excitement around GPT-3, ChatGPT, LaMDA, Codex, and other large language models. But excitement does not mean reliability. Just recently, a popular tech news website that was using ChatGPT to write articles had to review its AI-generated content due to major mistakes.
“When we want to rely on these traditional language models for creating our content and generating text for us, then the issue of provenance and reliability of the data becomes central,” Levine said.
LLMs are trained to predict the next token in a sequence, which they can do very well. But they are not designed to point to the source from which they have acquired their knowledge.
Reliability is not the only problem. The model might be outdated and its training data might be missing important knowledge that is relevant to your use case. Or you might want to know whether the model is giving you facts or someone’s opinion.
“There are many things that you don’t see in these opaque, mainstream language models,” Levine said. “And we want to change that.”
Levine says having more control over language models is important for a variety of reasons. Say you’re an author who is using an LLM-powered writing assistance tool. When the language model generates text, you want to know where it got the information from and how reliable the source is, especially if it will later be attributed to you.
“When someone creates content, they want to be comfortable to put their name on it. And in certain use cases, the ability to connect the content to sources really facilitates this,” Levine said.
And from a scientific perspective, a big question is whether a large neural network is the best structure to store knowledge. An alternative would be to relegate knowledge extraction to an external mechanism and have the mainstream language model to focus on generating linguistically accurate text.
“There’s a lot of there are a lot of merits in this decoupling, which we believe in,” Levine said.
Retrieval augmented language modeling
Scientists are working on different techniques to address the problem of citing the source of information generated by language models. One such technique is retrieval augmented language modeling (RALM), which tries to train language models to fetch information from external sources.
During training, a classic language model tunes its parameters in a way that implicitly represents all the knowledge in its training corpus. For example, if you prompt a classic language model with “Ludwig van Beethoven was born in,” it will try to complete the sentence by guessing the next token in the sequence. If Beethoven’s birthplace was included in its training corpus, then the model will likely provide a reliable answer. If not, it will still provide an answer, though it will probably be wrong.
A RALM model, on the other hand, adds a “knowledge retriever” to find the document that is most likely to contain information relevant to the prompt. It will then use the content of that document as part of its prompt to generate more reliable output. Therefore, not only will the model output Beethoven’s birthplace, but it will also retrieve the document (e.g., Wikipedia page) that contains that information.
During training, the knowledge retriever is rewarded by finding documents in its training corpus that can improve the output of the main language model. During inference, in addition to generating text, the model can produce references to the knowledge documents. This allows the end user to verify the source and reliability of the text that the model generated.
“Retrieval augmented language modeling is useful in long text generation, where the AI is writing with us and there is an amount of machine-generated text,” Levine said. “So to speak, the model is trying to make a case while generating text. RALM came to solve the language modeling task, and to have the generated text being more reliant on sources.”
Although compelling, RALM comes with a few key challenges. Chief among them is the integration of the document retriever’s output into the context of the main language model. Most previous techniques suggested solutions that require making architectural changes to the language model. These changes make it extremely difficult to reproduce them.
“Today, we don’t see all these generative models coming automatically with this retrieval augmented technology,” Levine said. “It’s in the literature for a year—maybe a bit more—and with great solutions. But because all of them require this extra effort around the language model, only specific companies can do it.”
A lot of today’s access to LLMs is through application programming interfaces (API), which makes it impossible to apply RALM techniques. And even for open-source models, the technical challenges of modifying and retraining the models effectively make RALM off-limits for many organizations. The researchers at AI21 Labs want to take retrieval augmented modeling to the next level.
“Retrieval augmented language modeling is super-appealing and our statement is, how do we make it prevalent? How do we make the next off-the-shelf language model inherently have this grounding mechanism?” Levine said.
The researchers sought an integration mechanism that would be possible even if you don’t have direct access to the language model. A very simple solution, they decided, was to take the document retrieved from external knowledge and just prepend it to the generated text to have it in context as a prompt. This is why they call their technique “in-context retrieval augmented language modeling.”
This is a straightforward technique that has been tried by other technical groups. But the researchers at AI21 Labs showed in a recent paper that this is a viable approach for having LLMs respect the retrieved document.
“You can do the integration without putting emphasis on changing the model, which is important in the current atmosphere of these models not being accessible to everyone,” Levine said. “The message is, you don’t really need to change the architecture.”
In the paper (PDF), the researchers show that in-context RALM can achieve performance gains equivalent to doubling or tripling the size of an off-the-shelf language model. The work also highlights the benefits of separating the document-retrieval mechanism from language modeling. Different tasks might require different types of knowledge bases and retrieval mechanisms. Decoupling the text generator and the retriever will enable users to adapt both components to their specific applications. Levine thinks this structure will help make RALM more widespread.
“You get to think hard within your domain and on the sort of mechanism you want for selecting the knowledge and guiding the model or grounding the model’s generation,” Levine said. “But you don’t need to do a lot of work in terms of changing the model itself.”
Optimizing the language model architecture
Relegating the tasks of finding information to the knowledge retriever takes a big load off the main language model. In turn, this will enable scientists and engineers to create language models that are much smaller—maybe a few billion parameters—that will focus solely on linguistic accuracy. This model does not need to be retrained very frequently. Meanwhile, the knowledge retriever will be optimized for fetching information from the knowledge base. Its architecture and its training frequency will be configured based on its application and domain.
While their results are promising, Levine says that the vision is still not complete, and they have to make progress on important challenges.
“The next step is to make the two models aware of each other—specifically, the language model being aware of the fact that retriever is going to be bringing in the facts,” Levine said.
They will also look for new language model architectures that are more attuned to RALM. Ultimately, generative systems can be composed of an LLM surrounded by a constellation of different modules that specialize in various tasks and cooperate in creating reliable and verifiable output. And not all these modules need to be machine learning models. For example, if the LLM is generating text about the weather next week, it can retrieve the information from a weather forecasting API. Or if it’s generating a revenue report, it can use a calculator module.
“You can think of several capabilities that might be needed during text generation that just doesn’t make sense for language models to do them. But today they do,” Levine said. “You have one function doing everything right now, and scientifically speaking, some functions are not even supposed to be learned or implemented by neural networks.”