Can a technology called RAG keep AI models from making stuff up?

174

We’ve been living through the generative AI boom for nearly a year and a half now, following the late 2022 release of OpenAI’s ChatGPT. But despite transformative effects on companies’ share prices, generative AI tools powered by large language models (LLMs) still have major drawbacks that have kept them from being as useful as many would like them to be. Retrieval augmented generation, or RAG, aims to fix some of those drawbacks.

Perhaps the most prominent drawback of LLMs is their tendency toward confabulation (also called “hallucination”), which is a statistical gap-filling phenomenon AI language models produce when they are tasked with reproducing knowledge that wasn’t present in the training data. They generate plausible-sounding text that can veer toward accuracy when the training data is solid but otherwise may just be completely made up.

Relying on confabulating AI models gets people and companies in trouble, as we’ve covered in the past. In 2023, we saw two instances of lawyers citing legal cases, confabulated by AI, that didn’t exist. We’ve covered claims against OpenAI in which ChatGPT confabulated and accused innocent people of doing terrible things. In February, we wrote about Air Canada’s customer service chatbot inventing a refund policy, and in March, a New York City chatbot was caught confabulating city regulations.

So if generative AI aims to be the technology that propels humanity into the future, someone needs to iron out the confabulation kinks along the way. That’s where RAG comes in. Its proponents hope the technique will help turn generative AI technology into reliable assistants that can supercharge productivity without requiring a human to double-check or second-guess the answers.

“RAG is a way of improving LLM performance, in essence by blending the LLM process with a web search or other document look-up process” to help LLMs stick to the facts, according to Noah Giansiracusa, associate professor of mathematics at Bentley University.

Let's take a closer look at how it works and what its limitations are.

A framework for enhancing AI accuracy

Although RAG is now seen as a technique to help fix issues with generative AI, it actually predates ChatGPT. Researchers coined the term in a 2020 academic paper by researchers at Facebook AI Research (FAIR, now Meta AI Research), University College London, and New York University.

As we've mentioned, LLMs struggle with facts. Google’s entry into the generative AI race, Bard, made an embarrassing error on its first public demonstration back in February 2023 about the James Webb Space Telescope. The error wiped around $100 billion off the value of parent company Alphabet. LLMs produce the most statistically likely response based on their training data and don’t understand anything they output, meaning they can present false information that seems accurate if you don't have expert knowledge on a subject.

LLMs also lack up-to-date knowledge and the ability to identify gaps in their knowledge. “When a human tries to answer a question, they can rely on their memory and come up with a response on the fly, or they could do something like Google it or peruse Wikipedia and then try to piece an answer together from what they find there—still filtering that info through their internal knowledge of the matter,” said Giansiracusa.

But LLMs aren’t humans, of course. Their training data can age quickly, particularly in more time-sensitive queries. In addition, the LLM often can’t distinguish specific sources of its knowledge, as all its training data is blended together into a kind of soup.

In theory, RAG should make keeping AI models up to date far cheaper and easier. “The beauty of RAG is that when new information becomes available, rather than having to retrain the model, all that’s needed is to augment the model’s external knowledge base with the updated information,” said Peterson. “This reduces LLM development time and cost while enhancing the model’s scalability.”

How does RAG work?

By default, an LLM will pull statistically plausible-sounding output from its training data, with some randomness inserted along the way to have the outputs appear more human-like. RAG introduces a new information-retrieval component into the process to search through external data. The data could be from any number of sources and in multiple formats.

As van der Putten puts it, “When a user has a question, the RAG first performs a search in all sources for text fragments relevant to the query. Then a prompt is sent to the generative AI model or service to request to answer the user question based on the search results.”

To find relevant information from the external data that could help answer the user’s query, the LLM converts the query to a vector representation, which allows for dense numerical representations of textual information, then cross-checks it with the vector databases of external data it holds. Asking an LLM to identify information about Apple's business performance, for instance, could lead to a search for all mentions of Apple in the LLM's external data, alongside any mentions of businesses more widely, and present them to the user in a response based on a ranking of how useful the information is.

What makes RAG so powerful is that it can then augment the user prompt with the new information it finds from its external data. It will try to harness that information to produce a better prompt that is more likely to elicit a higher-quality response. And it can be set up to constantly update that external data, all while not altering the underlying model that sits behind the process.

Even better, each answer the LLM produces can be fed into the external data used during RAG, in theory helping to improve accuracy. An LLM using RAG can also potentially recall how it answered previous similar questions.

And crucially, AI models using RAG can often cite the source of their claims because their information is held within that vector database. If an LLM produces an incorrect answer and it’s identified, the source of that incorrect information can be pinpointed within the vector database and be removed or corrected.

RAG’s potential applications

Beyond the general benefits RAG is thought to provide to generative AI outputs, specialist knowledge of subjects such as medicine or history could be improved by using RAG to augment the “knowledge” LLMs draw upon in certain subjects. “When you combine RAG with domain-specific fine-tuning, the result is a more robust, reliable, and refined LLM fit for business purpose,” said Melanie Peterson, senior director of TrainAI at the RWS Group, a tech firm.

RAG is already making a difference in real-world applications, according to some AI experts. “In my business role, we and our clients are exploring RAGs for lots of purposes because of how it steers AI in the right direction,” said van der Putten. “These controls will enable wider use of generative AI in business and elsewhere.”

But van der Putten—who has one foot in business and one foot in academia—believes RAG has benefits beyond the business world. “In my academic research, we are looking into interesting societal applications as well,” he said. “For instance, we are developing a RAG-controlled voting assistant for electoral systems based on proportional representation like in the Netherlands.”

The system would work by letting voters explore points of view of political parties from across the aisle on topics the voter provides. “The goal is to decrease polarization and make election choices more based on stated policy and actual proposals, motions, and party voting behavior in parliament,” he said.

Currently, OpenAI's ChatGPT does a form of RAG when it performs a web search related to a user question, providing more up-to-date information and a link to a source that a user can verify. Google's Gemini AI models do the same. And GPTs from OpenAI can be configured to use information from external data sources, which is also a form of RAG.

Does it actually solve the hallucination problem?

To hear some talk—I was at the recent International Journalism Festival in Perugia, Italy, where plenty of panelists mentioned RAG knowingly as a solution to the confabulation problem for generative AI—RAG seems like it will solve all of AI’s issues. But will it really?

When it comes to tackling generative AI’s confabulation problem, “RAG is one part of the solution,” said David Foster, founding partner of Applied Data Science Partners and the author of Generative Deep Learning: Teaching Machines how to Paint, Write, Compose and Play.

But Foster is clear that it’s not a catch-all solution to the issue of an LLM making things up. “It is not a direct solution because the LLM can still hallucinate around the source material in its response,” he said.

To explain why RAG isn’t a perfect solution, Foster drew an analogy. “Suppose a student takes an English literature exam but doesn’t have access to the source text in the exam itself,” he said. “They may be able to write a decent essay, but there is a high likelihood that they will misremember quotes or incorrectly recall the order of events.”

RAG is like providing easy access to the source material to jog the student's memory. (Anthropomorphizing AI tools is problematic, of course, but it can be difficult to avoid when speaking in analogies.)

“If you give the student the source text, they would be able to 'look up' the relevant information and therefore reduce errors in recall,” said Foster. “This is RAG. However, the student may still retrieve information from the wrong place in the book—so still draws incorrect hallucinatory conclusions—or they may hallucinate additional information that wasn’t present.”

Because of this, Foster calls RAG a “mitigation” rather than a “cure for hallucination.”

A step forward—but not a silver bullet

The million-dollar question is whether it’s worth expending time, effort, and money on integrating RAG into generative AI deployments. Bentley University’s Giansiracusa isn’t sure it is. “The LLM is still guessing answers, but with RAG, it's just that the guesses are often improved because it is told where to look for answers,” he said. The problem remains the same as with all LLMs: “There's still no deep understanding of words and the world,” he said.

Giansiracusa also pointed out that the rise of generative AI-aided search results—and the recent "enshittification" of the web through AI-generated content—means that what might at one point have been a halfway useful solution to a fundamental flaw in generative AI tools could become less useful if AI language models draw from AI-written junk found online.

We've seen that issue recently with Google's AI Overview, which draws on gamed page rankings to determine "accurate" sources that Google's AI model will then draw answers from.

“We know web search is riddled with misinformation and we know LLMs are riddled with hallucinations, so you can do the math here on what happens when you combine the two,” Giansiracusa said.

Listing image: Aurich Lawson | Getty Images

Chris Stokel-Walker

174

View Comments

Staff Picks

Harvesterify

A very recent research paper explored the hypothesis that RAG would reduce hallucinations and improve recall, when applied to legal texts and legal-related tasks (summarizing caselaws, document drafting, etc), and the conclusion is negative, specialized models hallucinate between 17 and 33% (which is a slight improvement over general purposes models, but not much), while slightly improving recall.

Paper is the following one: "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools", from Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning, Daniel E. Ho

June 6, 2024 at 12:38 pm