Token 1.3: What is Retrieval-Augmented Generation (RAG)?

we discuss the origins of RAG, what LLMs limitations it tries to fix, its architecture, and why it is so popular. Enjoy the collection of helpful links

Oct 05, 2023

Was this article shared with you? Sign up here

In previous articles of the “FM/LLM– the task-centric ML” series, we've explored several systematic concepts that aid in understanding and framing the discussion rightly. However, it's crucial to remain practical. To enrich this series, we will meld theory with practice.

One term that has become a buzzing topic recently is RAG.

What is it, and how can you utilize it to improve an LLM performance? Let’s dive in!

We discuss the origins of RAG, what LLM’s limitations it tries to fix, RAG’s architecture, and why it is so popular. You will also get a curated collection of helpful links for your RAG experiments.

Introduction

Though circulating very actively lately, the term itself came in 2020, when researchers at Meta AI introduced it in their paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

Retrieval-Augmented Generation (RAG) model is an architecture designed to harness the capabilities of large language models (LLMs) while providing the freedom to incorporate and update custom data at will. Unlike the resource-intensive process of constructing bespoke language models or repeatedly fine-tuning them whenever data updates occur, RAG offers a more streamlined and efficient approach for developers and businesses.

As you probably know, pre-trained language models undergo training using vast amounts of unlabeled text data in a self-supervised* manner. Consequently, these models acquire a significant depth of knowledge, leveraging the statistical relationships underlying the language data they have been trained on.

*Self-supervised learning uses unlabeled data to generate its own supervisory signal for training models.

This knowledge is encapsulated within the model's parameters, which can be harnessed to execute various language-related tasks without the need for external knowledge sources. This phenomenon is commonly referred to as a parameterized implicit knowledge base.

Although this parameterized implicit knowledge base is very impressive and allows the model to have a surprisingly good performance for some queries and tasks, this approach is still prone to errors and, so-called, hallucinations*.

*Hallucination in language models occurs when false information is generated and presented as true.

Why do errors happen in LLMs?

It is essential to recognize that LLMs do not possess a genuine understanding of language in the human sense. They rely on statistical patterns within the language they were trained on. Recent research has shown that no matter how much implicit knowledge a model has, it still has trouble with logical reasoning. While LLMs have achieved significant success in text generation, they still have problems using the data they already have, which often results in hallucinations.

How to deal with it? Surprisingly, introducing more external data. It can be used to expand or revise the model’s memory and as a base to assess and interpret its predictions. This is precisely what the Meta AI researchers implemented in the new type of models called RAG models.

Other limitations of current LLMs

Apart from hallucinations, contemporary language models suffer from a significant shortcoming for companies that want to implement them – they lack a company's internal data context. To address this issue through fine-tuning, ML practitioners must repeatedly adjust the model whenever the data undergoes changes. RAG addresses these limitations as well.

RAG Architecture

RAG combines two main components:

1. The Retriever: This component is a pre-trained neural retriever that accesses non-parametric data external to the language model, stored as a dense vector index. The original paper uses the Dense Passage Retriever to access a dense vector index of Wikipedia.

2. The Generator: This is a pre-trained language model. In the original paper, it's a pre-trained seq2seq transformer.

The retriever and generator components are trained jointly without any direct supervision on which document should be retrieved.

Here's how the architecture works when a user submits a query:

Retrieve Phase: First, the retriever searches for and fetches the most relevant text passages for the user's query.
Generative Phase: In this phase, the transformer generates text based on its parameterized implicit knowledge, the user's query, and the retrieved text. In other words, the final generated text results from not only the generator's built-in knowledge but also the external data fetched in the first step.

This approach enriches parametric memory (refers to the knowledge and capabilities embodied within a pre-trained seq2seq model) generation models by incorporating non-parametric (refers to extensible, real-time knowledge source that isn’t confined by a fixed set of parameters) memory through a fine-tuning method.

Why RAG is so popular?

RAG's popularity arises from its unique combination of advantages and cost-effectiveness. It combines the generation flexibility of "closed-book" (parametric-only) approaches with the performance of "open-book" retrieval-based approaches.

RAGs benefits:

Dynamic Knowledge Control: RAG allows easy modification and supplementation of its internal knowledge without retraining the entire model, saving time and resources.
Current and Reliable Information: It ensures the model always has access to the most up-to-date and trustworthy facts.
Transparent Source Verification: Users can verify the model's claims by accessing its sources, enhancing trust in its outputs.
Mitigation of Information Leakage: RAG's grounding in external, verifiable facts reduces the chances of the model leaking sensitive data or generating incorrect information.
Domain-specific knowledge: RAG can provide extra context and information about your internal data.
Cost-Efficient and Low Maintenance: RAG reduces the need for continuous training and parameter updates, lowering computational and financial costs for enterprise LLM-powered chatbots.

Conclusion

The Retrieval-Augmented Generation (RAG) model is a step toward addressing the limitations of LLMs. By blending pre-trained parametric and extensible non-parametric memory, it aims to tackle outdated information and hallucinations. This mix allows for more accurate and domain-specific responses, presenting a cost-effective, low-maintenance option for enterprises. However, the effectiveness of RAG is influenced by the precision of the embedding algorithm, database performance, and the context window size of the foundational model. Additionally, retrieval latency and reliance on initial training data highlight areas for further refinement. While RAG models present a notable advancement, they also underline the continued journey toward achieving more reliable and evolved language models in ML and NLP.

Tools for RAG implementation →

Read further

Discussion about this post

Ready for more?