Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is an architecture that fetches relevant text from an external knowledge source at query time and inserts it into a language model's prompt, so the model generates answers grounded in that supplied evidence…

Retrieval-Augmented Generation (RAG) is an architecture that fetches relevant text from an external knowledge source at query time and inserts it into a language model's prompt, so the model generates answers grounded in that supplied evidence rather than relying solely on the parameters learned during training. The term was coined in a 2020 paper from Facebook AI Research that paired a neural retriever with a sequence-to-sequence generator. The pattern separates what the model knows from what it says: knowledge lives in a searchable corpus you control, while the model handles synthesis and phrasing. Because the corpus can be updated, swapped, or scoped per user without retraining, RAG answers reflect current, private, or permission-gated data that a frozen model has never seen. A typical stack embeds documents into a vector database, retrieves the top matches for each query, and passes them alongside the user's question. The generated response can then cite its sources, making claims traceable back to specific passages.

How it works

The pipeline is retrieve → augment → generate. First, the user's query is converted into an embedding and matched against a pre-indexed corpus—often a vector database, sometimes combined with keyword or hybrid search—to pull the most relevant chunks. Those chunks are then concatenated into the prompt, usually with instructions telling the model to answer only from the provided context. Finally, the model generates a response conditioned on both the question and the retrieved passages. A reranking step frequently sits between retrieval and augmentation to reorder candidates by relevance before they consume context budget.

Why it matters for AI engineers

RAG is usually cheaper and faster to ship than fine-tuning because updating knowledge means re-indexing documents, not retraining weights. It reduces hallucination by grounding outputs in inspectable sources, and it enables citations, which matter for auditability and user trust. It also gives you access control: retrieval can filter by user permissions, so the model never sees data it shouldn't. The trade-offs are real, though—retrieval quality caps answer quality, every query pays latency and token cost for the injected context, and poisoned or stale documents flow straight into the output.

Retrieval-Augmented Generation (RAG) vs. alternatives

Approach Knowledge source Update cost Best for
RAG External corpus, retrieved at query time Re-index documents Fresh, private, or citable facts
Fine-tuning Baked into model weights Retrain the model Style, format, task behavior
Long context Everything stuffed into the prompt None, but token-bound Small, self-contained document sets
Go deeper

Definitions are the start. Ask the Research Desk for a cited, multi-source brief on Retrieval-Augmented Generation (RAG) — real sources, verified claims, delivered in minutes.

Ask the Research Desk →