RAG: Retrieval Augmented Generation

RAG is the dominant architecture for production AI applications that need to answer questions about specific documents, recent events, or private data. Understanding it completely — including its failure modes — is essential for anyone building AI systems.

The Problem Statement

LLMs have two fundamental limitations that matter for production applications.

First, static knowledge: models are trained on a snapshot of the world up to a cutoff date. A model trained in October 2023 knows nothing about events in November 2023 or later. For applications involving current information — regulatory changes, company policies, product updates, recent research — this is a critical gap.

Second, hallucination: when LLMs do not know the answer, they often generate a plausible-sounding but incorrect answer. Without grounding in real documents, accuracy cannot be trusted for specific factual claims.

Fine-tuning partially addresses the first problem but requires significant cost and time, must be repeated when documents change, and does not completely solve hallucination.

RAG addresses both problems by providing the LLM with the specific, accurate, current documents it needs at query time.

Architecture in Detail

Offline Pipeline:

Document loading: Ingest from your sources — file system, database, web crawl, API. Preprocessing: Clean text, extract from PDFs, remove boilerplate. Chunking: Split into retrieval units (covered in detail in the chunking video). Embedding: Convert each chunk to a vector. Indexing: Store in vector database with metadata. This pipeline runs once on initial setup, then incrementally as documents change.

Online Pipeline:

Query processing: Receive user question. Optional: query rewriting to improve retrieval (rephrase the question to better match document vocabulary). Query embedding: Convert question to vector. Retrieval: Find top-k similar chunks from vector database. Optional: reranking to improve ordering of retrieved chunks. Context assembly: Format retrieved chunks as a context block. Prompt construction: Combine system prompt, context, and user question. LLM call: Generate answer. Optional: answer verification or citation extraction. Response: Return to user.

Naive RAG vs Advanced RAG

Naive RAG: The basic pipeline described above. Query -> embed -> retrieve -> generate. Fast to build. Quality limited by retrieval precision and chunk quality.

Advanced RAG introduces additional steps:

Pre-retrieval:

Query decomposition: split complex questions into sub-questions
HyDE: generate a hypothetical answer and embed that for retrieval
Query expansion: generate multiple phrasings of the same question

Retrieval:

Hybrid search (dense + sparse)
Multiple retrieval strategies with result fusion

Post-retrieval:

Reranking with a cross-encoder model
Contextual compression: remove irrelevant parts of retrieved chunks
Long context reorder: put most relevant chunks at the start and end

Post-generation:

Faithfulness check: verify answer is grounded in retrieved context
Citation generation: identify which chunks supported which claims

Failure Mode Analysis

Each failure mode has a distinct diagnostic signature.

Bad retrieval: You can detect this by checking whether the ground-truth chunk for a test question appears in the retrieved results. If hit rate for known questions is below 70-80%, retrieval is the problem.

Context not utilized: Give the model a prompt that explicitly states "The following context contains the answer. Answer using only the information in the context." If the model still ignores the context, the model itself has poor instruction following.

Hallucination despite context: This is different from context not utilized. The model reads the context, misunderstands it, and generates a plausible-sounding but incorrect synthesis. Requires better prompting and sometimes a more capable model.

Stale or incorrect documents: The retrieval works correctly, but the retrieved documents contain outdated or incorrect information. This is a data quality problem, not a retrieval problem. Solution: implement document freshness tracking and re-indexing pipelines.

What Comes Next

The conceptual understanding is complete. You now know what vectors are, how semantic search uses them, how vector databases store them efficiently, how chunking prepares documents for retrieval, and how RAG assembles these components into a working system.

The next video implements this entire system in Python without any frameworks.

RAG: Retrieval Augmented Generation

Comments

AI Fundamentals

What is an AI Engineer?

More from this blog

Chunking: Why Document Splitting Determines RAG Quality

Vector Databases: Infrastructure for Semantic Search

Semantic Search: How Meaning-Based Retrieval Works

Vectors and Embeddings: The Mathematical Foundation of AI Systems

Command Palette

Comments

AI Fundamentals

What is an AI Engineer?

More from this blog