Chunking: Why Document Splitting Determines RAG Quality

Developers often focus on which LLM to use or which embedding model is best. In practice, chunking strategy has a larger impact on RAG quality than either of those choices. A mediocre LLM with good chunking outperforms a state-of-the-art LLM with poor chunking.

The Core Constraint

LLM context windows are measured in tokens. One token is approximately four characters or three-quarters of a word. A typical paragraph is 80 to 120 tokens. A page of text is 500 to 700 tokens.

Context window limits exist not only because of model architecture constraints but because of cost and latency. At $0.03 per 1000 input tokens for GPT-4, passing 10,000 tokens of document context per query costs $0.30 per query. At 10,000 queries per day, that is $3,000 per day — just in input tokens.

Chunking makes retrieval selective. Instead of passing the entire document, you pass only the 3 to 5 chunks most relevant to the query. Typical context usage drops from 10,000 tokens to 1,000 to 2,000 tokens.

Chunking Strategy Comparison:

1. Fixed Size Chunking:

Implementation: split text every N characters, advancing by (N - overlap) characters per step.

Use case: Quickly set up a working system. Acceptable quality for homogeneous text.

Failure mode: Splits in the middle of sentences, tables, and code. No awareness of document structure.

2. Recursive Character Text Splitting:

Implementation: Attempt to split on double newline. If chunk is too large, split on single newline. If still too large, split on space. If still too large, split on character.

Use case: General prose documents. Good default choice.

Strengths: Preserves natural boundaries where they exist. Degrades gracefully.

Weakness: Still unaware of document-level structure (headings, sections).

3. Semantic Chunking:

Implementation: Embed each sentence. Compute similarity between adjacent sentences. When similarity drops significantly below the local average, insert a chunk boundary.

Use case: Long-form documents with clear topic transitions.

Strengths: Chunks align with topic boundaries, producing coherent self-contained chunks.

Weakness: Requires embedding every sentence during indexing — expensive for large document collections.

Parent Document Retrieval:

Implementation: Index small child chunks (100-200 tokens) for high-precision retrieval. When a child chunk is retrieved, return its larger parent chunk (500-1000 tokens) as the actual context.

Use case: When you want both retrieval precision and answer context.

Strengths: Best of both worlds. Retrieval finds specific content; generation has enough context.

Weakness: More complex implementation, larger storage requirements.

Document-Specific Considerations:

PDFs: PDF text extraction is lossy. Tables are often extracted as garbled text. Page headers and footers appear in the middle of content. Use a PDF parser that understands structure (PDFMiner, PyMuPDF) rather than plain text extraction.

HTML: Parse the DOM structure. Chunk by heading hierarchy. Preserve heading context by prepending it to each chunk: "Section: Installation > Subsection: Requirements > [chunk content]".

Markdown: Use the heading structure to define chunk boundaries. This aligns with how the author intended the document to be navigated.

Code: Chunk at function or class boundaries using an AST parser, not a character splitter. Function-level chunks are semantically coherent.

Tabular data: Do not chunk tables mid-row. Preserve the header row in every chunk that contains table data. Consider converting tables to text descriptions before chunking.

Evaluating Your Chunking Strategy:

Manual inspection: Read a random sample of 20 chunks. Assess: Is each chunk self-contained? Does it contain a complete idea? Would you expect a user query to be satisfied by this chunk?

Retrieval hit rate: Create a test set of 50 to 100 questions where you know which chunk contains the answer. Measure what percentage of the time that chunk appears in the top-3 retrieved results. Iterate on chunk size and strategy until hit rate is acceptable.

Chunk size sensitivity: Run the same test set with chunk sizes of 200, 400, 600, and 800 tokens. Plot hit rate vs chunk size. The optimal chunk size varies by document type.

The relationship between chunk quality and answer quality is direct. Poor chunking means poor retrieval. Poor retrieval means the LLM answers from incomplete or wrong information. No amount of prompt engineering fixes bad retrieval.

Chunking: Why Document Splitting Determines RAG Quality

Comments

AI Fundamentals

RAG: Retrieval Augmented Generation

More from this blog

RAG: Retrieval Augmented Generation

Vector Databases: Infrastructure for Semantic Search

Semantic Search: How Meaning-Based Retrieval Works

Vectors and Embeddings: The Mathematical Foundation of AI Systems

Command Palette

Comments

AI Fundamentals

RAG: Retrieval Augmented Generation

More from this blog