Most teams building RAG systems spend weeks choosing the right vector database, comparing embedding models, and fine-tuning their prompts. Then they split their documents into arbitrary 500-character blocks and wonder why retrieval quality is inconsistent.

The truth is, chunking — how you break documents into smaller pieces before embedding — is one of the single biggest levers for improving retrieval accuracy. NVIDIA’s 2024 benchmarks showed up to a 9% gap in recall between the best and worst chunking approaches on the same dataset. That’s the difference between a system that finds the right answer and one that surfaces irrelevant noise.

Yet chunking remains the step most teams rush through. This article explains why it matters so much, what the main strategies are, and how to think about choosing the right one for your use case.

What chunking actually does

When a document gets embedded into a vector, the entire content is compressed into a single numerical representation. If that content is an entire 40-page financial report, the resulting vector becomes a vague average of everything — revenue figures, risk disclosures, forward-looking statements, and legal footnotes all blended together.

When a user asks “What was the Q3 operating margin?”, that diluted vector matches weakly because it represents too many topics at once.

Chunking solves this by splitting the document into focused segments, each representing a coherent piece of information. Each chunk gets its own vector, and retrieval becomes about finding the specific segment that answers the question — not the document that vaguely relates to it.

But the way you split changes everything.

The main chunking strategies

Fixed-size chunking

The simplest approach: split text into equal-sized pieces based on character or token count. Every chunk is, say, 500 tokens long, regardless of where sentence or paragraph boundaries fall.

This works for homogeneous content like news articles or blog posts where the structure is relatively uniform. It’s fast, cheap, and easy to implement.

The downside is obvious — a chunk boundary can land in the middle of a sentence, a table, or a critical explanation. The meaning gets severed, and retrieval suffers because the embedding captures an incomplete thought.

Best for: Prototyping, or corpora where all documents have similar structure and length.

Recursive chunking

Instead of cutting blindly at a character count, recursive chunking tries to split on natural boundaries. It first attempts to break on paragraph boundaries (double newlines). If the resulting piece is still too large, it falls back to single newlines, then sentence endings, then individual words.

This hierarchy preserves the natural structure of the document far better than fixed-size approaches. A paragraph about revenue stays together. A list of requirements doesn’t get split across two chunks with different contexts.

This is the default approach in LangChain, and for good reason — it strikes a practical balance between simplicity and quality for most document types.

Best for: General-purpose RAG pipelines. A solid default when you’re not sure what strategy to use.

Semantic chunking

Rather than relying on formatting cues like newlines or periods, semantic chunking groups sentences by their meaning. It computes embeddings for individual sentences and measures how similar consecutive ones are. When the similarity drops — indicating a topic shift — it introduces a break.

The result: chunks that represent coherent topics rather than arbitrary text windows. A financial report where revenue discussion flows into risk assessment gets split at the natural boundary between those topics, even if there’s no clear heading or paragraph break.

The trade-off is compute cost. Every sentence needs to be embedded during the chunking phase itself, which adds overhead and latency to ingestion. For large-scale pipelines processing thousands of documents, this cost adds up.

Best for: Documents with mixed topics, long-form content without clear structural markers, or use cases where retrieval precision is critical.

Structure-aware chunking

For documents with clear structural elements — headings, sections, numbered lists, tables — the most effective approach is to respect that existing structure during chunking.

This means splitting at heading boundaries, keeping table content together, and preserving the hierarchy of sections and subsections. Each resulting chunk carries metadata about where it sits in the document structure: which section it belongs to, what heading it falls under, what page it came from.

This is where the quality of the upstream document conversion step becomes critical. If a PDF gets converted to plain text, the headings, tables, and section boundaries are lost — and no chunking strategy can recover that information. But when conversion produces clean Markdown with preserved structure, those structural markers become natural and highly effective chunk boundaries.

This is core to what we do at Monkt. Our platform converts documents and web pages into structured Markdown that preserves semantic elements — headings, tables, lists, code blocks, and document hierarchy. These structural markers are exactly what structure-aware chunking needs to produce high-quality segments. As we extend our pipeline with configurable chunking capabilities, this connection between clean conversion and effective chunking is what we’re building on.

Best for: Technical documentation, research papers, reports with clear section hierarchy, and any content where structure carries meaning.

How to choose the right chunk size

NVIDIA tested chunk sizes of 128, 256, 512, 1,024, and 2,048 tokens across diverse datasets including financial reports, earnings presentations, and technical documentation. The findings align with what most practitioners observe in production:

Smaller chunks (256–512 tokens) work better for factual, lookup-style questions. “What was the Q3 revenue?” needs a focused vector that points to a specific number in a specific paragraph. A small chunk produces a precise embedding that matches tightly against such queries.

Larger chunks (512–1,024 tokens) work better for questions requiring broader context. “Summarize the company’s risk factors” benefits from chunks that capture enough surrounding information to form a complete picture.

The overlap between consecutive chunks also matters. When a chunk boundary falls near an important passage, some content from the end of one chunk should repeat at the beginning of the next to prevent information loss. Industry best practice is 10–20% overlap — for a 512-token chunk, that means 50–100 tokens of shared content at the boundaries.

For most teams starting out, 256–512 tokens with 10–15% overlap and recursive splitting is a strong default configuration.

Why metadata makes chunks more useful

A raw text chunk on its own is missing critical context. Where did it come from? What section of the document? What page? What type of document was it?

Attaching metadata to each chunk — source filename, page number, section heading, chunk position, token count — enables much more sophisticated retrieval. A query about “termination clauses” can first filter to contract documents, then rank by semantic similarity within that subset. This hybrid approach (metadata filtering plus vector search) consistently outperforms pure vector similarity alone.

Metadata also helps with deduplication and freshness. When a document gets updated and re-ingested, metadata lets the system identify and replace only the affected chunks instead of reprocessing the entire corpus.

The conversion-chunking connection

One of the most common failure modes in RAG pipelines is not a bad chunking strategy — it’s bad input to an otherwise reasonable chunking strategy.

When a PDF gets converted through a basic text extraction tool, the output is often a wall of unstructured text. Headers look the same as body text. Tables become garbled rows of numbers. Page headers and footers get mixed into the content. Footnotes appear in random positions.

No chunking algorithm can produce coherent segments from incoherent input.

This is why document conversion and chunking should be thought of as a single, connected pipeline rather than two independent steps. Clean, structured input — with preserved headings, table formatting, and semantic hierarchy — gives any chunking strategy dramatically better raw material to work with.

At Monkt, we’ve spent the past year focused on making that conversion step as reliable as possible across PDFs, Word documents, PowerPoint presentations, Excel sheets, and web pages. What we’ve learned is that the teams getting the best results from their RAG systems are the ones who invest in the quality of their ingestion pipeline before they invest in prompt engineering or model selection.

What to do next

If you’re building or improving a RAG system, here’s a practical sequence:

Start with conversion quality. Make sure your documents are being converted with structure preserved — headings, tables, lists, and hierarchy intact. This is the foundation everything else depends on.

Default to recursive chunking at 256–512 tokens with 10–15% overlap. This works well for the majority of document types and gives you a solid baseline to measure against.

Add metadata to every chunk. Source document, section heading, page number, and position within the document. This enables hybrid retrieval and makes debugging much easier.

Measure retrieval quality. Create a small evaluation set of queries with known relevant passages. Test your chunking configuration against it. Without measurement, you’re guessing.

Iterate. Try semantic chunking on your most heterogeneous content. Experiment with different chunk sizes for different document types. The best production pipelines often use different strategies for different content.

Chunking is not a one-time configuration decision. It’s an ongoing optimization that directly shapes the quality of every answer your RAG system produces. Getting it right is worth the investment.

Why Chunking Is the Most Underrated Step in Your RAG Pipeline