AI & AutomationFree to read

Document Parsing (Unstructured, Docling, PyPDF)

Garbage In, Garbage Out - Your RAG is Only as Good as Your Parsing

Learn how to extract clean, structured text from messy real-world documents - PDFs, Word docs, HTML, scanned images. Master the tools (Unstructured, Docling, PyPDF) that turn chaotic documents into RAG-ready content.

Why is Document Parsing So Hard?

The Messy Reality of Real-World Documents

The Food Processing Analogy

Building a RAG pipeline without good document parsing is like running a restaurant where you buy vegetables from the market but never wash, peel, or chop them. You just dump muddy, uncut vegetables into the cooking pot. The result? A terrible dish. Document parsing is the prep work - it takes raw, messy documents (PDFs with tables, scanned images, multi-column layouts) and converts them into clean, structured text that your embedding model and LLM can actually work with.

What Makes Documents Messy?

PDFs: Not actually text files! They are drawing instructions - "draw letter A at position (72, 450)." Extracting reading order is surprisingly hard.
Tables: Row and column structure gets lost in text extraction. "Price: 499" might end up next to the wrong product.
Scanned Documents: Images of text, not actual text. Need OCR (Optical Character Recognition) first.
Multi-Column Layouts: Newspapers, research papers. Text extraction might jump between columns randomly.
Headers/Footers/Page Numbers: "Page 47 of 128" is noise that pollutes your chunks.
Mixed Content: Text + images + charts + code blocks all in one document.

Impact on RAG Quality

If your parser misreads a table, your RAG might say "The price of Product X is Rs 999" when it is actually Rs 499 - because the parser jumbled table columns. 80% of RAG failures trace back to poor document parsing, not the LLM or embedding model. Fix your parsing first.

Note: Document parsing is the most underrated component of RAG. Teams spend weeks tuning prompts and models but minutes on parsing - when parsing quality has the biggest impact on final answer quality.

The Three Major Tools

Unstructured vs Docling vs PyPDF - Choosing Your Parser

Unstructured (The Swiss Army Knife)

The most popular document parsing library in the AI ecosystem. Handles 20+ file types with a single API.

Formats: PDF, DOCX, PPTX, HTML, EPUB, Markdown, images, email, CSV, and more
Strategies: fast (basic extraction), hi_res (ML-based layout detection), ocr_only (for scanned docs)
Strengths: Widest format support, excellent table extraction with hi_res, active community, LangChain/LlamaIndex integration
Weaknesses: hi_res mode is slow and needs heavy dependencies (detectron2, tesseract), complex setup
Think of it as: The Reliance JioMart of parsing - has everything but the store is big and sometimes slow

Docling (The New IBM Challenger)

IBM's newer document parsing library focused on high-quality PDF understanding.

Formats: PDF, DOCX, PPTX, HTML, images
Strengths: Excellent table detection using TableFormer model, great layout understanding, structured JSON output, lighter than Unstructured
Weaknesses: Fewer format types than Unstructured, newer with smaller community
Think of it as: The DMart of parsing - focused selection but high quality and efficient

PyPDF / PyMuPDF / pdfplumber (The Lightweight Options)

Python libraries specifically for PDF parsing. No ML models, just algorithmic extraction.

PyPDF: Basic text extraction, fast, minimal dependencies. Good for simple text-only PDFs.
PyMuPDF (fitz): Fast, handles images and text, good for programmatic PDFs.
pdfplumber: Best at table extraction among lightweight tools. Uses geometric analysis.
Weaknesses: Struggle with complex layouts, no OCR, no ML-based understanding
Think of it as: The kirana store - quick for daily needs but limited selection

Note: For most RAG projects: start with PyPDF for simple PDFs, upgrade to Unstructured hi_res or Docling when you hit tables, multi-column layouts, or scanned documents.

Parsing Strategies Deep Dive

Understanding When to Use Which Strategy

Strategy 1: Fast/Basic Extraction

Simply extracts text layer from the PDF without any ML processing. Like reading the text off a page without understanding the layout.

Speed: Very fast (seconds per document)
Quality: Good for simple single-column text documents
Fails On: Tables, multi-column layouts, scanned docs, images
Tools: PyPDF, PyMuPDF, Unstructured fast mode
Use When: Your PDFs are simple text (books, articles, reports without tables)

Strategy 2: Layout-Aware Extraction (Hi-Res)

Uses ML models (like detectron2) to first understand the page layout - where are paragraphs, tables, headers, images? Then extracts text respecting this structure.

Speed: Slow (10-60 seconds per page, needs GPU for reasonable speed)
Quality: Excellent for complex documents with tables and multi-column layouts
Handles: Tables (converts to HTML/Markdown), columns, headers, footers, images
Tools: Unstructured hi_res, Docling, Document AI (Google), Textract (AWS)
Use When: Documents have tables, multiple columns, or complex layouts

Strategy 3: OCR-Based Extraction

For scanned documents or images of text. First converts image to text using OCR, then processes.

Speed: Slow (depends on OCR engine)
Quality: Depends heavily on scan quality. Clear prints work well, handwriting struggles.
Tools: Tesseract (free), Google Cloud Vision, AWS Textract, Azure Document Intelligence
Use When: Scanned contracts, receipts, old government documents, handwritten forms

Strategy 4: Vision LLM Extraction (Emerging)

Use multimodal LLMs (GPT-4V, Claude Vision) to directly "read" document images and extract structured information. Most accurate but most expensive.

Best For: Complex forms, charts, diagrams that traditional parsers cannot handle
Cost: High (LLM API cost per page)
Use When: Other methods fail and accuracy is critical

Note: Use the simplest strategy that works for your documents. Fast mode for simple text, hi_res for tables and complex layouts, OCR for scanned docs, Vision LLMs as a last resort for extremely complex documents.

Building a Document Parsing Pipeline

Production-Ready Parsing Architecture

The Multi-Stage Pipeline

Stage 1: Document Classification
  Input: Raw document (PDF, DOCX, HTML, etc.)
  Action: Detect file type, check if scanned/digital, estimate complexity
  Output: Routing decision (fast vs hi_res vs OCR)

Stage 2: Extraction
  Simple PDF --> PyPDF (fast, cheap)
  Complex PDF --> Unstructured hi_res or Docling (slow, accurate)
  Scanned PDF --> Tesseract OCR + layout detection
  DOCX/HTML  --> Direct parsing (straightforward)

Stage 3: Cleaning
  - Remove headers, footers, page numbers
  - Fix encoding issues (Hindi Unicode problems)
  - Merge hyphenated words across line breaks
  - Convert tables to structured format (Markdown or HTML)
  - Remove duplicate content (repeated disclaimers)

Stage 4: Structuring
  - Identify sections by headings
  - Tag content types (text, table, list, code)
  - Preserve hierarchy (chapter > section > paragraph)
  - Add metadata (page number, section title, source file)

Stage 5: Output
  - Clean text chunks ready for embedding
  - Rich metadata for each chunk
  - Table content preserved in structured format

Handling Tables - The Hardest Part

Option 1: Convert tables to Markdown format. Embedding models can understand Markdown tables reasonably well.
Option 2: Convert each table row to a natural language sentence. "Product: Laptop, Price: Rs 49999, RAM: 16GB"
Option 3: Store tables separately with their own metadata. Retrieve table chunks specifically when queries seem table-related.
Best Practice: Always preserve the table caption/title with the table content. A table without context is useless.

Note: The cleaning step is where most value is added. Raw extraction output is messy - headers, footers, page numbers, broken words. Clean text dramatically improves embedding and retrieval quality.

Common Parsing Pitfalls

Mistakes That Silently Destroy Your RAG Quality

Pitfall 1: Not Validating Extraction Quality

You parse 10,000 PDFs and feed them into your RAG. But never manually check if the parsing was correct. Tables might be garbled, Hindi text might have encoding issues, multi-column text might be scrambled. Fix: Randomly sample 50 documents and manually compare parsed output against the original. Calculate a "parsing accuracy" score.

Pitfall 2: Losing Table Structure

Converting a table to plain text destroys the row-column relationships. "Mumbai 499 Delhi 599" - which city has which price? Without structure, the LLM might hallucinate the association. Fix: Convert tables to Markdown or HTML format that preserves structure.

Pitfall 3: Ignoring Document Metadata

Just extracting text and throwing away metadata (author, date, section title, page number) means you lose valuable context. A user asking "What changed in the 2025 policy?" needs date metadata to filter relevant documents. Fix: Extract and preserve all available metadata alongside the text.

Pitfall 4: Using One Strategy for All Documents

Using hi_res mode for simple text PDFs wastes time and money. Using fast mode for complex tables gives garbage output. Fix: Classify documents first, then route to the appropriate parsing strategy. One size does NOT fit all.

Note: Always manually validate parsing quality on a sample before ingesting thousands of documents. 80% of RAG answer errors trace back to parsing problems, not the LLM.

Interview Questions

Q: Why is document parsing critical for RAG quality?

Document parsing is the first step that determines everything downstream. If parsing garbles a table (wrong price next to wrong product) or scrambles multi-column text, the embeddings will encode incorrect information, retrieval will return wrong context, and the LLM will generate wrong answers. 80% of RAG failures trace back to parsing issues. It is the most underrated component - teams spend weeks on prompts but minutes on parsing.

Q: When would you use Unstructured hi_res mode vs PyPDF?

PyPDF is fast and lightweight - perfect for simple text-only PDFs like books, articles, or reports without complex layouts. Unstructured hi_res uses ML models (detectron2) for layout detection and is needed for documents with tables, multi-column layouts, mixed content, or complex formatting. Hi_res is 10-100x slower and needs heavy dependencies, so only use it when the document complexity demands it.

Q: How do you handle tables during document parsing for RAG?

Three approaches: (1) Convert tables to Markdown format - preserves structure and embedding models handle it reasonably. (2) Convert each row to a natural language sentence for better embedding quality. (3) Store tables as separate chunks with metadata linking to the source document. Always preserve the table caption/title with the content. The worst approach is converting to plain text, which destroys row-column relationships.

Q: How would you build a parsing pipeline for documents in multiple formats?

Five-stage pipeline: (1) Document classification - detect file type, scanned vs digital, complexity level. (2) Route to appropriate extractor - PyPDF for simple PDFs, Unstructured/Docling for complex, OCR for scanned, direct parse for DOCX/HTML. (3) Cleaning - remove headers/footers/page numbers, fix encoding, merge hyphenated words. (4) Structuring - identify sections, tag content types, preserve hierarchy, add metadata. (5) Output clean chunks ready for embedding with rich metadata.

Frequently Asked Questions

What is Document Parsing?

Learn how to extract clean, structured text from messy real-world documents - PDFs, Word docs, HTML, scanned images. Master the tools (Unstructured, Docling, PyPDF) that turn chaotic documents into RAG-ready content.

How does Document Parsing work?

The Messy Reality of Real-World Documents The Food Processing Analogy Building a RAG pipeline without good document parsing is like running a restaurant where you buy vegetables from the market but never wash, peel, or chop them. You just dump muddy, uncut vegetables into the cooking pot.

Browse all AI & Automation topics →

Practice this on DevInterviewMaster

Read the full Document Parsing (Unstructured, Docling, PyPDF) breakdown with interactive demos, quizzes, and Hinglish notes.

Open the interactive topic →

800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.

Document Parsing (Unstructured, Docling, PyPDF)

Why is Document Parsing So Hard?

The Three Major Tools

Parsing Strategies Deep Dive

Building a Document Parsing Pipeline

Common Parsing Pitfalls

Interview Questions

Frequently Asked Questions

Related topics

Practice this on DevInterviewMaster