Deep Dive into RAGFlow’s PDF Parser

A technical walkthrough of the key design decisions inside deepdoc/parser/pdf_parser.py from the RAGFlow project.


Overview

RAGFlow’s PDF parser is responsible for transforming raw PDF files into structured, semantically meaningful text that can feed a retrieval-augmented generation (RAG) pipeline. Rather than treating a PDF as a flat stream of characters, the parser reconstructs the reading order, column layout, and visual structure of each page. This post walks through the key components and the reasoning behind each design choice.


__images__: The Pre-processing Entry Point

__images__ is the primary pre-processing function. It transforms raw PDF pages into everything downstream analysis needs:

  1. Page rendering — Each page is rasterized into an image (self.page_images) at a configurable zoom level (zoomin), enabling layout analysis and OCR.
  2. Character extraction — Raw character-level data is extracted into self.page_chars, along with page-level statistics: mean_height, mean_width, page_cum_height.
  3. Language detection — The function estimates whether the document is predominantly English, which affects the OCR strategy used later.
  4. OCR execution — OCR is run on each page (concurrently or sequentially) and results are stored in self.boxes as bounding-box + text pairs.
  5. Outline parsing — PDF bookmarks/outlines are extracted for structural context.
  6. Adaptive retry — If OCR returns empty results and the zoom level is still low, the function recursively increases zoomin and retries.

In short: __images__ converts a PDF into page images + OCR text boxes + page statistics — the foundation for every subsequent step.


_assign_column: Column Layout Detection

Before merging text fragments, the parser needs to know which column each fragment belongs to. _assign_column solves this problem using unsupervised clustering:

  1. Per-page grouping — Text boxes (boxes) are grouped by page.
  2. KMeans clustering on x0 — For each page, the left-edge coordinates (x0) of all boxes are clustered using KMeans with k = 1..4. The optimal k is selected by maximizing the silhouette_score.
  3. Indent normalization — To avoid first-line indentation skewing the result, x0 values close to the page’s left boundary are snapped to a common value before clustering.
  4. Column ID assignment — After the final clustering pass, cluster centers are sorted left-to-right and remapped to col_id = 0, 1, 2, .... Each box receives a col_id.

Why merge within a column instead of reading left-to-right across the whole page?

This is a deliberate design for multi-column documents (academic papers, magazines, newspapers). In a two-column layout, the correct reading order is:

Left column top → Left column bottom → Right column top → Right column bottom

A naive left-to-right scan would incorrectly interleave lines from both columns. By assigning column IDs first, downstream steps like _text_merge and _final_reading_order_merge can process each column independently before combining them in order.

For single-column documents, the clustering simply returns k = 1, and the behavior degrades gracefully to a standard top-to-bottom reading order.


Library Responsibilities: pdfplumber vs. pypdf

The parser uses two PDF libraries simultaneously, each playing a distinct role:

Library Role
pdfplumber Page content extraction — render page images, extract character-level data (dedupe_chars().chars), count pages
pypdf Document structure — read the table of contents / bookmarks (outline) for chapter/section hierarchy

They are complementary, not interchangeable. pdfplumber is optimized for spatial, character-level analysis; pypdf is lightweight and well-suited for metadata and structural traversal.


Why OCR Instead of Direct Text Extraction?

pypdf (and pdfplumber) can directly extract the embedded text layer from a PDF — and in well-formed documents this is more accurate than OCR. So why does RAGFlow still rely heavily on OCR?

The problem with text-layer extraction

  • Scanned PDFs have no text layer at all.
  • Image-only PDFs (e.g., faxes, photographed documents) are entirely pixel-based.
  • Corrupted font mappings — PDFs with subset fonts or missing ToUnicode tables produce garbled or empty extraction results even with pypdf.
  • Encrypted documents may block text extraction.

The value of OCR bounding boxes

Beyond just recovering text, the parser needs spatial coordinates for every text fragment to perform layout analysis, table detection, and image-text fusion. OCR naturally produces (bounding box, text) pairs, making it a unified representation regardless of the PDF origin.

A single OCR-based pipeline is operationally simpler than maintaining multiple extraction paths and reconciling their outputs.

The trade-off

OCR introduces recognition errors, particularly on mathematical notation, non-Latin scripts, and low-resolution scans. The current implementation mitigates this by combining pdfplumber character data with OCR output rather than relying on OCR alone.

A better approach

The ideal architecture — and a natural evolution of this codebase — is:

  1. Attempt direct text extraction (via pdfplumber/pypdf).
  2. Evaluate quality (character coverage, encoding validity).
  3. Fall back to OCR only when the text layer is absent or unreliable.
  4. Fuse and cross-validate both sources when both are available.

This “text-layer first, OCR as fallback” strategy preserves accuracy on clean PDFs while remaining robust on scanned documents.


Summary

Component Responsibility
__images__ Rasterize pages, run OCR, collect statistics
_assign_column Detect multi-column layout via KMeans clustering
_text_merge Merge adjacent fragments within the same column
_final_reading_order_merge Combine columns in correct reading order
pdfplumber Page rendering + character-level data
pypdf Document outline / bookmark extraction

The overarching design philosophy is robustness over precision: handle every PDF type uniformly, even at the cost of some OCR noise. For use cases demanding higher text fidelity, augmenting the pipeline with text-layer extraction as a primary path — with OCR as a fallback — is the recommended improvement direction.