Deep Dive into RAGFlow’s PDF Parser
Deep Dive into RAGFlow’s PDF Parser
A technical walkthrough of the key design decisions inside
deepdoc/parser/pdf_parser.pyfrom the RAGFlow project.
Overview
RAGFlow’s PDF parser is responsible for transforming raw PDF files into structured, semantically meaningful text that can feed a retrieval-augmented generation (RAG) pipeline. Rather than treating a PDF as a flat stream of characters, the parser reconstructs the reading order, column layout, and visual structure of each page. This post walks through the key components and the reasoning behind each design choice.
__images__: The Pre-processing Entry Point
__images__ is the primary pre-processing function. It transforms raw PDF pages into everything downstream analysis needs:
- Page rendering — Each page is rasterized into an image (
self.page_images) at a configurable zoom level (zoomin), enabling layout analysis and OCR. - Character extraction — Raw character-level data is extracted into
self.page_chars, along with page-level statistics:mean_height,mean_width,page_cum_height. - Language detection — The function estimates whether the document is predominantly English, which affects the OCR strategy used later.
- OCR execution — OCR is run on each page (concurrently or sequentially) and results are stored in
self.boxesas bounding-box + text pairs. - Outline parsing — PDF bookmarks/outlines are extracted for structural context.
- Adaptive retry — If OCR returns empty results and the zoom level is still low, the function recursively increases
zoominand retries.
In short: __images__ converts a PDF into page images + OCR text boxes + page statistics — the foundation for every subsequent step.
_assign_column: Column Layout Detection
Before merging text fragments, the parser needs to know which column each fragment belongs to. _assign_column solves this problem using unsupervised clustering:
- Per-page grouping — Text boxes (
boxes) are grouped by page. - KMeans clustering on
x0— For each page, the left-edge coordinates (x0) of all boxes are clustered using KMeans withk = 1..4. The optimalkis selected by maximizing thesilhouette_score. - Indent normalization — To avoid first-line indentation skewing the result,
x0values close to the page’s left boundary are snapped to a common value before clustering. - Column ID assignment — After the final clustering pass, cluster centers are sorted left-to-right and remapped to
col_id = 0, 1, 2, .... Each box receives acol_id.
Why merge within a column instead of reading left-to-right across the whole page?
This is a deliberate design for multi-column documents (academic papers, magazines, newspapers). In a two-column layout, the correct reading order is:
Left column top → Left column bottom → Right column top → Right column bottom
A naive left-to-right scan would incorrectly interleave lines from both columns. By assigning column IDs first, downstream steps like _text_merge and _final_reading_order_merge can process each column independently before combining them in order.
For single-column documents, the clustering simply returns k = 1, and the behavior degrades gracefully to a standard top-to-bottom reading order.
Library Responsibilities: pdfplumber vs. pypdf
The parser uses two PDF libraries simultaneously, each playing a distinct role:
| Library | Role |
|---|---|
pdfplumber |
Page content extraction — render page images, extract character-level data (dedupe_chars().chars), count pages |
pypdf |
Document structure — read the table of contents / bookmarks (outline) for chapter/section hierarchy |
They are complementary, not interchangeable. pdfplumber is optimized for spatial, character-level analysis; pypdf is lightweight and well-suited for metadata and structural traversal.
Why OCR Instead of Direct Text Extraction?
pypdf (and pdfplumber) can directly extract the embedded text layer from a PDF — and in well-formed documents this is more accurate than OCR. So why does RAGFlow still rely heavily on OCR?
The problem with text-layer extraction
- Scanned PDFs have no text layer at all.
- Image-only PDFs (e.g., faxes, photographed documents) are entirely pixel-based.
- Corrupted font mappings — PDFs with subset fonts or missing
ToUnicodetables produce garbled or empty extraction results even withpypdf. - Encrypted documents may block text extraction.
The value of OCR bounding boxes
Beyond just recovering text, the parser needs spatial coordinates for every text fragment to perform layout analysis, table detection, and image-text fusion. OCR naturally produces (bounding box, text) pairs, making it a unified representation regardless of the PDF origin.
A single OCR-based pipeline is operationally simpler than maintaining multiple extraction paths and reconciling their outputs.
The trade-off
OCR introduces recognition errors, particularly on mathematical notation, non-Latin scripts, and low-resolution scans. The current implementation mitigates this by combining pdfplumber character data with OCR output rather than relying on OCR alone.
A better approach
The ideal architecture — and a natural evolution of this codebase — is:
- Attempt direct text extraction (via
pdfplumber/pypdf). - Evaluate quality (character coverage, encoding validity).
- Fall back to OCR only when the text layer is absent or unreliable.
- Fuse and cross-validate both sources when both are available.
This “text-layer first, OCR as fallback” strategy preserves accuracy on clean PDFs while remaining robust on scanned documents.
Summary
| Component | Responsibility |
|---|---|
__images__ |
Rasterize pages, run OCR, collect statistics |
_assign_column |
Detect multi-column layout via KMeans clustering |
_text_merge |
Merge adjacent fragments within the same column |
_final_reading_order_merge |
Combine columns in correct reading order |
pdfplumber |
Page rendering + character-level data |
pypdf |
Document outline / bookmark extraction |
The overarching design philosophy is robustness over precision: handle every PDF type uniformly, even at the cost of some OCR noise. For use cases demanding higher text fidelity, augmenting the pipeline with text-layer extraction as a primary path — with OCR as a fallback — is the recommended improvement direction.