Handling Large Tables When Converting Web Pages to PDF with Playwright

Converting web pages to PDF is a common task in web scraping and automation pipelines. Tools like Playwright make it straightforward — until you encounter large tables. This post walks through the pain points, the solutions attempted, and the best practical approach.

The Setup: Playwright PDF Generation

A typical Playwright-based converter looks like this:

from playwright.sync_api import sync_playwright
from urllib.parse import urlparse
import os

STORAGE_FILE = "storage_state.json"

def webpage_to_pdf(url: str, output_path: str, timeout: int = 30000):
    with sync_playwright() as p:
        browser = p.chromium.launch()

        PREFIX = urlparse(url).netloc + "_"
        if os.path.exists(PREFIX + STORAGE_FILE):
            context = browser.new_context(storage_state=PREFIX + STORAGE_FILE)
        else:
            context = browser.new_context()

        page = context.new_page()
        page.goto(url, wait_until="domcontentloaded", timeout=timeout)

        if "login" in page.url or "signin" in page.url:
            print("Authentication required. Please run save_login_state() first.")
            browser.close()
            return

        page.pdf(
            path=output_path,
            format="A4",
            landscape=True,
            print_background=True,
            margin={"top": "20mm", "bottom": "20mm", "left": "15mm", "right": "15mm"}
        )

        browser.close()
        print(f"PDF saved to {output_path}")

This works fine for most pages — until you hit a page with a wide table (10+ columns). Then two problems emerge.

Problem 1: Landscape Mode Is Slow

Setting landscape=True gives you roughly 40% more horizontal space, which helps wide tables fit. But it noticeably slows down PDF generation because Chromium internally:

Recalculates the entire page layout for the new aspect ratio
Re-evaluates print media queries (@media print + orientation: landscape)
Triggers component re-renders in SPA frameworks (React, Vue) that detect viewport size changes

Problem 2: Without Landscape, Content Gets Clipped

This is a known limitation of Chromium’s PDF renderer. When a table’s total column width exceeds the page width, everything beyond the right margin is simply cut off. There’s no overflow, no wrapping — just gone.

Attempted Solution 1: CSS Injection

The first instinct is to inject print-specific CSS to force tables to fit within the page:

page.add_style_tag(content="""
    @media print {
        table {
            width: 100% !important;
            table-layout: fixed !important;
            word-break: break-all !important;
            font-size: 8px !important;
        }
        th, td {
            overflow: hidden !important;
            white-space: normal !important;
        }
        * {
            max-width: 100% !important;
            overflow-x: hidden !important;
        }
    }
""")

Verdict: Helps for moderately wide tables, but table-layout: fixed + small font hits a readability floor. For tables with 20+ columns, the text becomes unreadable or columns still overflow.

Attempted Solution 2: Alternative Libraries

Library	Pros	Cons
WeasyPrint	Excellent `@page` CSS support, good table pagination	No JavaScript rendering (SPA pages won’t work)
pdfkit / wkhtmltopdf	Flexible config, decent table handling	Requires external binary, weak JS support
Puppeteer (Node.js)	Nearly identical to Playwright, larger community	Requires Node.js environment

Verdict: All PDF solutions share the same fundamental constraint — PDF pages have a fixed physical width. No library can escape this.

Attempted Solution 3: Dynamic Scale

Playwright’s page.pdf() accepts a scale parameter. You can dynamically calculate the ratio needed to fit the page content:

scroll_width = page.evaluate("document.body.scrollWidth")
a4_landscape_width = 1122  # A4 landscape width in pixels at 96 DPI

scale = min(1.0, a4_landscape_width / scroll_width)
scale = max(0.1, scale)  # floor at 10%

page.pdf(
    path=output_path,
    format="A4",
    landscape=True,
    print_background=True,
    scale=scale,
    margin={"top": "5mm", "bottom": "5mm", "left": "5mm", "right": "5mm"}
)

Verdict: This is the only way to guarantee no content loss in a PDF. But for very wide tables, the entire page shrinks to a point where nothing is legible.

The Real Solution: Don’t Put Large Tables in PDFs

The realization is straightforward once you step back:

HTML supports infinite horizontal scrolling
Excel supports infinite horizontal scrolling
PDF does not — it simulates a fixed-size piece of paper

PDF is fundamentally the wrong format for wide tabular data.

Extract Tables to Excel Instead

import pandas as pd
from playwright.sync_api import sync_playwright
from urllib.parse import urlparse
import os

def extract_tables(url: str, output_path: str, timeout: int = 30000):
    with sync_playwright() as p:
        browser = p.chromium.launch()

        PREFIX = urlparse(url).netloc + "_"
        if os.path.exists(PREFIX + STORAGE_FILE):
            context = browser.new_context(storage_state=PREFIX + STORAGE_FILE)
        else:
            context = browser.new_context()

        page = context.new_page()
        page.goto(url, wait_until="networkidle", timeout=timeout)

        if "login" in page.url or "signin" in page.url:
            print("Authentication required.")
            browser.close()
            return

        # Extract all table data via JavaScript
        tables_data = page.evaluate("""
            () => {
                const tables = document.querySelectorAll('table');
                return Array.from(tables).map(table => {
                    const rows = table.querySelectorAll('tr');
                    return Array.from(rows).map(row => {
                        const cells = row.querySelectorAll('th, td');
                        return Array.from(cells).map(cell => cell.innerText.trim());
                    });
                });
            }
        """)

        browser.close()

        # Write each table to a separate Excel sheet
        with pd.ExcelWriter(output_path, engine="openpyxl") as writer:
            for i, table in enumerate(tables_data):
                if not table:
                    continue
                df = pd.DataFrame(table[1:], columns=table[0])
                sheet_name = f"Table_{i+1}"
                df.to_excel(writer, sheet_name=sheet_name, index=False)
                print(f"  Sheet '{sheet_name}': {len(df)} rows, {len(df.columns)} columns")

        print(f"Tables saved to {output_path}")

Side-by-Side Comparison

	PDF Conversion	Table Extraction to Excel
Wide tables	Truncated or unreadably small	Complete, scrollable
Data editable	No	Yes
Downstream analysis	Requires re-parsing	Direct pandas/Excel use
Speed	Slow (layout + render)	Fast (DOM query only)
File size	Large	Small

A Hybrid Strategy

In practice, the best approach for scraping pages that contain both narrative content and large tables is a hybrid:

Save the page as PDF (with scale or CSS injection) for the text/visual content
Extract tables separately into Excel/CSV for the data
Or simply save the raw HTML — a browser can render it with full horizontal scrolling, preserving both layout and data fidelity

# Save raw HTML — simplest approach for full fidelity
html_content = page.content()
with open("output.html", "w", encoding="utf-8") as f:
    f.write(html_content)

Key Takeaways

PDF is paper simulation — it has a fixed width and cannot scroll. This is a format-level limitation, not a tool limitation.
landscape=True gives ~40% more width but slows rendering due to full layout recalculation.
CSS injection and scale are workarounds, not solutions — they trade readability for completeness.
For tabular data, extract to Excel/CSV. The data stays intact, editable, and analyzable.
For visual fidelity, save as HTML. Browsers handle horizontal overflow natively.

Choose the output format that matches your actual use case. If you need the data, don’t flatten it into a fixed-width image of paper.