Handling Large Tables When Converting Web Pages to PDF with Playwright
Handling Large Tables When Converting Web Pages to PDF with Playwright
Converting web pages to PDF is a common task in web scraping and automation pipelines. Tools like Playwright make it straightforward — until you encounter large tables. This post walks through the pain points, the solutions attempted, and the best practical approach.
The Setup: Playwright PDF Generation
A typical Playwright-based converter looks like this:
from playwright.sync_api import sync_playwright
from urllib.parse import urlparse
import os
STORAGE_FILE = "storage_state.json"
def webpage_to_pdf(url: str, output_path: str, timeout: int = 30000):
with sync_playwright() as p:
browser = p.chromium.launch()
PREFIX = urlparse(url).netloc + "_"
if os.path.exists(PREFIX + STORAGE_FILE):
context = browser.new_context(storage_state=PREFIX + STORAGE_FILE)
else:
context = browser.new_context()
page = context.new_page()
page.goto(url, wait_until="domcontentloaded", timeout=timeout)
if "login" in page.url or "signin" in page.url:
print("Authentication required. Please run save_login_state() first.")
browser.close()
return
page.pdf(
path=output_path,
format="A4",
landscape=True,
print_background=True,
margin={"top": "20mm", "bottom": "20mm", "left": "15mm", "right": "15mm"}
)
browser.close()
print(f"PDF saved to {output_path}")
This works fine for most pages — until you hit a page with a wide table (10+ columns). Then two problems emerge.
Problem 1: Landscape Mode Is Slow
Setting landscape=True gives you roughly 40% more horizontal space, which helps wide tables fit. But it noticeably slows down PDF generation because Chromium internally:
- Recalculates the entire page layout for the new aspect ratio
- Re-evaluates print media queries (
@media print+orientation: landscape) - Triggers component re-renders in SPA frameworks (React, Vue) that detect viewport size changes
Problem 2: Without Landscape, Content Gets Clipped
This is a known limitation of Chromium’s PDF renderer. When a table’s total column width exceeds the page width, everything beyond the right margin is simply cut off. There’s no overflow, no wrapping — just gone.
Attempted Solution 1: CSS Injection
The first instinct is to inject print-specific CSS to force tables to fit within the page:
page.add_style_tag(content="""
@media print {
table {
width: 100% !important;
table-layout: fixed !important;
word-break: break-all !important;
font-size: 8px !important;
}
th, td {
overflow: hidden !important;
white-space: normal !important;
}
* {
max-width: 100% !important;
overflow-x: hidden !important;
}
}
""")
Verdict: Helps for moderately wide tables, but table-layout: fixed + small font hits a readability floor. For tables with 20+ columns, the text becomes unreadable or columns still overflow.
Attempted Solution 2: Alternative Libraries
| Library | Pros | Cons |
|---|---|---|
| WeasyPrint | Excellent @page CSS support, good table pagination |
No JavaScript rendering (SPA pages won’t work) |
| pdfkit / wkhtmltopdf | Flexible config, decent table handling | Requires external binary, weak JS support |
| Puppeteer (Node.js) | Nearly identical to Playwright, larger community | Requires Node.js environment |
Verdict: All PDF solutions share the same fundamental constraint — PDF pages have a fixed physical width. No library can escape this.
Attempted Solution 3: Dynamic Scale
Playwright’s page.pdf() accepts a scale parameter. You can dynamically calculate the ratio needed to fit the page content:
scroll_width = page.evaluate("document.body.scrollWidth")
a4_landscape_width = 1122 # A4 landscape width in pixels at 96 DPI
scale = min(1.0, a4_landscape_width / scroll_width)
scale = max(0.1, scale) # floor at 10%
page.pdf(
path=output_path,
format="A4",
landscape=True,
print_background=True,
scale=scale,
margin={"top": "5mm", "bottom": "5mm", "left": "5mm", "right": "5mm"}
)
Verdict: This is the only way to guarantee no content loss in a PDF. But for very wide tables, the entire page shrinks to a point where nothing is legible.
The Real Solution: Don’t Put Large Tables in PDFs
The realization is straightforward once you step back:
- HTML supports infinite horizontal scrolling
- Excel supports infinite horizontal scrolling
- PDF does not — it simulates a fixed-size piece of paper
PDF is fundamentally the wrong format for wide tabular data.
Extract Tables to Excel Instead
import pandas as pd
from playwright.sync_api import sync_playwright
from urllib.parse import urlparse
import os
def extract_tables(url: str, output_path: str, timeout: int = 30000):
with sync_playwright() as p:
browser = p.chromium.launch()
PREFIX = urlparse(url).netloc + "_"
if os.path.exists(PREFIX + STORAGE_FILE):
context = browser.new_context(storage_state=PREFIX + STORAGE_FILE)
else:
context = browser.new_context()
page = context.new_page()
page.goto(url, wait_until="networkidle", timeout=timeout)
if "login" in page.url or "signin" in page.url:
print("Authentication required.")
browser.close()
return
# Extract all table data via JavaScript
tables_data = page.evaluate("""
() => {
const tables = document.querySelectorAll('table');
return Array.from(tables).map(table => {
const rows = table.querySelectorAll('tr');
return Array.from(rows).map(row => {
const cells = row.querySelectorAll('th, td');
return Array.from(cells).map(cell => cell.innerText.trim());
});
});
}
""")
browser.close()
# Write each table to a separate Excel sheet
with pd.ExcelWriter(output_path, engine="openpyxl") as writer:
for i, table in enumerate(tables_data):
if not table:
continue
df = pd.DataFrame(table[1:], columns=table[0])
sheet_name = f"Table_{i+1}"
df.to_excel(writer, sheet_name=sheet_name, index=False)
print(f" Sheet '{sheet_name}': {len(df)} rows, {len(df.columns)} columns")
print(f"Tables saved to {output_path}")
Side-by-Side Comparison
| PDF Conversion | Table Extraction to Excel | |
|---|---|---|
| Wide tables | Truncated or unreadably small | Complete, scrollable |
| Data editable | No | Yes |
| Downstream analysis | Requires re-parsing | Direct pandas/Excel use |
| Speed | Slow (layout + render) | Fast (DOM query only) |
| File size | Large | Small |
A Hybrid Strategy
In practice, the best approach for scraping pages that contain both narrative content and large tables is a hybrid:
- Save the page as PDF (with
scaleor CSS injection) for the text/visual content - Extract tables separately into Excel/CSV for the data
- Or simply save the raw HTML — a browser can render it with full horizontal scrolling, preserving both layout and data fidelity
# Save raw HTML — simplest approach for full fidelity
html_content = page.content()
with open("output.html", "w", encoding="utf-8") as f:
f.write(html_content)
Key Takeaways
- PDF is paper simulation — it has a fixed width and cannot scroll. This is a format-level limitation, not a tool limitation.
landscape=Truegives ~40% more width but slows rendering due to full layout recalculation.- CSS injection and
scaleare workarounds, not solutions — they trade readability for completeness. - For tabular data, extract to Excel/CSV. The data stays intact, editable, and analyzable.
- For visual fidelity, save as HTML. Browsers handle horizontal overflow natively.
Choose the output format that matches your actual use case. If you need the data, don’t flatten it into a fixed-width image of paper.