Intelligent PDF parser with font-aware structure detection, table extraction, and multi-column support

These details have not been verified by PyPI

Project links

Project description

pdfstruct

The PDF parser built for AI pipelines. Structured sections, tables, images, and metadata — not just raw text.

Overview

pdfstruct is a Python library that extracts structured content from PDF documents. Unlike basic text extraction tools, pdfstruct understands document layout — detecting headings, sections, tables, lists, headers/footers, and multi-column layouts using font analysis and geometric reasoning.

Features

Font-aware heading detection: Uses font size, weight, and frequency analysis to classify headings (H1–H6)
Table extraction: Detects tables from grid lines and whitespace-aligned columns
Image extraction: Extracts embedded images with metadata, DPI estimation, caption detection, and cross-page deduplication
Section hierarchy: Builds a document tree from headings and content
Multi-column support: Handles two-column and multi-column layouts
Header/footer removal: Identifies and filters repeating page content
List detection: Recognizes bulleted, numbered, lettered, and Roman numeral lists
Thumbnail generation: Create thumbnails from extracted images
Multiple output formats: JSON, Markdown, and plain text
Rich metadata: Word count, language detection, reading time, font statistics, image stats

Installation

pip install pdfstructx

Or install from source:

git clone https://github.com/Kyros-Groupe-Ltd/pdfstruct.git
cd pdfstruct
pip install -e .

Quickstart

import pdfstruct

# Parse a PDF
doc = pdfstruct.parse("contract.pdf")

# Access structured content
print(doc.title)
print(f"{doc.page_count} pages, {doc.metadata.word_count} words")

# Browse sections
for section in doc.sections:
    print(f"{section.heading} ({len(section.content)} chars)")
    for sub in section.subsections:
        print(f"  {sub.heading}")

# Get tables
for table in doc.tables:
    print(table.to_dicts())  # List of row dicts

# Extract images (opt-in)
doc = pdfstruct.parse("report.pdf", extract_images=True)
for page in doc.pages:
    for img in page.images:
        print(f"Page {img.page_number}: {img.format} {img.width_px}x{img.height_px} @ {img.dpi:.0f} DPI")
        if img.caption:
            print(f"  Caption: {img.caption}")
        if img.image_bytes:
            img.save(f"img_{img.page_number}_{img.image_index}.png")

# Generate thumbnails
thumbnail = pdfstruct.generate_thumbnail(img.image_bytes, max_size=(150, 150))

# Export to different formats
print(pdfstruct.to_markdown(doc))
print(pdfstruct.to_text(doc))
print(pdfstruct.to_json(doc))

# Full dict for programmatic use
data = pdfstruct.to_dict(doc)

API Reference

`pdfstruct.parse(source, **options) -> Document`

Parse a PDF file, bytes, or file-like object.

Options:

detect_tables (bool, default True) — Enable table detection
detect_headers_footers (bool, default True) — Remove repeating headers/footers
detect_lists (bool, default True) — Detect list structures
detect_columns (bool, default True) — Handle multi-column layouts
extract_images (bool, default False) — Enable full image extraction (opt-in)
extract_image_data (bool, default True) — Include raw image bytes (only when extract_images=True)

Document

doc.title — Detected document title
doc.pages — List of Page objects
doc.sections — Hierarchical section tree
doc.tables — All detected tables
doc.metadata — DocumentMetadata with statistics
doc.text — Full document text (concatenated from pages)
doc.to_dict() — JSON-serializable dictionary

Section

section.heading — Section heading text
section.heading_level — HeadingLevel enum (H1–H6)
section.content — Section body text
section.paragraphs — List of Paragraph objects
section.subsections — Nested subsections

Table

table.rows — List of TableRow objects
table.to_list() — 2D list of cell text
table.to_dicts() — List of dicts (header row as keys)
table.num_rows, table.num_cols — Dimensions

ImageInfo

img.bbox — BBox position on page
img.width_px, img.height_px — Pixel dimensions
img.format — Image format (jpeg, png, jbig2, ccitt, jpeg2000, raw)
img.colorspace — Color space (rgb, cmyk, grayscale, indexed)
img.dpi_x, img.dpi_y, img.dpi — DPI (estimated from bbox vs pixel size)
img.image_bytes — Raw image data (when extract_image_data=True)
img.file_size_bytes — Size of extracted image data
img.content_hash — SHA-256 hash for deduplication
img.caption — Auto-detected caption text (Figure 1, Fig. 2, etc.)
img.page_number, img.image_index — Location identifiers
img.is_duplicate, img.duplicate_of_index — Cross-page deduplication
img.save(path) — Save image to file

`pdfstruct.generate_thumbnail(image_bytes, max_size=(150, 150), output_format="PNG")`

Generate a thumbnail from extracted image bytes. Returns thumbnail bytes or None.

Metadata

metadata.word_count, metadata.char_count — Text statistics
metadata.language — Detected language code
metadata.page_count — Number of pages
metadata.is_scanned — Whether PDF appears to be scanned
metadata.has_tables, metadata.has_images — Content flags
metadata.primary_font, metadata.primary_font_size — Font info

Comparison

Feature	pdfstructx	PyMuPDF	pdfplumber	Unstructured
Text extraction	✅	✅	✅	✅
Section hierarchy (H1–H6 tree)	✅	❌	❌	Partial
Font-aware heading detection	✅	❌	❌	❌
Table extraction	✅	❌	✅	✅
Image extraction + metadata	✅	✅	❌	✅
Caption detection	✅	❌	❌	❌
Image deduplication	✅	❌	❌	❌
DPI estimation	✅	❌	❌	❌
Thumbnail generation	✅	❌	❌	❌
Multi-column layout	✅	❌	❌	✅
Header/footer removal	✅	❌	❌	✅
List detection	✅	❌	❌	✅
Language detection	✅	❌	❌	✅
Reading time / word count	✅	❌	❌	❌
Markdown export	✅	❌	❌	✅
JSON structured output	✅	❌	❌	✅
Pure Python (no Java/Docker)	✅	✅	✅	❌
License	Apache 2.0	AGPL	MIT	Apache 2.0

Real-World Benchmarks

Tested on actual documents — not toy examples:

Document	Pages	Words	Sections	Tables	Images (unique)	Time
3-page CV	3	863	1	3	0	164 ms
Bank statement (French)	5	1,880	23	2	2 (1)	379 ms
130-page gov't RFP	130	41,420	62	73	269 (8 unique)	10.2 s
224-page procurement doc	224	53,979	107	118	408 (58 unique)	23.6 s

Head-to-head on the 130-page RFP:

Library	Time	Words	Tables	Sections	Images	Dedup
PyMuPDF	277 ms	43,455	❌ N/A	❌ N/A	270	❌ No
pdfplumber	16.5 s	43,420	142	❌ N/A	❌ N/A	❌ No
pdfstructx	13.1 s	41,420	73	62	269 (8 unique)	✅ 261 dupes filtered

PyMuPDF is faster (C-based) but gives you flat text — no sections, no structure, no deduplication. pdfplumber finds tables but no hierarchy. pdfstructx gives you the complete picture.

Architecture

pdfstruct/
├── parser.py           # Main PDFParser class and parse() entry point
├── models/
│   ├── document.py     # Core models: Document, Page, Section, TextLine, Table, ImageInfo, etc.
│   └── metadata.py     # DocumentMetadata with computed statistics
├── extractors/
│   ├── text.py         # PDF text extraction via pdfminer.six
│   └── images.py       # Image extraction, caption detection, dedup, thumbnails
├── layout/
│   └── analyzer.py     # Paragraph grouping, reading order, margins
├── structure/
│   ├── headings.py     # Font-aware heading detection
│   ├── headers_footers.py  # Repeating content detection
│   ├── lists.py        # List structure detection
│   └── sections.py     # Section hierarchy builder
├── tables/
│   └── detector.py     # Grid and whitespace table detection
├── output/
│   ├── json_output.py  # JSON/dict export
│   ├── markdown.py     # Markdown export
│   └── text_output.py  # Plain text export
└── utils/
    ├── fonts.py        # Font analysis and heading classification
    ├── geometry.py     # Bounding box utilities, column detection
    └── language.py     # Language detection heuristics

Requirements

Python >= 3.10
pdfminer.six >= 20231228
Pillow >= 10.0.0

License

Apache License 2.0. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.4

Feb 9, 2026

0.2.3

Feb 9, 2026

0.2.2

Feb 9, 2026

0.2.1

Feb 9, 2026

0.2.0

Feb 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfstructx-0.2.4.tar.gz (47.4 kB view details)

Uploaded Feb 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdfstructx-0.2.4-py3-none-any.whl (49.0 kB view details)

Uploaded Feb 9, 2026 Python 3

File details

Details for the file pdfstructx-0.2.4.tar.gz.

File metadata

Download URL: pdfstructx-0.2.4.tar.gz
Upload date: Feb 9, 2026
Size: 47.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for pdfstructx-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`5ec675fd416120c21b21a32f44cab216e26664b66a590089d1f7362925246017`
MD5	`109fdcad60bfcca6575edbbe703011cc`
BLAKE2b-256	`cea78ad1ba96f9edcfbf82a62d2b28a6e917dd65b768879a10144318c3b65336`

See more details on using hashes here.

File details

Details for the file pdfstructx-0.2.4-py3-none-any.whl.

File metadata

Download URL: pdfstructx-0.2.4-py3-none-any.whl
Upload date: Feb 9, 2026
Size: 49.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for pdfstructx-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1a639d66a9b656aaeead765bdc002054276f5cbd856df438595d57136c569f06`
MD5	`700598a7e7a84aaefa6fdc5dae87fbc6`
BLAKE2b-256	`5af779cc513d39e19fc29390c7c73f42757d5a7d282deaa84745a18ef97f90fb`

See more details on using hashes here.

pdfstructx 0.2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pdfstruct

Overview

Features

Installation

Quickstart

API Reference

pdfstruct.parse(source, **options) -> Document

Document

Section

Table

ImageInfo

pdfstruct.generate_thumbnail(image_bytes, max_size=(150, 150), output_format="PNG")

Metadata

Comparison

Real-World Benchmarks

Architecture

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`pdfstruct.parse(source, **options) -> Document`

`pdfstruct.generate_thumbnail(image_bytes, max_size=(150, 150), output_format="PNG")`