Intelligent PDF parser with font-aware structure detection, table extraction, and multi-column support
Project description
pdfstruct
The PDF parser built for AI pipelines. Structured sections, tables, images, and metadata — not just raw text.
Overview
pdfstruct is a Python library that extracts structured content from PDF documents. Unlike basic text extraction tools, pdfstruct understands document layout — detecting headings, sections, tables, lists, headers/footers, and multi-column layouts using font analysis and geometric reasoning.
Features
- Font-aware heading detection: Uses font size, weight, and frequency analysis to classify headings (H1–H6)
- Table extraction: Detects tables from grid lines and whitespace-aligned columns
- Image extraction: Extracts embedded images with metadata, DPI estimation, caption detection, and cross-page deduplication
- Section hierarchy: Builds a document tree from headings and content
- Multi-column support: Handles two-column and multi-column layouts
- Header/footer removal: Identifies and filters repeating page content
- List detection: Recognizes bulleted, numbered, lettered, and Roman numeral lists
- Thumbnail generation: Create thumbnails from extracted images
- Multiple output formats: JSON, Markdown, and plain text
- Rich metadata: Word count, language detection, reading time, font statistics, image stats
Installation
pip install pdfstructx
Or install from source:
git clone https://github.com/Kyros-Groupe-Ltd/pdfstruct.git
cd pdfstruct
pip install -e .
Quickstart
import pdfstruct
# Parse a PDF
doc = pdfstruct.parse("contract.pdf")
# Access structured content
print(doc.title)
print(f"{doc.page_count} pages, {doc.metadata.word_count} words")
# Browse sections
for section in doc.sections:
print(f"{section.heading} ({len(section.content)} chars)")
for sub in section.subsections:
print(f" {sub.heading}")
# Get tables
for table in doc.tables:
print(table.to_dicts()) # List of row dicts
# Extract images (opt-in)
doc = pdfstruct.parse("report.pdf", extract_images=True)
for page in doc.pages:
for img in page.images:
print(f"Page {img.page_number}: {img.format} {img.width_px}x{img.height_px} @ {img.dpi:.0f} DPI")
if img.caption:
print(f" Caption: {img.caption}")
if img.image_bytes:
img.save(f"img_{img.page_number}_{img.image_index}.png")
# Generate thumbnails
thumbnail = pdfstruct.generate_thumbnail(img.image_bytes, max_size=(150, 150))
# Export to different formats
print(pdfstruct.to_markdown(doc))
print(pdfstruct.to_text(doc))
print(pdfstruct.to_json(doc))
# Full dict for programmatic use
data = pdfstruct.to_dict(doc)
API Reference
pdfstruct.parse(source, **options) -> Document
Parse a PDF file, bytes, or file-like object.
Options:
detect_tables(bool, default True) — Enable table detectiondetect_headers_footers(bool, default True) — Remove repeating headers/footersdetect_lists(bool, default True) — Detect list structuresdetect_columns(bool, default True) — Handle multi-column layoutsextract_images(bool, default False) — Enable full image extraction (opt-in)extract_image_data(bool, default True) — Include raw image bytes (only whenextract_images=True)
Document
doc.title— Detected document titledoc.pages— List of Page objectsdoc.sections— Hierarchical section treedoc.tables— All detected tablesdoc.metadata— DocumentMetadata with statisticsdoc.text— Full document text (concatenated from pages)doc.to_dict()— JSON-serializable dictionary
Section
section.heading— Section heading textsection.heading_level— HeadingLevel enum (H1–H6)section.content— Section body textsection.paragraphs— List of Paragraph objectssection.subsections— Nested subsections
Table
table.rows— List of TableRow objectstable.to_list()— 2D list of cell texttable.to_dicts()— List of dicts (header row as keys)table.num_rows,table.num_cols— Dimensions
ImageInfo
img.bbox— BBox position on pageimg.width_px,img.height_px— Pixel dimensionsimg.format— Image format (jpeg, png, jbig2, ccitt, jpeg2000, raw)img.colorspace— Color space (rgb, cmyk, grayscale, indexed)img.dpi_x,img.dpi_y,img.dpi— DPI (estimated from bbox vs pixel size)img.image_bytes— Raw image data (whenextract_image_data=True)img.file_size_bytes— Size of extracted image dataimg.content_hash— SHA-256 hash for deduplicationimg.caption— Auto-detected caption text (Figure 1, Fig. 2, etc.)img.page_number,img.image_index— Location identifiersimg.is_duplicate,img.duplicate_of_index— Cross-page deduplicationimg.save(path)— Save image to file
pdfstruct.generate_thumbnail(image_bytes, max_size=(150, 150), output_format="PNG")
Generate a thumbnail from extracted image bytes. Returns thumbnail bytes or None.
Metadata
metadata.word_count,metadata.char_count— Text statisticsmetadata.language— Detected language codemetadata.page_count— Number of pagesmetadata.is_scanned— Whether PDF appears to be scannedmetadata.has_tables,metadata.has_images— Content flagsmetadata.primary_font,metadata.primary_font_size— Font info
Comparison
| Feature | pdfstructx | PyMuPDF | pdfplumber | Unstructured |
|---|---|---|---|---|
| Text extraction | ✅ | ✅ | ✅ | ✅ |
| Section hierarchy (H1–H6 tree) | ✅ | ❌ | ❌ | Partial |
| Font-aware heading detection | ✅ | ❌ | ❌ | ❌ |
| Table extraction | ✅ | ❌ | ✅ | ✅ |
| Image extraction + metadata | ✅ | ✅ | ❌ | ✅ |
| Caption detection | ✅ | ❌ | ❌ | ❌ |
| Image deduplication | ✅ | ❌ | ❌ | ❌ |
| DPI estimation | ✅ | ❌ | ❌ | ❌ |
| Thumbnail generation | ✅ | ❌ | ❌ | ❌ |
| Multi-column layout | ✅ | ❌ | ❌ | ✅ |
| Header/footer removal | ✅ | ❌ | ❌ | ✅ |
| List detection | ✅ | ❌ | ❌ | ✅ |
| Language detection | ✅ | ❌ | ❌ | ✅ |
| Reading time / word count | ✅ | ❌ | ❌ | ❌ |
| Markdown export | ✅ | ❌ | ❌ | ✅ |
| JSON structured output | ✅ | ❌ | ❌ | ✅ |
| Pure Python (no Java/Docker) | ✅ | ✅ | ✅ | ❌ |
| License | Apache 2.0 | AGPL | MIT | Apache 2.0 |
Real-World Benchmarks
Tested on actual documents — not toy examples:
| Document | Pages | Words | Sections | Tables | Images (unique) | Time |
|---|---|---|---|---|---|---|
| 3-page CV | 3 | 863 | 1 | 3 | 0 | 164 ms |
| Bank statement (French) | 5 | 1,880 | 23 | 2 | 2 (1) | 379 ms |
| 130-page gov't RFP | 130 | 41,420 | 62 | 73 | 269 (8 unique) | 10.2 s |
| 224-page procurement doc | 224 | 53,979 | 107 | 118 | 408 (58 unique) | 23.6 s |
Head-to-head on the 130-page RFP:
| Library | Time | Words | Tables | Sections | Images | Dedup |
|---|---|---|---|---|---|---|
| PyMuPDF | 277 ms | 43,455 | ❌ N/A | ❌ N/A | 270 | ❌ No |
| pdfplumber | 16.5 s | 43,420 | 142 | ❌ N/A | ❌ N/A | ❌ No |
| pdfstructx | 13.1 s | 41,420 | 73 | 62 | 269 (8 unique) | ✅ 261 dupes filtered |
PyMuPDF is faster (C-based) but gives you flat text — no sections, no structure, no deduplication. pdfplumber finds tables but no hierarchy. pdfstructx gives you the complete picture.
Architecture
pdfstruct/
├── parser.py # Main PDFParser class and parse() entry point
├── models/
│ ├── document.py # Core models: Document, Page, Section, TextLine, Table, ImageInfo, etc.
│ └── metadata.py # DocumentMetadata with computed statistics
├── extractors/
│ ├── text.py # PDF text extraction via pdfminer.six
│ └── images.py # Image extraction, caption detection, dedup, thumbnails
├── layout/
│ └── analyzer.py # Paragraph grouping, reading order, margins
├── structure/
│ ├── headings.py # Font-aware heading detection
│ ├── headers_footers.py # Repeating content detection
│ ├── lists.py # List structure detection
│ └── sections.py # Section hierarchy builder
├── tables/
│ └── detector.py # Grid and whitespace table detection
├── output/
│ ├── json_output.py # JSON/dict export
│ ├── markdown.py # Markdown export
│ └── text_output.py # Plain text export
└── utils/
├── fonts.py # Font analysis and heading classification
├── geometry.py # Bounding box utilities, column detection
└── language.py # Language detection heuristics
Requirements
- Python >= 3.10
- pdfminer.six >= 20231228
- Pillow >= 10.0.0
License
Apache License 2.0. See LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfstructx-0.2.4.tar.gz.
File metadata
- Download URL: pdfstructx-0.2.4.tar.gz
- Upload date:
- Size: 47.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ec675fd416120c21b21a32f44cab216e26664b66a590089d1f7362925246017
|
|
| MD5 |
109fdcad60bfcca6575edbbe703011cc
|
|
| BLAKE2b-256 |
cea78ad1ba96f9edcfbf82a62d2b28a6e917dd65b768879a10144318c3b65336
|
File details
Details for the file pdfstructx-0.2.4-py3-none-any.whl.
File metadata
- Download URL: pdfstructx-0.2.4-py3-none-any.whl
- Upload date:
- Size: 49.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a639d66a9b656aaeead765bdc002054276f5cbd856df438595d57136c569f06
|
|
| MD5 |
700598a7e7a84aaefa6fdc5dae87fbc6
|
|
| BLAKE2b-256 |
5af779cc513d39e19fc29390c7c73f42757d5a7d282deaa84745a18ef97f90fb
|