PDF text, table, image, and form extraction utilities

These details have not been verified by PyPI

Project links

Project description

Text Peeler

Tests

One command to extract text, tables, images, and forms from any PDF.

Text Peeler analyzes your document, picks the right extraction strategy, and delivers clean output in the format you need. Digital PDFs, scanned pages, mixed documents, ebooks: one tool handles them all.

Install

pip install text-peeler

OCR support requires Tesseract:

brew install tesseract    # macOS
apt install tesseract-ocr # Debian/Ubuntu

See the full installation guide for all options.

Quick Start

# Auto-detect and extract
text-peeler-detect report.pdf

# Extract text from a digital PDF
text-peeler-native report.pdf

# Pull tables as JSON
text-peeler-tables report.pdf --format json

# Run every relevant extractor at once
text-peeler-ensemble report.pdf

Or use the shell router:

./extract.sh auto report.pdf
./extract.sh tables report.pdf output.csv --format csv

See the quickstart guide for more examples.

What It Does

Mode	Purpose
`native`	Digital PDFs with selectable text
`scanned`	Image-only PDFs via OCR
`mixed`	Per-page routing (native or OCR)
`tables`	Structured table extraction
`images`	Embedded images with surrounding context
`forms`	Fillable form field extraction
`epub`	EPUB ebook chapter extraction
`ebook`	Legacy ebook formats (.mobi, .lit, .prc)
`detect`	Analyze a PDF and recommend extractors
`ensemble`	Run all relevant extractors, merge into one JSON

Output Formats

Every extractor supports multiple output formats. Choose what fits your pipeline.

Extractor	Default	Supported
native	txt	txt, json, md
scanned	txt	txt, json, md
mixed	txt	txt, json, md
tables	json	json, csv, txt, md
images	json	json, md, txt
forms	json	json, txt, md, csv

See output format details for schema documentation.

Use Cases

Text Peeler is built for pipelines. Here are the most common workflows:

LLM Ingestion : Feed PDFs into language models as clean, structured text
Batch Processing : Process hundreds of mixed PDFs in a single pass
Scanned Documents : OCR pipeline with smart per-page text density checks
Table Extraction : Pull tabular data out of PDFs as CSV or JSON
Form Extraction : Extract fillable form fields with type-aware parsing
Image Extraction : Recover embedded images with page context
Ebook Conversion : Chapter-level text from EPUB and legacy ebook formats
Document Auditing : Characterize a PDF without extracting anything

Architecture

Each extractor is a standalone Python script. No shared base class, no deep inheritance. Shared formatting lives in output_utils.py.

text_peeler/
├── detect.py           # PDF analysis + routing
├── ensemble.py         # Multi-extractor runner
├── output_utils.py     # txt/json/md/csv formatting
└── extractors/
    ├── native.py       # pymupdf text
    ├── scanned.py      # pymupdf + tesseract OCR
    ├── mixed.py        # per-page routing
    ├── tables.py       # pdfplumber tables
    ├── images.py       # pymupdf images
    ├── forms.py        # pymupdf form widgets
    ├── epub.py         # ebooklib EPUB
    └── ebook.py        # Calibre ebook conversion

See the architecture guide for implementation details.

Documentation

Guide	Description
Installation	All install methods, system dependencies, troubleshooting
Quickstart	First extraction in under a minute
CLI Reference	Every flag and option for every mode
Output Formats	JSON schemas, CSV layouts, Markdown structure
Architecture	How the pieces fit together

License

AGPL-3.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Feb 10, 2026

0.1.0

Feb 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_peeler-1.0.0.tar.gz (24.5 kB view details)

Uploaded Feb 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

text_peeler-1.0.0-py3-none-any.whl (28.8 kB view details)

Uploaded Feb 10, 2026 Python 3

File details

Details for the file text_peeler-1.0.0.tar.gz.

File metadata

Download URL: text_peeler-1.0.0.tar.gz
Upload date: Feb 10, 2026
Size: 24.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for text_peeler-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`f505bb295a70310fabda04b1e79ef72c42bb6ad6c179c73b906cc01dee3e6c1c`
MD5	`0df78771204d1214975ce2686f47dcd6`
BLAKE2b-256	`ee3a401f8242ef4d76501ca2b263c43e06dda8bb9a4417d590def153dfe3a3b3`

See more details on using hashes here.

File details

Details for the file text_peeler-1.0.0-py3-none-any.whl.

File metadata

Download URL: text_peeler-1.0.0-py3-none-any.whl
Upload date: Feb 10, 2026
Size: 28.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for text_peeler-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`324c480b40706cde55692f84ab83f983653d38ff8e337649a22b2475c1c93c50`
MD5	`3d1f0019f6ee28f328fa4118fff65914`
BLAKE2b-256	`8b8fb54de49baebdc25479966fcb78f1ab7c3c88bcc2765401bd9cc30a817be3`

See more details on using hashes here.

text-peeler 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Text Peeler

Install

Quick Start

What It Does

Output Formats

Use Cases

Architecture

Documentation

See Also

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes