Skip to main content

PDF text, table, image, and form extraction utilities

Project description

Text Peeler

Python Tests Downloads Downloads/month License

One command to extract text, tables, images, and forms from any PDF.

Text Peeler analyzes your document, picks the right extraction strategy, and delivers clean output in the format you need. Digital PDFs, scanned pages, mixed documents, ebooks: one tool handles them all.

Install

pip install text-peeler

OCR support requires Tesseract:

brew install tesseract    # macOS
apt install tesseract-ocr # Debian/Ubuntu

See the full installation guide for all options.

Quick Start

# Auto-detect and extract
text-peeler-detect report.pdf

# Extract text from a digital PDF
text-peeler-native report.pdf

# Pull tables as JSON
text-peeler-tables report.pdf --format json

# Run every relevant extractor at once
text-peeler-ensemble report.pdf

Or use the shell router:

./extract.sh auto report.pdf
./extract.sh tables report.pdf output.csv --format csv

See the quickstart guide for more examples.

What It Does

Mode Purpose
native Digital PDFs with selectable text
scanned Image-only PDFs via OCR
mixed Per-page routing (native or OCR)
tables Structured table extraction
images Embedded images with surrounding context
forms Fillable form field extraction
epub EPUB ebook chapter extraction
ebook Legacy ebook formats (.mobi, .lit, .prc)
detect Analyze a PDF and recommend extractors
ensemble Run all relevant extractors, merge into one JSON

Output Formats

Every extractor supports multiple output formats. Choose what fits your pipeline.

Extractor Default Supported
native txt txt, json, md
scanned txt txt, json, md
mixed txt txt, json, md
tables json json, csv, txt, md
images json json, md, txt
forms json json, txt, md, csv

See output format details for schema documentation.

Use Cases

Text Peeler is built for pipelines. Here are the most common workflows:

Architecture

Each extractor is a standalone Python script. No shared base class, no deep inheritance. Shared formatting lives in output_utils.py.

text_peeler/
├── detect.py           # PDF analysis + routing
├── ensemble.py         # Multi-extractor runner
├── output_utils.py     # txt/json/md/csv formatting
└── extractors/
    ├── native.py       # pymupdf text
    ├── scanned.py      # pymupdf + tesseract OCR
    ├── mixed.py        # per-page routing
    ├── tables.py       # pdfplumber tables
    ├── images.py       # pymupdf images
    ├── forms.py        # pymupdf form widgets
    ├── epub.py         # ebooklib EPUB
    └── ebook.py        # Calibre ebook conversion

See the architecture guide for implementation details.

Documentation

Guide Description
Installation All install methods, system dependencies, troubleshooting
Quickstart First extraction in under a minute
CLI Reference Every flag and option for every mode
Output Formats JSON schemas, CSV layouts, Markdown structure
Architecture How the pieces fit together

See Also

Gutenfetchen (PyPI) - Bulk download and process public domain texts from Project Gutenberg. Pairs well with Text Peeler for building large text corpora from mixed sources.

License

AGPL-3.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_peeler-1.0.0.tar.gz (24.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

text_peeler-1.0.0-py3-none-any.whl (28.8 kB view details)

Uploaded Python 3

File details

Details for the file text_peeler-1.0.0.tar.gz.

File metadata

  • Download URL: text_peeler-1.0.0.tar.gz
  • Upload date:
  • Size: 24.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for text_peeler-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f505bb295a70310fabda04b1e79ef72c42bb6ad6c179c73b906cc01dee3e6c1c
MD5 0df78771204d1214975ce2686f47dcd6
BLAKE2b-256 ee3a401f8242ef4d76501ca2b263c43e06dda8bb9a4417d590def153dfe3a3b3

See more details on using hashes here.

File details

Details for the file text_peeler-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: text_peeler-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 28.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for text_peeler-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 324c480b40706cde55692f84ab83f983653d38ff8e337649a22b2475c1c93c50
MD5 3d1f0019f6ee28f328fa4118fff65914
BLAKE2b-256 8b8fb54de49baebdc25479966fcb78f1ab7c3c88bcc2765401bd9cc30a817be3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page