Skip to main content

PDF text, table, image, and form extraction utilities

Project description

Text Peeler

PyPI version Python Downloads License

One command to extract text, tables, images, and forms from any PDF.

Text Peeler analyzes your document, picks the right extraction strategy, and delivers clean output in the format you need. Digital PDFs, scanned pages, mixed documents, ebooks: one tool handles them all.

Install

pip install text-peeler

OCR support requires Tesseract:

brew install tesseract    # macOS
apt install tesseract-ocr # Debian/Ubuntu

See the full installation guide for all options.

Quick Start

# Auto-detect and extract
text-peeler-detect report.pdf

# Extract text from a digital PDF
text-peeler-native report.pdf

# Pull tables as JSON
text-peeler-tables report.pdf --format json

# Run every relevant extractor at once
text-peeler-ensemble report.pdf

Or use the shell router:

./extract.sh auto report.pdf
./extract.sh tables report.pdf output.csv --format csv

See the quickstart guide for more examples.

What It Does

Mode Purpose
native Digital PDFs with selectable text
scanned Image-only PDFs via OCR
mixed Per-page routing (native or OCR)
tables Structured table extraction
images Embedded images with surrounding context
forms Fillable form field extraction
epub EPUB ebook chapter extraction
ebook Legacy ebook formats (.mobi, .lit, .prc)
detect Analyze a PDF and recommend extractors
ensemble Run all relevant extractors, merge into one JSON

Output Formats

Every extractor supports multiple output formats. Choose what fits your pipeline.

Extractor Default Supported
native txt txt, json, md
scanned txt txt, json, md
mixed txt txt, json, md
tables json json, csv, txt, md
images json json, md, txt
forms json json, txt, md, csv

See output format details for schema documentation.

Use Cases

Text Peeler is built for pipelines. Here are the most common workflows:

Architecture

Each extractor is a standalone Python script. No shared base class, no deep inheritance. Shared formatting lives in output_utils.py.

text_peeler/
├── detect.py           # PDF analysis + routing
├── ensemble.py         # Multi-extractor runner
├── output_utils.py     # txt/json/md/csv formatting
└── extractors/
    ├── native.py       # pymupdf text
    ├── scanned.py      # pymupdf + tesseract OCR
    ├── mixed.py        # per-page routing
    ├── tables.py       # pdfplumber tables
    ├── images.py       # pymupdf images
    ├── forms.py        # pymupdf form widgets
    ├── epub.py         # ebooklib EPUB
    └── ebook.py        # Calibre ebook conversion

See the architecture guide for implementation details.

Documentation

Guide Description
Installation All install methods, system dependencies, troubleshooting
Quickstart First extraction in under a minute
CLI Reference Every flag and option for every mode
Output Formats JSON schemas, CSV layouts, Markdown structure
Architecture How the pieces fit together

See Also

Gutenfetchen (PyPI) - Bulk download and process public domain texts from Project Gutenberg. Pairs well with Text Peeler for building large text corpora from mixed sources.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_peeler-0.1.0.tar.gz (12.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

text_peeler-0.1.0-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file text_peeler-0.1.0.tar.gz.

File metadata

  • Download URL: text_peeler-0.1.0.tar.gz
  • Upload date:
  • Size: 12.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for text_peeler-0.1.0.tar.gz
Algorithm Hash digest
SHA256 972f5b36103a08866b21f5ff6e04157d46cb4677aec43a505d95258c6c49a528
MD5 df2d8ccdd3e7ea35ef9833ff477231f6
BLAKE2b-256 df2a947c2f6be47f0c1bc9b650df92969555ffa334f3c7811d7eb2e90100c728

See more details on using hashes here.

File details

Details for the file text_peeler-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: text_peeler-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for text_peeler-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 34447da63858c245379d72fcc3c3880636ea5d8ac3771681494c1ccc01d09850
MD5 c0cfc0d9f7301f031650b64c9e221682
BLAKE2b-256 b4b9fd2e1876d1df82b94298ac8e0b425c4f727f045b29fca2ddaf7882b69ebc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page