PDF text, table, image, and form extraction utilities
Project description
Text Peeler
One command to extract text, tables, images, and forms from any PDF.
Text Peeler analyzes your document, picks the right extraction strategy, and delivers clean output in the format you need. Digital PDFs, scanned pages, mixed documents, ebooks: one tool handles them all.
Install
pip install text-peeler
OCR support requires Tesseract:
brew install tesseract # macOS
apt install tesseract-ocr # Debian/Ubuntu
See the full installation guide for all options.
Quick Start
# Auto-detect and extract
text-peeler-detect report.pdf
# Extract text from a digital PDF
text-peeler-native report.pdf
# Pull tables as JSON
text-peeler-tables report.pdf --format json
# Run every relevant extractor at once
text-peeler-ensemble report.pdf
Or use the shell router:
./extract.sh auto report.pdf
./extract.sh tables report.pdf output.csv --format csv
See the quickstart guide for more examples.
What It Does
| Mode | Purpose |
|---|---|
native |
Digital PDFs with selectable text |
scanned |
Image-only PDFs via OCR |
mixed |
Per-page routing (native or OCR) |
tables |
Structured table extraction |
images |
Embedded images with surrounding context |
forms |
Fillable form field extraction |
epub |
EPUB ebook chapter extraction |
ebook |
Legacy ebook formats (.mobi, .lit, .prc) |
detect |
Analyze a PDF and recommend extractors |
ensemble |
Run all relevant extractors, merge into one JSON |
Output Formats
Every extractor supports multiple output formats. Choose what fits your pipeline.
| Extractor | Default | Supported |
|---|---|---|
| native | txt | txt, json, md |
| scanned | txt | txt, json, md |
| mixed | txt | txt, json, md |
| tables | json | json, csv, txt, md |
| images | json | json, md, txt |
| forms | json | json, txt, md, csv |
See output format details for schema documentation.
Use Cases
Text Peeler is built for pipelines. Here are the most common workflows:
- LLM Ingestion : Feed PDFs into language models as clean, structured text
- Batch Processing : Process hundreds of mixed PDFs in a single pass
- Scanned Documents : OCR pipeline with smart per-page text density checks
- Table Extraction : Pull tabular data out of PDFs as CSV or JSON
- Form Extraction : Extract fillable form fields with type-aware parsing
- Image Extraction : Recover embedded images with page context
- Ebook Conversion : Chapter-level text from EPUB and legacy ebook formats
- Document Auditing : Characterize a PDF without extracting anything
Architecture
Each extractor is a standalone Python script. No shared base class, no deep inheritance. Shared formatting lives in output_utils.py.
text_peeler/
├── detect.py # PDF analysis + routing
├── ensemble.py # Multi-extractor runner
├── output_utils.py # txt/json/md/csv formatting
└── extractors/
├── native.py # pymupdf text
├── scanned.py # pymupdf + tesseract OCR
├── mixed.py # per-page routing
├── tables.py # pdfplumber tables
├── images.py # pymupdf images
├── forms.py # pymupdf form widgets
├── epub.py # ebooklib EPUB
└── ebook.py # Calibre ebook conversion
See the architecture guide for implementation details.
Documentation
| Guide | Description |
|---|---|
| Installation | All install methods, system dependencies, troubleshooting |
| Quickstart | First extraction in under a minute |
| CLI Reference | Every flag and option for every mode |
| Output Formats | JSON schemas, CSV layouts, Markdown structure |
| Architecture | How the pieces fit together |
See Also
Gutenfetchen (PyPI) - Bulk download and process public domain texts from Project Gutenberg. Pairs well with Text Peeler for building large text corpora from mixed sources.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file text_peeler-1.0.0.tar.gz.
File metadata
- Download URL: text_peeler-1.0.0.tar.gz
- Upload date:
- Size: 24.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f505bb295a70310fabda04b1e79ef72c42bb6ad6c179c73b906cc01dee3e6c1c
|
|
| MD5 |
0df78771204d1214975ce2686f47dcd6
|
|
| BLAKE2b-256 |
ee3a401f8242ef4d76501ca2b263c43e06dda8bb9a4417d590def153dfe3a3b3
|
File details
Details for the file text_peeler-1.0.0-py3-none-any.whl.
File metadata
- Download URL: text_peeler-1.0.0-py3-none-any.whl
- Upload date:
- Size: 28.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
324c480b40706cde55692f84ab83f983653d38ff8e337649a22b2475c1c93c50
|
|
| MD5 |
3d1f0019f6ee28f328fa4118fff65914
|
|
| BLAKE2b-256 |
8b8fb54de49baebdc25479966fcb78f1ab7c3c88bcc2765401bd9cc30a817be3
|