A tool for parsing and extracting text from PDF files with OCR capabilities
Project description
atai-pdf-tool
A command-line tool for parsing and extracting text from PDF files with OCR capabilities and performance optimization options.
Installation
pip install atai-pdf-tool
Usage
Command Line Interface
Basic Usage with Default Settings
atai-pdf-tool path/to/your/document.pdf -o output.json
Parallel Processing (Faster for Multi-core Systems)
atai-pdf-tool path/to/your/document.pdf -o output.json --parallel --max-workers 4
Lower DPI for Faster Processing
atai-pdf-tool path/to/your/document.pdf -o output.json --dpi 150
Batch Processing for Large PDFs (Memory-Efficient)
atai-pdf-tool path/to/your/document.pdf -o output.json --batch --batch-size 10
OCR-Only Mode with Parallel Processing
atai-pdf-tool path/to/your/document.pdf -o output.json --ocr-only --parallel --gpu
Process Specific Page Range with Optimizations
atai-pdf-tool path/to/your/document.pdf -s 5 -e 15 -o output.json --parallel --dpi 180 --gpu
Options:
-s,--start-page: Starting page number (0-indexed, default: 0)-e,--end-page: Ending page number (0-indexed, default: last page)-o,--output: Output JSON file path (if not provided, prints to stdout)--ocr-only: Use OCR for all pages regardless of extractable text-l,--lang: Language for OCR processing (default: en)--parallel: Enable parallel processing for faster performance (multi-core systems)--max-workers: Control the number of parallel workers for processing--dpi: Control image resolution for OCR (lower DPI improves speed)--batch: Use memory-efficient batch processing for large PDFs--batch-size: Control the batch size for batch processing--ocr-threshold: Set the threshold for when to fallback to OCR--gpu: Enable GPU acceleration for OCR processing
Supported Languages
The language option (-l, --lang) accepts language codes supported by EasyOCR. Some common ones include:
en: Englishch_sim: Simplified Chinesech_tra: Traditional Chinesefr: Frenchde: Germanjp: Japaneseko: Koreansp: Spanish
For a complete list of language codes, see the EasyOCR documentation.
As a Python Module
from atai_pdf_tool.parser import extract_pdf_pages, ocr_pdf, save_as_json
# Extract text from specific pages with English OCR
text = extract_pdf_pages("document.pdf", start_page=0, end_page=5, lang="en")
# Extract text with different language
chinese_text = extract_pdf_pages("chinese_document.pdf", lang="ch_sim")
# Extract without progress bar
text = extract_pdf_pages("document.pdf", show_progress=False)
# Save to JSON
save_as_json(text, "output.json")
# OCR an entire PDF with a specific language
french_ocr_text = ocr_pdf("french_document.pdf", lang="fr")
Key Improvements and Performance Enhancements
- Parallel Processing: Use multiple CPU cores for faster processing of large PDFs.
- DPI Control: Adjust the resolution for OCR processing to balance speed and quality (
--dpi). - Batch Processing: Process large PDFs in memory-efficient batches (
--batch,--batch-size). - GPU Acceleration: Leverage GPU resources for OCR processing (
--gpu). - OCR Threshold: Set a configurable threshold for when to switch to OCR processing (
--ocr-threshold). - Reused OCR Reader: Optimized OCR integration for better speed, especially with multi-page documents.
These updates allow you to customize the extraction process based on hardware capabilities, whether you're looking for faster processing or better memory efficiency.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file atai_pdf_tool-0.1.1.tar.gz.
File metadata
- Download URL: atai_pdf_tool-0.1.1.tar.gz
- Upload date:
- Size: 9.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
908036064ef5c23785fad1829780930530392632eef04152437491e06b39642f
|
|
| MD5 |
ac0977ebf8c13f34217330ed98ec19b9
|
|
| BLAKE2b-256 |
55a86a1db91ecca1acead7e099dd0c161289fab727c1188b03a8580d7baa037a
|
File details
Details for the file atai_pdf_tool-0.1.1-py3-none-any.whl.
File metadata
- Download URL: atai_pdf_tool-0.1.1-py3-none-any.whl
- Upload date:
- Size: 8.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69068d709f5abef8a7f276b8db72cfc5719bd6e8272e2ec3f7ed274071d11c65
|
|
| MD5 |
9fb92c1fe2eba64732bea9a99f0bc330
|
|
| BLAKE2b-256 |
9d2e3fa283360a8d9f05289cffd1c9068e0050d480df30781177b67c91a68023
|