Skip to main content

A tool for parsing and extracting text from PDF files with OCR capabilities

Project description

atai-pdf-tool

A command-line tool for parsing and extracting text from PDF files with OCR capabilities and performance optimization options.

Installation

pip install atai-pdf-tool

Usage

Command Line Interface

Basic Usage with Default Settings

atai-pdf-tool path/to/your/document.pdf -o output.json

Parallel Processing (Faster for Multi-core Systems)

atai-pdf-tool path/to/your/document.pdf -o output.json --parallel --max-workers 4

Lower DPI for Faster Processing

atai-pdf-tool path/to/your/document.pdf -o output.json --dpi 150

Batch Processing for Large PDFs (Memory-Efficient)

atai-pdf-tool path/to/your/document.pdf -o output.json --batch --batch-size 10

OCR-Only Mode with Parallel Processing

atai-pdf-tool path/to/your/document.pdf -o output.json --ocr-only --parallel --gpu

Process Specific Page Range with Optimizations

atai-pdf-tool path/to/your/document.pdf -s 5 -e 15 -o output.json --parallel --dpi 180 --gpu

Options:

  • -s, --start-page: Starting page number (0-indexed, default: 0)
  • -e, --end-page: Ending page number (0-indexed, default: last page)
  • -o, --output: Output JSON file path (if not provided, prints to stdout)
  • --ocr-only: Use OCR for all pages regardless of extractable text
  • -l, --lang: Language for OCR processing (default: en)
  • --parallel: Enable parallel processing for faster performance (multi-core systems)
  • --max-workers: Control the number of parallel workers for processing
  • --dpi: Control image resolution for OCR (lower DPI improves speed)
  • --batch: Use memory-efficient batch processing for large PDFs
  • --batch-size: Control the batch size for batch processing
  • --ocr-threshold: Set the threshold for when to fallback to OCR
  • --gpu: Enable GPU acceleration for OCR processing

Supported Languages

The language option (-l, --lang) accepts language codes supported by EasyOCR. Some common ones include:

  • en: English
  • ch_sim: Simplified Chinese
  • ch_tra: Traditional Chinese
  • fr: French
  • de: German
  • jp: Japanese
  • ko: Korean
  • sp: Spanish

For a complete list of language codes, see the EasyOCR documentation.

As a Python Module

from atai_pdf_tool.parser import extract_pdf_pages, ocr_pdf, save_as_json

# Extract text from specific pages with English OCR
text = extract_pdf_pages("document.pdf", start_page=0, end_page=5, lang="en")

# Extract text with different language
chinese_text = extract_pdf_pages("chinese_document.pdf", lang="ch_sim")

# Extract without progress bar
text = extract_pdf_pages("document.pdf", show_progress=False)

# Save to JSON
save_as_json(text, "output.json")

# OCR an entire PDF with a specific language
french_ocr_text = ocr_pdf("french_document.pdf", lang="fr")

Key Improvements and Performance Enhancements

  • Parallel Processing: Use multiple CPU cores for faster processing of large PDFs.
  • DPI Control: Adjust the resolution for OCR processing to balance speed and quality (--dpi).
  • Batch Processing: Process large PDFs in memory-efficient batches (--batch, --batch-size).
  • GPU Acceleration: Leverage GPU resources for OCR processing (--gpu).
  • OCR Threshold: Set a configurable threshold for when to switch to OCR processing (--ocr-threshold).
  • Reused OCR Reader: Optimized OCR integration for better speed, especially with multi-page documents.

These updates allow you to customize the extraction process based on hardware capabilities, whether you're looking for faster processing or better memory efficiency.

License

This project is licensed under the MIT License - see the LICENSE file for details.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atai_pdf_tool-0.1.1.tar.gz (9.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

atai_pdf_tool-0.1.1-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file atai_pdf_tool-0.1.1.tar.gz.

File metadata

  • Download URL: atai_pdf_tool-0.1.1.tar.gz
  • Upload date:
  • Size: 9.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for atai_pdf_tool-0.1.1.tar.gz
Algorithm Hash digest
SHA256 908036064ef5c23785fad1829780930530392632eef04152437491e06b39642f
MD5 ac0977ebf8c13f34217330ed98ec19b9
BLAKE2b-256 55a86a1db91ecca1acead7e099dd0c161289fab727c1188b03a8580d7baa037a

See more details on using hashes here.

File details

Details for the file atai_pdf_tool-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: atai_pdf_tool-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 8.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for atai_pdf_tool-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 69068d709f5abef8a7f276b8db72cfc5719bd6e8272e2ec3f7ed274071d11c65
MD5 9fb92c1fe2eba64732bea9a99f0bc330
BLAKE2b-256 9d2e3fa283360a8d9f05289cffd1c9068e0050d480df30781177b67c91a68023

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page