Skip to main content

Extract text from PDFs using pypdfium2 with OCR fallback via pytesseract

Project description

pydocai

Extract text from PDF documents using pypdfium2 with automatic OCR fallback via pytesseract.

Features

  • Fast native text extraction using pypdfium2 (Apache 2.0 licensed)
  • Automatic OCR fallback for scanned documents using pytesseract
  • Smart detection of sparse/scanned PDFs that need OCR
  • Memory-efficient processing with lazy imports and temporary file handling
  • Page limit to prevent processing extremely large documents (default: 15 pages)

Installation

pip install pydocai

System Dependencies

This package requires Tesseract OCR to be installed on your system for the OCR fallback feature:

macOS:

brew install tesseract

Ubuntu/Debian:

sudo apt-get install tesseract-ocr

Windows: Download and install from: https://github.com/UB-Mannheim/tesseract/wiki

Usage

from pydocai import extract_pdf_text

# Extract text and save to file
success = extract_pdf_text("document.pdf", "output.txt")

# Or let it auto-generate the output filename
success = extract_pdf_text("document.pdf")  # Creates document_extracted.txt

How it works

  1. First attempts native text extraction using pypdfium2
  2. Checks if extracted text has sufficient content (>= 5 lines per page)
  3. If content is sparse (likely a scanned document), falls back to OCR
  4. Saves extracted text to the specified output file

Configuration

You can import and check the default configuration values:

from pydocai import (
    OCR_DPI,           # DPI for OCR rendering (default: 100)
    MAX_PDF_PAGES,     # Maximum pages to process (default: 15)
    MIN_LINES_PER_PAGE # Minimum lines to consider page valid (default: 5)
)

Development

# Clone the repository
git clone https://github.com/catherine/pydocai.git
cd pydocai

# Install in development mode with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

License

Apache License 2.0 - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydocai-0.1.0.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydocai-0.1.0-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file pydocai-0.1.0.tar.gz.

File metadata

  • Download URL: pydocai-0.1.0.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for pydocai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6cb8467fba356c5ff66b2e51dd108b9b4179a4b7d968712ab9b1f37f5f008cb4
MD5 a82129694cecb834745f589330fa5fde
BLAKE2b-256 f0e027c397ccbcd537aa2a72c42f1c2bc295f14f65535e36bd571cce3337d673

See more details on using hashes here.

File details

Details for the file pydocai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pydocai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for pydocai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fbb3f212eae9a399d8e735500a8117a73c286c212e2ac9efd78421c939ce1e1d
MD5 90acf05c1237e2e6f7a95c3fcdcd9f3a
BLAKE2b-256 055da5f05d409cac95ae7a94f5a89a473eb919a252dea3b76f13ad732780213a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page