Extract text from PDFs using pypdfium2 with OCR fallback via pytesseract
Project description
pydocai
Extract text from PDF documents using pypdfium2 with automatic OCR fallback via pytesseract.
Features
- Fast native text extraction using pypdfium2 (Apache 2.0 licensed)
- Automatic OCR fallback for scanned documents using pytesseract
- Smart detection of sparse/scanned PDFs that need OCR
- Memory-efficient processing with lazy imports and temporary file handling
- Page limit to prevent processing extremely large documents (default: 15 pages)
Installation
pip install pydocai
System Dependencies
This package requires Tesseract OCR to be installed on your system for the OCR fallback feature:
macOS:
brew install tesseract
Ubuntu/Debian:
sudo apt-get install tesseract-ocr
Windows: Download and install from: https://github.com/UB-Mannheim/tesseract/wiki
Usage
from pydocai import extract_pdf_text
# Extract text and save to file
success = extract_pdf_text("document.pdf", "output.txt")
# Or let it auto-generate the output filename
success = extract_pdf_text("document.pdf") # Creates document_extracted.txt
How it works
- First attempts native text extraction using pypdfium2
- Checks if extracted text has sufficient content (>= 5 lines per page)
- If content is sparse (likely a scanned document), falls back to OCR
- Saves extracted text to the specified output file
Configuration
You can import and check the default configuration values:
from pydocai import (
OCR_DPI, # DPI for OCR rendering (default: 100)
MAX_PDF_PAGES, # Maximum pages to process (default: 15)
MIN_LINES_PER_PAGE # Minimum lines to consider page valid (default: 5)
)
Development
# Clone the repository
git clone https://github.com/catherine/pydocai.git
cd pydocai
# Install in development mode with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
License
Apache License 2.0 - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pydocai-0.1.0.tar.gz.
File metadata
- Download URL: pydocai-0.1.0.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6cb8467fba356c5ff66b2e51dd108b9b4179a4b7d968712ab9b1f37f5f008cb4
|
|
| MD5 |
a82129694cecb834745f589330fa5fde
|
|
| BLAKE2b-256 |
f0e027c397ccbcd537aa2a72c42f1c2bc295f14f65535e36bd571cce3337d673
|
File details
Details for the file pydocai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pydocai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fbb3f212eae9a399d8e735500a8117a73c286c212e2ac9efd78421c939ce1e1d
|
|
| MD5 |
90acf05c1237e2e6f7a95c3fcdcd9f3a
|
|
| BLAKE2b-256 |
055da5f05d409cac95ae7a94f5a89a473eb919a252dea3b76f13ad732780213a
|