Reliable PDF text extraction with PyMuPDF and configurable OCR engines (Tesseract/PaddleOCR).
Project description
pdfall
Acesso Rapido | Quick Access
Portugues
pdfall é uma biblioteca Python para extração de texto de PDFs com estratégia híbrida e foco em robustez:
- extração de texto nativo com
PyMuPDF - OCR em imagens incorporadas
- fallback de OCR de página inteira quando necessário
O objetivo é entregar texto útil em PDFs reais (inclusive scans), sem obrigar cada projeto a montar sua própria pipeline de OCR.
Principais recursos
- Pipeline híbrida: texto nativo + OCR.
- Seleção de engine OCR por parâmetro (
tesseractoupaddle). - Default seguro para OCR em CPU:
tesseract. - Fallback inteligente para página inteira em conteúdo fraco ou falha de OCR em imagem.
- Preservação de ordem de leitura por página (
ordered_content). - CLI pronta para uso (
pdfall-extract).
Instalação
Instalação base do pacote:
pip install .
Ambiente sem isolamento:
pip install --no-build-isolation .
Extras de OCR
Instalar suporte ao Tesseract (recomendado para o default):
pip install ".[tesseract]"
Instalar suporte ao PaddleOCR:
pip install ".[paddle]"
Instalar todos os engines:
pip install ".[all]"
Dependências de sistema (Tesseract)
pytesseract é apenas o wrapper Python. O binário tesseract precisa estar disponível no sistema.
- Ubuntu/Debian:
sudo apt install tesseract-ocr tesseract-ocr-por
- macOS (Homebrew):
brew install tesseract tesseract-lang
- Windows (Chocolatey):
choco install tesseract
Uso rápido (Python)
from pdfall import extract_pdf_text
result = extract_pdf_text(
pdf_path="arquivo.pdf",
ocr_engine="tesseract", # default
ocr_language="pt",
)
print(result.full_text)
Usando PaddleOCR:
result = extract_pdf_text(
pdf_path="arquivo.pdf",
ocr_engine="paddle",
ocr_language="pt",
)
API principal
extract_pdf_text(...) retorna PDFTextResult com:
pages: lista dePageTextResultfull_text: texto final do documento
Cada PageTextResult contém:
page_numbernative_textimage_textsfull_page_ocr_textordered_contentcombined_text
Parâmetros mais importantes
pdf_path: caminho do PDF.ocr_engine:tesseract(default) oupaddle.ocr_language: idioma do OCR (pt,en,esetc.).ocr_on_images: habilita OCR em imagens incorporadas.fallback_full_page_ocr: habilita OCR da página inteira quando necessário.force_full_page_ocr: ignora heurística e sempre tenta OCR full-page.image_ocr_workers: workers para OCR de imagem (default1por estabilidade).min_embedded_image_area: ignora imagens pequenas.max_image_side: default4000.max_image_pixels: default8_000_000.
Comportamento de fallback
O pdfall tenta evitar resultados vazios em scans:
- detecta quando texto nativo não é significativo
- considera conteúdo de baixa informação
- detecta falha real de OCR de imagem (decode/extração/execução)
- quando necessário, dispara OCR de página inteira
Na prática, isso reduz casos de saída com apenas rodapé, assinatura digital ou poucas palavras.
CLI
Extração com engine default (tesseract):
pdfall-extract "arquivo.pdf" --lang pt --ocr-engine tesseract -o saida.txt
Extração com PaddleOCR:
pdfall-extract "arquivo.pdf" --lang pt --ocr-engine paddle -o saida.txt
Parâmetros úteis:
--workers(default1)--min-image-area(default4096)--max-image-side(default4000)--max-image-pixels(default8000000)
Observações sobre idiomas
- Com Tesseract, códigos curtos são mapeados automaticamente:
pt -> poren -> enges -> spa
- Com PaddleOCR, os códigos seguem o suporte da versão instalada.
Tratamento de erros
- Se o wrapper Python do Tesseract não estiver instalado, uma exceção clara é retornada.
- Se o binário Tesseract não estiver no
PATH, a mensagem indica a instalação do binário. - Se um engine OCR inválido for informado,
extract_pdf_textretornaValueError.
Teste de regressão
No repositório existe um script para validação em PDFs com muito conteúdo em imagem:
python testes.py
Com diretório customizado de materiais:
PDFALL_TEST_MATERIALS_DIR="/caminho/para/material" python testes.py
English
pdfall is a Python library for PDF text extraction with a reliability-first hybrid strategy:
- native text extraction with
PyMuPDF - OCR on embedded images
- full-page OCR fallback when needed
The goal is to provide useful text output for real-world PDFs (including scans), without requiring each project to build its own OCR pipeline.
Key Features
- Hybrid pipeline: native text + OCR.
- OCR engine selection via parameter (
tesseractorpaddle). - Safe CPU default:
tesseract. - Smart full-page fallback for weak content or image OCR failures.
- Page-level reading order preservation (
ordered_content). - Ready-to-use CLI (
pdfall-extract).
Installation
Base package installation:
pip install .
No-isolation environment:
pip install --no-build-isolation .
OCR Extras
Install Tesseract support (recommended default):
pip install ".[tesseract]"
Install PaddleOCR support:
pip install ".[paddle]"
Install all engines:
pip install ".[all]"
System Dependencies (Tesseract)
pytesseract is only the Python wrapper. The tesseract binary must be available on your system.
- Ubuntu/Debian:
sudo apt install tesseract-ocr tesseract-ocr-por
- macOS (Homebrew):
brew install tesseract tesseract-lang
- Windows (Chocolatey):
choco install tesseract
Quick Start (Python)
from pdfall import extract_pdf_text
result = extract_pdf_text(
pdf_path="file.pdf",
ocr_engine="tesseract", # default
ocr_language="pt",
)
print(result.full_text)
Using PaddleOCR:
result = extract_pdf_text(
pdf_path="file.pdf",
ocr_engine="paddle",
ocr_language="pt",
)
Main API
extract_pdf_text(...) returns PDFTextResult with:
pages: list ofPageTextResultfull_text: final document text
Each PageTextResult contains:
page_numbernative_textimage_textsfull_page_ocr_textordered_contentcombined_text
Most Important Parameters
pdf_path: input PDF path.ocr_engine:tesseract(default) orpaddle.ocr_language: OCR language (pt,en,es, etc.).ocr_on_images: enables OCR for embedded images.fallback_full_page_ocr: enables full-page OCR when needed.force_full_page_ocr: bypasses heuristics and always tries full-page OCR.image_ocr_workers: workers for image OCR (default1for stability).min_embedded_image_area: ignores tiny embedded images.max_image_side: default4000.max_image_pixels: default8_000_000.
Fallback Behavior
pdfall is designed to reduce empty OCR output on scanned PDFs:
- detects when native text is not meaningful
- considers low-information OCR output
- detects real image OCR failures (decode/extraction/execution)
- triggers full-page OCR when needed
In practice, this reduces outputs containing only footers, signatures, or a few noisy words.
CLI (EN)
Extraction with default engine (tesseract):
pdfall-extract "file.pdf" --lang pt --ocr-engine tesseract -o output.txt
Extraction with PaddleOCR:
pdfall-extract "file.pdf" --lang pt --ocr-engine paddle -o output.txt
Useful CLI parameters:
--workers(default1)--min-image-area(default4096)--max-image-side(default4000)--max-image-pixels(default8000000)
Language Notes
- With Tesseract, short language codes are mapped automatically:
pt -> poren -> enges -> spa
- With PaddleOCR, language support depends on the installed version.
Error Handling
- If the Tesseract Python wrapper is missing, a clear exception is raised.
- If the Tesseract binary is not in
PATH, the error message explains what to install. - If an invalid OCR engine is provided,
extract_pdf_textraisesValueError.
Regression Test
The repository includes a script to validate image-heavy PDFs:
python testes.py
With custom material directory:
PDFALL_TEST_MATERIALS_DIR="/path/to/material" python testes.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfall-0.1.0.tar.gz.
File metadata
- Download URL: pdfall-0.1.0.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
058fc4aca4392543129cddc85d56425e488d78e7d0e0e1bb4b9f38aed57aa7bd
|
|
| MD5 |
fc16f3c44a17e6623e6c736316909bb5
|
|
| BLAKE2b-256 |
ae05dd688646870f7bfab092e941ce8c935afc16dcc56d8e5911e9b75e53d3d6
|
File details
Details for the file pdfall-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pdfall-0.1.0-py3-none-any.whl
- Upload date:
- Size: 14.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd3c975ab1d9c1cdc1226404a6d2a8c377ff4129a15fcf8560a1674766f660ed
|
|
| MD5 |
b026377344d77c41862e80a81b9cffbd
|
|
| BLAKE2b-256 |
0824ad420563af75ad981596ff0220c19d0081cfccc4f7a4881f1acadf16ffe8
|