Reliable PDF text extraction with PyMuPDF and configurable OCR engines (Tesseract/PaddleOCR).

Project description

pdfall

Acesso Rapido | Quick Access

Idioma: Portugues | English
Inicio rapido:
- Uso rapido (PT)
- Quick Start (EN)
CLI:
- CLI (PT)
- CLI (EN)
Instalacao:
- Instalacao e extras (PT)
- Installation and extras (EN)

Portugues

pdfall é uma biblioteca Python para extração de texto de PDFs com estratégia híbrida e foco em robustez:

extração de texto nativo com PyMuPDF
OCR em imagens incorporadas
fallback de OCR de página inteira quando necessário

O objetivo é entregar texto útil em PDFs reais (inclusive scans), sem obrigar cada projeto a montar sua própria pipeline de OCR.

Principais recursos

Pipeline híbrida: texto nativo + OCR.
Seleção de engine OCR por parâmetro (tesseract ou paddle).
Default seguro para OCR em CPU: tesseract.
Fallback inteligente para página inteira em conteúdo fraco ou falha de OCR em imagem.
Preservação de ordem de leitura por página (ordered_content).
CLI pronta para uso (pdfall-extract).

Instalação

Instalação base do pacote:

pip install .

Ambiente sem isolamento:

pip install --no-build-isolation .

Extras de OCR

Instalar suporte ao Tesseract (recomendado para o default):

pip install ".[tesseract]"

Instalar suporte ao PaddleOCR:

pip install ".[paddle]"

Instalar todos os engines:

pip install ".[all]"

Dependências de sistema (Tesseract)

pytesseract é apenas o wrapper Python. O binário tesseract precisa estar disponível no sistema.

Ubuntu/Debian:

sudo apt install tesseract-ocr tesseract-ocr-por

macOS (Homebrew):

brew install tesseract tesseract-lang

Windows (Chocolatey):

choco install tesseract

Uso rápido (Python)

from pdfall import extract_pdf_text

result = extract_pdf_text(
    pdf_path="arquivo.pdf",
    ocr_engine="tesseract",   # default
    ocr_language="pt",
)

print(result.full_text)

Usando PaddleOCR:

result = extract_pdf_text(
    pdf_path="arquivo.pdf",
    ocr_engine="paddle",
    ocr_language="pt",
)

API principal

extract_pdf_text(...) retorna PDFTextResult com:

pages: lista de PageTextResult
full_text: texto final do documento

Cada PageTextResult contém:

page_number
native_text
image_texts
full_page_ocr_text
ordered_content
combined_text

Parâmetros mais importantes

pdf_path: caminho do PDF.
ocr_engine: tesseract (default) ou paddle.
ocr_language: idioma do OCR (pt, en, es etc.).
ocr_on_images: habilita OCR em imagens incorporadas.
fallback_full_page_ocr: habilita OCR da página inteira quando necessário.
force_full_page_ocr: ignora heurística e sempre tenta OCR full-page.
image_ocr_workers: workers para OCR de imagem (default 1 por estabilidade).
min_embedded_image_area: ignora imagens pequenas.
max_image_side: default 4000.
max_image_pixels: default 8_000_000.

Comportamento de fallback

O pdfall tenta evitar resultados vazios em scans:

detecta quando texto nativo não é significativo
considera conteúdo de baixa informação
detecta falha real de OCR de imagem (decode/extração/execução)
quando necessário, dispara OCR de página inteira

Na prática, isso reduz casos de saída com apenas rodapé, assinatura digital ou poucas palavras.

CLI

Extração com engine default (tesseract):

pdfall-extract "arquivo.pdf" --lang pt --ocr-engine tesseract -o saida.txt

Extração com PaddleOCR:

pdfall-extract "arquivo.pdf" --lang pt --ocr-engine paddle -o saida.txt

Parâmetros úteis:

--workers (default 1)
--min-image-area (default 4096)
--max-image-side (default 4000)
--max-image-pixels (default 8000000)

Observações sobre idiomas

Com Tesseract, códigos curtos são mapeados automaticamente:
- pt -> por
- en -> eng
- es -> spa
Com PaddleOCR, os códigos seguem o suporte da versão instalada.

Tratamento de erros

Se o wrapper Python do Tesseract não estiver instalado, uma exceção clara é retornada.
Se o binário Tesseract não estiver no PATH, a mensagem indica a instalação do binário.
Se um engine OCR inválido for informado, extract_pdf_text retorna ValueError.

Teste de regressão

No repositório existe um script para validação em PDFs com muito conteúdo em imagem:

python testes.py

Com diretório customizado de materiais:

PDFALL_TEST_MATERIALS_DIR="/caminho/para/material" python testes.py

English

pdfall is a Python library for PDF text extraction with a reliability-first hybrid strategy:

native text extraction with PyMuPDF
OCR on embedded images
full-page OCR fallback when needed

The goal is to provide useful text output for real-world PDFs (including scans), without requiring each project to build its own OCR pipeline.

Key Features

Hybrid pipeline: native text + OCR.
OCR engine selection via parameter (tesseract or paddle).
Safe CPU default: tesseract.
Smart full-page fallback for weak content or image OCR failures.
Page-level reading order preservation (ordered_content).
Ready-to-use CLI (pdfall-extract).

Installation

Base package installation:

pip install .

No-isolation environment:

pip install --no-build-isolation .

OCR Extras

Install Tesseract support (recommended default):

pip install ".[tesseract]"

Install PaddleOCR support:

pip install ".[paddle]"

Install all engines:

pip install ".[all]"

System Dependencies (Tesseract)

pytesseract is only the Python wrapper. The tesseract binary must be available on your system.

Ubuntu/Debian:

sudo apt install tesseract-ocr tesseract-ocr-por

macOS (Homebrew):

brew install tesseract tesseract-lang

Windows (Chocolatey):

choco install tesseract

Quick Start (Python)

from pdfall import extract_pdf_text

result = extract_pdf_text(
    pdf_path="file.pdf",
    ocr_engine="tesseract",   # default
    ocr_language="pt",
)

print(result.full_text)

Using PaddleOCR:

result = extract_pdf_text(
    pdf_path="file.pdf",
    ocr_engine="paddle",
    ocr_language="pt",
)

Main API

extract_pdf_text(...) returns PDFTextResult with:

pages: list of PageTextResult
full_text: final document text

Each PageTextResult contains:

page_number
native_text
image_texts
full_page_ocr_text
ordered_content
combined_text

Most Important Parameters

pdf_path: input PDF path.
ocr_engine: tesseract (default) or paddle.
ocr_language: OCR language (pt, en, es, etc.).
ocr_on_images: enables OCR for embedded images.
fallback_full_page_ocr: enables full-page OCR when needed.
force_full_page_ocr: bypasses heuristics and always tries full-page OCR.
image_ocr_workers: workers for image OCR (default 1 for stability).
min_embedded_image_area: ignores tiny embedded images.
max_image_side: default 4000.
max_image_pixels: default 8_000_000.

Fallback Behavior

pdfall is designed to reduce empty OCR output on scanned PDFs:

detects when native text is not meaningful
considers low-information OCR output
detects real image OCR failures (decode/extraction/execution)
triggers full-page OCR when needed

In practice, this reduces outputs containing only footers, signatures, or a few noisy words.

CLI (EN)

Extraction with default engine (tesseract):

pdfall-extract "file.pdf" --lang pt --ocr-engine tesseract -o output.txt

Extraction with PaddleOCR:

pdfall-extract "file.pdf" --lang pt --ocr-engine paddle -o output.txt

Useful CLI parameters:

--workers (default 1)
--min-image-area (default 4096)
--max-image-side (default 4000)
--max-image-pixels (default 8000000)

Language Notes

With Tesseract, short language codes are mapped automatically:
- pt -> por
- en -> eng
- es -> spa
With PaddleOCR, language support depends on the installed version.

Error Handling

If the Tesseract Python wrapper is missing, a clear exception is raised.
If the Tesseract binary is not in PATH, the error message explains what to install.
If an invalid OCR engine is provided, extract_pdf_text raises ValueError.

Regression Test

The repository includes a script to validate image-heavy PDFs:

python testes.py

With custom material directory:

PDFALL_TEST_MATERIALS_DIR="/path/to/material" python testes.py

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Mar 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfall-0.1.0.tar.gz (16.2 kB view details)

Uploaded Mar 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdfall-0.1.0-py3-none-any.whl (14.2 kB view details)

Uploaded Mar 2, 2026 Python 3

File details

Details for the file pdfall-0.1.0.tar.gz.

File metadata

Download URL: pdfall-0.1.0.tar.gz
Upload date: Mar 2, 2026
Size: 16.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pdfall-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`058fc4aca4392543129cddc85d56425e488d78e7d0e0e1bb4b9f38aed57aa7bd`
MD5	`fc16f3c44a17e6623e6c736316909bb5`
BLAKE2b-256	`ae05dd688646870f7bfab092e941ce8c935afc16dcc56d8e5911e9b75e53d3d6`

See more details on using hashes here.

File details

Details for the file pdfall-0.1.0-py3-none-any.whl.

File metadata

Download URL: pdfall-0.1.0-py3-none-any.whl
Upload date: Mar 2, 2026
Size: 14.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pdfall-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cd3c975ab1d9c1cdc1226404a6d2a8c377ff4129a15fcf8560a1674766f660ed`
MD5	`b026377344d77c41862e80a81b9cffbd`
BLAKE2b-256	`0824ad420563af75ad981596ff0220c19d0081cfccc4f7a4881f1acadf16ffe8`

See more details on using hashes here.

pdfall 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

pdfall

Acesso Rapido | Quick Access

Portugues

Principais recursos

Instalação

Extras de OCR

Dependências de sistema (Tesseract)

Uso rápido (Python)

API principal

Parâmetros mais importantes

Comportamento de fallback

CLI

Observações sobre idiomas

Tratamento de erros

Teste de regressão

English

Key Features

Installation

OCR Extras

System Dependencies (Tesseract)

Quick Start (Python)

Main API

Most Important Parameters

Fallback Behavior

CLI (EN)

Language Notes

Error Handling

Regression Test

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes