Skip to main content

Reliable PDF text extraction with PyMuPDF and configurable OCR engines (Tesseract/PaddleOCR).

Project description

pdfall

Acesso Rapido | Quick Access


Portugues

pdfall é uma biblioteca Python para extração de texto de PDFs com estratégia híbrida e foco em robustez:

  • extração de texto nativo com PyMuPDF
  • OCR em imagens incorporadas
  • fallback de OCR de página inteira quando necessário

O objetivo é entregar texto útil em PDFs reais (inclusive scans), sem obrigar cada projeto a montar sua própria pipeline de OCR.

Principais recursos

  • Pipeline híbrida: texto nativo + OCR.
  • Seleção de engine OCR por parâmetro (tesseract ou paddle).
  • Default seguro para OCR em CPU: tesseract.
  • Fallback inteligente para página inteira em conteúdo fraco ou falha de OCR em imagem.
  • Preservação de ordem de leitura por página (ordered_content).
  • CLI pronta para uso (pdfall-extract).

Instalação

Instalação base do pacote:

pip install .

Ambiente sem isolamento:

pip install --no-build-isolation .

Extras de OCR

Instalar suporte ao Tesseract (recomendado para o default):

pip install ".[tesseract]"

Instalar suporte ao PaddleOCR:

pip install ".[paddle]"

Instalar todos os engines:

pip install ".[all]"

Dependências de sistema (Tesseract)

pytesseract é apenas o wrapper Python. O binário tesseract precisa estar disponível no sistema.

  • Ubuntu/Debian:
sudo apt install tesseract-ocr tesseract-ocr-por
  • macOS (Homebrew):
brew install tesseract tesseract-lang
  • Windows (Chocolatey):
choco install tesseract

Uso rápido (Python)

from pdfall import extract_pdf_text

result = extract_pdf_text(
    pdf_path="arquivo.pdf",
    ocr_engine="tesseract",   # default
    ocr_language="pt",
)

print(result.full_text)

Usando PaddleOCR:

result = extract_pdf_text(
    pdf_path="arquivo.pdf",
    ocr_engine="paddle",
    ocr_language="pt",
)

API principal

extract_pdf_text(...) retorna PDFTextResult com:

  • pages: lista de PageTextResult
  • full_text: texto final do documento

Cada PageTextResult contém:

  • page_number
  • native_text
  • image_texts
  • full_page_ocr_text
  • ordered_content
  • combined_text

Parâmetros mais importantes

  • pdf_path: caminho do PDF.
  • ocr_engine: tesseract (default) ou paddle.
  • ocr_language: idioma do OCR (pt, en, es etc.).
  • ocr_on_images: habilita OCR em imagens incorporadas.
  • fallback_full_page_ocr: habilita OCR da página inteira quando necessário.
  • force_full_page_ocr: ignora heurística e sempre tenta OCR full-page.
  • image_ocr_workers: workers para OCR de imagem (default 1 por estabilidade).
  • min_embedded_image_area: ignora imagens pequenas.
  • max_image_side: default 4000.
  • max_image_pixels: default 8_000_000.

Comportamento de fallback

O pdfall tenta evitar resultados vazios em scans:

  • detecta quando texto nativo não é significativo
  • considera conteúdo de baixa informação
  • detecta falha real de OCR de imagem (decode/extração/execução)
  • quando necessário, dispara OCR de página inteira

Na prática, isso reduz casos de saída com apenas rodapé, assinatura digital ou poucas palavras.

CLI

Extração com engine default (tesseract):

pdfall-extract "arquivo.pdf" --lang pt --ocr-engine tesseract -o saida.txt

Extração com PaddleOCR:

pdfall-extract "arquivo.pdf" --lang pt --ocr-engine paddle -o saida.txt

Parâmetros úteis:

  • --workers (default 1)
  • --min-image-area (default 4096)
  • --max-image-side (default 4000)
  • --max-image-pixels (default 8000000)

Observações sobre idiomas

  • Com Tesseract, códigos curtos são mapeados automaticamente:
    • pt -> por
    • en -> eng
    • es -> spa
  • Com PaddleOCR, os códigos seguem o suporte da versão instalada.

Tratamento de erros

  • Se o wrapper Python do Tesseract não estiver instalado, uma exceção clara é retornada.
  • Se o binário Tesseract não estiver no PATH, a mensagem indica a instalação do binário.
  • Se um engine OCR inválido for informado, extract_pdf_text retorna ValueError.

Teste de regressão

No repositório existe um script para validação em PDFs com muito conteúdo em imagem:

python testes.py

Com diretório customizado de materiais:

PDFALL_TEST_MATERIALS_DIR="/caminho/para/material" python testes.py

English

pdfall is a Python library for PDF text extraction with a reliability-first hybrid strategy:

  • native text extraction with PyMuPDF
  • OCR on embedded images
  • full-page OCR fallback when needed

The goal is to provide useful text output for real-world PDFs (including scans), without requiring each project to build its own OCR pipeline.

Key Features

  • Hybrid pipeline: native text + OCR.
  • OCR engine selection via parameter (tesseract or paddle).
  • Safe CPU default: tesseract.
  • Smart full-page fallback for weak content or image OCR failures.
  • Page-level reading order preservation (ordered_content).
  • Ready-to-use CLI (pdfall-extract).

Installation

Base package installation:

pip install .

No-isolation environment:

pip install --no-build-isolation .

OCR Extras

Install Tesseract support (recommended default):

pip install ".[tesseract]"

Install PaddleOCR support:

pip install ".[paddle]"

Install all engines:

pip install ".[all]"

System Dependencies (Tesseract)

pytesseract is only the Python wrapper. The tesseract binary must be available on your system.

  • Ubuntu/Debian:
sudo apt install tesseract-ocr tesseract-ocr-por
  • macOS (Homebrew):
brew install tesseract tesseract-lang
  • Windows (Chocolatey):
choco install tesseract

Quick Start (Python)

from pdfall import extract_pdf_text

result = extract_pdf_text(
    pdf_path="file.pdf",
    ocr_engine="tesseract",   # default
    ocr_language="pt",
)

print(result.full_text)

Using PaddleOCR:

result = extract_pdf_text(
    pdf_path="file.pdf",
    ocr_engine="paddle",
    ocr_language="pt",
)

Main API

extract_pdf_text(...) returns PDFTextResult with:

  • pages: list of PageTextResult
  • full_text: final document text

Each PageTextResult contains:

  • page_number
  • native_text
  • image_texts
  • full_page_ocr_text
  • ordered_content
  • combined_text

Most Important Parameters

  • pdf_path: input PDF path.
  • ocr_engine: tesseract (default) or paddle.
  • ocr_language: OCR language (pt, en, es, etc.).
  • ocr_on_images: enables OCR for embedded images.
  • fallback_full_page_ocr: enables full-page OCR when needed.
  • force_full_page_ocr: bypasses heuristics and always tries full-page OCR.
  • image_ocr_workers: workers for image OCR (default 1 for stability).
  • min_embedded_image_area: ignores tiny embedded images.
  • max_image_side: default 4000.
  • max_image_pixels: default 8_000_000.

Fallback Behavior

pdfall is designed to reduce empty OCR output on scanned PDFs:

  • detects when native text is not meaningful
  • considers low-information OCR output
  • detects real image OCR failures (decode/extraction/execution)
  • triggers full-page OCR when needed

In practice, this reduces outputs containing only footers, signatures, or a few noisy words.

CLI (EN)

Extraction with default engine (tesseract):

pdfall-extract "file.pdf" --lang pt --ocr-engine tesseract -o output.txt

Extraction with PaddleOCR:

pdfall-extract "file.pdf" --lang pt --ocr-engine paddle -o output.txt

Useful CLI parameters:

  • --workers (default 1)
  • --min-image-area (default 4096)
  • --max-image-side (default 4000)
  • --max-image-pixels (default 8000000)

Language Notes

  • With Tesseract, short language codes are mapped automatically:
    • pt -> por
    • en -> eng
    • es -> spa
  • With PaddleOCR, language support depends on the installed version.

Error Handling

  • If the Tesseract Python wrapper is missing, a clear exception is raised.
  • If the Tesseract binary is not in PATH, the error message explains what to install.
  • If an invalid OCR engine is provided, extract_pdf_text raises ValueError.

Regression Test

The repository includes a script to validate image-heavy PDFs:

python testes.py

With custom material directory:

PDFALL_TEST_MATERIALS_DIR="/path/to/material" python testes.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfall-0.1.0.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfall-0.1.0-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file pdfall-0.1.0.tar.gz.

File metadata

  • Download URL: pdfall-0.1.0.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pdfall-0.1.0.tar.gz
Algorithm Hash digest
SHA256 058fc4aca4392543129cddc85d56425e488d78e7d0e0e1bb4b9f38aed57aa7bd
MD5 fc16f3c44a17e6623e6c736316909bb5
BLAKE2b-256 ae05dd688646870f7bfab092e941ce8c935afc16dcc56d8e5911e9b75e53d3d6

See more details on using hashes here.

File details

Details for the file pdfall-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdfall-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pdfall-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cd3c975ab1d9c1cdc1226404a6d2a8c377ff4129a15fcf8560a1674766f660ed
MD5 b026377344d77c41862e80a81b9cffbd
BLAKE2b-256 0824ad420563af75ad981596ff0220c19d0081cfccc4f7a4881f1acadf16ffe8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page