Extração OCR de processos judiciais — PDF para Markdown

These details have not been verified by PyPI

Project description

tecjustica-ocr

Extração OCR de processos judiciais — transforma PDFs em Markdown legível.

Usa PaddleOCR 3.x (PP-OCRv5 / PP-StructureV3) com auto-detecção GPU/CPU.

Requisitos do Sistema

Sistema Operacional

SO	Suporte	Observação
Linux (Ubuntu 22.04+, Debian 12+)	Nativo	Melhor experiência, GPU funciona direto
Windows 10/11	Apenas via WSL2	Não funciona nativo — veja Configuração WSL2
macOS	Apenas CPU	Não testado extensivamente

Por que não funciona no Windows nativo? O PaddlePaddle GPU requer CUDA toolkit compilado para Linux. Os modelos OCR e dependências nativas não são compatíveis com Windows. O WSL2 com GPU passthrough é o único caminho suportado no Windows.

Hardware

Componente	Mínimo	Recomendado
CPU	x86_64 (Intel/AMD)	Intel i5 12ª gen+ / AMD Ryzen 5+
RAM	4 GB	8 GB+
Disco	2 GB livres	5 GB+
GPU (opcional)	NVIDIA com CUDA CC 6.0+ e 4 GB VRAM	RTX 3050 6 GB+

GPU não é obrigatório. O tecjustica-ocr funciona em CPU, porém fica mais lento (~5x). Com GPU, processa ~312 páginas em ~7 minutos. Em CPU, o mesmo leva ~35 minutos.

GPU NVIDIA — requisitos para aceleração

Para usar GPU, você precisa de uma placa NVIDIA com:

Compute Capability 6.0+ (Pascal ou superior)
Mínimo 4 GB VRAM (6 GB recomendado para modelo server)
Driver NVIDIA 525+ (no Windows: instalar o driver Windows — o WSL2 usa automaticamente)
CUDA 11.8 (instalado automaticamente pelo comando init)

GPUs compatíveis (referência rápida)

Família	Compute Capability	Status
GTX 1050, 1060, 1070, 1080	6.1	Funciona, mais lento
GTX 1650, 1660	7.5	Funciona bem
RTX 2060, 2070, 2080	7.5	Funciona bem
RTX 3050, 3060, 3070, 3080	8.6	Recomendado, testado
RTX 4050, 4060, 4070, 4080, 4090	8.9	Excelente

GPUs AMD e Intel Arc não são suportadas. O PaddlePaddle requer CUDA, que é exclusivo NVIDIA.

Software

Requisito	Versão
Python	3.10+ (recomendado: 3.12)
Gerenciador de pacotes	`uv` (recomendado) ou `pip`

Configuração WSL2 (Windows)

Se você usa Windows, siga estes passos antes de instalar o tecjustica-ocr:

1. Instalar WSL2

# No PowerShell como Administrador
wsl --install

Reinicie o computador após a instalação. O Ubuntu será instalado por padrão.

2. Instalar driver NVIDIA para Windows

Baixe e instale o driver NVIDIA para Windows (não para Linux):

Acesse nvidia.com/drivers e baixe o driver para sua placa
Instale normalmente no Windows — versão 525 ou superior

Importante: NÃO instale o CUDA toolkit dentro do WSL. O driver do Windows faz o passthrough automaticamente para o WSL2.

3. Verificar GPU no WSL

Abra o terminal WSL (Ubuntu) e execute:

nvidia-smi

Se aparecer uma tabela com o nome da sua GPU e versão do driver, está tudo certo. Se der erro, verifique se o driver Windows está atualizado.

4. Instalar Python e uv no WSL

# Atualizar pacotes
sudo apt update && sudo apt upgrade -y

# Instalar Python
sudo apt install python3 python3-venv -y

# Instalar uv (recomendado)
curl -LsSf https://astral.sh/uv/install.sh | sh

5. Seguir instalação normal

A partir daqui, siga as instruções de Instalação abaixo normalmente dentro do WSL.

Instalação

Opção A: `uv tool` (recomendado)

A forma mais simples — instala em ambiente isolado com um único comando:

uv tool install tecjustica-ocr

Depois, configure o PaddlePaddle e baixe os modelos:

tecjustica-ocr init

O init detecta se você tem GPU e instala o PaddlePaddle correto automaticamente.

Opção B: `pip` com venv manual

Para quem prefere controle total ou não usa uv:

# 1. Criar ambiente virtual
python3 -m venv ocr-env
source ocr-env/bin/activate     # Linux/Mac
# ocr-env\Scripts\activate      # Windows

# 2. Instalar PaddlePaddle (escolha uma opção)
# Com GPU NVIDIA (CUDA 11.8):
pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/
# Sem GPU (somente CPU):
pip install paddlepaddle==3.2.0

# 3. Instalar tecjustica-ocr
pip install tecjustica-ocr

Como saber se tenho GPU compatível? Execute nvidia-smi no terminal. Se aparecer uma tabela com o nome da sua placa, você tem GPU. Se der erro "comando não encontrado", use a versão CPU.

Verificar instalação

tecjustica-ocr init

Este comando verifica tudo automaticamente:

Versão do Python
RAM e espaço em disco
Se PaddlePaddle está instalado
Se tem GPU e qual modelo
Se os modelos OCR estão baixados

Na primeira execução, ele baixa os modelos automaticamente (~500 MB). Depois disso, ficam em cache.

Exemplo de saída:

╭──────────────────────────────────────────────────────────────────╮
│ tecjustica-ocr 0.1.3 — Diagnóstico do sistema                   │
╰──────────────────────────────────────────────────────────────────╯

1. Verificando sistema...

┌────────────────────────────┬────────┬──────────────────────────────────────┐
│ Verificação                │ Status │ Detalhe                              │
├────────────────────────────┼────────┼──────────────────────────────────────┤
│ Python                     │   OK   │ 3.12.3                               │
│ Sistema                    │   OK   │ Linux x86_64                         │
│ RAM                        │   OK   │ 7.6 GB total                         │
│ Espaço em disco            │   OK   │ 901.9 GB livres                      │
│ PaddlePaddle               │   OK   │ v3.2.0                               │
│ PaddleOCR                  │   OK   │ v3.4.0                               │
│ pypdfium2                  │   OK   │ v4.x                                 │
│ GPU CUDA                   │   OK   │ 1 GPU(s) — NVIDIA GeForce RTX 3050   │
│ Modelo PP-OCRv5_mobile_det │   OK   │ em cache                             │
│ Modelo PP-OCRv5_mobile_rec │   OK   │ em cache                             │
│ Modelo PP-OCRv5_server_det │   OK   │ em cache                             │
│ Modelo PP-OCRv5_server_rec │   OK   │ em cache                             │
└────────────────────────────┴────────┴──────────────────────────────────────┘

2. Baixando modelos OCR...

  OK Modelo mobile pronto
  OK Modelo server pronto

╭────────────────────────────── Pronto para usar ──────────────────────────────╮
│ Sistema compatível!                                                          │
│                                                                              │
│   Dispositivo: GPU                                                           │
│   Comando: tecjustica-ocr run <pdf>                                          │
╰──────────────────────────────────────────────────────────────────────────────╯

Se algum item aparecer como FALHA, o init explica o que corrigir.

Uso

Processar um PDF

tecjustica-ocr run processo.pdf

Gera output/processo.md com todo o texto extraído.

Processar uma pasta inteira

tecjustica-ocr run pasta-com-pdfs/ -o resultado/

Gera um .md para cada PDF encontrado na pasta.

Escolher modelo

# Mobile (default) — rápido, bom para maioria dos casos
tecjustica-ocr run processo.pdf -m mobile

# Server — maior qualidade, ~2.5x mais lento
tecjustica-ocr run processo.pdf -m server

Modelo	Velocidade	Textos	Quando usar
`mobile`	~1.3s/pág	~86 por pág	Processos longos, leitura rápida
`server`	~3.2s/pág	~93 por pág	Documentos difíceis, qualidade máxima

Modo estrutural (tabelas e layout)

tecjustica-ocr run processo.pdf --mode structure

Usa PP-StructureV3 para preservar tabelas, cabeçalhos e layout do documento.

Forçar CPU ou GPU

tecjustica-ocr run processo.pdf -d cpu    # Forçar CPU
tecjustica-ocr run processo.pdf -d gpu    # Forçar GPU

Por padrão, detecta automaticamente (-d auto).

Todas as opções

tecjustica-ocr run [CAMINHO] [OPÇÕES]

Argumentos:
  CAMINHO              Arquivo PDF ou pasta com PDFs (obrigatório)

Opções:
  -o, --output DIR     Diretório de saída (default: ./output)
  -m, --model TEXT     mobile (default) ou server
  -d, --device TEXT    auto (default), gpu ou cpu
  -s, --scale INT      Escala de render: 1, 2 ou 3 (default: 2)
  -w, --workers INT    Workers paralelos (default: auto)
  --mode TEXT          text (default) ou structure
  --min-score FLOAT    Score mínimo de confiança (default: 0.5)
  -v, --verbose        Output detalhado

tecjustica-ocr init [OPÇÕES]

Opções:
  --download/--no-download   Baixar modelos (default: sim)
  -m, --model TEXT           Modelo para baixar: mobile, server ou all (default: all)

Formato de saída

O markdown gerado tem a seguinte estrutura:

# nome-do-arquivo.pdf

- **Data de extração**: 2025-03-17 15:44
- **Páginas**: 312
- **Textos extraídos**: 26706

## Página 1

Tribunal de Justiça do Estado do Ceará - 1° Grau
PJe - Processo Judicial Eletrônico
Número: 3000066-83.2025.8.06.0203
...

---

## Página 2

...

API Python

Para usar dentro de scripts ou outros programas:

from tecjustica_ocr import extract_text, extract_structure, OcrConfig

# Extrair texto simples
texto = extract_text("processo.pdf")
print(texto)

# Extrair com configurações customizadas
config = OcrConfig(model="server", device="gpu", scale=3)
texto = extract_text("processo.pdf", config)

# Extrair estrutura (tabelas, layout) como markdown
markdown = extract_structure("processo.pdf")

Resolução de problemas

"comando não encontrado" após instalar

O ambiente virtual precisa estar ativado:

source ocr-env/bin/activate
tecjustica-ocr --version

Ou execute via Python:

python -m tecjustica_ocr --help

Erro de CUDA / GPU não detectada

Verifique se nvidia-smi funciona
Verifique se instalou paddlepaddle-gpu (não paddlepaddle)
Execute tecjustica-ocr init para diagnóstico completo
Se não tiver GPU, o sistema funciona em CPU automaticamente

Modelos não baixam

Execute o init com download explícito:

tecjustica-ocr init --download -m all

Os modelos ficam em ~/.paddlex/official_models/. Se precisar limpar, delete essa pasta e rode o init novamente.

Processo muito lento

Use modelo mobile (default): -m mobile
Escala 2 é suficiente: -s 2
Se tiver GPU, confirme que está sendo usada: olhe "Dispositivo: gpu:0" no início

Benchmarks

Testado em RTX 3050 6GB, 312 páginas de processo judicial:

Modelo	Tempo total	Por página	Textos
mobile + GPU	6.8 min	1.310 ms	26.706
server + GPU	16.5 min	3.181 ms	29.005

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.4

Mar 19, 2026

0.1.3

Mar 18, 2026

0.1.2

Mar 18, 2026

0.1.1

Mar 17, 2026

0.1.0

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tecjustica_ocr-0.1.4.tar.gz (17.4 kB view details)

Uploaded Mar 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tecjustica_ocr-0.1.4-py3-none-any.whl (21.2 kB view details)

Uploaded Mar 19, 2026 Python 3

File details

Details for the file tecjustica_ocr-0.1.4.tar.gz.

File metadata

Download URL: tecjustica_ocr-0.1.4.tar.gz
Upload date: Mar 19, 2026
Size: 17.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tecjustica_ocr-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`937b65fad1e29d0d2f0e488d20f95e346545efc5456404dedfc236d391516a86`
MD5	`5ff007025c1db2a99a726df83b9605de`
BLAKE2b-256	`2745e9786f95684b0d2ca50cf05e8872cb5e50ca17b01ead3a3b9d20054782e1`

See more details on using hashes here.

File details

Details for the file tecjustica_ocr-0.1.4-py3-none-any.whl.

File metadata

Download URL: tecjustica_ocr-0.1.4-py3-none-any.whl
Upload date: Mar 19, 2026
Size: 21.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tecjustica_ocr-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7ada9649eb4b0a47be4c5d5cd0e0d562027a91a900c9fdca43b9da35b520f973`
MD5	`62a22d964dde294a3c0d382d9200a4b1`
BLAKE2b-256	`24a6a3597e923135b58badd9641664e1c864322f47ec1e6bc57ad377c565c11d`

See more details on using hashes here.

tecjustica-ocr 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

tecjustica-ocr

Requisitos do Sistema

Sistema Operacional

Hardware

GPU NVIDIA — requisitos para aceleração

GPUs compatíveis (referência rápida)

Software

Configuração WSL2 (Windows)

1. Instalar WSL2

2. Instalar driver NVIDIA para Windows

3. Verificar GPU no WSL

4. Instalar Python e uv no WSL

5. Seguir instalação normal

Instalação

Opção A: uv tool (recomendado)

Opção B: pip com venv manual

Verificar instalação

Uso

Processar um PDF

Processar uma pasta inteira

Escolher modelo

Modo estrutural (tabelas e layout)

Forçar CPU ou GPU

Todas as opções

Formato de saída

API Python

Resolução de problemas

"comando não encontrado" após instalar

Erro de CUDA / GPU não detectada

Modelos não baixam

Processo muito lento

Benchmarks

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Opção A: `uv tool` (recomendado)

Opção B: `pip` com venv manual