Extração OCR de processos judiciais — PDF para Markdown

These details have not been verified by PyPI

Project description

tecjustica-ocr

Extração OCR de processos judiciais — transforma PDFs em Markdown legível.

Usa PaddleOCR 3.x (PP-OCRv5 / PP-StructureV3) com auto-detecção GPU/CPU.

Requisitos

Requisito	Mínimo	Recomendado
Python	3.10+	3.12
RAM	4 GB	8 GB+
Disco	2 GB livres	5 GB+
GPU (opcional)	NVIDIA com CUDA 11.8	RTX 3050+
Sistema	Linux, Windows	Linux (melhor suporte GPU)

GPU não é obrigatório. O tecjustica-ocr funciona em CPU, porém fica mais lento (~5x). Com GPU, processa ~312 páginas em ~7 minutos. Em CPU, o mesmo leva ~35 minutos.

Instalação

Passo 1: Criar ambiente virtual

O tecjustica-ocr tem dependências pesadas (PaddlePaddle, modelos de IA). Use sempre um ambiente virtual.

# Opção A: com uv (recomendado, mais rápido)
uv venv ocr-env --python 3.12
source ocr-env/bin/activate     # Linux/Mac
# ocr-env\Scripts\activate      # Windows

# Opção B: com python padrão
python3 -m venv ocr-env
source ocr-env/bin/activate     # Linux/Mac
# ocr-env\Scripts\activate      # Windows

Passo 2: Instalar PaddlePaddle

O PaddlePaddle (motor de IA) precisa ser instalado separadamente porque a versão GPU vem de um repositório especial.

Se você tem GPU NVIDIA (CUDA 11.8):

pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/

Se você NÃO tem GPU (somente CPU):

pip install paddlepaddle==3.2.0

Como saber se tenho GPU compatível? Execute nvidia-smi no terminal. Se aparecer uma tabela com o nome da sua placa, você tem GPU. Se der erro "comando não encontrado", use a versão CPU.

Passo 3: Instalar tecjustica-ocr

pip install tecjustica-ocr

Passo 4: Verificar instalação

tecjustica-ocr init

Este comando verifica tudo automaticamente:

Versão do Python
RAM e espaço em disco
Se PaddlePaddle está instalado
Se tem GPU e qual modelo
Se os modelos OCR estão baixados

Na primeira execução, ele baixa os modelos automaticamente (~500 MB). Depois disso, ficam em cache.

Exemplo de saída:

╭──────────────────────────────────────────────────────────────────╮
│ tecjustica-ocr 0.1.0 — Diagnóstico do sistema                   │
╰──────────────────────────────────────────────────────────────────╯

1. Verificando sistema...

┌────────────────────────────┬────────┬──────────────────────────────────────┐
│ Verificação                │ Status │ Detalhe                              │
├────────────────────────────┼────────┼──────────────────────────────────────┤
│ Python                     │   OK   │ 3.12.3                               │
│ Sistema                    │   OK   │ Linux x86_64                         │
│ RAM                        │   OK   │ 7.6 GB total                         │
│ Espaço em disco            │   OK   │ 901.9 GB livres                      │
│ PaddlePaddle               │   OK   │ v3.2.0                               │
│ PaddleOCR                  │   OK   │ v3.4.0                               │
│ pypdfium2                  │   OK   │ v4.x                                 │
│ GPU CUDA                   │   OK   │ 1 GPU(s) — NVIDIA GeForce RTX 3050   │
│ Modelo PP-OCRv5_mobile_det │   OK   │ em cache                             │
│ Modelo PP-OCRv5_mobile_rec │   OK   │ em cache                             │
│ Modelo PP-OCRv5_server_det │   OK   │ em cache                             │
│ Modelo PP-OCRv5_server_rec │   OK   │ em cache                             │
└────────────────────────────┴────────┴──────────────────────────────────────┘

2. Baixando modelos OCR...

  OK Modelo mobile pronto
  OK Modelo server pronto

╭────────────────────────────── Pronto para usar ──────────────────────────────╮
│ Sistema compatível!                                                          │
│                                                                              │
│   Dispositivo: GPU                                                           │
│   Comando: tecjustica-ocr run <pdf>                                          │
╰──────────────────────────────────────────────────────────────────────────────╯

Se algum item aparecer como FALHA, o init explica o que corrigir.

Uso

Processar um PDF

tecjustica-ocr run processo.pdf

Gera output/processo.md com todo o texto extraído.

Processar uma pasta inteira

tecjustica-ocr run pasta-com-pdfs/ -o resultado/

Gera um .md para cada PDF encontrado na pasta.

Escolher modelo

# Mobile (default) — rápido, bom para maioria dos casos
tecjustica-ocr run processo.pdf -m mobile

# Server — maior qualidade, ~2.5x mais lento
tecjustica-ocr run processo.pdf -m server

Modelo	Velocidade	Textos	Quando usar
`mobile`	~1.3s/pág	~86 por pág	Processos longos, leitura rápida
`server`	~3.2s/pág	~93 por pág	Documentos difíceis, qualidade máxima

Modo estrutural (tabelas e layout)

tecjustica-ocr run processo.pdf --mode structure

Usa PP-StructureV3 para preservar tabelas, cabeçalhos e layout do documento.

Forçar CPU ou GPU

tecjustica-ocr run processo.pdf -d cpu    # Forçar CPU
tecjustica-ocr run processo.pdf -d gpu    # Forçar GPU

Por padrão, detecta automaticamente (-d auto).

Todas as opções

tecjustica-ocr run [CAMINHO] [OPÇÕES]

Argumentos:
  CAMINHO              Arquivo PDF ou pasta com PDFs (obrigatório)

Opções:
  -o, --output DIR     Diretório de saída (default: ./output)
  -m, --model TEXT     mobile (default) ou server
  -d, --device TEXT    auto (default), gpu ou cpu
  -s, --scale INT      Escala de render: 1, 2 ou 3 (default: 2)
  -w, --workers INT    Workers paralelos (default: auto)
  --mode TEXT          text (default) ou structure
  --min-score FLOAT    Score mínimo de confiança (default: 0.5)
  -v, --verbose        Output detalhado

tecjustica-ocr init [OPÇÕES]

Opções:
  --download/--no-download   Baixar modelos (default: sim)
  -m, --model TEXT           Modelo para baixar: mobile, server ou all (default: all)

Formato de saída

O markdown gerado tem a seguinte estrutura:

# nome-do-arquivo.pdf

- **Data de extração**: 2025-03-17 15:44
- **Páginas**: 312
- **Textos extraídos**: 26706

## Página 1

Tribunal de Justiça do Estado do Ceará - 1° Grau
PJe - Processo Judicial Eletrônico
Número: 3000066-83.2025.8.06.0203
...

---

## Página 2

...

API Python

Para usar dentro de scripts ou outros programas:

from tecjustica_ocr import extract_text, extract_structure, OcrConfig

# Extrair texto simples
texto = extract_text("processo.pdf")
print(texto)

# Extrair com configurações customizadas
config = OcrConfig(model="server", device="gpu", scale=3)
texto = extract_text("processo.pdf", config)

# Extrair estrutura (tabelas, layout) como markdown
markdown = extract_structure("processo.pdf")

Resolução de problemas

"comando não encontrado" após instalar

O ambiente virtual precisa estar ativado:

source ocr-env/bin/activate
tecjustica-ocr --version

Ou execute via Python:

python -m tecjustica_ocr --help

Erro de CUDA / GPU não detectada

Verifique se nvidia-smi funciona
Verifique se instalou paddlepaddle-gpu (não paddlepaddle)
Execute tecjustica-ocr init para diagnóstico completo
Se não tiver GPU, o sistema funciona em CPU automaticamente

Modelos não baixam

Execute o init com download explícito:

tecjustica-ocr init --download -m all

Os modelos ficam em ~/.paddlex/official_models/. Se precisar limpar, delete essa pasta e rode o init novamente.

Processo muito lento

Use modelo mobile (default): -m mobile
Escala 2 é suficiente: -s 2
Se tiver GPU, confirme que está sendo usada: olhe "Dispositivo: gpu:0" no início

Benchmarks

Testado em RTX 3050 6GB, 312 páginas de processo judicial:

Modelo	Tempo total	Por página	Textos
mobile + GPU	6.8 min	1.310 ms	26.706
server + GPU	16.5 min	3.181 ms	29.005

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.4

Mar 19, 2026

This version

0.1.3

Mar 18, 2026

0.1.2

Mar 18, 2026

0.1.1

Mar 17, 2026

0.1.0

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tecjustica_ocr-0.1.3.tar.gz (15.9 kB view details)

Uploaded Mar 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tecjustica_ocr-0.1.3-py3-none-any.whl (19.8 kB view details)

Uploaded Mar 18, 2026 Python 3

File details

Details for the file tecjustica_ocr-0.1.3.tar.gz.

File metadata

Download URL: tecjustica_ocr-0.1.3.tar.gz
Upload date: Mar 18, 2026
Size: 15.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tecjustica_ocr-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`0dde777c86c995db2283534bace67975c1156d5559688d8f234a9cce846ac0d8`
MD5	`7788193f0ed691bfb904718fbe9065b2`
BLAKE2b-256	`f7ce61d1369f0814cd011c344156ad4a1ada33ddd1fecfd83ed5736b9f3f6618`

See more details on using hashes here.

File details

Details for the file tecjustica_ocr-0.1.3-py3-none-any.whl.

File metadata

Download URL: tecjustica_ocr-0.1.3-py3-none-any.whl
Upload date: Mar 18, 2026
Size: 19.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tecjustica_ocr-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d26725137d3a9f878e0a5ceaa127d237f9cbb8fde96cb4b762dc6d6742a8c6fc`
MD5	`dd17d008be04e55524f660408661894c`
BLAKE2b-256	`d5d65b3ee7282d8217370fefeccd16d159e3daab7e8962840b9f2b001a34f02f`

See more details on using hashes here.

tecjustica-ocr 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

tecjustica-ocr

Requisitos

Instalação

Passo 1: Criar ambiente virtual

Passo 2: Instalar PaddlePaddle

Passo 3: Instalar tecjustica-ocr

Passo 4: Verificar instalação

Uso

Processar um PDF

Processar uma pasta inteira

Escolher modelo

Modo estrutural (tabelas e layout)

Forçar CPU ou GPU

Todas as opções

Formato de saída

API Python

Resolução de problemas

"comando não encontrado" após instalar

Erro de CUDA / GPU não detectada

Modelos não baixam

Processo muito lento

Benchmarks

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes