Skip to main content

CLI-first document normalization pipeline for PDFs and OCR-able files

Project description

scribai

CLI-first pipeline for turning PDFs (and other OCR-able inputs) into clean, accurate markdown.

The primary goal is reliability in real document workflows: predictable runs, resumable artifacts, and profile-driven backend control.

What it does

  • Runs a deterministic stage pipeline on local inputs:
    • extract -> clean -> sectionize -> normalize_map -> reduce -> validate -> export
  • Supports local, remote, and hybrid model backends through YAML profiles.
  • Writes complete run artifacts under ~/.scribai/artifacts/<run_id>/... by default for auditing and resume.
  • Exposes simple CLI commands for run/status/doctor checks.

Install

For source-tree development:

uv sync --group dev

Install from PyPI:

uv tool install scribai

Install the current checkout as a local CLI tool:

uv tool install .

Quick start

Run on a sample markdown fixture:

uv run scribai run \
  --input samples/docs/mini_api.md

For an installed tool, the equivalent command is just:

scribai run --input /path/to/file.pdf

By default, scribai uses the built-in auto preset when neither --profile nor --preset is provided. auto picks the first configured provider API key from OPENROUTER_API_KEY, CEREBRAS_API_KEY, or OPENAI_API_KEY.

If no provider key is set, use --preset passthrough (no model normalization) or pass an explicit --profile.

Installed CLI usage is runtime-native: scribai stores optional user config in ~/.scribai/config.yaml and writes artifacts to ~/.scribai/artifacts/ by default. Set SCRIBAI_HOME to move both locations.

Run with a built-in preset (no profile path needed):

uv run scribai run \
  --preset openrouter \
  --input /path/to/file.pdf

Optional quick overrides:

  • --text-model <model_id>
  • --ocr-model <model_id>
  • --artifacts-root <path>
  • --output <dir> copies the final exported outputs to a user-facing directory

Validate profile + input before running:

uv run scribai doctor \
  --profile profiles/pipeline.profile.example.yaml \
  --input samples/docs/mini_api.md

Inspect a run by ID:

uv run scribai status \
  --profile profiles/pipeline.profile.example.yaml \
  --run-id <run_id>

CLI

  • scribai run --profile ... --input ... [--run-id ...] [--resume]
  • scribai run --input ... (defaults to --preset auto)
  • scribai run --preset <auto|openrouter|cerebras|openai|passthrough> --input ...
  • scribai run ... --output <dir> copies artifacts/<run_id>/final/ to <dir>
  • scribai status --profile ... --run-id ...
  • scribai status --run-id ... (defaults to --preset auto)
  • scribai status --preset <auto|openrouter|cerebras|openai|passthrough> --run-id ...
  • scribai doctor --profile ... --input ...
  • scribai doctor --input ... (defaults to --preset auto)
  • scribai doctor --preset <auto|openrouter|cerebras|openai|passthrough> --input ...

Installed Usage

  • Default home: ~/.scribai
  • Optional config: ~/.scribai/config.yaml
  • Default artifacts root: ~/.scribai/artifacts
  • Override home root with SCRIBAI_HOME=/custom/path

Minimal optional config example:

version: 1
defaults:
  preset: auto
  artifacts_root: ~/.scribai/artifacts
  provider_priority:
    - openrouter
    - cerebras
    - openai
models:
  openrouter: qwen/qwen3.5-35b-a3b
  cerebras: gpt-oss-120b
  openai: gpt-4o-mini

Precedence is: CLI flags > explicit --profile > ~/.scribai/config.yaml > built-in defaults.

Profiles

Profile files live in profiles/ and are organized by topology:

  • profiles/pipeline.profile.example.yaml - minimal baseline
  • profiles/local_spawned/ - scribai launches backend process
  • profiles/local_attached/ - connect to an already-running local backend
  • profiles/remote/ - hosted provider profiles
  • profiles/hybrid/ - mixed local/remote profile patterns

See profiles/README.md for layout details.

Those example profiles are primarily for source-tree and advanced custom usage. The installed CLI does not depend on the repository profiles/ directory for default runs.

OCR behavior (explicit)

For PDF inputs, extraction follows this order:

  1. If a profile defines an ocr_vision role, scribai calls that vision model for OCR extraction.
  2. If no ocr_vision role exists (or OCR vision extraction fails), scribai falls back to local pymupdf4llm extraction.

Current hosted/hybrid examples are intentionally anchored to GLM-OCR as the default OCR model (provider: glm_ocr, model: glm-ocr).

You can switch OCR models/backends by editing profile config:

  • backends.ocr_backend (adapter/topology/base_url/auth)
  • roles.ocr_vision.model (OCR model id)

For this initial public push, GLM-OCR is the recommended default path.

Environment

Copy .env.example to .env and set credentials as needed:

  • OPENROUTER_API_KEY
  • CEREBRAS_API_KEY
  • OPENAI_API_KEY

Useful runtime controls:

  • SCRIBAI_PROGRESS=0 disables tqdm map-stage progress bar
  • SCRIBAI_MAP_RATE_LIMIT_RETRIES=<int> sets per-chunk retry budget for rate-limit events
  • SCRIBAI_CEREBRAS_TIER=paygo switches Cerebras metadata assumptions to paygo tier
  • SCRIBAI_BACKEND_PASSTHROUGH_LOGS=1 shows spawned backend stdout/stderr

Reliability notes

  • Use explicit --run-id for long jobs so reruns can safely --resume.
  • Review run artifacts in ~/.scribai/artifacts/<run_id>/ by default (map telemetry, validation report, final markdown).
  • Keep profile config as the source of truth for backend behavior (timeouts, workers, output limits).

Large-document preflight

Estimate token budget locally from cleaned markdown before expensive remote runs:

uv run scripts/estimate_token_budget.py \
  --markdown artifacts/<preflight_run_id>/raw/cleaned.md \
  --model gpt-oss-120b

Optional: model selection and benchmarking

Benchmarking exists to improve backend/model choices for the pipeline; it is not required for normal usage.

  • Sample benchmark command index: samples/README.md
  • Benchmark schema: docs/benchmark_schema.md
  • Quality framework: docs/quality_evaluation_framework.md
  • Synthetic benchmark spec: docs/benchmark_spec_v1.md

Contributing

See CONTRIBUTING.md for branch strategy, PR flow, and merge expectations. See RELEASING.md for the release checklist and PyPI publish flow. Basic GitHub Actions CI now validates tests, packaging, and installed-wheel smoke checks.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scribai-0.1.0.tar.gz (257.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scribai-0.1.0-py3-none-any.whl (47.2 kB view details)

Uploaded Python 3

File details

Details for the file scribai-0.1.0.tar.gz.

File metadata

  • Download URL: scribai-0.1.0.tar.gz
  • Upload date:
  • Size: 257.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scribai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9f716dcddbb4e381ede9f715673edc8c189d54103cad1013198c113530dfdb10
MD5 16923a26117700f21d40b9e0f13b8b3c
BLAKE2b-256 001b2d8265d84d293cbf0aa0650a1675af5d0cd9d0a766a0bd91ce173f5ea109

See more details on using hashes here.

Provenance

The following attestation bundles were made for scribai-0.1.0.tar.gz:

Publisher: publish.yml on codyw912/scribai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scribai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: scribai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 47.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scribai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d9b13f2bed10db0b419d1e4ec5b5f22c209efb2e01ca86224d65fa60b4d00798
MD5 8242ffdac20586a22b90c4627cb6d4c5
BLAKE2b-256 98bab4275992bcaa973c2143116a7f4d6d2568ac42dd00c93450c29e6d8242cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for scribai-0.1.0-py3-none-any.whl:

Publisher: publish.yml on codyw912/scribai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page