CLI-first document normalization pipeline for PDFs and OCR-able files

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

cewhite

These details have not been verified by PyPI

Project description

scribai

CLI-first pipeline for turning PDFs (and other OCR-able inputs) into clean, accurate markdown.

The primary goal is reliability in real document workflows: predictable runs, resumable artifacts, and profile-driven backend control.

What it does

Runs a deterministic stage pipeline on local inputs:
- extract -> clean -> sectionize -> normalize_map -> reduce -> validate -> export
Supports local, remote, and hybrid model backends through YAML profiles.
Writes complete run artifacts under ~/.scribai/artifacts/<run_id>/... by default for auditing and resume.
Exposes simple CLI commands for run/status/doctor checks.

Install

For source-tree development:

uv sync --group dev

Install from PyPI:

uv tool install scribai

Install the current checkout as a local CLI tool:

uv tool install .

Quick start

Run on a sample markdown fixture:

uv run scribai run \
  --input samples/docs/mini_api.md

For an installed tool, the equivalent command is just:

scribai run --input /path/to/file.pdf

By default, scribai uses the built-in auto preset when neither --profile nor --preset is provided. auto picks the first configured provider API key from OPENROUTER_API_KEY, CEREBRAS_API_KEY, or OPENAI_API_KEY.

If no provider key is set, use --preset passthrough (no model normalization) or pass an explicit --profile.

Installed CLI usage is runtime-native: scribai stores optional user config in ~/.scribai/config.yaml and writes artifacts to ~/.scribai/artifacts/ by default. Set SCRIBAI_HOME to move both locations.

Run with a built-in preset (no profile path needed):

uv run scribai run \
  --preset openrouter \
  --input /path/to/file.pdf

Optional quick overrides:

--text-model <model_id>
--ocr-model <model_id>
--artifacts-root <path>
--output <dir> copies the final exported outputs to a user-facing directory

Validate profile + input before running:

uv run scribai doctor \
  --profile profiles/pipeline.profile.example.yaml \
  --input samples/docs/mini_api.md

Inspect a run by ID:

uv run scribai status \
  --profile profiles/pipeline.profile.example.yaml \
  --run-id <run_id>

CLI

scribai run --profile ... --input ... [--run-id ...] [--resume]
scribai run --input ... (defaults to --preset auto)
scribai run --preset <auto|openrouter|cerebras|openai|passthrough> --input ...
scribai run ... --output <dir> copies artifacts/<run_id>/final/ to <dir>
scribai status --profile ... --run-id ...
scribai status --run-id ... (defaults to --preset auto)
scribai status --preset <auto|openrouter|cerebras|openai|passthrough> --run-id ...
scribai doctor --profile ... --input ...
scribai doctor --input ... (defaults to --preset auto)
scribai doctor --preset <auto|openrouter|cerebras|openai|passthrough> --input ...

Installed Usage

Default home: ~/.scribai
Optional config: ~/.scribai/config.yaml
Default artifacts root: ~/.scribai/artifacts
Override home root with SCRIBAI_HOME=/custom/path

Minimal optional config example:

version: 1
defaults:
  preset: auto
  artifacts_root: ~/.scribai/artifacts
  provider_priority:
    - openrouter
    - cerebras
    - openai
models:
  openrouter: qwen/qwen3.5-35b-a3b
  cerebras: gpt-oss-120b
  openai: gpt-4o-mini

Precedence is: CLI flags > explicit --profile > ~/.scribai/config.yaml > built-in defaults.

Profiles

Profile files live in profiles/ and are organized by topology:

profiles/pipeline.profile.example.yaml - minimal baseline
profiles/local_spawned/ - scribai launches backend process
profiles/local_attached/ - connect to an already-running local backend
profiles/remote/ - hosted provider profiles
profiles/hybrid/ - mixed local/remote profile patterns

See profiles/README.md for layout details.

Those example profiles are primarily for source-tree and advanced custom usage. The installed CLI does not depend on the repository profiles/ directory for default runs.

OCR behavior (explicit)

For PDF inputs, extraction follows this order:

If a profile defines an ocr_vision role, scribai calls that vision model for OCR extraction.
If no ocr_vision role exists (or OCR vision extraction fails), scribai falls back to local pymupdf4llm extraction.

Current hosted/hybrid examples are intentionally anchored to GLM-OCR as the default OCR model (provider: glm_ocr, model: glm-ocr).

You can switch OCR models/backends by editing profile config:

backends.ocr_backend (adapter/topology/base_url/auth)
roles.ocr_vision.model (OCR model id)

For this initial public push, GLM-OCR is the recommended default path.

Environment

Copy .env.example to .env and set credentials as needed:

OPENROUTER_API_KEY
CEREBRAS_API_KEY
OPENAI_API_KEY

Useful runtime controls:

SCRIBAI_PROGRESS=0 disables tqdm map-stage progress bar
SCRIBAI_MAP_RATE_LIMIT_RETRIES=<int> sets per-chunk retry budget for rate-limit events
SCRIBAI_CEREBRAS_TIER=paygo switches Cerebras metadata assumptions to paygo tier
SCRIBAI_BACKEND_PASSTHROUGH_LOGS=1 shows spawned backend stdout/stderr

Reliability notes

Use explicit --run-id for long jobs so reruns can safely --resume.
Review run artifacts in ~/.scribai/artifacts/<run_id>/ by default (map telemetry, validation report, final markdown).
Keep profile config as the source of truth for backend behavior (timeouts, workers, output limits).

Large-document preflight

Estimate token budget locally from cleaned markdown before expensive remote runs:

uv run scripts/estimate_token_budget.py \
  --markdown artifacts/<preflight_run_id>/raw/cleaned.md \
  --model gpt-oss-120b

Optional: model selection and benchmarking

Benchmarking exists to improve backend/model choices for the pipeline; it is not required for normal usage.

Sample benchmark command index: samples/README.md
Benchmark schema: docs/benchmark_schema.md
Quality framework: docs/quality_evaluation_framework.md
Synthetic benchmark spec: docs/benchmark_spec_v1.md

Contributing

See CONTRIBUTING.md for branch strategy, PR flow, and merge expectations. See RELEASING.md for the release checklist and PyPI publish flow. Basic GitHub Actions CI now validates tests, packaging, and installed-wheel smoke checks.

License

MIT. See LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

cewhite

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Mar 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scribai-0.1.0.tar.gz (257.6 kB view details)

Uploaded Mar 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scribai-0.1.0-py3-none-any.whl (47.2 kB view details)

Uploaded Mar 10, 2026 Python 3

File details

Details for the file scribai-0.1.0.tar.gz.

File metadata

Download URL: scribai-0.1.0.tar.gz
Upload date: Mar 10, 2026
Size: 257.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scribai-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9f716dcddbb4e381ede9f715673edc8c189d54103cad1013198c113530dfdb10`
MD5	`16923a26117700f21d40b9e0f13b8b3c`
BLAKE2b-256	`001b2d8265d84d293cbf0aa0650a1675af5d0cd9d0a766a0bd91ce173f5ea109`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scribai-0.1.0.tar.gz:

Publisher: publish.yml on codyw912/scribai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scribai-0.1.0.tar.gz
- Subject digest: 9f716dcddbb4e381ede9f715673edc8c189d54103cad1013198c113530dfdb10
- Sigstore transparency entry: 1076570006
- Sigstore integration time: Mar 10, 2026
Source repository:
- Permalink: codyw912/scribai@841ae27c475ad7ad5b2ddd3f2e232a744f25d674
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/codyw912
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@841ae27c475ad7ad5b2ddd3f2e232a744f25d674
- Trigger Event: push

File details

Details for the file scribai-0.1.0-py3-none-any.whl.

File metadata

Download URL: scribai-0.1.0-py3-none-any.whl
Upload date: Mar 10, 2026
Size: 47.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scribai-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d9b13f2bed10db0b419d1e4ec5b5f22c209efb2e01ca86224d65fa60b4d00798`
MD5	`8242ffdac20586a22b90c4627cb6d4c5`
BLAKE2b-256	`98bab4275992bcaa973c2143116a7f4d6d2568ac42dd00c93450c29e6d8242cc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scribai-0.1.0-py3-none-any.whl:

Publisher: publish.yml on codyw912/scribai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scribai-0.1.0-py3-none-any.whl
- Subject digest: d9b13f2bed10db0b419d1e4ec5b5f22c209efb2e01ca86224d65fa60b4d00798
- Sigstore transparency entry: 1076570014
- Sigstore integration time: Mar 10, 2026
Source repository:
- Permalink: codyw912/scribai@841ae27c475ad7ad5b2ddd3f2e232a744f25d674
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/codyw912
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@841ae27c475ad7ad5b2ddd3f2e232a744f25d674
- Trigger Event: push

scribai 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

scribai

What it does

Install

Quick start

CLI

Installed Usage

Profiles

OCR behavior (explicit)

Environment

Reliability notes

Large-document preflight

Optional: model selection and benchmarking

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance