CLI-first document normalization pipeline for PDFs and OCR-able files
Project description
scribai
CLI-first pipeline for turning PDFs (and other OCR-able inputs) into clean, accurate markdown.
The primary goal is reliability in real document workflows: predictable runs, resumable artifacts, and profile-driven backend control.
What it does
- Runs a deterministic stage pipeline on local inputs:
extract -> clean -> sectionize -> normalize_map -> reduce -> validate -> export
- Supports local, remote, and hybrid model backends through YAML profiles.
- Writes complete run artifacts under
~/.scribai/artifacts/<run_id>/...by default for auditing and resume. - Exposes simple CLI commands for run/status/doctor checks.
Install
For source-tree development:
uv sync --group dev
Install from PyPI:
uv tool install scribai
Install the current checkout as a local CLI tool:
uv tool install .
Quick start
Run on a sample markdown fixture:
uv run scribai run \
--input samples/docs/mini_api.md
For an installed tool, the equivalent command is just:
scribai run --input /path/to/file.pdf
By default, scribai uses the built-in auto preset when neither --profile
nor --preset is provided. auto picks the first configured provider API key
from OPENROUTER_API_KEY, CEREBRAS_API_KEY, or OPENAI_API_KEY.
If no provider key is set, use --preset passthrough (no model normalization)
or pass an explicit --profile.
Installed CLI usage is runtime-native: scribai stores optional user config in
~/.scribai/config.yaml and writes artifacts to ~/.scribai/artifacts/ by
default. Set SCRIBAI_HOME to move both locations.
Run with a built-in preset (no profile path needed):
uv run scribai run \
--preset openrouter \
--input /path/to/file.pdf
Optional quick overrides:
--text-model <model_id>--ocr-model <model_id>--artifacts-root <path>--output <dir>copies the final exported outputs to a user-facing directory
Validate profile + input before running:
uv run scribai doctor \
--profile profiles/pipeline.profile.example.yaml \
--input samples/docs/mini_api.md
Inspect a run by ID:
uv run scribai status \
--profile profiles/pipeline.profile.example.yaml \
--run-id <run_id>
CLI
scribai run --profile ... --input ... [--run-id ...] [--resume]scribai run --input ...(defaults to--preset auto)scribai run --preset <auto|openrouter|cerebras|openai|passthrough> --input ...scribai run ... --output <dir>copiesartifacts/<run_id>/final/to<dir>scribai status --profile ... --run-id ...scribai status --run-id ...(defaults to--preset auto)scribai status --preset <auto|openrouter|cerebras|openai|passthrough> --run-id ...scribai doctor --profile ... --input ...scribai doctor --input ...(defaults to--preset auto)scribai doctor --preset <auto|openrouter|cerebras|openai|passthrough> --input ...
Installed Usage
- Default home:
~/.scribai - Optional config:
~/.scribai/config.yaml - Default artifacts root:
~/.scribai/artifacts - Override home root with
SCRIBAI_HOME=/custom/path
Minimal optional config example:
version: 1
defaults:
preset: auto
artifacts_root: ~/.scribai/artifacts
provider_priority:
- openrouter
- cerebras
- openai
models:
openrouter: qwen/qwen3.5-35b-a3b
cerebras: gpt-oss-120b
openai: gpt-4o-mini
Precedence is: CLI flags > explicit --profile > ~/.scribai/config.yaml > built-in defaults.
Profiles
Profile files live in profiles/ and are organized by topology:
profiles/pipeline.profile.example.yaml- minimal baselineprofiles/local_spawned/- scribai launches backend processprofiles/local_attached/- connect to an already-running local backendprofiles/remote/- hosted provider profilesprofiles/hybrid/- mixed local/remote profile patterns
See profiles/README.md for layout details.
Those example profiles are primarily for source-tree and advanced custom usage.
The installed CLI does not depend on the repository profiles/ directory for
default runs.
OCR behavior (explicit)
For PDF inputs, extraction follows this order:
- If a profile defines an
ocr_visionrole,scribaicalls that vision model for OCR extraction. - If no
ocr_visionrole exists (or OCR vision extraction fails),scribaifalls back to localpymupdf4llmextraction.
Current hosted/hybrid examples are intentionally anchored to GLM-OCR as the
default OCR model (provider: glm_ocr, model: glm-ocr).
You can switch OCR models/backends by editing profile config:
backends.ocr_backend(adapter/topology/base_url/auth)roles.ocr_vision.model(OCR model id)
For this initial public push, GLM-OCR is the recommended default path.
Environment
Copy .env.example to .env and set credentials as needed:
OPENROUTER_API_KEYCEREBRAS_API_KEYOPENAI_API_KEY
Useful runtime controls:
SCRIBAI_PROGRESS=0disables tqdm map-stage progress barSCRIBAI_MAP_RATE_LIMIT_RETRIES=<int>sets per-chunk retry budget for rate-limit eventsSCRIBAI_CEREBRAS_TIER=paygoswitches Cerebras metadata assumptions to paygo tierSCRIBAI_BACKEND_PASSTHROUGH_LOGS=1shows spawned backend stdout/stderr
Reliability notes
- Use explicit
--run-idfor long jobs so reruns can safely--resume. - Review run artifacts in
~/.scribai/artifacts/<run_id>/by default (map telemetry, validation report, final markdown). - Keep profile config as the source of truth for backend behavior (timeouts, workers, output limits).
Large-document preflight
Estimate token budget locally from cleaned markdown before expensive remote runs:
uv run scripts/estimate_token_budget.py \
--markdown artifacts/<preflight_run_id>/raw/cleaned.md \
--model gpt-oss-120b
Optional: model selection and benchmarking
Benchmarking exists to improve backend/model choices for the pipeline; it is not required for normal usage.
- Sample benchmark command index:
samples/README.md - Benchmark schema:
docs/benchmark_schema.md - Quality framework:
docs/quality_evaluation_framework.md - Synthetic benchmark spec:
docs/benchmark_spec_v1.md
Contributing
See CONTRIBUTING.md for branch strategy, PR flow, and merge expectations.
See RELEASING.md for the release checklist and PyPI publish flow.
Basic GitHub Actions CI now validates tests, packaging, and installed-wheel smoke checks.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scribai-0.1.0.tar.gz.
File metadata
- Download URL: scribai-0.1.0.tar.gz
- Upload date:
- Size: 257.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f716dcddbb4e381ede9f715673edc8c189d54103cad1013198c113530dfdb10
|
|
| MD5 |
16923a26117700f21d40b9e0f13b8b3c
|
|
| BLAKE2b-256 |
001b2d8265d84d293cbf0aa0650a1675af5d0cd9d0a766a0bd91ce173f5ea109
|
Provenance
The following attestation bundles were made for scribai-0.1.0.tar.gz:
Publisher:
publish.yml on codyw912/scribai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scribai-0.1.0.tar.gz -
Subject digest:
9f716dcddbb4e381ede9f715673edc8c189d54103cad1013198c113530dfdb10 - Sigstore transparency entry: 1076570006
- Sigstore integration time:
-
Permalink:
codyw912/scribai@841ae27c475ad7ad5b2ddd3f2e232a744f25d674 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/codyw912
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@841ae27c475ad7ad5b2ddd3f2e232a744f25d674 -
Trigger Event:
push
-
Statement type:
File details
Details for the file scribai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: scribai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 47.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9b13f2bed10db0b419d1e4ec5b5f22c209efb2e01ca86224d65fa60b4d00798
|
|
| MD5 |
8242ffdac20586a22b90c4627cb6d4c5
|
|
| BLAKE2b-256 |
98bab4275992bcaa973c2143116a7f4d6d2568ac42dd00c93450c29e6d8242cc
|
Provenance
The following attestation bundles were made for scribai-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on codyw912/scribai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scribai-0.1.0-py3-none-any.whl -
Subject digest:
d9b13f2bed10db0b419d1e4ec5b5f22c209efb2e01ca86224d65fa60b4d00798 - Sigstore transparency entry: 1076570014
- Sigstore integration time:
-
Permalink:
codyw912/scribai@841ae27c475ad7ad5b2ddd3f2e232a744f25d674 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/codyw912
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@841ae27c475ad7ad5b2ddd3f2e232a744f25d674 -
Trigger Event:
push
-
Statement type: