Agentic image/PDF/docs digitalization. Router for all your OCRs/VLMs/text-extractors.
Project description
OCRAgent
OCR-first, agent-guided.
OCRAgent is a command-line document parsing workflow. It uses an agent to select OCR, VLM, PDF, office-document, or user-defined tools, then reviews the extracted text before writing output.
The goal is practical routing: use inexpensive extraction when it is enough, and spend model/API cost only on files that need it.
Grade, Route, Parse, Review
OCRAgent works best for mixed folders: PDFs with text layers, scanned PDFs, images, office files, handwritten pages, tables, forms, and other files that should not all use the same parser.
| Step | What OCRAgent does | Main artifact |
|---|---|---|
| Grade | Inspects file names, metadata, preview signals, and sample pages to estimate parsing difficulty. | .ocragent_memory.txt |
| Route | Chooses a parser from builtin tools and user-defined tools according to cost, scope, and prior folder notes. | tool call |
| Parse | Runs the selected tool and writes UTF-8 text while preserving source-relative paths. | ocragent_results/ |
| Review | Checks whether extracted text is usable; retries with another tool or route when review fails. | accepted output or retry |
The four steps keep the system understandable:
- Grade before spending model/API cost.
- Route through one tool registry.
- Parse through deterministic command boundaries.
- Review before writing final output.
Runtime Flow
documents
-> init docs
-> folder memory
-> parser agent
-> parser tool
-> reviewer agent
-> output text
Install
Install with common document backends:
python -m pip install "ocragent[full]"
ocragent --help
uv
uv tool install "ocragent[full]"
ocragent --help
Configure a chat-completions API through environment variables:
export OCRAGENT_CHAT_BASE=http://localhost:8080/v1
export OCRAGENT_CHAT_MODEL=your-model
export OCRAGENT_CHAT_AUTHKEY=your-key
OPENAI_API_KEY is also accepted as the auth key. A vision-capable model is strongly recommended, because OCRAgent uses model judgment during grading and review. The same values can be configured in ~/.ocragent/ocragent.settings.toml, ./ocragent.settings.toml, or .env. Use src/ocragent/ocragent.settings.default.toml as the reference.
Text-only LLM vs multimodal VLM
| Stage | Text-only LLM | Multimodal VLM |
|---|---|---|
| Grade | Uses file names, metadata, text-layer probes, and OCR samples. It can estimate readability from extracted text, but cannot inspect page images directly. | Uses thumbnails or rendered pages to judge scan quality, handwriting, diagrams, tables, layout density, and whether OCR is likely to fail. |
| Review | Checks whether extracted text reads coherently, whether tables look damaged in text form, and whether obvious OCR artifacts appear. | Can compare extracted text against visual page evidence when available, which is better for missing regions, layout loss, handwriting, formulas, and image-heavy pages. |
Quick Start
List available tools:
ocragent tool --list
ocragent tool --list --scope=parser
Generate user tools if you want OCRAgent to call your own OCR, VLM, shell command, or API. Describe tools in plain text:
$HOME/ocragent.toolbox_user.txt
The format can follow src/ocragent/ocragent.toolbox_user.example.txt. Include tool name, scope, cost, flags, limits, call shape, and required environment variables.
Generate the runtime:
ocragent init tools
OCRAgent writes executable Python to $HOME/.ocragent/user_toolbox.py. Review this file before running it with credentials.
Initialize and parse a document folder:
cd /path/to/documents
ocragent init docs
ocragent run --out-dir ocragent_results
CLI Example
$ ocragent tool --list --scope=parser
pdf2txt scope: parser cost: low Extract PDF text with PyMuPDF.
--path /path/to/file.pdf
pdf_pages_to_images scope: parser cost: medium Render each PDF page to a PNG image with PyMuPDF.
--path /path/to/file.pdf
--out-dir /path/to/page-images
pandoc2txt scope: parser cost: low Convert office documents to plain text with Pandoc.
--path /path/to/file
$ cd ~/cases/mixed_docs
$ ocragent init tools --from ./ocragent.toolbox_user.txt
# writes /home/me/.ocragent/user_toolbox.py
# reports valid and failed user tools
$ ocragent init docs
# writes .ocragent_memory.txt
# reports detected groups, file_count, and unmatched_count
$ ocragent run invoice.pdf scans/ --out-dir ocragent_results
# writes parsed files under ocragent_results/
# reports parsed_count, failed_count, skipped_count, and output_stats
The commands return JSON in normal use. The example above keeps the flow compact and notes the important fields.
Output
OCRAgent preserves relative paths:
docs/report.pdf -> ocragent_results/docs/report.pdf.txt
scans/page-01.jpg -> ocragent_results/scans/page-01.jpg.md
It also writes a folder memory file:
.ocragent_memory.txt
The memory file is prose. It records file groups, difficulty estimates, tool choices, and run summaries. Later parser runs use it as context.
Architecture
CLI (ocragent init / run / tool)
|
AI Agents (init_tools / parser / reviewer)
|
Tool chain (builtin tools + user_toolbox.py)
| Plane | Responsibility | Examples |
|---|---|---|
| CLI and commands | Stable command behavior | config, paths, logging, stdout, stderr |
| Tool registry | Parser capability boundary | PDF text, image thumbnails, Pandoc, user OCR, VLM APIs |
| Agent loops | Runtime decisions | file grouping, tool selection, review, retry |
The parser agent does not call vendor APIs directly. It reads the available parser tools, chooses one, runs it through the tool boundary, and sends extracted text to the reviewer. If review fails, the parser can retry with another tool or a higher-cost route.
Configuration
Configuration priority:
- Environment variables.
./ocragent.settings.toml.~/.ocragent/ocragent.settings.toml.- Package defaults.
Common settings:
[aigc.api.chatcomp]
base = "http://localhost:8080/v1"
authkey = ""
model = ""
model_hasVision = true
[output]
dir = "ocragent_results"
ext = "auto"
parser_summary_batch = 5
[reviewer]
max_length = 1000
The complete default file is src/ocragent/ocragent.settings.default.toml.
Documentation
Contributing
OCRAgent is beta. Breaking changes are still possible.
Useful contributions:
- Add or improve builtin parser tools.
- Add demo assets for real document cases.
- Improve reviewer prompts and failure cases.
- Strengthen tests around CLI behavior, tool discovery, and generated user tools.
- Write adapters for common OCR, VLM, and document conversion backends.
- Improve documentation for tested workflows.
Run tests:
uv run python -m unittest discover -s tests
uv run --extra pdf python -m unittest discover -s tests
Important paths:
src/ocragent/cli.py: command boundary.src/ocragent/cmd/: command implementations.src/ocragent/cmd/tool.py: builtin and user tool contract.src/ocragent/agent/: model-facing loops.src/ocragent/config.py: layered settings.tests/: test suite and CLI flow checks.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ocragent-0.1.2.tar.gz.
File metadata
- Download URL: ocragent-0.1.2.tar.gz
- Upload date:
- Size: 44.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8518204c1b80d5b4fba9718015a197664509a504a2b2809f48b6dab45ce6d278
|
|
| MD5 |
05de6858721f1ac4e5d8f8de238920d9
|
|
| BLAKE2b-256 |
284cef3720803b7ec8309ed890efe461a29fca2fc5f8f3b18bc99a7a35955a34
|
File details
Details for the file ocragent-0.1.2-py3-none-any.whl.
File metadata
- Download URL: ocragent-0.1.2-py3-none-any.whl
- Upload date:
- Size: 54.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a7f9cf5804b9bb89595c5283726dde4eb614f752d5fd53fa55182da0193c2d6
|
|
| MD5 |
0905229044dc4bfaeebc2d461d673029
|
|
| BLAKE2b-256 |
af47ffc258d25e2b84115fca9d19834974d23a617b1a868cb437cb926eea4c9c
|