Skip to main content

Agentic image/PDF/docs digitalization. Router for all your OCRs/VLMs/text-extractors.

Project description

OCRAgent logo

OCRAgent

English | 简体中文

Publish to PyPI PyPI version

License: MIT Python Versions

Brand banner

OCR-first, agent-guided.

OCRAgent is a command-line document parsing workflow. It uses an agent to select OCR, VLM, PDF, office-document, or user-defined tools, then reviews the extracted text before writing output.

The goal is practical routing: use inexpensive extraction when it is enough, and spend model/API cost only on files that need it.

Core value comparison

Grade, Route, Parse, Review

OCRAgent works best for mixed folders: PDFs with text layers, scanned PDFs, images, office files, handwritten pages, tables, forms, and other files that should not all use the same parser.

Step What OCRAgent does Main artifact
Grade Inspects file names, metadata, preview signals, and sample pages to estimate parsing difficulty. .ocragent_memory.txt
Route Chooses a parser from builtin tools and user-defined tools according to cost, scope, and prior folder notes. tool call
Parse Runs the selected tool and writes UTF-8 text while preserving source-relative paths. ocragent_results/
Review Checks whether extracted text is usable; retries with another tool or route when review fails. accepted output or retry

The four steps keep the system understandable:

  • Grade before spending model/API cost.
  • Route through one tool registry.
  • Parse through deterministic command boundaries.
  • Review before writing final output.

Runtime Flow

documents
  -> init docs
  -> folder memory
  -> parser agent
  -> parser tool
  -> reviewer agent
  -> output text

Install

Install with common document backends:

python -m pip install "ocragent[full]"
ocragent --help
uv
uv tool install "ocragent[full]"
ocragent --help

Configure a chat-completions API through environment variables:

export OCRAGENT_CHAT_BASE=http://localhost:8080/v1
export OCRAGENT_CHAT_MODEL=your-model
export OCRAGENT_CHAT_AUTHKEY=your-key

OPENAI_API_KEY is also accepted as the auth key. A vision-capable model is strongly recommended, because OCRAgent uses model judgment during grading and review. The same values can be configured in ~/.ocragent/ocragent.settings.toml, ./ocragent.settings.toml, or .env. Use src/ocragent/ocragent.settings.default.toml as the reference.

Text-only LLM vs multimodal VLM
Stage Text-only LLM Multimodal VLM
Grade Uses file names, metadata, text-layer probes, and OCR samples. It can estimate readability from extracted text, but cannot inspect page images directly. Uses thumbnails or rendered pages to judge scan quality, handwriting, diagrams, tables, layout density, and whether OCR is likely to fail.
Review Checks whether extracted text reads coherently, whether tables look damaged in text form, and whether obvious OCR artifacts appear. Can compare extracted text against visual page evidence when available, which is better for missing regions, layout loss, handwriting, formulas, and image-heavy pages.

Quick Start

List available tools:

ocragent tool --list
ocragent tool --list --scope=parser

Generate user tools if you want OCRAgent to call your own OCR, VLM, shell command, or API. Describe tools in plain text:

$HOME/ocragent.toolbox_user.txt

The format can follow src/ocragent/ocragent.toolbox_user.example.txt. Include tool name, scope, cost, flags, limits, call shape, and required environment variables.

Generate the runtime:

ocragent init tools

OCRAgent writes executable Python to $HOME/.ocragent/user_toolbox.py. Review this file before running it with credentials.

Initialize and parse a document folder:

cd /path/to/documents
ocragent init docs
ocragent run --out-dir ocragent_results

CLI Example

$ ocragent tool --list --scope=parser
pdf2txt	scope: parser cost: low	Extract PDF text with PyMuPDF.
	--path /path/to/file.pdf

pdf_pages_to_images	scope: parser cost: medium	Render each PDF page to a PNG image with PyMuPDF.
	--path /path/to/file.pdf
	--out-dir /path/to/page-images

pandoc2txt	scope: parser cost: low	Convert office documents to plain text with Pandoc.
	--path /path/to/file

$ cd ~/cases/mixed_docs
$ ocragent init tools --from ./ocragent.toolbox_user.txt
# writes /home/me/.ocragent/user_toolbox.py
# reports valid and failed user tools

$ ocragent init docs
# writes .ocragent_memory.txt
# reports detected groups, file_count, and unmatched_count

$ ocragent run invoice.pdf scans/ --out-dir ocragent_results
# writes parsed files under ocragent_results/
# reports parsed_count, failed_count, skipped_count, and output_stats

The commands return JSON in normal use. The example above keeps the flow compact and notes the important fields.

Output

OCRAgent preserves relative paths:

docs/report.pdf -> ocragent_results/docs/report.pdf.txt
scans/page-01.jpg -> ocragent_results/scans/page-01.jpg.md

It also writes a folder memory file:

.ocragent_memory.txt

The memory file is prose. It records file groups, difficulty estimates, tool choices, and run summaries. Later parser runs use it as context.

Architecture

CLI  (ocragent init / run / tool)
 |
AI Agents  (init_tools / parser / reviewer)
 |
Tool chain  (builtin tools + user_toolbox.py)

Architecture diagram

Plane Responsibility Examples
CLI and commands Stable command behavior config, paths, logging, stdout, stderr
Tool registry Parser capability boundary PDF text, image thumbnails, Pandoc, user OCR, VLM APIs
Agent loops Runtime decisions file grouping, tool selection, review, retry

The parser agent does not call vendor APIs directly. It reads the available parser tools, chooses one, runs it through the tool boundary, and sends extracted text to the reviewer. If review fails, the parser can retry with another tool or a higher-cost route.

Configuration

Configuration priority:

  1. Environment variables.
  2. ./ocragent.settings.toml.
  3. ~/.ocragent/ocragent.settings.toml.
  4. Package defaults.

Common settings:

[aigc.api.chatcomp]
base = "http://localhost:8080/v1"
authkey = ""
model = ""
model_hasVision = true

[output]
dir = "ocragent_results"
ext = "auto"
parser_summary_batch = 5

[reviewer]
max_length = 1000

The complete default file is src/ocragent/ocragent.settings.default.toml.

Documentation

Contributing

OCRAgent is beta. Breaking changes are still possible.

Useful contributions:

  • Add or improve builtin parser tools.
  • Add demo assets for real document cases.
  • Improve reviewer prompts and failure cases.
  • Strengthen tests around CLI behavior, tool discovery, and generated user tools.
  • Write adapters for common OCR, VLM, and document conversion backends.
  • Improve documentation for tested workflows.

Run tests:

uv run python -m unittest discover -s tests
uv run --extra pdf python -m unittest discover -s tests

Important paths:

  • src/ocragent/cli.py: command boundary.
  • src/ocragent/cmd/: command implementations.
  • src/ocragent/cmd/tool.py: builtin and user tool contract.
  • src/ocragent/agent/: model-facing loops.
  • src/ocragent/config.py: layered settings.
  • tests/: test suite and CLI flow checks.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocragent-0.1.2.tar.gz (44.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocragent-0.1.2-py3-none-any.whl (54.1 kB view details)

Uploaded Python 3

File details

Details for the file ocragent-0.1.2.tar.gz.

File metadata

  • Download URL: ocragent-0.1.2.tar.gz
  • Upload date:
  • Size: 44.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for ocragent-0.1.2.tar.gz
Algorithm Hash digest
SHA256 8518204c1b80d5b4fba9718015a197664509a504a2b2809f48b6dab45ce6d278
MD5 05de6858721f1ac4e5d8f8de238920d9
BLAKE2b-256 284cef3720803b7ec8309ed890efe461a29fca2fc5f8f3b18bc99a7a35955a34

See more details on using hashes here.

File details

Details for the file ocragent-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: ocragent-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 54.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for ocragent-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8a7f9cf5804b9bb89595c5283726dde4eb614f752d5fd53fa55182da0193c2d6
MD5 0905229044dc4bfaeebc2d461d673029
BLAKE2b-256 af47ffc258d25e2b84115fca9d19834974d23a617b1a868cb437cb926eea4c9c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page