Decoupled, LLM-agnostic document OCR + structured extraction. Vision and LLM parsing in 3 lines of code.
Project description
ocrcontext
Decoupled, LLM-agnostic document OCR + structured extraction. Turn a PDF or image into clean text — or a typed Pydantic model — in three lines.
ocrcontext is the extraction core of a document-analysis platform, lifted out
of its web stack into a pure, pip-installable library. No FastAPI, no servers,
no hardcoded model providers.
from ocrcontext import Analyzer
result = Analyzer().analyze("invoice.pdf")
print(result.text)
Why
- 3-line DX — instantiate, pass a file, get a result.
- LLM-agnostic — inject any LangChain chat model (OpenAI, Anthropic, Ollama,
local). Only
langchain-coreis required; you bring the provider. - Resource-efficient — heavy OCR models (PaddleOCR, TrOCR) load lazily and are cached as process-wide singletons, so they never reload per call.
- Lightweight base install — engines are opt-in extras.
Install
pip install ocrcontext # core only (PDF text layer + the API surface)
pip install 'ocrcontext[paddle]' # printed text + scanned PDFs (PaddleOCR)
pip install 'ocrcontext[trocr]' # handwriting fallback (Microsoft TrOCR)
pip install 'ocrcontext[vision]' # handwriting primary (Google Cloud Vision)
pip install 'ocrcontext[all]' # everything
Pick an LLM provider for refinement / extraction:
pip install langchain-openai # or langchain-anthropic, langchain-ollama, ...
Usage
Raw OCR (no LLM, no API key)
from ocrcontext import Analyzer
result = Analyzer().analyze("scan.png")
print(result.text, result.confidence, result.pages, result.text_source)
LLM-refined OCR
Refinement fixes OCR errors without paraphrasing, translating, or inventing text. Emails/URLs/IBANs are frozen so the model can't "correct" them, and output that drifts too far from the source is rejected in favour of the raw text.
from langchain_openai import ChatOpenAI
from ocrcontext import Analyzer
analyzer = Analyzer(llm=ChatOpenAI(model="gpt-4o"), lang="tr")
result = analyzer.analyze("handwritten_note.jpg", handwriting=True)
print(result.text) # refined
print(result.raw_text) # original OCR, kept alongside
Structured extraction
from langchain_openai import ChatOpenAI
from ocrcontext import Analyzer
from ocrcontext.schemas import Invoice
analyzer = Analyzer(llm=ChatOpenAI(model="gpt-4o-mini", temperature=0))
invoice = analyzer.extract("invoice.pdf", schema=Invoice) # -> Invoice instance
print(invoice.total_amount, invoice.currency)
Define your own schema with plain Pydantic:
from pydantic import BaseModel, Field
class Receipt(BaseModel):
merchant: str | None = Field(None, description="Store name")
total: float | None = Field(None, description="Grand total")
receipt = analyzer.extract("receipt.jpg", schema=Receipt)
Same code, local model (no API key)
from langchain_ollama import ChatOllama
from ocrcontext import Analyzer
analyzer = Analyzer(llm=ChatOllama(model="llama3.1"))
print(analyzer.analyze("scan.png").text)
How it routes a document
- Digital PDF → embedded text-layer extraction (exact text; LLM refine is skipped so identifiers aren't altered).
- Image / scanned PDF → PaddleOCR with preprocessing (deskew, denoise, CLAHE), multi-language coverage-first selection, and a line-band recovery fallback.
- Handwriting (
handwriting=True, or auto when printed OCR yields too little text) → Google Vision primary, TrOCR fallback. - Optional LLM refine → fidelity-first, literal-preserved, drift-guarded.
- Optional
extract(schema=...)→ typed Pydantic model.
Refinement modes
RefinementMode: conservative (scans), layout (digital PDFs),
handwriting_prose, handwriting_layout. The handwriting mode is auto-selected
based on whether the text looks like a DIKW/pyramid diagram. Modes and prompts
are ported verbatim from the production pipeline.
Configuration
from ocrcontext import Analyzer, AnalyzerConfig
cfg = AnalyzerConfig(
lang="tr",
prefer_pdf_text_layer=True,
auto_handwriting_fallback=True,
)
analyzer = Analyzer(llm=..., config=cfg)
Development
pip install -e '.[dev]'
pytest # runs without GPU/network — engines and LLM are faked
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ocrcontext-0.1.0.tar.gz.
File metadata
- Download URL: ocrcontext-0.1.0.tar.gz
- Upload date:
- Size: 45.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
53335776ed59c5c9c86f327a90844b73a4224343567d84abb5ccee998c651159
|
|
| MD5 |
69ec72238e31f77369c8253dd2401f3c
|
|
| BLAKE2b-256 |
bfc370f86cb0dbdc0003e24043de5ce6d2ca4ddf9c0e1bcd47035640dbe90613
|
File details
Details for the file ocrcontext-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ocrcontext-0.1.0-py3-none-any.whl
- Upload date:
- Size: 48.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc2acf5297fa1fc2f5f21af3e26f88db381308f0f7a1d0611a7e41a575c08aff
|
|
| MD5 |
3b8ff9840cca2357727d95b011f02e39
|
|
| BLAKE2b-256 |
a204c861759092b1590b07151620f1f0e61c9dd8e6289e622013b6a2bc380998
|