Skip to main content

Decoupled, LLM-agnostic document OCR + structured extraction. Vision and LLM parsing in 3 lines of code.

Project description

OCR Context

Turn any PDF or image into clean text — or a typed Pydantic model — in three lines.

Decoupled, LLM-agnostic document OCR + structured extraction. No web server, no vendor lock-in.

CI PyPI version Python versions License: MIT Typed

from ocrcontext import Analyzer

result = Analyzer().analyze("invoice.pdf")
print(result.text)

ocrcontext is the extraction core of a production document-analysis platform, lifted out of its FastAPI/Next.js stack into a pure, pip-installable library. It handles OCR engine routing, fidelity-first LLM cleanup, and schema-based structured extraction — and gets out of your way.

Contents


Install

Engines are opt-in so your base install stays small:

Command What you get
pip install ocrcontext Digital PDFs only (PyMuPDF text-layer — no OCR, no GPU, no API key)
pip install 'ocrcontext[paddle]' + printed images & scanned PDFs (PaddleOCR, CPU/GPU)
pip install 'ocrcontext[trocr]' + handwriting fallback (Microsoft TrOCR)
pip install 'ocrcontext[vision]' + handwriting primary (Google Cloud Vision)
pip install 'ocrcontext[cli]' + terminal CLI (ocrcontext extract)
pip install 'ocrcontext[all]' everything above

Add an LLM provider for refinement and structured extraction:

pip install langchain-openai        # or langchain-anthropic, langchain-ollama, ...

Images and scanned PDFs require [paddle]. Passing an image file to a bare pip install ocrcontext raises an EngineError with a clear install hint.

Google Cloud Vision ([vision])

  1. Enable the Cloud Vision API in Google Cloud Console
  2. Create a service account key (JSON) under IAM & Admin → Service Accounts → Keys
  3. Export the path:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"   # Linux/macOS
$env:GOOGLE_APPLICATION_CREDENTIALS = "C:\path\to\key.json" # PowerShell

Quick start

Digital PDF

from ocrcontext import Analyzer

result = Analyzer().analyze("document.pdf")
print(result.text)          # extracted text
print(result.pages)         # page count
print(result.text_source)   # "pdf_text_layer"

Image / scanned PDF

pip install 'ocrcontext[paddle]'
from ocrcontext import Analyzer

result = Analyzer().analyze("scan.png")
print(result.text, result.confidence)

LLM-refined OCR

Refinement fixes character-level OCR errors without paraphrasing, translating, or inventing. Emails, URLs, and IBANs are masked before the model sees them and restored verbatim after. Output that drifts too far from the source is rejected in favour of the raw OCR text.

pip install 'ocrcontext[paddle]' langchain-openai
export OPENAI_API_KEY="sk-..."
from langchain_openai import ChatOpenAI
from ocrcontext import Analyzer

analyzer = Analyzer(llm=ChatOpenAI(model="gpt-4o-mini"), lang="en")
result = analyzer.analyze("scan.jpg")

print(result.text)       # refined
print(result.raw_text)   # original OCR output
print(result.refined)    # True

Structured extraction

Hand the analyzer a Pydantic schema and get a populated instance back.

from langchain_openai import ChatOpenAI
from ocrcontext import Analyzer
from ocrcontext.schemas import Invoice

analyzer = Analyzer(llm=ChatOpenAI(model="gpt-4o-mini", temperature=0))
invoice = analyzer.extract("invoice.pdf", schema=Invoice)

print(invoice.supplier_name, invoice.total_amount, invoice.currency)
for item in invoice.line_items:
    print(item.description, item.quantity, item.unit_price)

Define your own schema — field descriptions are the prompt:

from pydantic import BaseModel, Field

class ShippingLabel(BaseModel):
    sender: str | None = Field(None, description="Sender full name and address")
    recipient: str | None = Field(None, description="Recipient full name and address")
    tracking_number: str | None = Field(None, description="Carrier tracking number")

label = analyzer.extract("label.jpg", schema=ShippingLabel)

No API key? Use a local model

from langchain_ollama import ChatOllama
from ocrcontext import Analyzer

analyzer = Analyzer(llm=ChatOllama(model="llama3.1"))
result = analyzer.analyze("scan.png")
print(result.text)

CLI

Install the [cli] extra to use ocrcontext straight from the terminal — no Python script needed.

pip install 'ocrcontext[cli]'

Extract plain text:

ocrcontext extract invoice.pdf
ocrcontext extract scan.png --output json

Extract structured data with a built-in schema:

ocrcontext extract invoice.pdf   --schema invoice
ocrcontext extract receipt.jpg   --schema receipt
ocrcontext extract contract.pdf  --schema contract
ocrcontext extract passport.jpg  --schema idcard
ocrcontext extract lab_report.pdf --schema medical

Choose your LLM provider:

ocrcontext extract invoice.pdf --schema invoice \
  --provider openai --model gpt-4o-mini

ocrcontext extract invoice.pdf --schema invoice \
  --provider anthropic --model claude-haiku-4-5-20251001

ocrcontext extract invoice.pdf --schema invoice \
  --provider ollama --model llama3.1

All options:

ocrcontext extract FILE [OPTIONS]

  --schema    -s   invoice | receipt | contract | idcard | medical
  --lang      -l   Language code (default: en)
  --handwriting    Force handwriting engine
  --refine         auto (default) | yes | no
  --output    -o   text (default) | json
  --provider  -p   openai | anthropic | ollama | google
  --model     -m   Model name (default: gpt-4o-mini)

LangChain integration

OCRContextLoader is a drop-in LangChain BaseLoader. It slots into any LangChain pipeline — RAG, document Q&A, chain-of-thought — without glue code.

from ocrcontext.loaders import OCRContextLoader

# Plain OCR
loader = OCRContextLoader("contract.pdf")
docs = loader.load()  # -> [Document(page_content="...", metadata={...})]

# With LLM refinement
from langchain_openai import ChatOpenAI

loader = OCRContextLoader(
    "scan.pdf",
    llm=ChatOpenAI(model="gpt-4o-mini"),
    lang="en",
    refine="yes",
)
docs = loader.load()
print(docs[0].page_content)
print(docs[0].metadata)
# {
#   "source": "scan.pdf",
#   "text_source": "ocr",
#   "pages": 3,
#   "confidence": 0.94,
#   "refined": True,
#   "raw_text": "..."
# }

In a RAG pipeline:

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from ocrcontext.loaders import OCRContextLoader

docs = OCRContextLoader("annual_report.pdf").load()
chunks = RecursiveCharacterTextSplitter(chunk_size=1000).split_documents(docs)
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())

Built-in schemas

Five ready-to-use Pydantic schemas with system prompts, importable from ocrcontext.schemas. Pass them directly to analyzer.extract() or the CLI --schema flag.

Invoice

from ocrcontext.schemas import Invoice

invoice = analyzer.extract("invoice.pdf", schema=Invoice)
# invoice.supplier_name, .invoice_number, .invoice_date, .total_amount,
# .currency, .tax_id, .tax_rate, .line_items (list[LineItem])

Receipt

from ocrcontext.schemas import Receipt

receipt = analyzer.extract("receipt.jpg", schema=Receipt)
# receipt.store_name, .date, .time, .total_amount, .tax_amount,
# .subtotal, .payment_method, .currency, .items (list[ReceiptItem])

Contract

from ocrcontext.schemas import Contract

contract = analyzer.extract("agreement.pdf", schema=Contract)
# contract.title, .effective_date, .expiration_date, .contract_value,
# .currency, .governing_law, .key_obligations,
# .parties (list[ContractParty] with .name, .role)

IdCard

Supports national_id, passport, driver_license, residence_permit.

from ocrcontext.schemas import IdCard

card = analyzer.extract("passport.jpg", schema=IdCard)
# card.document_type, .full_name, .date_of_birth, .gender,
# .nationality, .document_number, .issue_date, .expiry_date,
# .issuing_authority, .address

MedicalReport

from ocrcontext.schemas import MedicalReport

report = analyzer.extract("lab_report.pdf", schema=MedicalReport)
# report.patient_name, .patient_dob, .report_date, .doctor_name,
# .institution, .diagnosis, .icd_codes (list[str]),
# .medications (list[Medication]), .notes

How it routes a document

              ┌─────────────┐
 document ───▶│   Analyzer  │
              └──────┬──────┘
                     ▼
      ┌──────────────────────────────────────┐
      │ 1. Digital PDF?                       │
      │    └─▶ PyMuPDF text layer             │
      │        LLM refine auto-skipped        │
      │                                       │
      │ 2. Image / scanned PDF?               │
      │    └─▶ PaddleOCR                      │
      │        (preprocess → coverage-first   │
      │         → line-band fallback)         │
      │                                       │
      │ 3. Handwriting (explicit or auto)?    │
      │    └─▶ Google Cloud Vision            │
      │        → TrOCR fallback               │
      │                                       │
      │ 4. (optional) LLM refine              │
      │    fidelity-first · literal-safe      │
      │                                       │
      │ 5. (optional) extract(schema)         │
      │    └─▶ typed Pydantic model           │
      └──────────────────────────────────────┘

Multi-page documents are joined with --- Page N --- separators. Handwriting kicks in automatically when printed OCR returns too little text.


Refinement modes

Mode When it's used
conservative Scanned images — minimal char-level correction only
layout Digital PDFs — reconstruct clean structure
handwriting_layout Handwritten notes / lists / diagrams
handwriting_prose Handwritten poems / paragraphs / letters

Modes are auto-selected based on the document type and text content. The handwriting mode choice is driven by whether the text looks like a DIKW/pyramid diagram. All prompts are ported verbatim from the production pipeline.

Override manually:

from ocrcontext import Analyzer, RefinementMode

result = analyzer.analyze("scan.png", mode=RefinementMode.CONSERVATIVE)

Configuration

from ocrcontext import Analyzer, AnalyzerConfig

cfg = AnalyzerConfig(
    lang="tr",                        # default document language
    prefer_pdf_text_layer=True,       # skip OCR when a text layer exists
    auto_handwriting_fallback=True,   # retry with handwriting if OCR returns too little
    refine_by_default=True,           # auto-refine whenever an LLM is configured
)
analyzer = Analyzer(llm=..., config=cfg)

Development

git clone https://github.com/BahadirKarsli/OCRContext
cd OCRContext
pip install -e '.[dev]'
pytest            # runs without GPU or network — engines and LLM are faked
ruff check .

See examples/ for runnable smoke tests (image OCR, structured extraction, PDF routing).


License

MIT © Bahadır Karslı

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocrcontext-0.1.2.tar.gz (56.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocrcontext-0.1.2-py3-none-any.whl (56.8 kB view details)

Uploaded Python 3

File details

Details for the file ocrcontext-0.1.2.tar.gz.

File metadata

  • Download URL: ocrcontext-0.1.2.tar.gz
  • Upload date:
  • Size: 56.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ocrcontext-0.1.2.tar.gz
Algorithm Hash digest
SHA256 6652d4ce353bbe5a06fc6b71ec08ab4209ca5d66f296a08f9e37ef7ec4df61af
MD5 8bc398b68d3c2c8e0df38272db7a9421
BLAKE2b-256 fc641a112ef083f91a730e3caad8b5583a8bb8c9cadf4f36e048ec8d2d8b5cb9

See more details on using hashes here.

File details

Details for the file ocrcontext-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: ocrcontext-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 56.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ocrcontext-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 292bb178822ea4d3463fcddb3a535defc53fb0ad425a70221550a4710d8d733f
MD5 447ff430838cfb3045eabda84b8a4583
BLAKE2b-256 7e86df8499e030f1ed59ec286b294feccbbf111c41f7378cdfcc1ffb67d47acd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page