Document in, Structured JSON out. Schema-driven document extraction with local OCR + LLM.

These details have not been verified by PyPI

Project links

Project description

Docpick

Document in, Structured JSON out. Locally. With your schema.

docpick is a lightweight, schema-driven document extraction pipeline that combines local OCR engines with local LLMs to extract structured JSON from any document — invoices, receipts, bills of lading, tax forms, and more.

Zero cloud dependency — runs entirely on your machine (CPU or GPU)
Custom schemas — define your own Pydantic models or use 8 built-in document schemas
Validation built-in — checkdigit verification, cross-field rules, cross-document consistency
Apache 2.0 — no GPL/AGPL dependencies

Install

pip install docpick            # core (LLM extraction only)
pip install docpick[paddle]    # + PaddleOCR (recommended)
pip install docpick[easyocr]   # + EasyOCR (Korean-optimized)
pip install docpick[got]       # + GOT-OCR2.0 (GPU, vision-language)
pip install docpick[all]       # all OCR backends

Requirements: Python 3.11+ / LLM endpoint (vLLM, Ollama, or OpenAI-compatible)

Quick Start

Python API

from docpick import DocpickPipeline
from docpick.schemas import InvoiceSchema

pipeline = DocpickPipeline()
result = pipeline.extract("invoice.pdf", schema=InvoiceSchema)

print(result.data)           # Structured dict matching schema
print(result.validation)     # Validation errors/warnings
print(result.confidence)     # Per-field confidence scores

CLI

# Extract structured data
docpick extract invoice.pdf --schema invoice --output result.json

# OCR only (no LLM)
docpick ocr document.png --lang ko,en

# Validate extracted JSON
docpick validate result.json --schema invoice

# Batch process a directory
docpick batch ./documents/ --schema invoice --output ./results/ --concurrency 4

# List available schemas
docpick schemas list

# Show schema details
docpick schemas show invoice

Built-in Schemas

Schema	Document Type	Key Validations
`invoice`	Commercial invoices	Line item sums, tax ID checkdigit, date order
`receipt`	Retail/restaurant receipts	Total = subtotal + tax + tip
`bill_of_lading`	Ocean/air B/L	Container weight sums, ISO 6346, HS code format
`purchase_order`	Purchase orders	PO total = line items, delivery date order
`kr_tax_invoice`	Korean e-tax invoice (세금계산서)	Business number checkdigit (x2), supply/tax/total sums
`bank_statement`	Bank statements	IBAN mod97, period date order
`id_document`	Passport/ID (ICAO 9303)	MRZ, ISO 3166 country codes, date ranges
`certificate_of_origin`	Certificate of Origin	ISO 3166 alpha-2 country codes

Custom Schemas

Define your own schema with Pydantic:

from pydantic import BaseModel
from docpick import DocpickPipeline
from docpick.validation.rules import SumEqualsRule, RequiredFieldRule

class MyDocument(BaseModel):
    """Custom document schema."""
    company_name: str | None = None
    total_amount: float | None = None
    tax_amount: float | None = None
    net_amount: float | None = None
    items: list[dict] | None = None

    class ValidationRules:
        rules = [
            RequiredFieldRule("company_name"),
            SumEqualsRule(["net_amount", "tax_amount"], "total_amount"),
        ]

pipeline = DocpickPipeline()
result = pipeline.extract("my_document.pdf", schema=MyDocument)

Or use a JSON Schema file:

docpick extract document.pdf --schema my_schema.json

Validation

Check Digit Algorithms

Algorithm	Use Case
`kr_business_number`	Korean business registration number (10 digits)
`luhn`	Credit card numbers
`iso_6346`	Shipping container numbers
`iban_mod97`	International bank account numbers
`awb_mod7`	Air waybill numbers
`mrz`	Machine Readable Zone (passport/ID)

Cross-Field Rules

Rule	Description
`SumEqualsRule`	Sum of fields equals target (with tolerance)
`DateBeforeRule`	Date A must precede Date B
`RequiredFieldRule`	Field must be non-null and non-empty
`FieldEqualsRule`	Two fields must be equal
`RangeRule`	Numeric field within min/max bounds
`RegexRule`	Field matches regex pattern

Cross-Document Validation

Validate consistency across related documents (e.g., Invoice + B/L + Packing List):

from docpick.validation.cross_document import create_trade_document_validator

validator = create_trade_document_validator()
result = validator.validate({
    "invoice": invoice_data,
    "bl": bl_data,
    "packing_list": packing_list_data,
    "certificate": certificate_data,
})
print(result.is_valid)

OCR Engines

Engine	Type	GPU	Languages	Best For
PaddleOCR	Traditional OCR	Optional	111	General documents (default)
EasyOCR	Traditional OCR	Optional	80+	Korean text
GOT-OCR2.0	Vision-Language	Required	Multi	Complex layouts
VLM	Vision-Language	Required	Multi	Direct image → JSON

2-Tier Auto Engine

The default auto engine uses confidence-based fallback:

Tier 1 (CPU): PaddleOCR → EasyOCR
Tier 2 (GPU): GOT-OCR2.0 → VLM

If Tier 1 average confidence falls below threshold (default 0.7), automatically escalates to Tier 2.

LLM Providers

Provider	Endpoint	Default Model
vLLM	`http://localhost:30000/v1`	Qwen/Qwen3.5-32B-AWQ
Ollama	`http://localhost:11434`	qwen3.5:7b

Configure via CLI or YAML:

docpick config set llm.provider ollama
docpick config set llm.base_url http://localhost:11434
docpick config set llm.model qwen3.5:7b

Error Handling

The pipeline is designed to be resilient:

OCR failure → automatic fallback to next available engine
LLM JSON parse failure → automatic retry with correction prompt (up to 1 retry)
Partial results → returns whatever was extracted, with errors logged in result.errors
Document load failure → returns empty result with error message

result = pipeline.extract("damaged.pdf", schema=InvoiceSchema)
if result.errors:
    print("Pipeline warnings:", result.errors)
if result.data:
    print("Partial extraction:", result.data)

Batch Processing

Process entire directories with parallel workers:

from docpick.batch import BatchProcessor
from docpick.schemas import InvoiceSchema

processor = BatchProcessor(concurrency=4)
result = processor.process_directory(
    "./invoices/",
    schema=InvoiceSchema,
    recursive=True,
)

print(f"Processed {result.succeeded}/{result.total} files")
for path, extraction in result.results.items():
    print(f"{path}: {extraction.data.get('total_amount')}")

Architecture

Document (PDF/Image)
  → DocumentLoader (pypdfium2)
  → Tier 1: OCR (PaddleOCR/EasyOCR, CPU)
    → [confidence < threshold] → Tier 2: VLM (GOT/VLM, GPU)
  → LLM Extractor (vLLM/Ollama, schema prompt)
  → Pydantic Validation (checkdigit, cross-field, cross-document)
  → ExtractionResult (structured JSON + confidence + validation)

License

Apache 2.0 — all dependencies are Apache 2.0 or MIT licensed.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

Mar 25, 2026

This version

0.1.2

Mar 17, 2026

0.1.1

Mar 16, 2026

0.1.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpick-0.1.2.tar.gz (58.8 kB view details)

Uploaded Mar 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docpick-0.1.2-py3-none-any.whl (54.2 kB view details)

Uploaded Mar 17, 2026 Python 3

File details

Details for the file docpick-0.1.2.tar.gz.

File metadata

Download URL: docpick-0.1.2.tar.gz
Upload date: Mar 17, 2026
Size: 58.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for docpick-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`a5a18b96cc7fde86f79782ec29f96a5176b4de241463d527e16f1371add5aa70`
MD5	`00bd0e10f31e70a8bfa795249d5a23f1`
BLAKE2b-256	`d3a9bae42764030a6f957b2e73027f8296a5b6fef4b66c1e785e803178d04371`

See more details on using hashes here.

File details

Details for the file docpick-0.1.2-py3-none-any.whl.

File metadata

Download URL: docpick-0.1.2-py3-none-any.whl
Upload date: Mar 17, 2026
Size: 54.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for docpick-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dc05ca84b9e907fa26e0e5ccbaf04ccccfa33fabd9b93cf37d39b8fb4d3d1487`
MD5	`5dabb169aff29c3f171dffbe7d2592e7`
BLAKE2b-256	`740213415ec07d8a6bcb3017037e326379f993b2bc8e8a8f4ab246fbad2cf1da`

See more details on using hashes here.

docpick 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Docpick

Install

Quick Start

Python API

CLI

Built-in Schemas

Custom Schemas

Validation

Check Digit Algorithms

Cross-Field Rules

Cross-Document Validation

OCR Engines

2-Tier Auto Engine

LLM Providers

Error Handling

Batch Processing

Architecture

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes