Skip to main content

Document in, Structured JSON out. Schema-driven document extraction with local OCR + LLM.

Project description

Docpick

Document in, Structured JSON out. Locally. With your schema.

docpick is a lightweight, schema-driven document extraction pipeline that combines local OCR engines with local LLMs to extract structured JSON from any document — invoices, receipts, bills of lading, tax forms, and more.

  • Zero cloud dependency — runs entirely on your machine (CPU or GPU)
  • Custom schemas — define your own Pydantic models or use 8 built-in document schemas
  • Validation built-in — checkdigit verification, cross-field rules, cross-document consistency
  • Apache 2.0 — no GPL/AGPL dependencies

Install

pip install docpick            # core (LLM extraction only)
pip install docpick[paddle]    # + PaddleOCR (recommended)
pip install docpick[easyocr]   # + EasyOCR (Korean-optimized)
pip install docpick[got]       # + GOT-OCR2.0 (GPU, vision-language)
pip install docpick[all]       # all OCR backends

Requirements: Python 3.11+ / LLM endpoint (vLLM, Ollama, or OpenAI-compatible)

Quick Start

Python API

from docpick import DocpickPipeline
from docpick.schemas import InvoiceSchema

pipeline = DocpickPipeline()
result = pipeline.extract("invoice.pdf", schema=InvoiceSchema)

print(result.data)           # Structured dict matching schema
print(result.validation)     # Validation errors/warnings
print(result.confidence)     # Per-field confidence scores

CLI

# Extract structured data
docpick extract invoice.pdf --schema invoice --output result.json

# OCR only (no LLM)
docpick ocr document.png --lang ko,en

# Validate extracted JSON
docpick validate result.json --schema invoice

# Batch process a directory
docpick batch ./documents/ --schema invoice --output ./results/ --concurrency 4

# List available schemas
docpick schemas list

# Show schema details
docpick schemas show invoice

Built-in Schemas

Schema Document Type Key Validations
invoice Commercial invoices Line item sums, tax ID checkdigit, date order
receipt Retail/restaurant receipts Total = subtotal + tax + tip
bill_of_lading Ocean/air B/L Container weight sums, ISO 6346, HS code format
purchase_order Purchase orders PO total = line items, delivery date order
kr_tax_invoice Korean e-tax invoice (세금계산서) Business number checkdigit (x2), supply/tax/total sums
bank_statement Bank statements IBAN mod97, period date order
id_document Passport/ID (ICAO 9303) MRZ, ISO 3166 country codes, date ranges
certificate_of_origin Certificate of Origin ISO 3166 alpha-2 country codes

Custom Schemas

Define your own schema with Pydantic:

from pydantic import BaseModel
from docpick import DocpickPipeline
from docpick.validation.rules import SumEqualsRule, RequiredFieldRule

class MyDocument(BaseModel):
    """Custom document schema."""
    company_name: str | None = None
    total_amount: float | None = None
    tax_amount: float | None = None
    net_amount: float | None = None
    items: list[dict] | None = None

    class ValidationRules:
        rules = [
            RequiredFieldRule("company_name"),
            SumEqualsRule(["net_amount", "tax_amount"], "total_amount"),
        ]

pipeline = DocpickPipeline()
result = pipeline.extract("my_document.pdf", schema=MyDocument)

Or use a JSON Schema file:

docpick extract document.pdf --schema my_schema.json

Validation

Check Digit Algorithms

Algorithm Use Case
kr_business_number Korean business registration number (10 digits)
luhn Credit card numbers
iso_6346 Shipping container numbers
iban_mod97 International bank account numbers
awb_mod7 Air waybill numbers
mrz Machine Readable Zone (passport/ID)

Cross-Field Rules

Rule Description
SumEqualsRule Sum of fields equals target (with tolerance)
DateBeforeRule Date A must precede Date B
RequiredFieldRule Field must be non-null and non-empty
FieldEqualsRule Two fields must be equal
RangeRule Numeric field within min/max bounds
RegexRule Field matches regex pattern

Cross-Document Validation

Validate consistency across related documents (e.g., Invoice + B/L + Packing List):

from docpick.validation.cross_document import create_trade_document_validator

validator = create_trade_document_validator()
result = validator.validate({
    "invoice": invoice_data,
    "bl": bl_data,
    "packing_list": packing_list_data,
    "certificate": certificate_data,
})
print(result.is_valid)

OCR Engines

Engine Type GPU Languages Best For
PaddleOCR Traditional OCR Optional 111 General documents (default)
EasyOCR Traditional OCR Optional 80+ Korean text
GOT-OCR2.0 Vision-Language Required Multi Complex layouts
VLM Vision-Language Required Multi Direct image → JSON

2-Tier Auto Engine

The default auto engine uses confidence-based fallback:

  1. Tier 1 (CPU): PaddleOCR → EasyOCR
  2. Tier 2 (GPU): GOT-OCR2.0 → VLM

If Tier 1 average confidence falls below threshold (default 0.7), automatically escalates to Tier 2.

LLM Providers

Provider Endpoint Default Model
vLLM http://localhost:30000/v1 Qwen/Qwen3.5-32B-AWQ
Ollama http://localhost:11434 qwen3.5:7b

Configure via CLI or YAML:

docpick config set llm.provider ollama
docpick config set llm.base_url http://localhost:11434
docpick config set llm.model qwen3.5:7b

Error Handling

The pipeline is designed to be resilient:

  • OCR failure → automatic fallback to next available engine
  • LLM JSON parse failure → automatic retry with correction prompt (up to 1 retry)
  • Partial results → returns whatever was extracted, with errors logged in result.errors
  • Document load failure → returns empty result with error message
result = pipeline.extract("damaged.pdf", schema=InvoiceSchema)
if result.errors:
    print("Pipeline warnings:", result.errors)
if result.data:
    print("Partial extraction:", result.data)

Batch Processing

Process entire directories with parallel workers:

from docpick.batch import BatchProcessor
from docpick.schemas import InvoiceSchema

processor = BatchProcessor(concurrency=4)
result = processor.process_directory(
    "./invoices/",
    schema=InvoiceSchema,
    recursive=True,
)

print(f"Processed {result.succeeded}/{result.total} files")
for path, extraction in result.results.items():
    print(f"{path}: {extraction.data.get('total_amount')}")

Architecture

Document (PDF/Image)
  → DocumentLoader (pypdfium2)
  → Tier 1: OCR (PaddleOCR/EasyOCR, CPU)
    → [confidence < threshold] → Tier 2: VLM (GOT/VLM, GPU)
  → LLM Extractor (vLLM/Ollama, schema prompt)
  → Pydantic Validation (checkdigit, cross-field, cross-document)
  → ExtractionResult (structured JSON + confidence + validation)

License

Apache 2.0 — all dependencies are Apache 2.0 or MIT licensed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpick-0.1.1.tar.gz (57.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docpick-0.1.1-py3-none-any.whl (52.8 kB view details)

Uploaded Python 3

File details

Details for the file docpick-0.1.1.tar.gz.

File metadata

  • Download URL: docpick-0.1.1.tar.gz
  • Upload date:
  • Size: 57.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for docpick-0.1.1.tar.gz
Algorithm Hash digest
SHA256 66c23ecc58df86bf64303cf305fbaea7c79be33c736b11dff59d9634e672212b
MD5 88a3d9024272cba9a84cc528fff8d09e
BLAKE2b-256 b99110d21d6804b2fb0edbae6a5c1ef8cb0f3ca6f49c9c1352360148724c5df2

See more details on using hashes here.

File details

Details for the file docpick-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: docpick-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 52.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for docpick-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 751d806c50d6a98e064e80e0f500930cd9030bca789479e0d2aecad8872ba633
MD5 daf6908d2e3f38fdfc1e285db12bfa59
BLAKE2b-256 ce97d59322e4cb4e9b25d7d8d312ec9741e69cd2366bf5051b01ea2e56840834

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page