Skip to main content

LLM-powered document extraction SDK

Project description

docparse

LLM-powered document extraction SDK. Extract structured data from PDFs, invoices, contracts, and any custom schema — in two lines of Python.

from docparse import LLMExtractor, load, INVOICE_SCHEMA

layout = load("invoice.pdf")                        # or from_text("...")
result = LLMExtractor().extract(layout, INVOICE_SCHEMA)

print(result.get("total_amount"))        # 1500.0
print(result.confidence("vendor_name"))  # 0.97

Installation

pip install docparse                # core (text files only)
pip install "docparse[pdf]"         # + PDF support via pdfplumber
pip install "docparse[openai]"      # + OpenAI provider
pip install "docparse[anthropic]"   # + Anthropic provider
pip install "docparse[all]"         # everything

Quickstart

from docparse import LLMExtractor, from_text, INVOICE_SCHEMA

layout = from_text("""
INVOICE #INV-2024-042
Vendor: Acme Corp
Date: 2024-03-15
Total: $1,500.00
""")

extractor = LLMExtractor(model="gpt-4o-mini", provider="openai")
result = extractor.extract(layout, INVOICE_SCHEMA)

for field_name in INVOICE_SCHEMA.field_names():
    value = result.get(field_name)
    conf  = result.confidence(field_name)
    if value is not None:
        print(f"{field_name}: {value}  (confidence: {conf:.0%})")

Built-in schemas

Schema constant Key Fields
INVOICE_SCHEMA invoice 12 fields — amounts, dates, vendor, line items
LOAN_APPLICATION_SCHEMA loan_application 12 fields — borrower, amounts, property
W2_SCHEMA w2 8 fields — employer, wages, withholdings
NDA_SCHEMA nda 8 fields — parties, term, jurisdiction
CONTRACT_SCHEMA contract 9 fields — parties, dates, obligations

Access any by key: from docparse import REGISTRY; schema = REGISTRY["invoice"]

Custom schemas

from docparse import ExtractionSchema, FieldSpec, LLMExtractor, from_text

schema = ExtractionSchema(name="purchase_order", fields=[
    FieldSpec(name="po_number",     description="PO number",         required=True),
    FieldSpec(name="total",         description="Total amount",       type="number", required=True, example="4200.00"),
    FieldSpec(name="delivery_date", description="Expected delivery",  type="date"),
])

result = LLMExtractor().extract(from_text(po_text), schema)
missing = result.missing_required(schema)

CLI

docparse extract invoice.pdf --schema invoice
docparse extract contract.txt --schema nda --json
docparse schemas          # list available schemas

Providers

# OpenAI (default)
LLMExtractor(model="gpt-4o-mini", provider="openai")

# Anthropic
LLMExtractor(model="claude-3-5-haiku-20241022", provider="anthropic")

License

MIT © Mawlaia

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mawlaia_docparse-0.1.0.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mawlaia_docparse-0.1.0-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file mawlaia_docparse-0.1.0.tar.gz.

File metadata

  • Download URL: mawlaia_docparse-0.1.0.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mawlaia_docparse-0.1.0.tar.gz
Algorithm Hash digest
SHA256 aeddfab48e46a867d3d3c5fef359329ea95df7aa9b0160ca62c0afc0862fad80
MD5 15e1863fb539f0e8c3f1a4ffcf5cc2ad
BLAKE2b-256 3c4e1af9ae249add82d7192de5f6ad1e6c0cd6fd74916a3cba6ed0bc216db19d

See more details on using hashes here.

File details

Details for the file mawlaia_docparse-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mawlaia_docparse-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8546b3880ac1a6c1c35a8c7df7a66d769c680ee1d07d153844610f74c877982a
MD5 3c3c1fea2c1b6245a13b8d5ca709ac4c
BLAKE2b-256 26575f5d24a197de1931110253fd198e8f42f8a9de5cb2aba121f29be4d79484

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page