Turn unstructured documents into clean JSON with auto-generated schemas

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

datafenix

These details have not been verified by PyPI

Project description

doc2json

Beta - Actively developed and used in production, but APIs may change. Feedback welcome!

Your documents are unique. Your extraction tool should be too.

Every industry has documents that generic AI tools don't understand. Legal contracts with jurisdiction-specific clauses. Medical intake forms with diagnosis codes. Invoices with VAT breakdowns. Shipping manifests with customs classifications.

You know the structure of your documents. doc2json lets you encode that knowledge into Pydantic schemas - then extracts exactly what you need, validated and typed.

And because your documents often contain sensitive data, you choose where the AI runs: locally on your laptop, in your enterprise cloud, or via public APIs.

The Problem

You've probably tried this before:

"Just use an LLM" - You write a prompt, get back JSON... sometimes. No validation. Hallucinated fields. Different structure every time. You spend more time parsing the output than you saved.

"Use a document extraction API" - Generic fields that don't match your domain. "Amount" when you need "VAT-exclusive subtotal". No way to capture your industry's specific terminology.

"Build it with LangChain" - Three weeks later you have a fragile pipeline that breaks when documents vary. No schema versioning. No quality feedback. No idea which extractions need review.

"Send everything to the cloud" - Your compliance team wants to know why patient records are going to OpenAI's servers.

The Solution

doc2json is a Python CLI that turns unstructured documents into validated JSON using LLMs and Pydantic schemas.

Documents (PDF, Word, HTML, text)
        ↓
   Your Pydantic Schema (you define the fields)
        ↓
   LLM Extraction (provider of your choice)
        ↓
   Validated JSON (type-checked, structured)
        ↓
   Your Destination (files, databases, warehouses)

You define the schema. You choose the AI. You control your data.

Industry Examples

Legal - Extract party names, obligations, termination clauses, governing law from contracts. Run locally with Ollama for client confidentiality.

Medical - Parse patient intake forms into structured records: demographics, symptoms, medications, allergies. Keep PHI off public clouds.

Finance - Pull line items, tax breakdowns, payment terms from invoices. Load directly to Snowflake for reconciliation.

Supply Chain - Extract shipment details, HS codes, weights, origins from customs documents. Connect to your existing data warehouse.

Insurance - Parse claims forms, policy documents, coverage details. Maintain audit trails with schema versioning.

Real Estate - Extract property details, terms, contingencies from purchase agreements and leases.

Quick Start

1. Install

pip install doc2json[openai]  # or [anthropic], [gemini], [all]

2. Initialize

doc2json init

This creates your project structure:

doc2json.yml       # Configuration
schemas/           # Your Pydantic schemas
sources/           # Input documents
outputs/           # Extracted JSON

3. Define Your Schema

Edit schemas/example.py - this is where your domain knowledge lives:

__version__ = "1"

from pydantic import BaseModel, Field
from typing import Optional
from datetime import date

class LineItem(BaseModel):
    description: str = Field(description="Item or service description")
    quantity: float = Field(description="Number of units")
    unit_price: float = Field(description="Price per unit excluding tax")
    vat_rate: Optional[float] = Field(default=None, description="VAT rate as decimal")

class Schema(BaseModel):
    """Invoice extraction schema."""
    invoice_number: str = Field(description="Invoice number or reference")
    invoice_date: date = Field(description="Date invoice was issued")
    vendor_name: str = Field(description="Name of the vendor/seller")
    vendor_vat_number: Optional[str] = Field(default=None, description="Vendor VAT registration")
    line_items: list[LineItem] = Field(description="All line items on the invoice")
    subtotal: float = Field(description="Total before tax")
    vat_amount: Optional[float] = Field(default=None, description="Total VAT/tax amount")
    total: float = Field(description="Final amount due")
    currency: str = Field(description="Currency code (GBP, USD, EUR, etc.)")

The field descriptions guide the LLM. Nested models like LineItem just work.

4. Add Documents & Run

# Put your documents in sources/example/
doc2json run

Output appears in outputs/example_<timestamp>.jsonl - validated, structured, ready to use.

Schema Evolution

Here's what makes doc2json different: the AI helps you improve your schema.

Enable assessment in your config:

schemas:
  - name: invoice
    assess: true

Now when you run extractions, the LLM evaluates each result and suggests missing fields it noticed in your documents:

doc2json run
# "Noticed 'payment_terms' in 8/10 documents - consider adding to schema"
# "Noticed 'purchase_order_number' in 6/10 documents - consider adding to schema"

doc2json suggest-schema
# Generates updated schema with new fields

doc2json accept-suggestion
# Backs up old schema (invoice_v1.py), promotes new version

Your schema evolves based on real data, not guesswork. Every extraction records which schema version was used for full traceability.

Privacy Tiers

Your documents, your choice:

Tier	Provider	Your Data
Local	Ollama	Never leaves your machine
Enterprise	Azure OpenAI	Stays in your cloud tenant
Public Cloud	Anthropic, OpenAI, Gemini, Groq	Sent to provider's servers

Run Locally with Ollama

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.3

# doc2json.yml
llm:
  provider: ollama
  model: llama3.3

No API keys. No data leaving your machine. No per-token costs.

Enterprise Cloud (Azure OpenAI)

llm:
  provider: openai
  base_url: https://your-resource.openai.azure.com
  api_key: ${AZURE_OPENAI_API_KEY}
  api_version: 2024-12-01-preview
  model: gpt-4.1

Data stays in your Azure tenant. Required for many compliance frameworks.

Public Cloud (Fastest, Most Accurate)

llm:
  provider: anthropic
  model: claude-sonnet-4-20250514

Best accuracy for complex extractions. See docs/models.md for model recommendations.

Production Connectors

doc2json isn't just for prototypes. Connect to real data infrastructure:

Sources: Local files, AWS S3, Google Drive, Azure Blob Storage

Destinations: JSONL files, PostgreSQL, MongoDB, Snowflake, BigQuery, SQLite, MySQL

# Example: S3 → Snowflake pipeline
source:
  type: s3
  bucket: legal-documents
  prefix: contracts/2024/

destination:
  type: snowflake
  account: xy12345.us-east-1
  user: ${SNOWFLAKE_USER}
  password: ${SNOWFLAKE_PASSWORD}
  database: ANALYTICS
  schema: RAW
  warehouse: COMPUTE_WH

See docs/reference.md for all connector options.

Commands

Command	What it does
`doc2json init`	Create project structure
`doc2json run`	Extract data from documents
`doc2json run --dry-run`	Preview without calling the LLM (cost estimate)
`doc2json test`	Validate configuration and schemas
`doc2json preview`	Show the JSON schema sent to the LLM
`doc2json suggest-schema`	Generate schema improvements from feedback
`doc2json accept-suggestion`	Apply suggested schema (with version backup)

Why doc2json?

Schema-first - Pydantic models with type hints and validation. No more hoping the JSON looks right.

Domain-specific - Your schema encodes your domain knowledge. Extract exactly what matters to your business.

Privacy-conscious - Run locally, in your enterprise cloud, or via public APIs. You decide.

Self-improving - The assessment loop discovers fields you missed. Your schema evolves with your data.

Production-ready - Real connectors to real infrastructure. Metadata tracking. Schema versioning.

Open source - MIT licensed. No vendor lock-in. See exactly what it does.

Installation Options

# Core + LLM provider
pip install doc2json[anthropic]    # Claude
pip install doc2json[openai]       # OpenAI, Azure, Groq, Together, Ollama
pip install doc2json[gemini]       # Google Gemini
pip install doc2json[all]          # All providers

# Add connectors as needed
pip install doc2json[s3]           # AWS S3 source
pip install doc2json[snowflake]    # Snowflake destination
pip install doc2json[postgres]     # PostgreSQL destination
pip install doc2json[sql]          # Generic SQL (MySQL, SQLite, etc.)

Documentation

Reference Guide - Full configuration options, all connectors, file format support
Model Selection - Choosing the right LLM provider for your use case

License

MIT - Use it however you want.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

datafenix

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Dec 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc2json-0.1.0.tar.gz (74.4 kB view details)

Uploaded Dec 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doc2json-0.1.0-py3-none-any.whl (70.9 kB view details)

Uploaded Dec 17, 2025 Python 3

File details

Details for the file doc2json-0.1.0.tar.gz.

File metadata

Download URL: doc2json-0.1.0.tar.gz
Upload date: Dec 17, 2025
Size: 74.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for doc2json-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`abc7cbbd8577e2e3ac24572c6031e6d8fbd7a673b9a7a54e982c664bb4085117`
MD5	`eeb22ee9d0ac7ccead4de4eea9c0f743`
BLAKE2b-256	`224c06eeed02887bcc28805a410a4ea44bfee553ec35e84606de766c9b49c392`

See more details on using hashes here.

Provenance

The following attestation bundles were made for doc2json-0.1.0.tar.gz:

Publisher: publish.yml on DataFenix-Ltd/doc2json

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: doc2json-0.1.0.tar.gz
- Subject digest: abc7cbbd8577e2e3ac24572c6031e6d8fbd7a673b9a7a54e982c664bb4085117
- Sigstore transparency entry: 768790512
- Sigstore integration time: Dec 17, 2025
Source repository:
- Permalink: DataFenix-Ltd/doc2json@0919c25839014b0a7eb69c04b14fa595b9b4b417
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/DataFenix-Ltd
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0919c25839014b0a7eb69c04b14fa595b9b4b417
- Trigger Event: push

File details

Details for the file doc2json-0.1.0-py3-none-any.whl.

File metadata

Download URL: doc2json-0.1.0-py3-none-any.whl
Upload date: Dec 17, 2025
Size: 70.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for doc2json-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2c1dd9e086bea71cbc901153d38ae61a7fd5f44cd72471e72f4e2a003dbb7231`
MD5	`244ce504c99ad3ba57fed7b6e50f5a28`
BLAKE2b-256	`01ee7551e1361486d9653cdef9abdae917df8356dc23945678afc98f1bcccf55`

See more details on using hashes here.

Provenance

The following attestation bundles were made for doc2json-0.1.0-py3-none-any.whl:

Publisher: publish.yml on DataFenix-Ltd/doc2json

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: doc2json-0.1.0-py3-none-any.whl
- Subject digest: 2c1dd9e086bea71cbc901153d38ae61a7fd5f44cd72471e72f4e2a003dbb7231
- Sigstore transparency entry: 768790515
- Sigstore integration time: Dec 17, 2025
Source repository:
- Permalink: DataFenix-Ltd/doc2json@0919c25839014b0a7eb69c04b14fa595b9b4b417
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/DataFenix-Ltd
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0919c25839014b0a7eb69c04b14fa595b9b4b417
- Trigger Event: push

doc2json 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

doc2json

The Problem

The Solution

Industry Examples

Quick Start

1. Install

2. Initialize

3. Define Your Schema

4. Add Documents & Run

Schema Evolution

Privacy Tiers

Run Locally with Ollama

Enterprise Cloud (Azure OpenAI)

Public Cloud (Fastest, Most Accurate)

Production Connectors

Commands

Why doc2json?

Installation Options

Documentation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance