Skip to main content

Persistent cache for LLM document extraction: extract once, query forever. CLI, Python API, and MCP server.

Project description

docstore

tests lint Ruff License: MIT Python

dbt for unstructured data. Extract once, query forever.

Most tools re-read your documents every time you ask a question. docstore extracts the fields you care about once, caches them locally, and answers subsequent queries from the cache - no LLM calls, no re-reading, no waiting.

Run 1 (cold): 200 invoices → 400 LLM calls → ~$0.40
Run 2 (warm): 200 invoices → 0 LLM calls  → $0.00
Query:        "which invoices are unpaid?" → 0 LLM calls → <1s

Why this exists

LLMs made it easy to extract structured data from documents. What they did not provide is a layer that persists that extraction and invalidates it automatically when the source file changes. Every existing tool either:

  • Re-reads raw documents on every query (expensive, slow)
  • Requires a database or vector store (complex, overkill for most teams)
  • Stores embeddings for semantic search (wrong abstraction for structured extraction)

docstore treats structured extraction as a cache over your unstructured data. Same insight as dbt applied to SQL - you define the transformation once, the system manages the state.


Benchmark

docstore ships with a reproducible public cache benchmark. It generates a synthetic invoice corpus, writes ground_truth.jsonl, then measures:

  • cold_extract: empty cache, every document calls the LLM once
  • warm_extract: same corpus and schema, every document is served from cache
  • cached_query: query stored JSON locally, with no parser or LLM calls
uv run python scripts/benchmark.py /tmp/docstore-benchmark --count 30
uv run python scripts/benchmark.py /tmp/docstore-benchmark --count 30 --output json

Use --provider and --model to run it against a specific vendor. The benchmark is intended to show cache behavior, not provider quality.


Installation

pip install lumient-docstore

All four LLM providers (Anthropic, OpenAI, Groq, Gemini) work out of the box - pick one at runtime via --provider.

Or from source:

git clone https://github.com/LumientAI/docstore
cd docstore
pip install -e ".[dev]"

Quickstart

Python API

from pathlib import Path
from docstore import DocStore, ExtractionSchema, create_llm_client
from docstore.agents.orchestrator import run_directory

class InvoiceSchema(ExtractionSchema):
    vendor: str
    amount: float
    currency: str
    due_date: str
    paid: bool

invoices_dir = Path("./invoices")

# Co-locate the cache with the corpus so the CLI and Python API
# share state (the CLI's path-taking commands default to this).
store = DocStore(root=invoices_dir / ".docstore")
descriptor = InvoiceSchema.to_descriptor()
client = create_llm_client()  # defaults to Anthropic; pass provider="openai" etc. to override
results = run_directory(invoices_dir, descriptor, store, client)

# Query without any LLM calls
unpaid = store.query("InvoiceSchema", lambda r: r.data.get("paid") is False)

CLI

# Generate a synthetic invoice corpus for testing (30 .txt files)
python scripts/generate_txt_invoices.py ./sample_invoices --count 30

# Extract - describe fields interactively
docstore shell ./invoices/

# Extract with a named schema
docstore extract ./invoices/ --schema invoice_schema

# Use OpenAI, Groq, or Gemini instead of the default Anthropic provider
docstore extract ./invoices/ --schema invoice_schema --provider openai
docstore extract ./invoices/ --schema invoice_schema --provider groq
docstore extract ./invoices/ --schema invoice_schema --provider gemini

# Override the default model for any provider
docstore extract ./invoices/ --schema invoice_schema --provider gemini --model gemini-2.5-pro

# Query stored results (no LLM)
docstore query invoice_schema --filter "is_paid=false" --store ./invoices/.docstore

# Ask in natural language - one LLM call compiles to a filter,
# results come from cache with no per-document re-reads
docstore ask "which unpaid invoices are over $5000?" --schema invoice_schema --store ./invoices/.docstore

# Diff a changed file
docstore diff ./invoices/acme_april.pdf --schema invoice_schema

# Remove cache entries whose source file no longer exists
docstore sync --store ./invoices/.docstore        # dry run
docstore sync --store ./invoices/.docstore --yes  # delete stale entries

# Wipe the cache (optional --schema X to scope)
docstore clean --store ./invoices/.docstore --yes

# Stats
docstore stats --store ./invoices/.docstore

MCP server (Claude Desktop)

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "docstore": {
      "command": "docstore-server",
      "env": {
        "DOCSTORE_DIR": "/path/to/your/.docstore",
        "DOCSTORE_PROVIDER": "anthropic",
        "ANTHROPIC_API_KEY": "your-key"
      }
    }
  }
}

Claude can then call extract, query, diff, and stats directly.

Supported providers are anthropic (default), openai, groq, and gemini. Set ANTHROPIC_API_KEY, OPENAI_API_KEY, GROQ_API_KEY, or GEMINI_API_KEY for the provider you choose. Each provider has a default model, and you can override it with --model on the CLI or DOCSTORE_MODEL for the MCP server.


How it works

┌──────────────────────────────────────────────────────────────────┐
│                         docstore pipeline                        │
│                                                                  │
│  document.pdf                                                    │
│       │                                                          │
│       ▼                                                          │
│  ┌─────────┐     cache hit?   ──────────────────────────────┐    │
│  │  Parser │  ─────────────►  .docstore/{key}.json          │    │
│  └─────────┘     (no LLM)     ──────────────────────────────┘    │
│       │                                                          │
│       │ cache miss                                               │
│       ▼                                                          │
│  ┌───────────┐                                                   │
│  │ Extractor │  1 LLM call - extract fields against schema       │
│  └───────────┘                                                   │
│       │                                                          │
│       │   (opt-in via --validate)                                │
│       ▼                                                          │
│  ┌╌╌╌╌╌╌╌╌╌╌╌┐                                                   │
│  ╎ Validator ╎  +1 LLM call - sanity-check extracted values      │
│  └╌╌╌╌╌╌╌╌╌╌╌┘                                                   │
│       │                                                          │
│       ▼                                                          │
│  .docstore/{file_hash}__{schema}__{version}.json                 │
└──────────────────────────────────────────────────────────────────┘

The validator is off by default - cold extraction is one LLM call per file. Pass --validate to add a plausibility check (doubles cost; see the CLI reference for trade-offs).

Cache key: sha256(file_bytes)[:16] + schema_name + sha256(json.dumps(fields, sort_keys=True))[:12]

The cache invalidates automatically when:

  • The file content changes (file hash changes)
  • The schema changes (schema version changes)
  • A different schema is applied to the same file (different key)

Schema definition

Two ways to define a schema:

1. Python class (recommended for code)

from docstore import ExtractionSchema

class ContractSchema(ExtractionSchema):
    parties: list
    start_date: str
    end_date: str
    obligations: list
    auto_renews: bool

2. Natural language via CLI (recommended for ad-hoc use)

docstore shell ./contracts/
# > vendor name, contract start date, expiry date, whether it auto-renews

The orchestrator normalises your description into a canonical schema and shows it to you before running.


Supported file types

PDF, DOCX, TXT, MD, CSV, HTML, JSON

PDF support covers documents with embedded/selectable text. Scanned or image-only PDFs need OCR, which docstore does not support yet.


Limitations vs Lumient

docstore is a single-document extraction cache. It does not:

  • Compose records across documents (invoice + Stripe → reconciliation status)
  • Trigger automatically when files arrive
  • Maintain a queryable entity layer with lineage
  • Support multi-step workflow logic (validate, diff, generate)
  • Provide governance and audit trails for regulated industries

For cross-document composition and maintained operational records, see Lumient.


Development

# uv (recommended)
uv sync --all-extras
uv run pytest
uv run ruff check .

# Or with pip
pip install -e ".[dev]"
pytest tests/

See AGENTS.md for architectural invariants and CONTRIBUTING.md for the PR process.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lumient_docstore-0.1.0.tar.gz (141.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lumient_docstore-0.1.0-py3-none-any.whl (31.1 kB view details)

Uploaded Python 3

File details

Details for the file lumient_docstore-0.1.0.tar.gz.

File metadata

  • Download URL: lumient_docstore-0.1.0.tar.gz
  • Upload date:
  • Size: 141.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for lumient_docstore-0.1.0.tar.gz
Algorithm Hash digest
SHA256 769a1dfbb46a10e68234d7c9bdd95d70bab0153b3405baeb5c655105ca257a91
MD5 6cd19eb955f8c68f792c017342c52689
BLAKE2b-256 095146e79d97c8f7322b4ff32671f82af954ab25b3eca91acec7f6dcfcb81d4c

See more details on using hashes here.

File details

Details for the file lumient_docstore-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for lumient_docstore-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 05cb9be0a91723b7074741879ba078ef8f3a59f65530708d62689fdc7ee88b43
MD5 7fb2232e18e79866f8887c94ba92a88d
BLAKE2b-256 38086abc5c05b7690894d32839b45174efa16da807738c57904cfc24b4b73ce9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page