Persistent cache for LLM document extraction: extract once, query forever. CLI, Python API, and MCP server.
Project description
docstore
dbt for unstructured data. Extract once, query forever.
Most tools re-read your documents every time you ask a question. docstore extracts the fields you care about once, caches them locally, and answers subsequent queries from the cache - no LLM calls, no re-reading, no waiting.
Run 1 (cold): 200 invoices → 400 LLM calls → ~$0.40
Run 2 (warm): 200 invoices → 0 LLM calls → $0.00
Query: "which invoices are unpaid?" → 0 LLM calls → <1s
Why this exists
LLMs made it easy to extract structured data from documents. What they did not provide is a layer that persists that extraction and invalidates it automatically when the source file changes. Every existing tool either:
- Re-reads raw documents on every query (expensive, slow)
- Requires a database or vector store (complex, overkill for most teams)
- Stores embeddings for semantic search (wrong abstraction for structured extraction)
docstore treats structured extraction as a cache over your unstructured data. Same insight as dbt applied to SQL - you define the transformation once, the system manages the state.
Benchmark
docstore ships with a reproducible public cache benchmark. It generates a
synthetic invoice corpus, writes ground_truth.jsonl, then measures:
cold_extract: empty cache, every document calls the LLM oncewarm_extract: same corpus and schema, every document is served from cachecached_query: query stored JSON locally, with no parser or LLM calls
uv run python scripts/benchmark.py /tmp/docstore-benchmark --count 30
uv run python scripts/benchmark.py /tmp/docstore-benchmark --count 30 --output json
Use --provider and --model to run it against a specific vendor. The
benchmark is intended to show cache behavior, not provider quality.
Installation
pip install lumient-docstore
All four LLM providers (Anthropic, OpenAI, Groq, Gemini) work out of the box - pick one at runtime via --provider.
Or from source:
git clone https://github.com/LumientAI/docstore
cd docstore
pip install -e ".[dev]"
Quickstart
Python API
from pathlib import Path
from docstore import DocStore, ExtractionSchema, create_llm_client
from docstore.agents.orchestrator import run_directory
class InvoiceSchema(ExtractionSchema):
vendor: str
amount: float
currency: str
due_date: str
paid: bool
invoices_dir = Path("./invoices")
# Co-locate the cache with the corpus so the CLI and Python API
# share state (the CLI's path-taking commands default to this).
store = DocStore(root=invoices_dir / ".docstore")
descriptor = InvoiceSchema.to_descriptor()
client = create_llm_client() # defaults to Anthropic; pass provider="openai" etc. to override
results = run_directory(invoices_dir, descriptor, store, client)
# Query without any LLM calls
unpaid = store.query("InvoiceSchema", lambda r: r.data.get("paid") is False)
CLI
# Generate a synthetic invoice corpus for testing (30 .txt files)
python scripts/generate_txt_invoices.py ./sample_invoices --count 30
# Extract - describe fields interactively
docstore shell ./invoices/
# Extract with a named schema
docstore extract ./invoices/ --schema invoice_schema
# Use OpenAI, Groq, or Gemini instead of the default Anthropic provider
docstore extract ./invoices/ --schema invoice_schema --provider openai
docstore extract ./invoices/ --schema invoice_schema --provider groq
docstore extract ./invoices/ --schema invoice_schema --provider gemini
# Override the default model for any provider
docstore extract ./invoices/ --schema invoice_schema --provider gemini --model gemini-2.5-pro
# Query stored results (no LLM)
docstore query invoice_schema --filter "is_paid=false" --store ./invoices/.docstore
# Ask in natural language - one LLM call compiles to a filter,
# results come from cache with no per-document re-reads
docstore ask "which unpaid invoices are over $5000?" --schema invoice_schema --store ./invoices/.docstore
# Diff a changed file
docstore diff ./invoices/acme_april.pdf --schema invoice_schema
# Remove cache entries whose source file no longer exists
docstore sync --store ./invoices/.docstore # dry run
docstore sync --store ./invoices/.docstore --yes # delete stale entries
# Wipe the cache (optional --schema X to scope)
docstore clean --store ./invoices/.docstore --yes
# Stats
docstore stats --store ./invoices/.docstore
MCP server (Claude Desktop)
Add to your claude_desktop_config.json:
{
"mcpServers": {
"docstore": {
"command": "docstore-server",
"env": {
"DOCSTORE_DIR": "/path/to/your/.docstore",
"DOCSTORE_PROVIDER": "anthropic",
"ANTHROPIC_API_KEY": "your-key"
}
}
}
}
Claude can then call extract, query, diff, and stats directly.
Supported providers are anthropic (default), openai, groq, and gemini.
Set ANTHROPIC_API_KEY, OPENAI_API_KEY, GROQ_API_KEY, or GEMINI_API_KEY
for the provider you choose. Each provider has a default model, and you can
override it with --model on the CLI or DOCSTORE_MODEL for the MCP server.
How it works
┌──────────────────────────────────────────────────────────────────┐
│ docstore pipeline │
│ │
│ document.pdf │
│ │ │
│ ▼ │
│ ┌─────────┐ cache hit? ──────────────────────────────┐ │
│ │ Parser │ ─────────────► .docstore/{key}.json │ │
│ └─────────┘ (no LLM) ──────────────────────────────┘ │
│ │ │
│ │ cache miss │
│ ▼ │
│ ┌───────────┐ │
│ │ Extractor │ 1 LLM call - extract fields against schema │
│ └───────────┘ │
│ │ │
│ │ (opt-in via --validate) │
│ ▼ │
│ ┌╌╌╌╌╌╌╌╌╌╌╌┐ │
│ ╎ Validator ╎ +1 LLM call - sanity-check extracted values │
│ └╌╌╌╌╌╌╌╌╌╌╌┘ │
│ │ │
│ ▼ │
│ .docstore/{file_hash}__{schema}__{version}.json │
└──────────────────────────────────────────────────────────────────┘
The validator is off by default - cold extraction is one LLM call per file. Pass --validate to add a plausibility check (doubles cost; see the CLI reference for trade-offs).
Cache key: sha256(file_bytes)[:16] + schema_name + sha256(json.dumps(fields, sort_keys=True))[:12]
The cache invalidates automatically when:
- The file content changes (file hash changes)
- The schema changes (schema version changes)
- A different schema is applied to the same file (different key)
Schema definition
Two ways to define a schema:
1. Python class (recommended for code)
from docstore import ExtractionSchema
class ContractSchema(ExtractionSchema):
parties: list
start_date: str
end_date: str
obligations: list
auto_renews: bool
2. Natural language via CLI (recommended for ad-hoc use)
docstore shell ./contracts/
# > vendor name, contract start date, expiry date, whether it auto-renews
The orchestrator normalises your description into a canonical schema and shows it to you before running.
Supported file types
PDF, DOCX, TXT, MD, CSV, HTML, JSON
PDF support covers documents with embedded/selectable text. Scanned or image-only PDFs need OCR, which docstore does not support yet.
Limitations vs Lumient
docstore is a single-document extraction cache. It does not:
- Compose records across documents (invoice + Stripe → reconciliation status)
- Trigger automatically when files arrive
- Maintain a queryable entity layer with lineage
- Support multi-step workflow logic (validate, diff, generate)
- Provide governance and audit trails for regulated industries
For cross-document composition and maintained operational records, see Lumient.
Development
# uv (recommended)
uv sync --all-extras
uv run pytest
uv run ruff check .
# Or with pip
pip install -e ".[dev]"
pytest tests/
See AGENTS.md for architectural invariants and CONTRIBUTING.md for the PR process.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lumient_docstore-0.1.0.tar.gz.
File metadata
- Download URL: lumient_docstore-0.1.0.tar.gz
- Upload date:
- Size: 141.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
769a1dfbb46a10e68234d7c9bdd95d70bab0153b3405baeb5c655105ca257a91
|
|
| MD5 |
6cd19eb955f8c68f792c017342c52689
|
|
| BLAKE2b-256 |
095146e79d97c8f7322b4ff32671f82af954ab25b3eca91acec7f6dcfcb81d4c
|
File details
Details for the file lumient_docstore-0.1.0-py3-none-any.whl.
File metadata
- Download URL: lumient_docstore-0.1.0-py3-none-any.whl
- Upload date:
- Size: 31.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05cb9be0a91723b7074741879ba078ef8f3a59f65530708d62689fdc7ee88b43
|
|
| MD5 |
7fb2232e18e79866f8887c94ba92a88d
|
|
| BLAKE2b-256 |
38086abc5c05b7690894d32839b45174efa16da807738c57904cfc24b4b73ce9
|