Pydantic-AI agent layer for intelligent document processing with distillcore
Project description
distillcore-agents
Pydantic-AI agent layer for intelligent document processing with distillcore.
Four specialized agents collaborate to process documents: triage assesses the document and picks the optimal pipeline config, processing executes the distillcore pipeline, QA validates coverage and chunk quality, and research answers questions over stored documents using semantic search.
Installation
pip install distillcore-agents
With optional extras:
pip install distillcore-agents[serve] # FastAPI WebSocket server
pip install distillcore-agents[pdf] # PDF extraction support
pip install distillcore-agents[all] # everything
Quick Start
Python API
from distillcore_agents import Orchestrator
async with Orchestrator(openai_api_key="sk-...") as orc:
# Process a single document
result = await orc.process_one("/path/to/document.pdf")
print(result.triage.preset) # "legal"
print(result.triage.chunk_strategy) # "paragraph"
print(result.processing.chunk_count) # 42
print(result.qa.verified) # True
# Process a batch
batch = await orc.process_batch([
"/path/to/a.pdf",
"/path/to/b.pdf",
"/path/to/c.pdf",
])
print(f"{batch.succeeded}/{batch.total} succeeded")
# Ad-hoc research over stored documents
answer = await orc.research("What are the key custody arrangements?")
print(answer.answer)
for cite in answer.citations:
print(f" [{cite.source_filename}] chunk {cite.chunk_index}: {cite.text_snippet[:80]}")
WebSocket Server
pip install distillcore-agents[serve]
# Optional: set API key for authentication
export DISTILLCORE_API_KEY="your-server-key"
export OPENAI_API_KEY="sk-..."
distillcore-agents
# Server runs on http://127.0.0.1:8000
Connect via WebSocket at ws://127.0.0.1:8000/ws/agent:
// Authenticate (optional, required if DISTILLCORE_API_KEY is set)
{"type": "auth", "api_key": "your-server-key"}
// Process a file
{"type": "process", "id": "run-1", "source": "/path/to/doc.pdf"}
// Process raw text
{"type": "process_text", "id": "run-2", "text": "The full text content..."}
The server streams agent_event messages as each agent works, then sends a final result message with the complete pipeline output.
Architecture
Orchestrator
|
+-------+-------+-------+--------+
| | | |
Triage Processing QA Research
| | | |
+-------+-------+-------+--------+
|
DistillcoreClient
|
distillcore
(extract, classify, structure,
chunk, enrich, embed, validate)
Agents
| Agent | Input | Output | What it does |
|---|---|---|---|
| Triage | Document path | TriageDecision |
Previews first page, picks preset, chunking strategy, OCR settings |
| Processing | Triage config | ProcessingDecision |
Runs the full distillcore pipeline, saves result to store |
| QA | Processing metrics | QADecision |
Validates coverage thresholds, recommends parameter adjustments |
| Research | Natural language query | ResearchResult |
Embeds query, searches stored chunks, synthesizes answer with citations |
DistillcoreClient
DistillcoreClient is the shared dependency object (deps_type) for all agents. It wraps distillcore's Python API and manages:
- Document store (
Store) for persistence and semantic search - Embedding function (
openai_embedder) for vectorizing queries - Extraction (sync and async) for document text extraction
- Presets for domain-specific pipeline configuration
- Coverage validation for quality checks
from distillcore_agents import DistillcoreClient
async with DistillcoreClient(
store_path="~/.distillcore/store.db",
tenant_id="user_123",
openai_api_key="sk-...",
embedding_model="text-embedding-3-small",
) as client:
# Use directly or pass to agents
result = client.extract_document("/path/to/doc.pdf")
presets = client.list_presets() # ["generic", "legal"]
Dual Storage
The project uses two SQLite databases:
- Document store (
~/.distillcore/store.db) -- distillcore'sStorefor document chunks, embeddings, and metadata - Agent results store (
~/.distillcore/agents.db) -- pipeline run history with triage decisions, coverage metrics, QA recommendations
Pipeline Flow
-
Triage extracts the first page and examines the content:
- Picks a domain preset (
"generic"or"legal") - Sets chunking parameters (
target_tokens,overlap_chars,strategy,min_tokens) - Detects if OCR is needed (sparse text per page)
- Adjusts for document size (larger windows for long docs, smaller chunks for short ones)
- Picks a domain preset (
-
Processing builds a
DistillConfigfrom the triage decision and runs the full pipeline:- Extract -> Classify -> Structure -> Chunk -> Enrich -> Embed -> Validate
- Saves the processed document to the store
-
QA checks the validation metrics against thresholds:
- Structuring coverage >= 0.95
- Chunking coverage >= 0.98
- End-to-end coverage >= 0.93
- If any fail, recommends specific parameter adjustments (e.g.,
min_tokens=50for empty chunks)
-
Research (standalone or post-pipeline) answers questions over stored documents:
- Embeds the query, searches the vector store
- Synthesizes an answer with citations (filename, chunk index, text snippet)
Streaming
The Orchestrator supports streaming via process_one_stream(), which yields AgentEvent objects as each agent works:
async with Orchestrator(openai_api_key="sk-...") as orc:
async for event, result in orc.process_one_stream("/path/to/doc.pdf"):
print(f"[{event.data.get('agent', '?')}] {event.event_type}")
if result is not None:
print(f"Done: {result.qa.verified}")
Event types: started, tool_call, tool_result, completed, error.
Configuration
Environment Variables
| Variable | Purpose |
|---|---|
OPENAI_API_KEY |
OpenAI API key for LLM and embeddings |
DISTILLCORE_API_KEY |
Server authentication key (optional) |
Orchestrator Options
Orchestrator(
model="openai:gpt-4o-mini", # Pydantic-AI model for agents
store_path="~/.distillcore/agents.db", # Agent results DB
doc_store_path="~/.distillcore/store.db", # Document store DB
openai_api_key="sk-...",
tenant_id="user_123", # Scope document access
max_concurrency=3, # Batch parallelism
)
Development
git clone https://github.com/mfbaig35r/distillcore-agents.git
cd distillcore-agents
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Lint
ruff check src/ tests/
Requirements
- Python >= 3.11
- distillcore >= 0.7.0 (with openai extra)
- pydantic-ai >= 0.1
- pydantic >= 2.0
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file distillcore_agents-0.2.0.tar.gz.
File metadata
- Download URL: distillcore_agents-0.2.0.tar.gz
- Upload date:
- Size: 241.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38e3f40dcb3bdb86c1de3eb244fa276879c34be3b3348131b70c77ecbd0f24ea
|
|
| MD5 |
ff32764ccdbdea71539a4cac3b4c8c41
|
|
| BLAKE2b-256 |
bd443589fa9aeaeed6b665f7e92a90a63922b40b325334604a9a07938aac1d50
|
Provenance
The following attestation bundles were made for distillcore_agents-0.2.0.tar.gz:
Publisher:
publish.yml on mfbaig35r/distillcore-agents
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
distillcore_agents-0.2.0.tar.gz -
Subject digest:
38e3f40dcb3bdb86c1de3eb244fa276879c34be3b3348131b70c77ecbd0f24ea - Sigstore transparency entry: 1397733812
- Sigstore integration time:
-
Permalink:
mfbaig35r/distillcore-agents@0700d030a42292d439d6c3c6c844e5a09a23e178 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/mfbaig35r
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0700d030a42292d439d6c3c6c844e5a09a23e178 -
Trigger Event:
release
-
Statement type:
File details
Details for the file distillcore_agents-0.2.0-py3-none-any.whl.
File metadata
- Download URL: distillcore_agents-0.2.0-py3-none-any.whl
- Upload date:
- Size: 24.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee45745aca4e38c6b28b3df37e74d6a2eec234a695552fbb958b93ddd459ce29
|
|
| MD5 |
d00ece95bfc411d61f158d265d17d88e
|
|
| BLAKE2b-256 |
6f04050a502fb1617df0a734f36c3a8e1ca12c43e2a2469dccf78c2a590db473
|
Provenance
The following attestation bundles were made for distillcore_agents-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on mfbaig35r/distillcore-agents
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
distillcore_agents-0.2.0-py3-none-any.whl -
Subject digest:
ee45745aca4e38c6b28b3df37e74d6a2eec234a695552fbb958b93ddd459ce29 - Sigstore transparency entry: 1397733822
- Sigstore integration time:
-
Permalink:
mfbaig35r/distillcore-agents@0700d030a42292d439d6c3c6c844e5a09a23e178 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/mfbaig35r
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0700d030a42292d439d6c3c6c844e5a09a23e178 -
Trigger Event:
release
-
Statement type: