Skip to main content

Pydantic-AI agent layer for intelligent document processing with distillcore

Project description

distillcore-agents

Pydantic-AI agent layer for intelligent document processing with distillcore.

Four specialized agents collaborate to process documents: triage assesses the document and picks the optimal pipeline config, processing executes the distillcore pipeline, QA validates coverage and chunk quality, and research answers questions over stored documents using semantic search.

Installation

pip install distillcore-agents

With optional extras:

pip install distillcore-agents[serve]   # FastAPI WebSocket server
pip install distillcore-agents[pdf]     # PDF extraction support
pip install distillcore-agents[all]     # everything

Quick Start

Python API

from distillcore_agents import Orchestrator

async with Orchestrator(openai_api_key="sk-...") as orc:
    # Process a single document
    result = await orc.process_one("/path/to/document.pdf")

    print(result.triage.preset)              # "legal"
    print(result.triage.chunk_strategy)      # "paragraph"
    print(result.processing.chunk_count)     # 42
    print(result.qa.verified)                # True

    # Process a batch
    batch = await orc.process_batch([
        "/path/to/a.pdf",
        "/path/to/b.pdf",
        "/path/to/c.pdf",
    ])
    print(f"{batch.succeeded}/{batch.total} succeeded")

    # Ad-hoc research over stored documents
    answer = await orc.research("What are the key custody arrangements?")
    print(answer.answer)
    for cite in answer.citations:
        print(f"  [{cite.source_filename}] chunk {cite.chunk_index}: {cite.text_snippet[:80]}")

WebSocket Server

pip install distillcore-agents[serve]

# Optional: set API key for authentication
export DISTILLCORE_API_KEY="your-server-key"
export OPENAI_API_KEY="sk-..."

distillcore-agents
# Server runs on http://127.0.0.1:8000

Connect via WebSocket at ws://127.0.0.1:8000/ws/agent:

// Authenticate (optional, required if DISTILLCORE_API_KEY is set)
{"type": "auth", "api_key": "your-server-key"}

// Process a file
{"type": "process", "id": "run-1", "source": "/path/to/doc.pdf"}

// Process raw text
{"type": "process_text", "id": "run-2", "text": "The full text content..."}

The server streams agent_event messages as each agent works, then sends a final result message with the complete pipeline output.

Architecture

                    Orchestrator
                         |
         +-------+-------+-------+--------+
         |       |               |         |
      Triage  Processing        QA     Research
         |       |               |         |
         +-------+-------+-------+--------+
                         |
                  DistillcoreClient
                         |
                    distillcore
              (extract, classify, structure,
               chunk, enrich, embed, validate)

Agents

Agent Input Output What it does
Triage Document path TriageDecision Previews first page, picks preset, chunking strategy, OCR settings
Processing Triage config ProcessingDecision Runs the full distillcore pipeline, saves result to store
QA Processing metrics QADecision Validates coverage thresholds, recommends parameter adjustments
Research Natural language query ResearchResult Embeds query, searches stored chunks, synthesizes answer with citations

DistillcoreClient

DistillcoreClient is the shared dependency object (deps_type) for all agents. It wraps distillcore's Python API and manages:

  • Document store (Store) for persistence and semantic search
  • Embedding function (openai_embedder) for vectorizing queries
  • Extraction (sync and async) for document text extraction
  • Presets for domain-specific pipeline configuration
  • Coverage validation for quality checks
from distillcore_agents import DistillcoreClient

async with DistillcoreClient(
    store_path="~/.distillcore/store.db",
    tenant_id="user_123",
    openai_api_key="sk-...",
    embedding_model="text-embedding-3-small",
) as client:
    # Use directly or pass to agents
    result = client.extract_document("/path/to/doc.pdf")
    presets = client.list_presets()  # ["generic", "legal"]

Dual Storage

The project uses two SQLite databases:

  • Document store (~/.distillcore/store.db) -- distillcore's Store for document chunks, embeddings, and metadata
  • Agent results store (~/.distillcore/agents.db) -- pipeline run history with triage decisions, coverage metrics, QA recommendations

Pipeline Flow

  1. Triage extracts the first page and examines the content:

    • Picks a domain preset ("generic" or "legal")
    • Sets chunking parameters (target_tokens, overlap_chars, strategy, min_tokens)
    • Detects if OCR is needed (sparse text per page)
    • Adjusts for document size (larger windows for long docs, smaller chunks for short ones)
  2. Processing builds a DistillConfig from the triage decision and runs the full pipeline:

    • Extract -> Classify -> Structure -> Chunk -> Enrich -> Embed -> Validate
    • Saves the processed document to the store
  3. QA checks the validation metrics against thresholds:

    • Structuring coverage >= 0.95
    • Chunking coverage >= 0.98
    • End-to-end coverage >= 0.93
    • If any fail, recommends specific parameter adjustments (e.g., min_tokens=50 for empty chunks)
  4. Research (standalone or post-pipeline) answers questions over stored documents:

    • Embeds the query, searches the vector store
    • Synthesizes an answer with citations (filename, chunk index, text snippet)

Streaming

The Orchestrator supports streaming via process_one_stream(), which yields AgentEvent objects as each agent works:

async with Orchestrator(openai_api_key="sk-...") as orc:
    async for event, result in orc.process_one_stream("/path/to/doc.pdf"):
        print(f"[{event.data.get('agent', '?')}] {event.event_type}")
        if result is not None:
            print(f"Done: {result.qa.verified}")

Event types: started, tool_call, tool_result, completed, error.

Configuration

Environment Variables

Variable Purpose
OPENAI_API_KEY OpenAI API key for LLM and embeddings
DISTILLCORE_API_KEY Server authentication key (optional)

Orchestrator Options

Orchestrator(
    model="openai:gpt-4o-mini",       # Pydantic-AI model for agents
    store_path="~/.distillcore/agents.db",  # Agent results DB
    doc_store_path="~/.distillcore/store.db",  # Document store DB
    openai_api_key="sk-...",
    tenant_id="user_123",              # Scope document access
    max_concurrency=3,                 # Batch parallelism
)

Development

git clone https://github.com/mfbaig35r/distillcore-agents.git
cd distillcore-agents
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Lint
ruff check src/ tests/

Requirements

  • Python >= 3.11
  • distillcore >= 0.7.0 (with openai extra)
  • pydantic-ai >= 0.1
  • pydantic >= 2.0

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distillcore_agents-0.2.0.tar.gz (241.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

distillcore_agents-0.2.0-py3-none-any.whl (24.2 kB view details)

Uploaded Python 3

File details

Details for the file distillcore_agents-0.2.0.tar.gz.

File metadata

  • Download URL: distillcore_agents-0.2.0.tar.gz
  • Upload date:
  • Size: 241.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for distillcore_agents-0.2.0.tar.gz
Algorithm Hash digest
SHA256 38e3f40dcb3bdb86c1de3eb244fa276879c34be3b3348131b70c77ecbd0f24ea
MD5 ff32764ccdbdea71539a4cac3b4c8c41
BLAKE2b-256 bd443589fa9aeaeed6b665f7e92a90a63922b40b325334604a9a07938aac1d50

See more details on using hashes here.

Provenance

The following attestation bundles were made for distillcore_agents-0.2.0.tar.gz:

Publisher: publish.yml on mfbaig35r/distillcore-agents

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file distillcore_agents-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for distillcore_agents-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ee45745aca4e38c6b28b3df37e74d6a2eec234a695552fbb958b93ddd459ce29
MD5 d00ece95bfc411d61f158d265d17d88e
BLAKE2b-256 6f04050a502fb1617df0a734f36c3a8e1ca12c43e2a2469dccf78c2a590db473

See more details on using hashes here.

Provenance

The following attestation bundles were made for distillcore_agents-0.2.0-py3-none-any.whl:

Publisher: publish.yml on mfbaig35r/distillcore-agents

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page