Skip to main content

Universal document processing: extract, chunk, enrich, embed, validate

Project description

distillcore

Universal document processing: extract, chunk, enrich, embed, validate.

distillcore takes any document (PDF, DOCX, HTML, text, markdown) and runs it through an intelligent 7-stage pipeline — extracting text, classifying the document, breaking it into structured sections, chunking for RAG, enriching chunks with LLM-generated metadata, generating embeddings, and validating coverage at every stage.

Works as a Python library or a standalone FastMCP server. Supports sync and async pipelines, batch processing, optional SQLite persistence with cosine search, and 4 embedding providers. Domain-neutral by default, with pluggable presets for specialized domains (legal built-in).

New in v0.7: openai is now optional — standalone chunking, extraction, validation, and storage work without it. Install distillcore[openai] for LLM features.

Install

# Core (chunking, extraction, validation, storage — no API key needed)
pip install distillcore

# With LLM features (classification, structuring, enrichment, OpenAI embeddings)
pip install distillcore[openai]

# With PDF support
pip install distillcore[pdf]

# With all document formats + OpenAI
pip install distillcore[all]

# With MCP server
pip install distillcore[mcp]

# Fully offline embeddings (sentence-transformers)
pip install distillcore[local]

# Everything
pip install distillcore[all,mcp,local]

Quickstart

Chunk text (no API key needed)

from distillcore import chunk, estimate_tokens

chunks = chunk("Your document text here...", strategy="paragraph", target_tokens=500)

for i, c in enumerate(chunks):
    print(f"[{i}] {estimate_tokens(c)} tokens: {c[:80]}...")

Four strategies: "paragraph" (default), "sentence", "fixed", "llm" (requires API key).

# Sentence boundaries
chunks = chunk(text, strategy="sentence", target_tokens=300)

# Fixed sliding window with overlap
chunks = chunk(text, strategy="fixed", target_tokens=500, overlap_tokens=50)

# LLM-driven semantic chunking
chunks = chunk(text, strategy="llm", api_key="sk-...", target_tokens=500)

# Async version
from distillcore import achunk
chunks = await achunk(text, strategy="paragraph")

Process a file (full pipeline)

from distillcore import process_document

result = process_document("report.pdf")
print(f"Type: {result.document.metadata.document_type}")
print(f"Chunks: {len(result.chunks)}")
print(f"Coverage: {result.validation.end_to_end_coverage:.1%}")

Process raw text

from distillcore import process_text, DistillConfig

result = process_text(
    "Introduction\n\nThis report covers Q4 results...\n\nConclusion\n\nWe recommend...",
    config=DistillConfig(openai_api_key="sk-..."),
)

for chunk in result.chunks:
    print(f"[{chunk.chunk_index}] {chunk.topic} ({chunk.relevance})")

Async pipeline

from distillcore import process_document_async

result = await process_document_async("report.pdf")

Batch processing

from distillcore import process_batch_sync

results = process_batch_sync(
    ["doc1.pdf", "doc2.docx", "doc3.html"],
    max_concurrent=5,
)

for r in results:
    print(f"{r.document.metadata.source_filename}: {len(r.chunks)} chunks")

Or async with callbacks:

from distillcore import process_batch

results = await process_batch(
    paths,
    max_concurrent=5,
    on_result=lambda src, res: print(f"Done: {src}"),
)

Failed files don't crash the batch — each gets a ProcessingResult with passed=False.

Embedding providers

from distillcore import DistillConfig, EmbeddingConfig

# OpenAI (requires distillcore[openai])
from distillcore.embedding import openai_embedder
config = DistillConfig(embedding=EmbeddingConfig(
    embed_fn=openai_embedder("text-embedding-3-large"),
))

# Ollama (local, no API key, no pip deps)
from distillcore.embedding import ollama_embedder
config = DistillConfig(embedding=EmbeddingConfig(
    embed_fn=ollama_embedder("nomic-embed-text"),
))

# Sentence-transformers (fully offline) — pip install distillcore[local]
from distillcore.embedding import local_embedder
config = DistillConfig(embedding=EmbeddingConfig(
    embed_fn=local_embedder("all-MiniLM-L6-v2"),
))

# Cohere — pip install distillcore[cohere]
from distillcore.embedding import cohere_embedder
config = DistillConfig(embedding=EmbeddingConfig(
    embed_fn=cohere_embedder("embed-english-v3.0"),
))

Persist and search

from distillcore import Store

store = Store()  # ~/.distillcore/store.db
doc_id = store.save(result)

# Cosine similarity search
results = store.search(query_embedding=[0.1, 0.2, ...], top_k=5)

Tenant isolation:

store.save(result, tenant_id="user_123")
store.search(query_embedding, tenant_id="user_123")  # only sees this tenant's docs

Domain presets

from distillcore import process_document, DistillConfig, load_preset

# Legal domain — extracts case numbers, attorneys, court orders
result = process_document(
    "motion.pdf",
    config=DistillConfig(domain=load_preset("legal")),
)
print(result.document.metadata.extra)
# {"case_number": "2024-CV-001", "court": "Superior Court", ...}

Without LLM (zero API calls)

from distillcore import process_text, DistillConfig, DomainConfig

result = process_text(
    "Your text here...",
    config=DistillConfig(domain=DomainConfig(), enrich_chunks=False),
    embed=False,
)
# Chunking and validation still work — no API key needed

MCP Server

Run as a standalone FastMCP server:

pip install distillcore[mcp,openai]
distillcore

Environment variables

Variable Description Default
OPENAI_API_KEY OpenAI API key for LLM + embeddings
DISTILLCORE_STORE SQLite store path ~/.distillcore/store.db
DISTILLCORE_TENANT_ID Tenant isolation for multi-user
DISTILLCORE_ALLOWED_DIRS Colon-separated allowed file paths unrestricted
DISTILLCORE_EMBEDDING_MODEL Embedding model for search text-embedding-3-small

Tools

Tool Description
distill_file Process a document file through the full pipeline
distill_text Process raw text (skips extraction)
distill_batch Process multiple files concurrently
distill_chunks_only Chunk text without LLM calls
distill_validate Validate coverage between text and chunks
distill_search Semantic search across stored documents
distill_list_documents List stored documents
distill_get_document Get a document and its chunks

Pipeline stages

extract -> classify -> structure -> chunk -> enrich -> embed -> validate
  1. Extract — pull text from PDF (with OCR fallback), DOCX, HTML, TXT, or MD files
  2. Classify — LLM identifies document type, title, and domain-specific metadata
  3. Structure — LLM breaks the document into hierarchical sections (boundary-based with page ranges)
  4. Chunk — section-aware splitting with 4 strategies: paragraph, sentence, fixed, or LLM-driven
  5. Enrich — LLM tags each chunk with topic, key concepts, and relevance
  6. Embed — generate vector embeddings (OpenAI, Ollama, local, or Cohere)
  7. Validate — coverage checks at each stage (structuring 95%, chunking 98%, end-to-end 93%)

Every LLM stage degrades gracefully — if the API key is missing or a call fails, the pipeline continues with fallback values.

Supported formats

Format Extension Extra
Plain text .txt, .text included
Markdown .md, .markdown included
PDF .pdf distillcore[pdf]
Word .docx distillcore[docx]
HTML .html, .htm distillcore[html]

Custom extractors can be registered for any format:

from distillcore import register_extractor

class MyExtractor:
    formats = ["xml"]
    def extract(self, source, config=None):
        ...

register_extractor(MyExtractor())

Configuration

from distillcore import DistillConfig, ChunkConfig, EmbeddingConfig, DomainConfig

config = DistillConfig(
    # LLM (requires distillcore[openai])
    openai_api_key="sk-...",       # or set OPENAI_API_KEY env var
    openai_model="gpt-4o",

    # Chunking
    chunk=ChunkConfig(
        target_tokens=500,
        overlap_chars=200,
        max_tokens=1000,
        min_tokens=50,             # merge small chunks
        strategy="auto",           # "auto", "paragraph", "sentence", "fixed", "llm"
    ),

    # Embedding
    embedding=EmbeddingConfig(
        model="text-embedding-3-small",
        embed_fn=None,             # custom callable overrides OpenAI
    ),

    # Domain
    domain=DomainConfig(),         # or load_preset("legal")

    # Feature flags
    enrich_chunks=True,
    enable_ocr=True,

    # Security
    allowed_dirs=None,             # restrict file access (list of paths)

    # Validation thresholds
    structuring_coverage_threshold=0.95,
    chunking_coverage_threshold=0.98,
    end_to_end_coverage_threshold=0.93,

    # Progress callback
    on_progress=lambda stage, data: print(f"{stage}: {data}"),
)

API reference

Standalone chunking

Function Description
chunk(text, strategy?, target_tokens?, ...) Split text into chunks
achunk(text, ...) Async version of chunk
estimate_tokens(text, tokenizer?) Estimate token count

Pipeline (sync)

Function Description
process_document(path, config?, format?, embed?) Full pipeline from file
process_text(text, config?, filename?, embed?) Full pipeline from text
extract(path, format?) Extract text only

Pipeline (async + batch)

Function Description
process_document_async(path, config?, format?, embed?) Async full pipeline from file
process_text_async(text, config?, filename?, embed?) Async full pipeline from text
process_batch(sources, config?, max_concurrent?, on_result?) Concurrent batch processing
process_batch_sync(sources, **kwargs) Sync wrapper for batch

Embedding providers

Factory Deps API key?
openai_embedder(model, api_key) distillcore[openai] yes
ollama_embedder(model, base_url) included no
local_embedder(model, device) distillcore[local] no
cohere_embedder(model, api_key, input_type) distillcore[cohere] yes

Storage

Method Description
Store(path?) Create/open SQLite store
store.save(result, tenant_id?) Persist a ProcessingResult, returns document_id
store.search(embedding, top_k?, tenant_id?) Cosine similarity search
store.get_document(id, tenant_id?) Retrieve document metadata
store.get_chunks(id, tenant_id?) Retrieve chunks for a document
store.list_documents(type?, limit?, tenant_id?) List stored documents
store.delete_document(id, tenant_id?) Delete document and chunks
store.stats() Aggregate store statistics

Utilities

Function Description
compute_coverage(original, derived) Word-level coverage metric (0-1)
find_missing_segments(original, derived) Find gaps in coverage
safe_parse(json_str) Parse JSON with truncation repair
load_preset(name) Load a domain preset ("generic", "legal")
register_extractor(extractor) Register a custom file extractor

Security

  • Path traversal protectionallowed_dirs config restricts file access to specified directories
  • Prompt injection hardening — untrusted document content is isolated with --- BEGIN/END UNTRUSTED --- sentinels, with explicit "ignore instructions" directives
  • Tenant isolation — optional tenant_id scoping on all Store operations
  • Config validationconfig.validate() warns early if API key is missing
  • Graceful degradation — no stage failure crashes the pipeline

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distillcore-0.7.1.tar.gz (251.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

distillcore-0.7.1-py3-none-any.whl (60.3 kB view details)

Uploaded Python 3

File details

Details for the file distillcore-0.7.1.tar.gz.

File metadata

  • Download URL: distillcore-0.7.1.tar.gz
  • Upload date:
  • Size: 251.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for distillcore-0.7.1.tar.gz
Algorithm Hash digest
SHA256 62911f45bbbbfd332a9768cd4b43d04df0e24472d475680ae7a8ad3fff574ada
MD5 0bceb4e59b2bbc260f4485ed062da69a
BLAKE2b-256 78d4eb970015bc796e6aaf4b5f6b20d988b98a4f9e0c4c3122ce1463ee1f829b

See more details on using hashes here.

Provenance

The following attestation bundles were made for distillcore-0.7.1.tar.gz:

Publisher: publish.yml on mfbaig35r/distillcore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file distillcore-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: distillcore-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 60.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for distillcore-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 11e19cca4f28e5740d1774572d9e419a8b84bce7e69d0870b1a3f7002fc235b4
MD5 8b9f8fe4474e456d8a0917deecb6b571
BLAKE2b-256 327950c921e02c32a06aca4e8eb0ed858ff43e94ffad1daf0e0e2d806913e8dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for distillcore-0.7.1-py3-none-any.whl:

Publisher: publish.yml on mfbaig35r/distillcore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page