Universal document processing: extract, chunk, enrich, embed, validate

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mfbaig

These details have not been verified by PyPI

Project description

distillcore

Universal document processing: extract, chunk, enrich, embed, validate.

distillcore takes any document (PDF, DOCX, HTML, text, markdown) and runs it through an intelligent 7-stage pipeline — extracting text, classifying the document, breaking it into structured sections, chunking for RAG, enriching chunks with LLM-generated metadata, generating embeddings, and validating coverage at every stage.

Works as a Python library or a standalone FastMCP server. Supports sync and async pipelines, batch processing, optional SQLite persistence with cosine search, and 4 embedding providers. Domain-neutral by default, with pluggable presets for specialized domains (legal built-in).

New in v0.7: openai is now optional — standalone chunking, extraction, validation, and storage work without it. Install distillcore[openai] for LLM features.

Install

# Core (chunking, extraction, validation, storage — no API key needed)
pip install distillcore

# With LLM features (classification, structuring, enrichment, OpenAI embeddings)
pip install distillcore[openai]

# With PDF support
pip install distillcore[pdf]

# With all document formats + OpenAI
pip install distillcore[all]

# With MCP server
pip install distillcore[mcp]

# Fully offline embeddings (sentence-transformers)
pip install distillcore[local]

# Everything
pip install distillcore[all,mcp,local]

Quickstart

Chunk text (no API key needed)

from distillcore import chunk, estimate_tokens

chunks = chunk("Your document text here...", strategy="paragraph", target_tokens=500)

for i, c in enumerate(chunks):
    print(f"[{i}] {estimate_tokens(c)} tokens: {c[:80]}...")

Four strategies: "paragraph" (default), "sentence", "fixed", "llm" (requires API key).

# Sentence boundaries
chunks = chunk(text, strategy="sentence", target_tokens=300)

# Fixed sliding window with overlap
chunks = chunk(text, strategy="fixed", target_tokens=500, overlap_tokens=50)

# LLM-driven semantic chunking
chunks = chunk(text, strategy="llm", api_key="sk-...", target_tokens=500)

# Async version
from distillcore import achunk
chunks = await achunk(text, strategy="paragraph")

Process a file (full pipeline)

from distillcore import process_document

result = process_document("report.pdf")
print(f"Type: {result.document.metadata.document_type}")
print(f"Chunks: {len(result.chunks)}")
print(f"Coverage: {result.validation.end_to_end_coverage:.1%}")

Process raw text

from distillcore import process_text, DistillConfig

result = process_text(
    "Introduction\n\nThis report covers Q4 results...\n\nConclusion\n\nWe recommend...",
    config=DistillConfig(openai_api_key="sk-..."),
)

for chunk in result.chunks:
    print(f"[{chunk.chunk_index}] {chunk.topic} ({chunk.relevance})")

Async pipeline

from distillcore import process_document_async

result = await process_document_async("report.pdf")

Batch processing

from distillcore import process_batch_sync

results = process_batch_sync(
    ["doc1.pdf", "doc2.docx", "doc3.html"],
    max_concurrent=5,
)

for r in results:
    print(f"{r.document.metadata.source_filename}: {len(r.chunks)} chunks")

Or async with callbacks:

from distillcore import process_batch

results = await process_batch(
    paths,
    max_concurrent=5,
    on_result=lambda src, res: print(f"Done: {src}"),
)

Failed files don't crash the batch — each gets a ProcessingResult with passed=False.

Embedding providers

from distillcore import DistillConfig, EmbeddingConfig

# OpenAI (requires distillcore[openai])
from distillcore.embedding import openai_embedder
config = DistillConfig(embedding=EmbeddingConfig(
    embed_fn=openai_embedder("text-embedding-3-large"),
))

# Ollama (local, no API key, no pip deps)
from distillcore.embedding import ollama_embedder
config = DistillConfig(embedding=EmbeddingConfig(
    embed_fn=ollama_embedder("nomic-embed-text"),
))

# Sentence-transformers (fully offline) — pip install distillcore[local]
from distillcore.embedding import local_embedder
config = DistillConfig(embedding=EmbeddingConfig(
    embed_fn=local_embedder("all-MiniLM-L6-v2"),
))

# Cohere — pip install distillcore[cohere]
from distillcore.embedding import cohere_embedder
config = DistillConfig(embedding=EmbeddingConfig(
    embed_fn=cohere_embedder("embed-english-v3.0"),
))

Persist and search

from distillcore import Store

store = Store()  # ~/.distillcore/store.db
doc_id = store.save(result)

# Cosine similarity search
results = store.search(query_embedding=[0.1, 0.2, ...], top_k=5)

Tenant isolation:

store.save(result, tenant_id="user_123")
store.search(query_embedding, tenant_id="user_123")  # only sees this tenant's docs

Domain presets

from distillcore import process_document, DistillConfig, load_preset

# Legal domain — extracts case numbers, attorneys, court orders
result = process_document(
    "motion.pdf",
    config=DistillConfig(domain=load_preset("legal")),
)
print(result.document.metadata.extra)
# {"case_number": "2024-CV-001", "court": "Superior Court", ...}

Without LLM (zero API calls)

from distillcore import process_text, DistillConfig, DomainConfig

result = process_text(
    "Your text here...",
    config=DistillConfig(domain=DomainConfig(), enrich_chunks=False),
    embed=False,
)
# Chunking and validation still work — no API key needed

MCP Server

Run as a standalone FastMCP server:

pip install distillcore[mcp,openai]
distillcore

Environment variables

Variable	Description	Default
`OPENAI_API_KEY`	OpenAI API key for LLM + embeddings
`DISTILLCORE_STORE`	SQLite store path	`~/.distillcore/store.db`
`DISTILLCORE_TENANT_ID`	Tenant isolation for multi-user
`DISTILLCORE_ALLOWED_DIRS`	Colon-separated allowed file paths	unrestricted
`DISTILLCORE_EMBEDDING_MODEL`	Embedding model for search	`text-embedding-3-small`

Tools

Tool	Description
`distill_file`	Process a document file through the full pipeline
`distill_text`	Process raw text (skips extraction)
`distill_batch`	Process multiple files concurrently
`distill_chunks_only`	Chunk text without LLM calls
`distill_validate`	Validate coverage between text and chunks
`distill_search`	Semantic search across stored documents
`distill_list_documents`	List stored documents
`distill_get_document`	Get a document and its chunks

Pipeline stages

extract -> classify -> structure -> chunk -> enrich -> embed -> validate

Extract — pull text from PDF (with OCR fallback), DOCX, HTML, TXT, or MD files
Classify — LLM identifies document type, title, and domain-specific metadata
Structure — LLM breaks the document into hierarchical sections (boundary-based with page ranges)
Chunk — section-aware splitting with 4 strategies: paragraph, sentence, fixed, or LLM-driven
Enrich — LLM tags each chunk with topic, key concepts, and relevance
Embed — generate vector embeddings (OpenAI, Ollama, local, or Cohere)
Validate — coverage checks at each stage (structuring 95%, chunking 98%, end-to-end 93%)

Every LLM stage degrades gracefully — if the API key is missing or a call fails, the pipeline continues with fallback values.

Supported formats

Format	Extension	Extra
Plain text	`.txt`, `.text`	included
Markdown	`.md`, `.markdown`	included
PDF	`.pdf`	`distillcore[pdf]`
Word	`.docx`	`distillcore[docx]`
HTML	`.html`, `.htm`	`distillcore[html]`

Custom extractors can be registered for any format:

from distillcore import register_extractor

class MyExtractor:
    formats = ["xml"]
    def extract(self, source, config=None):
        ...

register_extractor(MyExtractor())

Configuration

from distillcore import DistillConfig, ChunkConfig, EmbeddingConfig, DomainConfig

config = DistillConfig(
    # LLM (requires distillcore[openai])
    openai_api_key="sk-...",       # or set OPENAI_API_KEY env var
    openai_model="gpt-4o",

    # Chunking
    chunk=ChunkConfig(
        target_tokens=500,
        overlap_chars=200,
        max_tokens=1000,
        min_tokens=50,             # merge small chunks
        strategy="auto",           # "auto", "paragraph", "sentence", "fixed", "llm"
    ),

    # Embedding
    embedding=EmbeddingConfig(
        model="text-embedding-3-small",
        embed_fn=None,             # custom callable overrides OpenAI
    ),

    # Domain
    domain=DomainConfig(),         # or load_preset("legal")

    # Feature flags
    enrich_chunks=True,
    enable_ocr=True,

    # Security
    allowed_dirs=None,             # restrict file access (list of paths)

    # Validation thresholds
    structuring_coverage_threshold=0.95,
    chunking_coverage_threshold=0.98,
    end_to_end_coverage_threshold=0.93,

    # Progress callback
    on_progress=lambda stage, data: print(f"{stage}: {data}"),
)

API reference

Standalone chunking

Function	Description
`chunk(text, strategy?, target_tokens?, ...)`	Split text into chunks
`achunk(text, ...)`	Async version of chunk
`estimate_tokens(text, tokenizer?)`	Estimate token count

Pipeline (sync)

Function	Description
`process_document(path, config?, format?, embed?)`	Full pipeline from file
`process_text(text, config?, filename?, embed?)`	Full pipeline from text
`extract(path, format?)`	Extract text only

Pipeline (async + batch)

Function	Description
`process_document_async(path, config?, format?, embed?)`	Async full pipeline from file
`process_text_async(text, config?, filename?, embed?)`	Async full pipeline from text
`process_batch(sources, config?, max_concurrent?, on_result?)`	Concurrent batch processing
`process_batch_sync(sources, **kwargs)`	Sync wrapper for batch

Embedding providers

Factory	Deps	API key?
`openai_embedder(model, api_key)`	`distillcore[openai]`	yes
`ollama_embedder(model, base_url)`	included	no
`local_embedder(model, device)`	`distillcore[local]`	no
`cohere_embedder(model, api_key, input_type)`	`distillcore[cohere]`	yes

Storage

Method	Description
`Store(path?)`	Create/open SQLite store
`store.save(result, tenant_id?)`	Persist a ProcessingResult, returns document_id
`store.search(embedding, top_k?, tenant_id?)`	Cosine similarity search
`store.get_document(id, tenant_id?)`	Retrieve document metadata
`store.get_chunks(id, tenant_id?)`	Retrieve chunks for a document
`store.list_documents(type?, limit?, tenant_id?)`	List stored documents
`store.delete_document(id, tenant_id?)`	Delete document and chunks
`store.stats()`	Aggregate store statistics

Utilities

Function	Description
`compute_coverage(original, derived)`	Word-level coverage metric (0-1)
`find_missing_segments(original, derived)`	Find gaps in coverage
`safe_parse(json_str)`	Parse JSON with truncation repair
`load_preset(name)`	Load a domain preset ("generic", "legal")
`register_extractor(extractor)`	Register a custom file extractor

Security

Path traversal protection — allowed_dirs config restricts file access to specified directories
Prompt injection hardening — untrusted document content is isolated with --- BEGIN/END UNTRUSTED --- sentinels, with explicit "ignore instructions" directives
Tenant isolation — optional tenant_id scoping on all Store operations
Config validation — config.validate() warns early if API key is missing
Graceful degradation — no stage failure crashes the pipeline

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mfbaig

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.7.1

Apr 28, 2026

0.7.0

Apr 27, 2026

0.6.1

Apr 27, 2026

0.6.0

Apr 27, 2026

0.5.0

Apr 27, 2026

0.4.0

Apr 23, 2026

0.3.0

Apr 23, 2026

0.2.0

Apr 23, 2026

0.1.0

Apr 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distillcore-0.7.1.tar.gz (251.2 kB view details)

Uploaded Apr 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

distillcore-0.7.1-py3-none-any.whl (60.3 kB view details)

Uploaded Apr 28, 2026 Python 3

File details

Details for the file distillcore-0.7.1.tar.gz.

File metadata

Download URL: distillcore-0.7.1.tar.gz
Upload date: Apr 28, 2026
Size: 251.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for distillcore-0.7.1.tar.gz
Algorithm	Hash digest
SHA256	`62911f45bbbbfd332a9768cd4b43d04df0e24472d475680ae7a8ad3fff574ada`
MD5	`0bceb4e59b2bbc260f4485ed062da69a`
BLAKE2b-256	`78d4eb970015bc796e6aaf4b5f6b20d988b98a4f9e0c4c3122ce1463ee1f829b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for distillcore-0.7.1.tar.gz:

Publisher: publish.yml on mfbaig35r/distillcore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: distillcore-0.7.1.tar.gz
- Subject digest: 62911f45bbbbfd332a9768cd4b43d04df0e24472d475680ae7a8ad3fff574ada
- Sigstore transparency entry: 1398092943
- Sigstore integration time: Apr 28, 2026
Source repository:
- Permalink: mfbaig35r/distillcore@d485a4e0c8ec0ebcd926381b687b04d10ab273cb
- Branch / Tag: refs/tags/v0.7.1
- Owner: https://github.com/mfbaig35r
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d485a4e0c8ec0ebcd926381b687b04d10ab273cb
- Trigger Event: release

File details

Details for the file distillcore-0.7.1-py3-none-any.whl.

File metadata

Download URL: distillcore-0.7.1-py3-none-any.whl
Upload date: Apr 28, 2026
Size: 60.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for distillcore-0.7.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`11e19cca4f28e5740d1774572d9e419a8b84bce7e69d0870b1a3f7002fc235b4`
MD5	`8b9f8fe4474e456d8a0917deecb6b571`
BLAKE2b-256	`327950c921e02c32a06aca4e8eb0ed858ff43e94ffad1daf0e0e2d806913e8dd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for distillcore-0.7.1-py3-none-any.whl:

Publisher: publish.yml on mfbaig35r/distillcore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: distillcore-0.7.1-py3-none-any.whl
- Subject digest: 11e19cca4f28e5740d1774572d9e419a8b84bce7e69d0870b1a3f7002fc235b4
- Sigstore transparency entry: 1398092953
- Sigstore integration time: Apr 28, 2026
Source repository:
- Permalink: mfbaig35r/distillcore@d485a4e0c8ec0ebcd926381b687b04d10ab273cb
- Branch / Tag: refs/tags/v0.7.1
- Owner: https://github.com/mfbaig35r
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d485a4e0c8ec0ebcd926381b687b04d10ab273cb
- Trigger Event: release

distillcore 0.7.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

distillcore

Install

Quickstart

Chunk text (no API key needed)

Process a file (full pipeline)

Process raw text

Async pipeline

Batch processing

Embedding providers

Persist and search

Domain presets

Without LLM (zero API calls)

MCP Server

Environment variables

Tools

Pipeline stages

Supported formats

Configuration

API reference

Standalone chunking

Pipeline (sync)

Pipeline (async + batch)

Embedding providers

Storage

Utilities

Security

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance