sieves

Plug-and-play, zero-shot document processing pipelines.

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.12
Topic
- Software Development :: Libraries

Project description

GitHub top language PyPI - Status

A Unified Interface for Document AI.

sieves provides a framework-agnostic abstraction for building document AI pipelines.

It decouples business logic from the underlying language model framework. By combining a ready-to-use task library with declarative design, sieves lets you focus on what data you need rather than how to extract it. Its consistent, type-safe API allows you to swap language model frameworks without having to rewrite your application logic.

This approach recognizes that different LM frameworks excel at different aspects of language model development:

outlines for high-performance, strictly constrained structured generation with local models.
dspy for sophisticated prompt optimization and few-shot example tuning.
langchain for broad compatibility with proprietary APIs and existing ecosystems.
gliner2 or transformers zero-shot pipelines for specialized, low-latency local inference.

sieves unifies the entire workflow:

Ingestion: Parsing PDFs, images, and Office docs (via docling).
Preprocessing: Intelligent text chunking and windowing (via chonkie).
Prediction: Zero-shot structured generation using a unified interface. Supports multiple backends: dspy, langchain, outlines, gliner2, transformers zero-shot classification pipelines
Distillation: Distill a specialized local model from zero-shot predictions (via setfit and model2vec).

Define your task pipeline once, then swap execution engines without rewriting your pipeline logic. Use the task library to skip having to define tasks from scratch.

[!WARNING] sieves is in active development (Beta). The API is stable within minor versions, but we recommend pinning your version for production use.

Features

:dart: Zero Training Required: Immediate inference using zero-/few-shot models
:robot: Unified Generation Interface: Seamlessly use multiple libraries
:arrow_forward: Observable Pipelines: Easy debugging and monitoring with conditional task execution
:hammer_and_wrench: Integrated Tools:
- Document parsing (optional via ingestion extra): docling, marker
- Text chunking: chonkie
:label: Ready-to-Use Tasks:
- Multi-label classification
- Information extraction
- Relation extraction
- Summarization
- Translation
- Multi-question answering
- Aspect-based sentiment analysis
- PII (personally identifiable information) anonymization
- Named entity recognition
:floppy_disk: Persistence: Save and load pipelines with configurations
:chart_with_upwards_trend: Evaluation: Measure pipeline and task performance against ground-truth data with deterministic metrics or LLM-based judging.
:rocket: Optimization: Improve task performance by optimizing prompts and few-shot examples using DSPy's MIPROv2
:teacher: Distillation: Fine-tune smaller, specialized models using your zero-shot results with frameworks like SetFit and Model2Vec. Export results as HuggingFace Dataset for custom training.
:recycle: Caching to avoid unnecessary model calls

Quick Start

1. Install

pip install sieves

Requires Python 3.12 (due to dependency constraints in docling and pyarrow).

2. Basic: text classification with a small local model

import outlines
import transformers
from sieves import Pipeline, tasks, Doc

# Set up model.
model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"
model = outlines.models.from_transformers(
    transformers.AutoModelForCausalLM.from_pretrained(model_name),
    transformers.AutoTokenizer.from_pretrained(model_name)
)

# Define task.
task = tasks.Classification(labels=["science", "politics"], model=model)

# Define pipeline with the classification task.
pipeline = Pipeline(task)

# Define documents to analyze.
doc = Doc(text="The new telescope captures images of distant galaxies.")

# Run pipeline and print results.
docs = list(pipeline([doc]))
# The `results` field contains the structured task output as a unified Pydantic model.
print(docs[0].results["Classification"]) # ResultMultiLabel(label_scores=[('science', 1.0), ('politics', 0.0)])
# The `meta` field contains more information helpful for observability and debugging, such as raw model output and token count information.
print(docs[0].meta)    # {'Classification': {
                       #    'raw': ['{ "science": 1.0, "politics": 0 }'],
                       #    'usage': {'input_tokens': 2, 'output_tokens': 2, 'chunks': [{'input_tokens': 2, 'output_tokens': 2}]}}, 'usage': {'input_tokens': 2, 'output_tokens': 2}
                       #  }

3. Advanced: End-to-end document AI with a hosted LLM

This example demonstrates the full power of sieves: parsing a PDF, chunking it, and extracting structured data (equations) using a remote LLM via DSPy.

Requires pip install "sieves[ingestion]"

import dspy
import os
import pydantic
import chonkie
import tokenizers
from sieves import tasks, Doc

# Define which schema of entity to extract.
class Equation(pydantic.BaseModel, frozen=True):
    id: str = pydantic.Field(description="ID/index of equation in paper.")
    equation: str = pydantic.Field(description="Equation as shown in paper.")

# Setup DSPy model.
model = dspy.LM(
    "openrouter/google/gemini-3-flash-preview",
    api_base="https://openrouter.ai/api/v1/",
    api_key=os.environ["OPENROUTER_API_KEY"]
)

# Build pipeline: ingest -> chunk -> extract.
pipeline = (
    tasks.Ingestion() +
    tasks.Chunking(chonkie.TokenChunker(tokenizers.Tokenizer.from_pretrained("gpt2"))) +
    tasks.InformationExtraction(entity_type=Equation, model=model)
)

# Define docs to analyze.
doc = Doc(uri="https://arxiv.org/pdf/1204.0162")

# Run pipeline.
results = list(pipeline([doc]))

# Print results.
for equation in results[0].results["InformationExtraction"].entities:
    print(equation)

This gives us:

id='(1)' equation="the observer measures not the linear but angular ... both cars are near the stop sign."
id='(3)' equation='\\omega(t) = \\frac{r_0 v(t)}{r_0^2 + x(t)^2}'
id='(4)' equation='\\tan \\alpha(t) = \\frac{x(t)}{r_0}'
id='(5)' equation='x(t) = \\frac{a_0 t^2}{2}'
id='(6)' equation="\\frac{d}{dt} f(t) = f'(t)"
id='(7)' equation='\\omega(t) = \\frac{a_0 t}{r_0} \\left( 1 + \\frac{a_0^2 t^4}{4 r_0^2} \\right)^{-1}'
id='(8)' equation='x(t) = x_0 + v_0 t + \\frac{1}{2} a t^2'

Read the guides

Why `sieves`?

Building Document AI prototypes usually involves gluing together disparate tools: one library for PDF parsing, another for chunking, a third for LLM interaction, another one for distillation, and so on. Switching from one model/framework stack, e.g., using Outlines with a local model, to a different one, e.g. LangChain with a closed vendor LLM, often requires rewriting core logic and boilerplate.

sieves solves this by providing a vertical stack optimized for Document AI.

Best for:

✅ Document AI: End-to-end pipelines from raw file to structured data.
✅ Rapid Prototyping: Validate ideas quickly with zero-shot models; no training data needed.
✅ Backend Flexibility: Switch between Local (GLiNER, Outlines) and Remote (DSPy, LangChain) execution instantly.
✅ Observability: Built-in inspection of intermediate steps (chunks, prompts).

Not for:

❌ Chatbots or conversational agents.
❌ Simple, one-off LLM completion calls.

Feature Comparison

Feature	`sieves`	`langchain`	`dspy`	`outlines`	`transformers`	`gliner2`
Primary Focus	Document AI	General LLM apps	Declarative LM development	Structured generation	Modeling	Extraction
Backend Support	Universal	Own ecosystem	Own ecosystem	Own ecosystem	Own ecosystem	Specialized
Document Parsing	Built-in	Tool integrations	❌ No	❌ No	❌ No	❌ No
Structured Output	Unified Pydantic API	Framework-specific	Framework-specific	Core feature	⚠️ Limited	Core feature
Prompt Optimization	DSPy Integration	❌ No	✅ Core feature	❌ No	❌ No	❌ No
Model Distillation	`setfit`/`model2vec`	❌ No	✅ Yes	❌ No	⚠️ Manual	❌ No

Core Concepts

Doc: The atomic unit of data. Holds raw text, metadata, parsed content, and extraction results.
Task: A functional step in the pipeline (e.g., Ingestion, Chunking, NER, Classification).
Pipeline: A composable sequence of tasks that manages execution flow, caching, and state.

Supported Backends

sieves allows you to bring your own model backend. We support:

DSPy: For optimizing prompts and working with remote/local models via LiteLLM.
Outlines: For strictly constrained structured generation with local models.
LangChain: For broad compatibility with the LangChain ecosystem.
GLiNER2: For high-performance, small-model Named Entity Recognition.
Transformers: For standard Hugging Face zero-shot classification pipelines.

See the Model Setup Guide for configuration details.

Installation

pip install sieves

Optional extras:

pip install "sieves[ingestion]"  # PDF/DOCX parsing (docling, marker)
pip install "sieves[distill]"    # Model distillation (setfit, model2vec)

Community & Support

📖 Documentation • ❓ Chat with the sieves DeepWiki • 🤝 Discussions

Attribution

sieves is inspired by the design philosophy of spaCy and spacy-llm.

Sieve icons created by Freepik - Flaticon.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.12
Topic
- Software Development :: Libraries

Release history Release notifications | RSS feed

1.0.4

May 11, 2026

1.0.3

May 1, 2026

1.0.2

Feb 16, 2026

1.0.1

Feb 16, 2026

1.0.0

Feb 7, 2026

This version

1.0.0rc2 pre-release

Dec 28, 2025

1.0.0rc1 pre-release

Dec 28, 2025

0.24.0

Dec 22, 2025

0.23.0

Dec 16, 2025

0.22.0

Nov 17, 2025

0.21.0

Nov 16, 2025

0.20.0

Nov 15, 2025

0.19.3

Nov 13, 2025

0.19.2

Nov 13, 2025

0.19.1

Nov 12, 2025

0.19.0

Nov 8, 2025

0.18.1

Nov 6, 2025

0.18.0

Nov 6, 2025

0.17.8

Nov 6, 2025

0.17.7

Oct 24, 2025

0.17.5

Oct 13, 2025

0.17.4

Oct 12, 2025

0.17.3

Oct 12, 2025

0.17.2

Oct 12, 2025

0.17.1

Oct 11, 2025

0.17.0

Oct 10, 2025

0.16.0

Oct 4, 2025

0.15.1

Oct 3, 2025

0.15.0

Oct 1, 2025

0.14.0

Sep 27, 2025

0.13.0

Sep 25, 2025

0.12.0

Sep 24, 2025

0.11.1

Jul 29, 2025

0.11.0

May 11, 2025

0.10.0

Apr 22, 2025

0.9.0

Apr 6, 2025

0.8.0

Mar 15, 2025

0.7.0

Feb 22, 2025

0.6.1

Feb 22, 2025

0.6.0

Feb 9, 2025

0.5.0

Feb 6, 2025

0.4.0

Jan 25, 2025

0.3.0

Jan 19, 2025

0.2.1

Jan 17, 2025

0.2.0

Jan 17, 2025

0.1.0

Jan 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sieves-1.0.0rc2.tar.gz (2.4 MB view details)

Uploaded Dec 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sieves-1.0.0rc2-py3-none-any.whl (640.3 kB view details)

Uploaded Dec 28, 2025 Python 3

File details

Details for the file sieves-1.0.0rc2.tar.gz.

File metadata

Download URL: sieves-1.0.0rc2.tar.gz
Upload date: Dec 28, 2025
Size: 2.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sieves-1.0.0rc2.tar.gz
Algorithm	Hash digest
SHA256	`0681c106e544aef7f46e2a7a96e953ea224580b4e5a5f41efdffa0299c958c58`
MD5	`9ee0b9485eff27c787f6a374465055e8`
BLAKE2b-256	`c134c320e75adf620f08f4d266ebb94d478073e59d45a7aacff61c191266ec1c`

See more details on using hashes here.

File details

Details for the file sieves-1.0.0rc2-py3-none-any.whl.

File metadata

Download URL: sieves-1.0.0rc2-py3-none-any.whl
Upload date: Dec 28, 2025
Size: 640.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sieves-1.0.0rc2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`43f180de4c6d26b43b76fec2693f1d494b41e523668fde191651a845d5671cd5`
MD5	`c8b42a3a3ca1ba16609768c7f2855198`
BLAKE2b-256	`187ec35f73844606f807b46b9835eca11af0d7f55e43b3fb71cf6f3a9c6c7149`

See more details on using hashes here.

sieves 1.0.0rc2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

A Unified Interface for Document AI.

Features

Quick Start

Why `sieves`?

Feature Comparison

Core Concepts

Supported Backends

Installation

Community & Support

Attribution

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

sieves 1.0.0rc2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

A Unified Interface for Document AI.

Features

Quick Start

Why sieves?

Feature Comparison

Core Concepts

Supported Backends

Installation

Community & Support

Attribution

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Why `sieves`?