sieves

Plug-and-play, zero-shot document processing pipelines.

These details have not been verified by PyPI

Project links

Homepage

Project description

GitHub top language PyPI - Status

Zero-shot document processing made easy.

sieves is a library for zero-shot document AI with structured generation. It supports you to build document AI pipelines quickly, with validated output formats. No training required.

Read our documentation at sieves.ai. An automatically generated version (courtesy of Devin via DeepWiki) is available here.

[!WARNING] sieves is in active development and currently in beta. Be advised that the API might change in between minor version updates.

Quick Start

1. Install

pip install sieves

2. Run this example - Text classification with local models:

import outlines
import transformers
from sieves import Pipeline, tasks, Doc

# Create model and pipeline
model_name = "HuggingFaceTB/SmolLM-135M-Instruct"
model = outlines.models.from_transformers(
    transformers.AutoModelForCausalLM.from_pretrained(model_name),
    transformers.AutoTokenizer.from_pretrained(model_name)
)

pipeline = Pipeline(
    tasks.Classification(
        labels=["technology", "sports", "politics"],
        model=model
    )
)

# Process text
doc = Doc(text="The new smartphone features advanced AI capabilities.")
results = list(pipeline([doc]))
print(results[0].results)  # "technology"

3. Explore the docs

Read the guides • Browse examples

Advanced Example: Information Extraction from PDFs

This example shows how to extract structured information from a scientific paper as PDF using

a remote LLM with DSPy via OpenRouter
chunking
the sieves.tasks.InformationExtraction task

We're using this setup to extract mathematical equations from the paper.

    import dspy
    import os
    from sieves import tasks, Doc
    import pydantic
    import chonkie

    # Define schema for extraction.
    class Equation(pydantic.BaseModel, frozen=True):
        id: str = pydantic.Field(description="ID/index of equation in paper.")
        equation: str = pydantic.Field(description="Equation as shown in paper.")

    # Create model instance using OpenRouter.
    model = dspy.LM(
        "openrouter/google/gemini-2.5-flash-lite-preview-09-2025",
        api_base="https://openrouter.ai/api/v1/",
        api_key=os.environ["OPENROUTER_API_KEY"]
    )

    # Create pipeline with PDF ingestion, chunking, and extraction.
    pipeline = (
        tasks.Ingestion(export_format="markdown") +
        tasks.Chunking(chonkie.TokenChunker(tokenizers.Tokenizer.from_pretrained("gpt2"))) +
        tasks.InformationExtraction(entity_type=Equation, model=model)
    )

    # Process a paper with equations as PDF.
    pdf_path = "https://arxiv.org/pdf/1204.0162"
    doc = Doc(uri=pdf_path)
    results = list(pipeline([doc]))

    # Access extracted entities.
    if results[0].results.get("InformationExtraction"):
        for equation in results[0].results["InformationExtraction"]:
            print(equation)

The output will look similar to this:

id='(1)' equation="the observer measures not the linear but angular ... both cars are near the stop sign."
id='(3)' equation='\\omega(t) = \\frac{r_0 v(t)}{r_0^2 + x(t)^2}'
id='(4)' equation='\\tan \\alpha(t) = \\frac{x(t)}{r_0}'
id='(5)' equation='x(t) = \\frac{a_0 t^2}{2}'
id='(6)' equation="\\frac{d}{dt} f(t) = f'(t)"
id='(7)' equation='\\omega(t) = \\frac{a_0 t}{r_0} \\left( 1 + \\frac{a_0^2 t^4}{4 r_0^2} \\right)^{-1}'
id='(8)' equation='x(t) = x_0 + v_0 t + \\frac{1}{2} a t^2'

Requirements: Install PDF parsing support:

pip install "sieves[ingestion]"

See Ingestion Guide for more PDF parsing options.

Key Features

Zero-shot NLP, ready to use

🎯 No training required - immediate inference with zero-shot models (LLMs and local models)
📋 Built-in tasks: classification, extraction, NER, summarization, sentiment analysis, PII masking, QA
🔄 Unified interface for DSPy, Outlines, LangChain, GLiNER2, Transformers

Production-ready pipelines

🔍 Observable execution with conditional task logic
💾 Caching to avoid redundant model calls
📦 Pipeline serialization and configuration management

Full NLP workflow

📄 Document parsing: Docling, Marker (optional)
✂️ Text chunking: Chonkie integration
🚀 Prompt optimization: DSPy MIPROv2
👨‍🏫 Model distillation: SetFit, Model2Vec

Installation

Requirements: Python 3.12 (exact version)

pip install sieves

Optional extras:

pip install "sieves[ingestion]"  # PDF/DOCX parsing (docling, marker)
pip install "sieves[distill]"     # Model distillation (setfit, model2vec)

⚠️ Important: sieves requires Python 3.12.x. This is because some dependencies like pyarrow don't have prebuilt wheels for Python 3.13+ yet, which would require manual compilation. Support for newer Python libraries is on the roadmap.

Why `sieves`?

Building document AI prototypes means juggling multiple tools: one for structured output, another for parsing, one for chunking, another for optimization. There are many options for structured output, each with its own pros and cons - and very different APIs. This can be arduous when what you're actually want to focus on is to hit the ground running and build a prototype quickly.

To address this, sieves provides a unified pipeline for the entire workflow, from document ingestion to model distillation, with validated structured outputs across multiple backends.

Best for:

✅ Use case: document AI/processing
✅ Rapid prototyping with zero training
✅ Switching between language model backends without rewriting code
✅ Building document AI pipelines with observability

Not for:

❌ Use case: chat bot, RAG
❌ Applications deeply coupled to LangChain/DSPy ecosystems
❌ Simple one-off LLM calls without pipeline needs

Inspired by spaCy and spacy-llm.

How does sieves compare?

Feature	sieves	LangChain	DSPy	Outlines	Transformers
Multi-backend support	✅ All	❌ Own only	❌ Own only	❌ Own only	❌ Own only
Document parsing	✅ Built-in	✅ Via tools	❌ No	❌ No	❌ No
Structured output	✅ Unified	✅ Yes	✅ Yes	✅ Core feature	⚠️ Limited
Prompt optimization	✅ DSPy wrapper	❌ No	✅ Core feature	❌ No	❌ No
Model distillation	✅ SetFit/M2V	❌ No	✅ Yes	❌ No	⚠️ Manual
Learning curve	Low	Medium	High	Low	Low

When to choose sieves:

You want to implement a document processing/AI use case
You want to prototype quickly without committing to a specific backend
You need end-to-end document AI workflow (parsing → processing → distillation)
You value a unified interface over framework-specific features

When to choose alternatives:

You're looking for a fully featured LLM framework
You want to implement a chat bot or RAG use case
LangChain: Already deeply integrated in your production stack
DSPy: Research projects requiring custom optimization algorithms
Outlines: Simple structured generation without pipeline needs
Transformers: Maximum flexibility and fine-grained control

Core Concepts

sieves is built on three key abstractions:

Doc: Represents a document with text, metadata, and processing results
Task: A processing step (classification, extraction, summarization, etc.)
Pipeline: Orchestrates tasks with caching, serialization, and observability

→ Read the architecture guide for details on bridges, engines, and internals.

Supported Models

sieves works with multiple NLP frameworks. Here's how to create models for each:

DSPy

See docs.

import dspy
import os

model = dspy.LM(
    "anthropic/claude-4-5-haiku",
    api_key=os.environ["ANTHROPIC_API_KEY"]
)

Outlines

See docs.

import outlines
import transformers

model_name = "HuggingFaceTB/SmolLM-135M-Instruct"
model = outlines.models.from_transformers(
    transformers.AutoModelForCausalLM.from_pretrained(model_name),
    transformers.AutoTokenizer.from_pretrained(model_name)
)

Transformers (Zero-Shot Classification Pipelines)

See docs.

import transformers

model = transformers.pipeline(
    "zero-shot-classification",
    model="MoritzLaurer/xtremedistil-l6-h256-zeroshot-v1.1-all-33",
    device=0
)

LangChain

See docs. E.g. with an OpenAI model:

from langchain_openai import ChatOpenAI
import os

model = ChatOpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    model="gpt-5-mini",
    temperature=0
)

GLiNER2

See docs.

Basic usage:

import gliner2

model = gliner2.GLiNER2.from_pretrained("fastino/gliner2-base-v1")

See the Model Setup Guide for more details and troubleshooting.

Get Started

📖 Read the guides
Start with the 5-minute tutorial

🎯 Browse examples
Explore what you can do with Sieves

🤝 Join discussions
Ask questions, share projects

Frequently Asked Questions

Why "sieves"?

Filtering an LLM's potentially endless stream of unstructured text into a structured format is like sieving water to capture gold nuggets. That's where the name comes from.

Why not just prompt an LLM directly?

Validated outputs: Structured results with type checking
Observable pipelines: Debug each stage
Backend flexibility: Switch models without rewriting
Built-in tooling: Caching, serialization, optimization

How do I set up models?

See the Model Setup Guide for framework-specific examples.

Quick example: model = dspy.LM("claude-3-haiku-20240307", api_key=os.environ["ANTHROPIC_API_KEY"])

Can I use local models?

Yes! Via Ollama, vLLM, or Transformers directly. See guide.

Is sieves production-ready?

Beta status: API stable within minor versions, well-tested, used in real projects. Pin your version: pip install "sieves==0.x.*"

Attribution

sieves is inspired by spaCy and spacy-llm.

Sieve icons created by Freepik - Flaticon.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.4

May 11, 2026

1.0.3

May 1, 2026

1.0.2

Feb 16, 2026

1.0.1

Feb 16, 2026

1.0.0

Feb 7, 2026

1.0.0rc2 pre-release

Dec 28, 2025

1.0.0rc1 pre-release

Dec 28, 2025

0.24.0

Dec 22, 2025

This version

0.23.0

Dec 16, 2025

0.22.0

Nov 17, 2025

0.21.0

Nov 16, 2025

0.20.0

Nov 15, 2025

0.19.3

Nov 13, 2025

0.19.2

Nov 13, 2025

0.19.1

Nov 12, 2025

0.19.0

Nov 8, 2025

0.18.1

Nov 6, 2025

0.18.0

Nov 6, 2025

0.17.8

Nov 6, 2025

0.17.7

Oct 24, 2025

0.17.5

Oct 13, 2025

0.17.4

Oct 12, 2025

0.17.3

Oct 12, 2025

0.17.2

Oct 12, 2025

0.17.1

Oct 11, 2025

0.17.0

Oct 10, 2025

0.16.0

Oct 4, 2025

0.15.1

Oct 3, 2025

0.15.0

Oct 1, 2025

0.14.0

Sep 27, 2025

0.13.0

Sep 25, 2025

0.12.0

Sep 24, 2025

0.11.1

Jul 29, 2025

0.11.0

May 11, 2025

0.10.0

Apr 22, 2025

0.9.0

Apr 6, 2025

0.8.0

Mar 15, 2025

0.7.0

Feb 22, 2025

0.6.1

Feb 22, 2025

0.6.0

Feb 9, 2025

0.5.0

Feb 6, 2025

0.4.0

Jan 25, 2025

0.3.0

Jan 19, 2025

0.2.1

Jan 17, 2025

0.2.0

Jan 17, 2025

0.1.0

Jan 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sieves-0.23.0.tar.gz (2.3 MB view details)

Uploaded Dec 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sieves-0.23.0-py3-none-any.whl (589.9 kB view details)

Uploaded Dec 16, 2025 Python 3

File details

Details for the file sieves-0.23.0.tar.gz.

File metadata

Download URL: sieves-0.23.0.tar.gz
Upload date: Dec 16, 2025
Size: 2.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sieves-0.23.0.tar.gz
Algorithm	Hash digest
SHA256	`a7e26f90e35826d2008bdb23af8de7e05e35a853bf8127b4d481245d38dfd61b`
MD5	`c7f308483ba0028bd70a45b66fe24c83`
BLAKE2b-256	`2a031ad46c4501cc010638994757e32100ae1009d0015f34d49804a9c4f92a59`

See more details on using hashes here.

File details

Details for the file sieves-0.23.0-py3-none-any.whl.

File metadata

Download URL: sieves-0.23.0-py3-none-any.whl
Upload date: Dec 16, 2025
Size: 589.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sieves-0.23.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1fcfae0e0669cff6c064343765323446d7d85e1304a5962bda705521fe569e7c`
MD5	`0a3927ab9c14cc961008a96fb37f68f2`
BLAKE2b-256	`0c40945a0fafc7721f7608cd1ee4f5dfd977889d2e0387bafe2bee39e4aeaef9`

See more details on using hashes here.

sieves 0.23.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Zero-shot document processing made easy.

Quick Start

Key Features

Installation

Why sieves?

How does sieves compare?

Core Concepts

Supported Models

DSPy

Outlines

Transformers (Zero-Shot Classification Pipelines)

LangChain

GLiNER2

Get Started

Frequently Asked Questions

Attribution

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Why `sieves`?