Skip to main content

Plug-and-play, zero-shot document processing pipelines.

Project description











GitHub Actions Workflow Status GitHub top language PyPI - Version PyPI - Status Code style: black codecov DOI

Zero-shot document processing made easy.

sieves is a library for zero-shot document AI with structured generation. It supports you to build document AI pipelines quickly, with validated output formats. No training required.

Read our documentation at sieves.ai. An automatically generated version (courtesy of Devin via DeepWiki) is available here.

[!WARNING] sieves is in active development and currently in beta. Be advised that the API might change in between minor version updates.

Quick Start

1. Install

pip install sieves

2. Run this example - Text classification with local models:

import outlines
import transformers
from sieves import Pipeline, tasks, Doc

# Create model and pipeline
model_name = "HuggingFaceTB/SmolLM-135M-Instruct"
model = outlines.models.from_transformers(
    transformers.AutoModelForCausalLM.from_pretrained(model_name),
    transformers.AutoTokenizer.from_pretrained(model_name)
)

pipeline = Pipeline(
    tasks.Classification(
        labels=["technology", "sports", "politics"],
        model=model
    )
)

# Process text
doc = Doc(text="The new smartphone features advanced AI capabilities.")
results = list(pipeline([doc]))
print(results[0].results)  # "technology"

3. Explore the docs

Read the guidesBrowse examples

Advanced Example: Information Extraction from PDFs

This example shows how to extract structured information from a scientific paper as PDF using

  • a remote LLM with DSPy via OpenRouter
  • chunking
  • the sieves.tasks.InformationExtraction task

We're using this setup to extract mathematical equations from the paper.

    import dspy
    import os
    from sieves import tasks, Doc
    import pydantic
    import chonkie

    # Define schema for extraction.
    class Equation(pydantic.BaseModel, frozen=True):
        id: str = pydantic.Field(description="ID/index of equation in paper.")
        equation: str = pydantic.Field(description="Equation as shown in paper.")

    # Create model instance using OpenRouter.
    model = dspy.LM(
        "openrouter/google/gemini-2.5-flash-lite-preview-09-2025",
        api_base="https://openrouter.ai/api/v1/",
        api_key=os.environ["OPENROUTER_API_KEY"]
    )

    # Create pipeline with PDF ingestion, chunking, and extraction.
    pipeline = (
        tasks.Ingestion(export_format="markdown") +
        tasks.Chunking(chonkie.TokenChunker(tokenizers.Tokenizer.from_pretrained("gpt2"))) +
        tasks.InformationExtraction(entity_type=Equation, model=model)
    )

    # Process a paper with equations as PDF.
    pdf_path = "https://arxiv.org/pdf/1204.0162"
    doc = Doc(uri=pdf_path)
    results = list(pipeline([doc]))

    # Access extracted entities.
    if results[0].results.get("InformationExtraction"):
        for equation in results[0].results["InformationExtraction"]:
            print(equation)

The output will look similar to this:

id='(1)' equation="the observer measures not the linear but angular ... both cars are near the stop sign."
id='(3)' equation='\\omega(t) = \\frac{r_0 v(t)}{r_0^2 + x(t)^2}'
id='(4)' equation='\\tan \\alpha(t) = \\frac{x(t)}{r_0}'
id='(5)' equation='x(t) = \\frac{a_0 t^2}{2}'
id='(6)' equation="\\frac{d}{dt} f(t) = f'(t)"
id='(7)' equation='\\omega(t) = \\frac{a_0 t}{r_0} \\left( 1 + \\frac{a_0^2 t^4}{4 r_0^2} \\right)^{-1}'
id='(8)' equation='x(t) = x_0 + v_0 t + \\frac{1}{2} a t^2'

Requirements: Install PDF parsing support:

pip install "sieves[ingestion]"

See Ingestion Guide for more PDF parsing options.


Key Features

Zero-shot NLP, ready to use

  • 🎯 No training required - immediate inference with zero-shot models (LLMs and local models)
  • 📋 Built-in tasks: classification, extraction, NER, summarization, sentiment analysis, PII masking, QA
  • 🔄 Unified interface for DSPy, Outlines, LangChain, GLiNER2, Transformers

Production-ready pipelines

  • 🔍 Observable execution with conditional task logic
  • 💾 Caching to avoid redundant model calls
  • 📦 Pipeline serialization and configuration management

Full NLP workflow

  • 📄 Document parsing: Docling, Marker (optional)
  • ✂️ Text chunking: Chonkie integration
  • 🚀 Prompt optimization: DSPy MIPROv2
  • 👨‍🏫 Model distillation: SetFit, Model2Vec

Installation

Requirements: Python 3.12 (exact version)

pip install sieves

Optional extras:

pip install "sieves[ingestion]"  # PDF/DOCX parsing (docling, marker)
pip install "sieves[distill]"     # Model distillation (setfit, model2vec)

⚠️ Important: sieves requires Python 3.12.x. This is because some dependencies like pyarrow don't have prebuilt wheels for Python 3.13+ yet, which would require manual compilation. Support for newer Python libraries is on the roadmap.

Why sieves?

Building document AI prototypes means juggling multiple tools: one for structured output, another for parsing, one for chunking, another for optimization. There are many options for structured output, each with its own pros and cons - and very different APIs. This can be arduous when what you're actually want to focus on is to hit the ground running and build a prototype quickly.

To address this, sieves provides a unified pipeline for the entire workflow, from document ingestion to model distillation, with validated structured outputs across multiple backends.

Best for:

  • ✅ Use case: document AI/processing
  • ✅ Rapid prototyping with zero training
  • ✅ Switching between language model backends without rewriting code
  • ✅ Building document AI pipelines with observability

Not for:

  • ❌ Use case: chat bot, RAG
  • ❌ Applications deeply coupled to LangChain/DSPy ecosystems
  • ❌ Simple one-off LLM calls without pipeline needs

Inspired by spaCy and spacy-llm.

How does sieves compare?

Feature sieves LangChain DSPy Outlines Transformers
Multi-backend support ✅ All ❌ Own only ❌ Own only ❌ Own only ❌ Own only
Document parsing ✅ Built-in ✅ Via tools ❌ No ❌ No ❌ No
Structured output ✅ Unified ✅ Yes ✅ Yes ✅ Core feature ⚠️ Limited
Prompt optimization ✅ DSPy wrapper ❌ No ✅ Core feature ❌ No ❌ No
Model distillation ✅ SetFit/M2V ❌ No ✅ Yes ❌ No ⚠️ Manual
Learning curve Low Medium High Low Low

When to choose sieves:

  • You want to implement a document processing/AI use case
  • You want to prototype quickly without committing to a specific backend
  • You need end-to-end document AI workflow (parsing → processing → distillation)
  • You value a unified interface over framework-specific features

When to choose alternatives:

  • You're looking for a fully featured LLM framework
  • You want to implement a chat bot or RAG use case
  • LangChain: Already deeply integrated in your production stack
  • DSPy: Research projects requiring custom optimization algorithms
  • Outlines: Simple structured generation without pipeline needs
  • Transformers: Maximum flexibility and fine-grained control

Core Concepts

sieves is built on three key abstractions:

  • Doc: Represents a document with text, metadata, and processing results
  • Task: A processing step (classification, extraction, summarization, etc.)
  • Pipeline: Orchestrates tasks with caching, serialization, and observability

→ Read the architecture guide for details on bridges, engines, and internals.

Supported Models

sieves works with multiple NLP frameworks. Here's how to create models for each:

DSPy

See docs.

import dspy
import os

model = dspy.LM(
    "anthropic/claude-4-5-haiku",
    api_key=os.environ["ANTHROPIC_API_KEY"]
)

Outlines

See docs.

import outlines
import transformers

model_name = "HuggingFaceTB/SmolLM-135M-Instruct"
model = outlines.models.from_transformers(
    transformers.AutoModelForCausalLM.from_pretrained(model_name),
    transformers.AutoTokenizer.from_pretrained(model_name)
)

Transformers (Zero-Shot Classification Pipelines)

See docs.

import transformers

model = transformers.pipeline(
    "zero-shot-classification",
    model="MoritzLaurer/xtremedistil-l6-h256-zeroshot-v1.1-all-33",
    device=0
)

LangChain

See docs. E.g. with an OpenAI model:

from langchain_openai import ChatOpenAI
import os

model = ChatOpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    model="gpt-5-mini",
    temperature=0
)

GLiNER2

See docs.

Basic usage:

import gliner2

model = gliner2.GLiNER2.from_pretrained("fastino/gliner2-base-v1")

See the Model Setup Guide for more details and troubleshooting.

Get Started

📖 Read the guides
Start with the 5-minute tutorial

🎯 Browse examples
Explore what you can do with Sieves

🤝 Join discussions
Ask questions, share projects

Frequently Asked Questions

Why "sieves"?

Filtering an LLM's potentially endless stream of unstructured text into a structured format is like sieving water to capture gold nuggets. That's where the name comes from.

Why not just prompt an LLM directly?
  • Validated outputs: Structured results with type checking
  • Observable pipelines: Debug each stage
  • Backend flexibility: Switch models without rewriting
  • Built-in tooling: Caching, serialization, optimization
How do I set up models?

See the Model Setup Guide for framework-specific examples.

Quick example: model = dspy.LM("claude-3-haiku-20240307", api_key=os.environ["ANTHROPIC_API_KEY"])

Can I use local models?

Yes! Via Ollama, vLLM, or Transformers directly. See guide.

Is sieves production-ready?

Beta status: API stable within minor versions, well-tested, used in real projects. Pin your version: pip install "sieves==0.x.*"

Attribution

sieves is inspired by spaCy and spacy-llm.

Sieve icons created by Freepik - Flaticon.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sieves-0.23.0.tar.gz (2.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sieves-0.23.0-py3-none-any.whl (589.9 kB view details)

Uploaded Python 3

File details

Details for the file sieves-0.23.0.tar.gz.

File metadata

  • Download URL: sieves-0.23.0.tar.gz
  • Upload date:
  • Size: 2.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sieves-0.23.0.tar.gz
Algorithm Hash digest
SHA256 a7e26f90e35826d2008bdb23af8de7e05e35a853bf8127b4d481245d38dfd61b
MD5 c7f308483ba0028bd70a45b66fe24c83
BLAKE2b-256 2a031ad46c4501cc010638994757e32100ae1009d0015f34d49804a9c4f92a59

See more details on using hashes here.

File details

Details for the file sieves-0.23.0-py3-none-any.whl.

File metadata

  • Download URL: sieves-0.23.0-py3-none-any.whl
  • Upload date:
  • Size: 589.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sieves-0.23.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1fcfae0e0669cff6c064343765323446d7d85e1304a5962bda705521fe569e7c
MD5 0a3927ab9c14cc961008a96fb37f68f2
BLAKE2b-256 0c40945a0fafc7721f7608cd1ee4f5dfd977889d2e0387bafe2bee39e4aeaef9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page