Plug-and-play, zero-shot document processing pipelines.
Project description
Zero-shot document processing made easy.
sieves is a library for zero-shot document AI with structured generation.
It supports you to build document AI pipelines quickly, with validated output formats. No training required.
Read our documentation at sieves.ai. An automatically generated version (courtesy of Devin via DeepWiki) is available here.
[!WARNING]
sievesis in active development and currently in beta. Be advised that the API might change in between minor version updates.
Quick Start
1. Install
pip install sieves
2. Run this example - Text classification with local models:
import outlines
import transformers
from sieves import Pipeline, tasks, Doc
# Create model and pipeline
model_name = "HuggingFaceTB/SmolLM-135M-Instruct"
model = outlines.models.from_transformers(
transformers.AutoModelForCausalLM.from_pretrained(model_name),
transformers.AutoTokenizer.from_pretrained(model_name)
)
pipeline = Pipeline(
tasks.Classification(
labels=["technology", "sports", "politics"],
model=model
)
)
# Process text
doc = Doc(text="The new smartphone features advanced AI capabilities.")
results = list(pipeline([doc]))
print(results[0].results) # "technology"
3. Explore the docs
Read the guides • Browse examples
Advanced Example: Information Extraction from PDFs
This example shows how to extract structured information from a scientific paper as PDF using
- a remote LLM with DSPy via OpenRouter
- chunking
- the
sieves.tasks.InformationExtractiontask
We're using this setup to extract mathematical equations from the paper.
import dspy
import os
from sieves import tasks, Doc
import pydantic
import chonkie
# Define schema for extraction.
class Equation(pydantic.BaseModel, frozen=True):
id: str = pydantic.Field(description="ID/index of equation in paper.")
equation: str = pydantic.Field(description="Equation as shown in paper.")
# Create model instance using OpenRouter.
model = dspy.LM(
"openrouter/google/gemini-2.5-flash-lite-preview-09-2025",
api_base="https://openrouter.ai/api/v1/",
api_key=os.environ["OPENROUTER_API_KEY"]
)
# Create pipeline with PDF ingestion, chunking, and extraction.
pipeline = (
tasks.Ingestion(export_format="markdown") +
tasks.Chunking(chonkie.TokenChunker(tokenizers.Tokenizer.from_pretrained("gpt2"))) +
tasks.InformationExtraction(entity_type=Equation, model=model)
)
# Process a paper with equations as PDF.
pdf_path = "https://arxiv.org/pdf/1204.0162"
doc = Doc(uri=pdf_path)
results = list(pipeline([doc]))
# Access extracted entities.
if results[0].results.get("InformationExtraction"):
for equation in results[0].results["InformationExtraction"]:
print(equation)
The output will look similar to this:
id='(1)' equation="the observer measures not the linear but angular ... both cars are near the stop sign."
id='(3)' equation='\\omega(t) = \\frac{r_0 v(t)}{r_0^2 + x(t)^2}'
id='(4)' equation='\\tan \\alpha(t) = \\frac{x(t)}{r_0}'
id='(5)' equation='x(t) = \\frac{a_0 t^2}{2}'
id='(6)' equation="\\frac{d}{dt} f(t) = f'(t)"
id='(7)' equation='\\omega(t) = \\frac{a_0 t}{r_0} \\left( 1 + \\frac{a_0^2 t^4}{4 r_0^2} \\right)^{-1}'
id='(8)' equation='x(t) = x_0 + v_0 t + \\frac{1}{2} a t^2'
Requirements: Install PDF parsing support:
pip install "sieves[ingestion]"
See Ingestion Guide for more PDF parsing options.
Key Features
Zero-shot NLP, ready to use
- 🎯 No training required - immediate inference with zero-shot models (LLMs and local models)
- 📋 Built-in tasks: classification, extraction, NER, summarization, sentiment analysis, PII masking, QA
- 🔄 Unified interface for DSPy, Outlines, LangChain, GLiNER2, Transformers
Production-ready pipelines
- 🔍 Observable execution with conditional task logic
- 💾 Caching to avoid redundant model calls
- 📦 Pipeline serialization and configuration management
Full NLP workflow
- 📄 Document parsing: Docling, Marker (optional)
- ✂️ Text chunking: Chonkie integration
- 🚀 Prompt optimization: DSPy MIPROv2
- 👨🏫 Model distillation: SetFit, Model2Vec
Installation
Requirements: Python 3.12 (exact version)
pip install sieves
Optional extras:
pip install "sieves[ingestion]" # PDF/DOCX parsing (docling, marker)
pip install "sieves[distill]" # Model distillation (setfit, model2vec)
⚠️ Important: sieves requires Python 3.12.x. This is because some dependencies like
pyarrowdon't have prebuilt wheels for Python 3.13+ yet, which would require manual compilation. Support for newer Python libraries is on the roadmap.
Why sieves?
Building document AI prototypes means juggling multiple tools: one for structured output, another for parsing, one for chunking, another for optimization. There are many options for structured output, each with its own pros and cons - and very different APIs. This can be arduous when what you're actually want to focus on is to hit the ground running and build a prototype quickly.
To address this, sieves provides a unified pipeline for the entire workflow, from document
ingestion to model distillation, with validated structured outputs across multiple backends.
Best for:
- ✅ Use case: document AI/processing
- ✅ Rapid prototyping with zero training
- ✅ Switching between language model backends without rewriting code
- ✅ Building document AI pipelines with observability
Not for:
- ❌ Use case: chat bot, RAG
- ❌ Applications deeply coupled to LangChain/DSPy ecosystems
- ❌ Simple one-off LLM calls without pipeline needs
Inspired by spaCy and spacy-llm.
How does sieves compare?
| Feature | sieves | LangChain | DSPy | Outlines | Transformers |
|---|---|---|---|---|---|
| Multi-backend support | ✅ All | ❌ Own only | ❌ Own only | ❌ Own only | ❌ Own only |
| Document parsing | ✅ Built-in | ✅ Via tools | ❌ No | ❌ No | ❌ No |
| Structured output | ✅ Unified | ✅ Yes | ✅ Yes | ✅ Core feature | ⚠️ Limited |
| Prompt optimization | ✅ DSPy wrapper | ❌ No | ✅ Core feature | ❌ No | ❌ No |
| Model distillation | ✅ SetFit/M2V | ❌ No | ✅ Yes | ❌ No | ⚠️ Manual |
| Learning curve | Low | Medium | High | Low | Low |
When to choose sieves:
- You want to implement a document processing/AI use case
- You want to prototype quickly without committing to a specific backend
- You need end-to-end document AI workflow (parsing → processing → distillation)
- You value a unified interface over framework-specific features
When to choose alternatives:
- You're looking for a fully featured LLM framework
- You want to implement a chat bot or RAG use case
- LangChain: Already deeply integrated in your production stack
- DSPy: Research projects requiring custom optimization algorithms
- Outlines: Simple structured generation without pipeline needs
- Transformers: Maximum flexibility and fine-grained control
Core Concepts
sieves is built on three key abstractions:
Doc: Represents a document with text, metadata, and processing resultsTask: A processing step (classification, extraction, summarization, etc.)Pipeline: Orchestrates tasks with caching, serialization, and observability
→ Read the architecture guide for details on bridges, engines, and internals.
Supported Models
sieves works with multiple NLP frameworks. Here's how to create models for each:
DSPy
See docs.
import dspy
import os
model = dspy.LM(
"anthropic/claude-4-5-haiku",
api_key=os.environ["ANTHROPIC_API_KEY"]
)
Outlines
See docs.
import outlines
import transformers
model_name = "HuggingFaceTB/SmolLM-135M-Instruct"
model = outlines.models.from_transformers(
transformers.AutoModelForCausalLM.from_pretrained(model_name),
transformers.AutoTokenizer.from_pretrained(model_name)
)
Transformers (Zero-Shot Classification Pipelines)
See docs.
import transformers
model = transformers.pipeline(
"zero-shot-classification",
model="MoritzLaurer/xtremedistil-l6-h256-zeroshot-v1.1-all-33",
device=0
)
LangChain
See docs. E.g. with an OpenAI model:
from langchain_openai import ChatOpenAI
import os
model = ChatOpenAI(
api_key=os.environ["OPENAI_API_KEY"],
model="gpt-5-mini",
temperature=0
)
GLiNER2
See docs.
Basic usage:
import gliner2
model = gliner2.GLiNER2.from_pretrained("fastino/gliner2-base-v1")
See the Model Setup Guide for more details and troubleshooting.
Get Started
📖 Read the guides
Start with the 5-minute tutorial
🎯 Browse examples
Explore what you can do with Sieves
🤝 Join discussions
Ask questions, share projects
Frequently Asked Questions
Why "sieves"?
Filtering an LLM's potentially endless stream of unstructured text into a structured format is like sieving water to capture gold nuggets. That's where the name comes from.
Why not just prompt an LLM directly?
- Validated outputs: Structured results with type checking
- Observable pipelines: Debug each stage
- Backend flexibility: Switch models without rewriting
- Built-in tooling: Caching, serialization, optimization
How do I set up models?
See the Model Setup Guide for framework-specific examples.
Quick example: model = dspy.LM("claude-3-haiku-20240307", api_key=os.environ["ANTHROPIC_API_KEY"])
Can I use local models?
Yes! Via Ollama, vLLM, or Transformers directly. See guide.
Is sieves production-ready?
Beta status: API stable within minor versions, well-tested, used in real projects.
Pin your version: pip install "sieves==0.x.*"
Attribution
sieves is inspired by spaCy and spacy-llm.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sieves-0.23.0.tar.gz.
File metadata
- Download URL: sieves-0.23.0.tar.gz
- Upload date:
- Size: 2.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a7e26f90e35826d2008bdb23af8de7e05e35a853bf8127b4d481245d38dfd61b
|
|
| MD5 |
c7f308483ba0028bd70a45b66fe24c83
|
|
| BLAKE2b-256 |
2a031ad46c4501cc010638994757e32100ae1009d0015f34d49804a9c4f92a59
|
File details
Details for the file sieves-0.23.0-py3-none-any.whl.
File metadata
- Download URL: sieves-0.23.0-py3-none-any.whl
- Upload date:
- Size: 589.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1fcfae0e0669cff6c064343765323446d7d85e1304a5962bda705521fe569e7c
|
|
| MD5 |
0a3927ab9c14cc961008a96fb37f68f2
|
|
| BLAKE2b-256 |
0c40945a0fafc7721f7608cd1ee4f5dfd977889d2e0387bafe2bee39e4aeaef9
|