sieves

Rapid prototyping and robust baselines for information extraction with zero- and few-shot models.

These details have not been verified by PyPI

Project links

Homepage

Project description

GitHub top language PyPI - Status

Zero-shot document processing made easy.

sieves is a library for zero- and few-shot NLP tasks with structured generation. Build production-ready NLP prototypes quickly, with guaranteed output formats and no training required.

Read our documentation here. An automatically generated version (courtesy of Devin via DeepWiki) is available here.

Install sieves with pip install sieves. If you want to install all optional dependencies right away, install with pip install sieves[engines,distill]. You can also choose to install individual dependencies as you see fit.

[!WARNING] sieves is in active development and currently in beta. Be advised that the API might change in between minor version updates.

Why `sieves`?

Even in the era of generative AI, structured outputs and observability remain crucial.

Many real-world scenarios require rapid prototyping with minimal data. Generative language models excel here, but producing clean, structured output can be challenging. Various tools address this need for structured/guided language model output, including outlines, dspy, ollama, and others. Each has different design patterns, pros and cons. sieves wraps these tools and provides a unified interface for input, processing, and output.

Developing NLP prototypes often involves repetitive steps: parsing and chunking documents, exporting results for model fine-tuning, and experimenting with different prompting techniques. All these needs are addressed by existing libraries in the NLP ecosystem address (e.g. docling for file parsing, or datasets for transforming data into a unified format for model training).

sieves simplifies NLP prototyping by bundling these capabilities into a single library, allowing you to quickly build modern NLP applications. It provides:

Zero- and few-shot model support for immediate inference
A bundle of utilities addressing common requirements in NLP applications
A unified interface for structured generation across multiple libraries
Built-in tasks for common NLP operations
Easy extendability
A document-based pipeline architecture for easy observability and debugging
Caching - pipelines cache processed documents to prevent costly redundant model calls

sieves draws a lot of inspiration from spaCy and particularly spacy-llm.

Features

:dart: Zero Training Required: Immediate inference using zero-/few-shot models
:robot: Unified Generation Interface: Seamlessly use multiple libraries
- dspy
- gliner
- instructor
- langchain
- ollama
- outlines
- transformer
:arrow_forward: Observable Pipelines: Easy debugging and monitoring
:hammer_and_wrench: Integrated Tools:
- Document parsing: docling, unstructured, marker
- Text chunking: chonkie
:label: Ready-to-Use Tasks:
- Multi-label classification
- Information extraction
- Summarization
- Translation
- Multi-question answering
- Aspect-based sentiment analysis
- PII (personally identifiable information) anonymization
- Named entity recognition
- Coming soon: entity linking, knowledge graph creation, ...
:floppy_disk: Persistence: Save and load pipelines with configurations
:teacher: Distillation: Distill local, specialized models using your zero-shot model results automatically. Export your results as HuggingFace Dataset if you want to run your own training routine.
:recycle: Caching to avoid unnecessary model calls

Getting Started

Here's a simple classification example using outlines:

from sieves import Pipeline, tasks, Doc
import outlines

# 1. Define documents by text or URI.
docs = [Doc(text="Special relativity applies to all physical phenomena in the absence of gravity.")]

# 2. Choose a model (Outlines in this example).
model = outlines.models.transformers("HuggingFaceTB/SmolLM-135M-Instruct")

# 3. Create pipeline with tasks (verbose init).
pipe = Pipeline(
  # Add classification task to pipeline.
  tasks.Classification(labels=["science", "politics"], model=model)
)

# 4. Run pipe and output results.
for doc in pipe(docs):
  print(doc.results)

# Tip: Pipelines can also be composed succinctly via chaining (+).
# For multi-step pipelines, you can write:
#   pipe = tasks.Ingestion(export_format="markdown") + tasks.Chunking(chunker) + tasks.Classification(labels=[...], model=model)
# Note: additional Pipeline parameters (e.g., use_cache=False) are only available via the verbose init,
# e.g., Pipeline([t1, t2], use_cache=False).

Advanced Example

This example demonstrates PDF parsing, text chunking, and classification:

import pickle

import gliner.multitask
import chonkie
import tokenizers
import docling.document_converter

from sieves import Pipeline, tasks, Doc

# 1. Define documents by text or URI.
docs = [Doc(uri="https://arxiv.org/pdf/2408.09869")]

# 2. Choose a model for structured generation.
model_name = 'knowledgator/gliner-multitask-v1.0'
model = gliner.GLiNER.from_pretrained(model_name)

# 3. Create chunker object.
chunker = chonkie.TokenChunker(tokenizers.Tokenizer.from_pretrained(model_name))

# 3. Create pipeline with tasks.
pipe = Pipeline(
  [
    # 4. Add document parsing task.
    tasks.Ingestion(export_format="markdown"),
    # 5. Add chunking task to ensure we don't exceed our model's context window.
    tasks.Chunking(chunker),
    # 6. Add classification task to pipeline.
    tasks.Classification(
        task_id="classifier",
        labels=["science", "politics"],
        model=model,
    ),
  ]
)
# Alternatively you can also construct a pipeline by using the + operators:
# pipe = tasks.Ingestion(export_format="markdown") + tasks.Chunking(chunker) + tasks.Classification(
#     task_id="classifier", labels=["science", "politics"], model=model
# )

# 7. Run pipe and output results.
docs = list(pipe(docs))
for doc in docs:
  print(doc.results["classifier"])

# 8. Serialize pipeline and docs.
pipe.dump("pipeline.yml")
with open("docs.pkl", "wb") as f:
  pickle.dump(docs, f)

# 9. Load pipeline and docs from disk. Note: we don't serialize complex third-party objects, so you'll have
#    to pass those in at load time.
loaded_pipe = Pipeline.load(
  "pipeline.yml",
  (
    {"converter": docling.document_converter.DocumentConverter(), "export_format": "markdown"},
    {"chunker": chunker},
    {"model": model},
  ),
)
with open("docs.pkl", "rb") as f:
  loaded_docs = pickle.load(f)

Core Concepts

sieves is built on five key abstractions.

`Pipeline`

Orchestrates task execution with features for.

Task configuration and sequencing
Pipeline execution
Configuration management and serialization

`Doc`

Represents a document in the pipeline.

Contains text content and metadata
Tracks document URI and processing results
Passes information between pipeline tasks

`Task`

Encapsulates a single processing step in a pipeline.

Defines input arguments
Wraps and initializes Bridge instances handling task-engine-specific logic
Implements task-specific dataset export

`Engine`

Provides a unified interface to structured generation libraries (internal). You pass a backend model into tasks; the Engine is used under the hood.

Manages model interactions
Handles prompt execution
Standardizes output formats

GenerationSettings (optional)

Controls execution behavior across engines. You usually don't need to pass this — defaults are sensible:

batch_size: -1 (batch all inputs together)
strict_mode: False (on parse issues, return None instead of raising)
init_kwargs/inference_kwargs: None (use engine defaults)
config_kwargs: None (used by some backends like DSPy) Pass it only when you need non-defaults, e.g. small batches: GenerationSettings(batch_size=8) or strict_mode=True.

`Bridge`

Connects Task with Engine.

Implements engine-specific prompt templates
Manages output type specifications
Ensures compatibility between tasks and engine

Frequently Asked Questions

Show FAQs

Why "sieves"?

sieves was originally motivated by the want to use generative models for structured information extraction. Coming from this angle, there are two ways to explain why we settled on this name (pick the one you like better):

An analogy to gold panning: run your raw data through a sieve to obtain structured, refined "gold."
An acronym - "sieves" can be read as "Structured Information Extraction and VErification System" (but that's a mouthful).

Why not just prompt an LLM directly?

Asked differently: what are the benefits of using sieves over directly interacting with an LLM?

Validated, structured data output - also for LLMs that don't offer structured outputs natively. Zero-/few-shot language models can be finicky without guardrails or parsing.
A step-by-step pipeline, making it easier to debug and track each stage.
The flexibility to switch between different models and ways to ensure structured and validated output.
A bunch of useful utilities for pre- and post-processing you might need.
An array of useful tasks you can right of the bat without having to roll your own.

Why use `sieves` and not a structured generation library, like `outlines`, directly?

Which library makes the most sense to you depends strongly on your use-case. outlines provides structured generation abilities, but not the pipeline system, utilities and pre-built tasks that sieves has to offer (and of course not the flexibility to switch between different structured generation libraries). Then again, maybe you don't need all that - in which case we recommend using outlines (or any other structured generation libray) directly.

Similarly, maybe you already have an existing tech stack in your project that uses exclusively ollama, langchain, or dspy? All of these libraries (and more) are supported by sieves - but they are not just structured generation libraries, they come with a plethora of features that are out of scope for sieves. If your application deeply integrates with a framework like LangChain or DSPy, it may be reasonable to stick to those libraries directly.

As many things in engineering, this is a trade-off. The way we see it: the less tightly coupled your existing application is with a particular language model framework, the more mileage you'll get out of sieves. This means that it's ideal for prototyping (there's no reason you can't use it in production too, of course).

Source for sieves icon: Sieve icons created by Freepik - Flaticon.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.4

May 11, 2026

1.0.3

May 1, 2026

1.0.2

Feb 16, 2026

1.0.1

Feb 16, 2026

1.0.0

Feb 7, 2026

1.0.0rc2 pre-release

Dec 28, 2025

1.0.0rc1 pre-release

Dec 28, 2025

0.24.0

Dec 22, 2025

0.23.0

Dec 16, 2025

0.22.0

Nov 17, 2025

0.21.0

Nov 16, 2025

0.20.0

Nov 15, 2025

0.19.3

Nov 13, 2025

0.19.2

Nov 13, 2025

0.19.1

Nov 12, 2025

0.19.0

Nov 8, 2025

0.18.1

Nov 6, 2025

0.18.0

Nov 6, 2025

0.17.8

Nov 6, 2025

0.17.7

Oct 24, 2025

0.17.5

Oct 13, 2025

0.17.4

Oct 12, 2025

0.17.3

Oct 12, 2025

0.17.2

Oct 12, 2025

0.17.1

Oct 11, 2025

0.17.0

Oct 10, 2025

0.16.0

Oct 4, 2025

0.15.1

Oct 3, 2025

0.15.0

Oct 1, 2025

This version

0.14.0

Sep 27, 2025

0.13.0

Sep 25, 2025

0.12.0

Sep 24, 2025

0.11.1

Jul 29, 2025

0.11.0

May 11, 2025

0.10.0

Apr 22, 2025

0.9.0

Apr 6, 2025

0.8.0

Mar 15, 2025

0.7.0

Feb 22, 2025

0.6.1

Feb 22, 2025

0.6.0

Feb 9, 2025

0.5.0

Feb 6, 2025

0.4.0

Jan 25, 2025

0.3.0

Jan 19, 2025

0.2.1

Jan 17, 2025

0.2.0

Jan 17, 2025

0.1.0

Jan 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sieves-0.14.0.tar.gz (2.5 MB view details)

Uploaded Sep 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sieves-0.14.0-py3-none-any.whl (561.4 kB view details)

Uploaded Sep 27, 2025 Python 3

File details

Details for the file sieves-0.14.0.tar.gz.

File metadata

Download URL: sieves-0.14.0.tar.gz
Upload date: Sep 27, 2025
Size: 2.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sieves-0.14.0.tar.gz
Algorithm	Hash digest
SHA256	`473319908e08fcc02938135b7404770f9d3ca4069c3934c06112389949d0b2a6`
MD5	`15c9dd16ba3d25afe9d3757a2a4fb16d`
BLAKE2b-256	`2a6dd5c547a3747e047d3edf3c3aeeb56c3d408dc502690c9c1798ce29439879`

See more details on using hashes here.

File details

Details for the file sieves-0.14.0-py3-none-any.whl.

File metadata

Download URL: sieves-0.14.0-py3-none-any.whl
Upload date: Sep 27, 2025
Size: 561.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sieves-0.14.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0a3724a61f3fc1f327b1d33b0bdf67382cefebcea87c4586af2561e53212b861`
MD5	`ba13e9e1edb64908d04cb2686d46368c`
BLAKE2b-256	`0d8e5dc941f1112af423998b74e1936fd938a6a5f8b3512834b31c2a1549b7fa`

See more details on using hashes here.

sieves 0.14.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Zero-shot document processing made easy.

Why sieves?

Features

Getting Started

Core Concepts

Pipeline

Doc

Task

Engine

GenerationSettings (optional)

Bridge

Frequently Asked Questions

Why "sieves"?

Why not just prompt an LLM directly?

Why use sieves and not a structured generation library, like outlines, directly?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Why `sieves`?

`Pipeline`

`Doc`

`Task`

`Engine`

`Bridge`

Why use `sieves` and not a structured generation library, like `outlines`, directly?