Skip to main content

ZooPipe is a data processing framework that allows you to process data in a declarative way.

Project description

ZooPipe Logo

ZooPipe is a lean, ultra-high-performance data processing engine for Python. It leverages a 100% Rust core to handle I/O and orchestration, while keeping the flexibility of Python for schema validation (via Pydantic) and custom data enrichment (via Hooks).

Python 3.10+ License: MIT PyPI Downloads CI ReadTheDocs


Read the docs for more information.

✨ Key Features

  • 🚀 100% Native Rust Engine: The core execution loop, including CSV and JSON parsing/writing, is implemented in Rust for maximum throughput.
  • 🔍 Declarative Validation: Use Pydantic models to define and validate your data structures naturally.
  • 🪝 Python Hooks: Transform and enrich data at any stage using standard Python functions or classes.
  • 🚨 Automated Error Routing: Native support for routing failed records to a dedicated error output.
  • 📊 Multiple Format Support: Optimized readers/writers for CSV, JSONL, Parquet, and Iceberg.
  • 🔧 Two-Tier Parallelism: Orchestrate across processes or clusters with Engines (Local, Ray, Dask), and scale throughput at the node level with Rust Executors.
  • ☁️ Cloud Native: Native S3, GCS, and Azure support, plus native Iceberg Data Lake integration.

⚡ Performance & Benchmarks

Why ZooPipe? Because vectorization isn't always the answer.

Tools like Pandas and Polars are incredible for analytical workloads (groupby, sum, joins) where operations can be vectorized in C/Rust. However, real-world Data Engineering often involves "chaotic ETL": messy custom rules, API calls per row, hashing, conditional cleanup, and complex normalization that forcedly drop down to Python loops.

In these "Heavy ETL" scenarios, ZooPipe outperforms Vectorized DataFrames by 3x-8x.

Benchmark Chart

Key Takeaway: ZooPipe's "Python-First Architecture" with parallel streaming (PipeManager) avoids the serialization overhead that cripples Polars/Pandas when using Python UDFs (map_elements/apply), and uses 97% less RAM.

⚖️ Is this unfair to Pandas/Polars?

Yes and No.

  • Unfair: If your workload is purely analytical (e.g., GROUP BY, SUM, JOIN), Polars and Pandas will likely destroy ZooPipe because they can use vectorized C/Rust operations on whole columns at once.
  • Fair: In real-world Data Engineering, many pipelines are "chaotic". They require custom hashing, API calls per row, conditional normalization, or complex Pydantic validation. In these "Python-UDF heavy" scenarios, vectorization breaks down, and ZooPipe shines by orchestrating parallel Python execution efficiently without the DataFrame overhead.

❓ When to use what?

Use ZooPipe When... Use Pandas / Polars When...
🏗️ You have complex, custom Python logic per row (hash, clean, validate). 🧮 You are doing aggregations (SUM, AVG) or Relational Algebra (JOIN, GROUP BY).
🔄 You are processing streaming data or files larger than RAM. 💾 Your dataset fits comfortably in RAM (or use LazyFrames).
🛡️ You need strict schema validation (Pydantic) and error handling. 🔬 You are doing data exploration or statistical analysis.
🚀 You want to mix Rust I/O performance with Python flexibility. ⚡ Your entire pipeline can be expressed in vectorized expressions.

🚀 Quick Start

Installation

Using uv (recommended):

uv add zoopipe

Or using pip:

pip install zoopipe

From source:

uv sync
uv run maturin develop --release

Simple Example

from pydantic import BaseModel, ConfigDict
from zoopipe import CSVInputAdapter, CSVOutputAdapter, Pipe


class UserSchema(BaseModel):
    model_config = ConfigDict(extra="ignore")
    user_id: str
    username: str
    email: str


pipe = Pipe(
    input_adapter=CSVInputAdapter("users.csv"),
    output_adapter=CSVOutputAdapter("processed_users.csv"),
    error_output_adapter=CSVOutputAdapter("errors.csv"),
    schema_model=UserSchema,
)

# Run the pipe (streaming processing)
pipe.run()

print(f"Finished! Processed {pipe.report.total_processed} items.")

Automatically split large files or manage multiple independent workflows:

```python
from zoopipe import PipeManager, MultiProcessEngine

# Create your pipe as usual (Pipe is purely declarative)
pipe = Pipe(...)

# Automatically parallelize across 4 workers
# MultiProcessEngine() for local, RayEngine() or DaskEngine() for clusters
# Automatically parallelize across 4 workers
manager = PipeManager.parallelize_pipe(
    pipe, 
    workers=4, 
    engine=MultiProcessEngine() 
)

# Start, wait, and coordinate (e.g. merge files) automatically
manager.run()

---

## 📚 Documentation

### Core Concepts


#### Hooks

Hooks are Python classes that allow you to intercept, transform, and enrich data at different stages of the pipeline.

**[📘 Read the full Hooks Guide](https://github.com/albertobadia/zoopipe/blob/main/docs/hooks.md)** to learn about lifecycle methods (`setup`, `execute`, `teardown`), state management, and advanced patterns like cursor pagination.

### Quick Example

```python
from zoopipe import BaseHook

class MyHook(BaseHook):
    def execute(self, entries, store):
        for entry in entries:
            entry["raw_data"]["checked"] = True
        return entries

[!IMPORTANT] If you are using a schema_model, the pipeline will output the contents of validated_data for successful records.

  • To modify data before validation, use pre_validation_hooks and modify entry["raw_data"].
  • To modify data after validation (and ensure it reaches the output), use post_validation_hooks and modify entry["validated_data"].

Executors

Executors control how ZooPipe scales up within a single node using Rust-managed threads. They are the engine under the hood that drives high throughput.

📘 Read the full Executors Guide to understand the difference between SingleThreadExecutor (debug/ordered) and MultiThreadExecutor (high-throughput).

Input/Output Adapters

File Formats

Databases

  • SQL Adapters - Read from and write to SQL databases with batch optimization
  • SQL Pagination - High-performance cursor-style pagination for large tables

Messaging Systems

Advanced


🛠 Architecture

ZooPipe is designed as a thin Python wrapper around a powerful Rust core, featuring a two-tier parallel architecture:

  1. Orchestration Tier (Python Engines):
    • Manage distribution across processes or nodes (e.g., MultiProcessEngine).
    • Handles data sharding, process lifecycle, and metrics aggregation.
  2. Execution Tier (Rust BatchExecutors):
    • Internal Throughput: High-speed processing within a single process.
    • Adapters: Native CSV/JSON/SQL Readers and Writers.
    • NativePipe: Orchestrates the loop, fetching chunks and routing result batches.
    • Executors: Multi-threaded Rust strategies to bypass the GIL within a node.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zoopipe-2026.2.2.tar.gz (266.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

zoopipe-2026.2.2-pp311-pypy311_pp73-manylinux_2_28_x86_64.whl (17.1 MB view details)

Uploaded PyPymanylinux: glibc 2.28+ x86-64

zoopipe-2026.2.2-cp310-abi3-win_amd64.whl (15.7 MB view details)

Uploaded CPython 3.10+Windows x86-64

zoopipe-2026.2.2-cp310-abi3-manylinux_2_28_x86_64.whl (17.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

zoopipe-2026.2.2-cp310-abi3-manylinux_2_28_aarch64.whl (15.8 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

zoopipe-2026.2.2-cp310-abi3-macosx_11_0_arm64.whl (14.6 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

zoopipe-2026.2.2-cp310-abi3-macosx_10_12_x86_64.whl (16.0 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file zoopipe-2026.2.2.tar.gz.

File metadata

  • Download URL: zoopipe-2026.2.2.tar.gz
  • Upload date:
  • Size: 266.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for zoopipe-2026.2.2.tar.gz
Algorithm Hash digest
SHA256 c86d4d887c8e8459e9532f9b0c86e470974dc3f8f0e2a825ee864a15d9f93e7d
MD5 3365dbdd1987208d9dfa0975d0569a76
BLAKE2b-256 28e0d9a59cbc04d706e0a2bab1bea54d0d9a5a574044138ff759aceeea64e9f1

See more details on using hashes here.

Provenance

The following attestation bundles were made for zoopipe-2026.2.2.tar.gz:

Publisher: release.yml on albertobadia/zoopipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file zoopipe-2026.2.2-pp311-pypy311_pp73-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for zoopipe-2026.2.2-pp311-pypy311_pp73-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 350bcb9c661f52955945e30df42122228b43fb64d42b839e816e986bcab9df1d
MD5 e96f5a8e02f5fe7a37eae2cd0007356a
BLAKE2b-256 c41354e63d3662731d62350bdff045b50ed14d19fa306d7759dc6a5215bcee59

See more details on using hashes here.

Provenance

The following attestation bundles were made for zoopipe-2026.2.2-pp311-pypy311_pp73-manylinux_2_28_x86_64.whl:

Publisher: release.yml on albertobadia/zoopipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file zoopipe-2026.2.2-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: zoopipe-2026.2.2-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 15.7 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for zoopipe-2026.2.2-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 de35458414bd3855ebaa5cdb826420f1044c3110ef68e8be60d1d772e7cbed19
MD5 ac96a2919c2f40e7d41664d16498a0a3
BLAKE2b-256 9c670a5a54a0822889b83f2dec3e651d6d7c494772ac7b1d5a5bdd14a80c0267

See more details on using hashes here.

Provenance

The following attestation bundles were made for zoopipe-2026.2.2-cp310-abi3-win_amd64.whl:

Publisher: release.yml on albertobadia/zoopipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file zoopipe-2026.2.2-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for zoopipe-2026.2.2-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e0bcc9fac52ac7745f72867744d29335762b6c665734455f2451110cfb144292
MD5 5adcc5040b209195a4ca6c15de7af7bc
BLAKE2b-256 142108a38cb7fec0de4951c276d41c86ae81c31dd76e6b4d004156819c019200

See more details on using hashes here.

Provenance

The following attestation bundles were made for zoopipe-2026.2.2-cp310-abi3-manylinux_2_28_x86_64.whl:

Publisher: release.yml on albertobadia/zoopipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file zoopipe-2026.2.2-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for zoopipe-2026.2.2-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 b59bcded589f94d4d3a9e3a8094ef8c44b4d1c18a7792d880a20bf5933b5e0d3
MD5 c06171216495f031aef9284186c078a8
BLAKE2b-256 73fe9fff319b045685727aec0e520fa3f1969f5f43d650c0a5a764ef6e32b8ea

See more details on using hashes here.

Provenance

The following attestation bundles were made for zoopipe-2026.2.2-cp310-abi3-manylinux_2_28_aarch64.whl:

Publisher: release.yml on albertobadia/zoopipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file zoopipe-2026.2.2-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for zoopipe-2026.2.2-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a576ae1e796d199b930957d77097754039f51428b84b56870489c403d33f082a
MD5 8bce3c3be1192acf97a14bd2c2d153b0
BLAKE2b-256 409302ff1eb9918f4271cf325b81ea822985f1786680a583b8bc14e20ab41fb8

See more details on using hashes here.

Provenance

The following attestation bundles were made for zoopipe-2026.2.2-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on albertobadia/zoopipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file zoopipe-2026.2.2-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for zoopipe-2026.2.2-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 8fe20692f2fc5e322dff8e93959f7cd3b49c5e7e3c78190d341c5216d6abbe5e
MD5 55d896e14eabf9df9b4050c624210a0d
BLAKE2b-256 fcc9a601a6bfae404053559cb3274f46b21cb9052c6d61a6449bb77dc2f3c31b

See more details on using hashes here.

Provenance

The following attestation bundles were made for zoopipe-2026.2.2-cp310-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on albertobadia/zoopipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page