ZooPipe is a data processing framework that allows you to process data in a declarative way.
Project description
ZooPipe is a lean, ultra-high-performance data processing engine for Python. It leverages a 100% Rust core to handle I/O and orchestration, while keeping the flexibility of Python for schema validation (via Pydantic) and custom data enrichment (via Hooks).
Read the docs for more information.
✨ Key Features
- 🚀 100% Native Rust Engine: The core execution loop, including CSV and JSON parsing/writing, is implemented in Rust for maximum throughput.
- 🔍 Declarative Validation: Use Pydantic models to define and validate your data structures naturally.
- 🪝 Python Hooks: Transform and enrich data at any stage using standard Python functions or classes.
- 🚨 Automated Error Routing: Native support for routing failed records to a dedicated error output.
- 📊 Multiple Format Support: Optimized readers/writers for CSV, JSONL, and SQL databases.
- 🔧 Two-Tier Parallelism: Orchestrate across processes or clusters with Engines (Local, Ray, Dask), and scale throughput at the node level with Rust Executors.
- ☁️ Cloud Native: Native S3, GCS, and Azure support, plus zero-config distributed execution on Ray or Dask clusters.
⚡ Performance & Benchmarks
Why ZooPipe? Because vectorization isn't always the answer.
Tools like Pandas and Polars are incredible for analytical workloads (groupby, sum, joins) where operations can be vectorized in C/Rust. However, real-world Data Engineering often involves "chaotic ETL": messy custom rules, API calls per row, hashing, conditional cleanup, and complex normalization that forcedly drop down to Python loops.
In these "Heavy ETL" scenarios, ZooPipe outperforms Vectorized DataFrames by 3x-8x.
Key Takeaway: ZooPipe's "Python-First Architecture" with parallel streaming (
PipeManager) avoids the serialization overhead that cripples Polars/Pandas when using Python UDFs (map_elements/apply), and uses 97% less RAM.
⚖️ Is this unfair to Pandas/Polars?
Yes and No.
- Unfair: If your workload is purely analytical (e.g.,
GROUP BY,SUM,JOIN), Polars and Pandas will likely destroy ZooPipe because they can use vectorized C/Rust operations on whole columns at once. - Fair: In real-world Data Engineering, many pipelines are "chaotic". They require custom hashing, API calls per row, conditional normalization, or complex Pydantic validation. In these "Python-UDF heavy" scenarios, vectorization breaks down, and ZooPipe shines by orchestrating parallel Python execution efficiently without the DataFrame overhead.
❓ When to use what?
| Use ZooPipe When... | Use Pandas / Polars When... |
|---|---|
| 🏗️ You have complex, custom Python logic per row (hash, clean, validate). | 🧮 You are doing aggregations (SUM, AVG) or Relational Algebra (JOIN, GROUP BY). |
| 🔄 You are processing streaming data or files larger than RAM. | 💾 Your dataset fits comfortably in RAM (or use LazyFrames). |
| 🛡️ You need strict schema validation (Pydantic) and error handling. | 🔬 You are doing data exploration or statistical analysis. |
| 🚀 You want to mix Rust I/O performance with Python flexibility. | ⚡ Your entire pipeline can be expressed in vectorized expressions. |
🚀 Quick Start
Installation
pip install zoopipe
Or using uv:
uv add zoopipe
Or from source (uv recommended):
uv build
uv run maturin develop --release
Simple Example
from pydantic import BaseModel, ConfigDict
from zoopipe import CSVInputAdapter, CSVOutputAdapter, Pipe
class UserSchema(BaseModel):
model_config = ConfigDict(extra="ignore")
user_id: str
username: str
email: str
pipe = Pipe(
input_adapter=CSVInputAdapter("users.csv"),
output_adapter=CSVOutputAdapter("processed_users.csv"),
error_output_adapter=CSVOutputAdapter("errors.csv"),
schema_model=UserSchema,
)
pipe.start()
pipe.wait()
print(f"Finished! Processed {pipe.report.total_processed} items.")
Automatically split large files or manage multiple independent workflows:
from zoopipe import PipeManager, MultiProcessEngine
# Create your pipe as usual (Pipe is purely declarative)
pipe = Pipe(...)
# Automatically parallelize across 4 workers
# MultiProcessEngine() for local, RayEngine() or DaskEngine() for clusters
manager = PipeManager.parallelize_pipe(
pipe,
workers=4,
engine=MultiProcessEngine()
)
manager.start()
manager.wait()
📚 Documentation
Core Concepts
Hooks
Hooks are Python classes that allow you to intercept, transform, and enrich data at different stages of the pipeline.
📘 Read the full Hooks Guide to learn about lifecycle methods (setup, execute, teardown), state management, and advanced patterns like cursor pagination.
Quick Example
from zoopipe import BaseHook
class MyHook(BaseHook):
def execute(self, entries, store):
for entry in entries:
entry["raw_data"]["checked"] = True
return entries
[!IMPORTANT] If you are using a
schema_model, the pipeline will output the contents ofvalidated_datafor successful records.
- To modify data before validation, use
pre_validation_hooksand modifyentry["raw_data"].- To modify data after validation (and ensure it reaches the output), use
post_validation_hooksand modifyentry["validated_data"].
Executors
Executors control how ZooPipe scales up within a single node using Rust-managed threads. They are the engine under the hood that drives high throughput.
📘 Read the full Executors Guide to understand the difference between SingleThreadExecutor (debug/ordered) and MultiThreadExecutor (high-throughput).
Input/Output Adapters
File Formats
- CSV Adapters - High-performance CSV reading and writing
- JSON Adapters - JSONL and JSON array format support
- Excel Adapters - Read and write Excel (.xlsx) files
- Parquet Adapters - Columnar storage for analytics and data lakes
- Arrow Adapters - Apache Arrow IPC format for high-throughput interoperability
Databases
- SQL Adapters - Read from and write to SQL databases with batch optimization
- SQL Pagination - High-performance cursor-style pagination for large tables
- DuckDB Adapters - Analytical database for OLAP workloads
Messaging Systems
- Kafka Adapters - High-throughput messaging
Advanced
- Python Generator Adapters - In-memory streaming and testing
- Cloud Storage (S3) - Read and write data from Amazon S3 and compatible services
- PipeManager - Run multiple pipes in parallel for distributed processing
- Ray Guide - Zero-config distributed execution on Ray clusters
- Dask Guide - Zero-config distributed execution on Dask clusters
🛠 Architecture
ZooPipe is designed as a thin Python wrapper around a powerful Rust core, featuring a two-tier parallel architecture:
- Orchestration Tier (Python Engines):
- Manage distribution across processes or nodes (e.g.,
MultiProcessEngine). - Handles data sharding, process lifecycle, and metrics aggregation.
- Manage distribution across processes or nodes (e.g.,
- Execution Tier (Rust BatchExecutors):
- Internal Throughput: High-speed processing within a single process.
- Adapters: Native CSV/JSON/SQL Readers and Writers.
- NativePipe: Orchestrates the loop, fetching chunks and routing result batches.
- Executors: Multi-threaded Rust strategies to bypass the GIL within a node.
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zoopipe-2026.1.27.tar.gz.
File metadata
- Download URL: zoopipe-2026.1.27.tar.gz
- Upload date:
- Size: 234.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66f172b4ac43905790dc24bc74a3bf5386c49c57ed1241bbb0fb581df454c7ed
|
|
| MD5 |
06c378193ad156a1cbd5fcb8006a8fd3
|
|
| BLAKE2b-256 |
a3d888d1d45f22f9939b840eedebac2c9517af305dc14ad27ce18d219b4335b0
|
Provenance
The following attestation bundles were made for zoopipe-2026.1.27.tar.gz:
Publisher:
release.yml on albertobadia/zoopipe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
zoopipe-2026.1.27.tar.gz -
Subject digest:
66f172b4ac43905790dc24bc74a3bf5386c49c57ed1241bbb0fb581df454c7ed - Sigstore transparency entry: 866627148
- Sigstore integration time:
-
Permalink:
albertobadia/zoopipe@c4524bfb2ebfcc984655f815e5246d35aac97ca1 -
Branch / Tag:
refs/tags/2026.1.27 - Owner: https://github.com/albertobadia
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c4524bfb2ebfcc984655f815e5246d35aac97ca1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file zoopipe-2026.1.27-pp311-pypy311_pp73-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: zoopipe-2026.1.27-pp311-pypy311_pp73-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 22.4 MB
- Tags: PyPy, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9bec647d9d6c66c123c4c00f8a9f5b3f4eddeaf9e904ed1b66a69a01314d684f
|
|
| MD5 |
d77aa2a7e0b5bd9bc42226c8f9ef990b
|
|
| BLAKE2b-256 |
43b9c3664ae7d671db85a21853f2deb2afb98a1d844fed4d998a413377e7c2e4
|
Provenance
The following attestation bundles were made for zoopipe-2026.1.27-pp311-pypy311_pp73-manylinux_2_28_x86_64.whl:
Publisher:
release.yml on albertobadia/zoopipe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
zoopipe-2026.1.27-pp311-pypy311_pp73-manylinux_2_28_x86_64.whl -
Subject digest:
9bec647d9d6c66c123c4c00f8a9f5b3f4eddeaf9e904ed1b66a69a01314d684f - Sigstore transparency entry: 866627232
- Sigstore integration time:
-
Permalink:
albertobadia/zoopipe@c4524bfb2ebfcc984655f815e5246d35aac97ca1 -
Branch / Tag:
refs/tags/2026.1.27 - Owner: https://github.com/albertobadia
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c4524bfb2ebfcc984655f815e5246d35aac97ca1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file zoopipe-2026.1.27-cp310-abi3-win_amd64.whl.
File metadata
- Download URL: zoopipe-2026.1.27-cp310-abi3-win_amd64.whl
- Upload date:
- Size: 17.8 MB
- Tags: CPython 3.10+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f5e4ea2cf45078b8967e1c6598441a3ba0dc13d8c3533dbfc0d4f8de8fee4f5
|
|
| MD5 |
ead15dcbae02f8af5e06b46f9d1e770b
|
|
| BLAKE2b-256 |
92fdf642a49a13fc5bbff5cf163f0b805111477962de13ed39b4708e36835f09
|
Provenance
The following attestation bundles were made for zoopipe-2026.1.27-cp310-abi3-win_amd64.whl:
Publisher:
release.yml on albertobadia/zoopipe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
zoopipe-2026.1.27-cp310-abi3-win_amd64.whl -
Subject digest:
6f5e4ea2cf45078b8967e1c6598441a3ba0dc13d8c3533dbfc0d4f8de8fee4f5 - Sigstore transparency entry: 866627315
- Sigstore integration time:
-
Permalink:
albertobadia/zoopipe@c4524bfb2ebfcc984655f815e5246d35aac97ca1 -
Branch / Tag:
refs/tags/2026.1.27 - Owner: https://github.com/albertobadia
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c4524bfb2ebfcc984655f815e5246d35aac97ca1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file zoopipe-2026.1.27-cp310-abi3-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: zoopipe-2026.1.27-cp310-abi3-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 22.4 MB
- Tags: CPython 3.10+, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d49cf3b21b44a97c5ced554853d61c2ed4a48ada60440d054a29aa6aa5f2e05
|
|
| MD5 |
64c1bb053a146bcb640ded8df79878a1
|
|
| BLAKE2b-256 |
5047346b71224394926ba01221fd7098c5567d98f723c2e36d509da29b1ec0f5
|
Provenance
The following attestation bundles were made for zoopipe-2026.1.27-cp310-abi3-manylinux_2_28_x86_64.whl:
Publisher:
release.yml on albertobadia/zoopipe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
zoopipe-2026.1.27-cp310-abi3-manylinux_2_28_x86_64.whl -
Subject digest:
9d49cf3b21b44a97c5ced554853d61c2ed4a48ada60440d054a29aa6aa5f2e05 - Sigstore transparency entry: 866627268
- Sigstore integration time:
-
Permalink:
albertobadia/zoopipe@c4524bfb2ebfcc984655f815e5246d35aac97ca1 -
Branch / Tag:
refs/tags/2026.1.27 - Owner: https://github.com/albertobadia
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c4524bfb2ebfcc984655f815e5246d35aac97ca1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file zoopipe-2026.1.27-cp310-abi3-manylinux_2_28_aarch64.whl.
File metadata
- Download URL: zoopipe-2026.1.27-cp310-abi3-manylinux_2_28_aarch64.whl
- Upload date:
- Size: 19.6 MB
- Tags: CPython 3.10+, manylinux: glibc 2.28+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf0cba496c352a94c62500ef0c517258cbbc5abe92c751172fc3d02e5f31ef40
|
|
| MD5 |
1afb767608c354c7ab358d71883c1885
|
|
| BLAKE2b-256 |
4831fd8ec155ef6ba209254fe2c5349538f185ca3ff29622341ef3e19286bed0
|
Provenance
The following attestation bundles were made for zoopipe-2026.1.27-cp310-abi3-manylinux_2_28_aarch64.whl:
Publisher:
release.yml on albertobadia/zoopipe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
zoopipe-2026.1.27-cp310-abi3-manylinux_2_28_aarch64.whl -
Subject digest:
cf0cba496c352a94c62500ef0c517258cbbc5abe92c751172fc3d02e5f31ef40 - Sigstore transparency entry: 866627361
- Sigstore integration time:
-
Permalink:
albertobadia/zoopipe@c4524bfb2ebfcc984655f815e5246d35aac97ca1 -
Branch / Tag:
refs/tags/2026.1.27 - Owner: https://github.com/albertobadia
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c4524bfb2ebfcc984655f815e5246d35aac97ca1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file zoopipe-2026.1.27-cp310-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: zoopipe-2026.1.27-cp310-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 17.6 MB
- Tags: CPython 3.10+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4bd4fd6d22984062666b5e56b0a3d9bf76be183b006545248919bf63d8152b40
|
|
| MD5 |
cbd3cb838d018d6113135251078564a3
|
|
| BLAKE2b-256 |
776672b48ac8a8673466cd9a75ea2dc35fa26a1a4ee2712feae02666a3fea2d2
|
Provenance
The following attestation bundles were made for zoopipe-2026.1.27-cp310-abi3-macosx_11_0_arm64.whl:
Publisher:
release.yml on albertobadia/zoopipe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
zoopipe-2026.1.27-cp310-abi3-macosx_11_0_arm64.whl -
Subject digest:
4bd4fd6d22984062666b5e56b0a3d9bf76be183b006545248919bf63d8152b40 - Sigstore transparency entry: 866627401
- Sigstore integration time:
-
Permalink:
albertobadia/zoopipe@c4524bfb2ebfcc984655f815e5246d35aac97ca1 -
Branch / Tag:
refs/tags/2026.1.27 - Owner: https://github.com/albertobadia
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c4524bfb2ebfcc984655f815e5246d35aac97ca1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file zoopipe-2026.1.27-cp310-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: zoopipe-2026.1.27-cp310-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 19.5 MB
- Tags: CPython 3.10+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9dc72e8a15beb4140dc676beb34a3337266312adc92d7fd0c680674d325ef42e
|
|
| MD5 |
9e5d8169796f10206b4536dc7e97b242
|
|
| BLAKE2b-256 |
27c005a4ba7fb8fdcdbcc38e0d4a34d3ee539835e530f8833e598ed701433be3
|
Provenance
The following attestation bundles were made for zoopipe-2026.1.27-cp310-abi3-macosx_10_12_x86_64.whl:
Publisher:
release.yml on albertobadia/zoopipe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
zoopipe-2026.1.27-cp310-abi3-macosx_10_12_x86_64.whl -
Subject digest:
9dc72e8a15beb4140dc676beb34a3337266312adc92d7fd0c680674d325ef42e - Sigstore transparency entry: 866627183
- Sigstore integration time:
-
Permalink:
albertobadia/zoopipe@c4524bfb2ebfcc984655f815e5246d35aac97ca1 -
Branch / Tag:
refs/tags/2026.1.27 - Owner: https://github.com/albertobadia
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c4524bfb2ebfcc984655f815e5246d35aac97ca1 -
Trigger Event:
release
-
Statement type: