Lightweight dataset library for distributed data processing

Project description

Zephyr

Simple data processing library for Marin pipelines. Build lazy dataset pipelines that run on Iris jobs or a local backend.

Quick Start

from zephyr import Dataset, ZephyrContext, load_jsonl

# Read, transform, write
ctx = ZephyrContext(max_workers=100)
pipeline = (
    Dataset.from_files("gs://input/", "**/*.jsonl.gz")
    .flat_map(load_jsonl)
    .filter(lambda x: x["score"] > 0.5)
    .map(lambda x: transform_record(x))
    .write_jsonl("gs://output/data-{shard:05d}-of-{total:05d}.jsonl.gz")
)
ctx.execute(pipeline)

Key Patterns

Dataset Creation:

Dataset.from_files(path, pattern) - glob files
Dataset.from_list(items) - explicit list

Loading Files

.load_{file,parquet,jsonl,vortex} - load rows from a file

Transformations:

.map(fn) - transform each item
.flat_map(fn) - expand items (e.g., load_jsonl)
.filter(fn) - filter items by function or expression
.select(columna, columnb) - select out the given columns
.window(n) - group into batches
.reshard(n) - redistribute across n shards

Output:

.write_jsonl(pattern) - write JSONL (gzip if .gz)
.write_parquet(pattern, schema) - write to a Parquet file
.write_vortex(pattern) - write to a Vortex file

Execution (ZephyrContext):

ZephyrContext(max_workers=N) — auto-detects the backend (Iris inside an Iris job, local otherwise) via fray.current_client()
ZephyrContext(client=LocalClient()) — explicit local backend (testing)
ctx.execute(pipeline) — runs the pipeline; returns a ZephyrExecutionResult(results, counters)

Real Usage

Wikipedia Processing:

from zephyr import Dataset, ZephyrContext, load_jsonl

ctx = ZephyrContext(max_workers=100)
pipeline = (
    Dataset.from_list(files)
    .load_jsonl()
    .map(lambda row: process_record(row, config))
    .filter(lambda x: x is not None)
    .write_jsonl(f"{output}/data-{{shard:05d}}-of-{{total:05d}}.jsonl.gz")
)
ctx.execute(pipeline)

Dataset Sampling:

from zephyr import Dataset, ZephyrContext

ctx = ZephyrContext(max_workers=1000)
pipeline = (
    Dataset.from_files(input_path, "**/*.jsonl.gz")
    .map(lambda path: sample_file(path, weights))
    .write_jsonl(f"{output}/sampled-{{shard:05d}}.jsonl.gz")
)
ctx.execute(pipeline)

Parallel Downloads:

from zephyr import Dataset, ZephyrContext

tasks = [(config, fs, src, dst) for src, dst in file_pairs]
ctx = ZephyrContext(max_workers=32)
pipeline = Dataset.from_list(tasks).map(lambda t: download(*t))
ctx.execute(pipeline)

Installation

# From Marin monorepo
uv sync

# Standalone
cd lib/zephyr
uv pip install -e .

Running Tests

Zephyr tests run against multiple execution backends to ensure correctness across different environments.

All Tests on Both Backends (Default)

uv run pytest lib/zephyr/tests
# Runs all tests on both Local and Iris backends
# Local Iris cluster is started automatically via ClusterManager

Run Specific Backend Only

uv run pytest lib/zephyr/tests -k "local"
uv run pytest lib/zephyr/tests -k "iris"

The Iris cluster is started once per test session and reused across all tests for efficiency.

Design

Zephyr consolidates ad-hoc distributed and Hugging Face dataset processing patterns in Marin into a simple abstraction.

Key Features:

Lazy evaluation with operation fusion
Disk-based inter-stage data flow for low memory footprint
Chunk-by-chunk streaming to minimize memory pressure
Distributed execution with bounded parallelism (Iris/local backends)
Automatic chunking to prevent large object overhead
fsspec integration (GCS, S3, local)
Type-safe operation chaining

See AGENTS.md for execution internals and source layout.

Project details

Release history Release notifications | RSS feed

0.2.29.dev202606270820 pre-release

Jun 27, 2026

0.2.28.dev202606260843 pre-release

Jun 26, 2026

0.2.27.dev202606250842 pre-release

Jun 25, 2026

0.2.26.dev202606240847 pre-release

Jun 24, 2026

0.2.25.dev202606230851 pre-release

Jun 23, 2026

0.2.24.dev202606221125 pre-release

Jun 22, 2026

0.2.23.dev202606210930 pre-release

Jun 21, 2026

0.2.22.dev202606200836 pre-release

Jun 20, 2026

0.2.21.dev202606191006 pre-release

Jun 19, 2026

0.2.20.dev202606180959 pre-release

Jun 18, 2026

0.2.19.dev202606171019 pre-release

Jun 17, 2026

0.2.18.dev202606161047 pre-release

Jun 16, 2026

0.2.17.dev202606151137 pre-release

Jun 15, 2026

0.2.16.dev202606140857 pre-release

Jun 14, 2026

0.2.15.dev202606130841 pre-release

Jun 13, 2026

0.2.14.dev202606120949 pre-release

Jun 12, 2026

0.2.13.dev202606110957 pre-release

Jun 11, 2026

0.2.12.dev202606100934 pre-release

Jun 10, 2026

0.2.11.dev202606081009 pre-release

Jun 8, 2026

0.2.10.dev202606070840 pre-release

Jun 7, 2026

0.2.9.dev202606060818 pre-release

Jun 6, 2026

0.2.8.dev202606050858 pre-release

Jun 5, 2026

0.2.7.dev202606040937 pre-release

Jun 4, 2026

0.2.6.dev202606031026 pre-release

Jun 3, 2026

0.2.5.dev202606020954 pre-release

Jun 2, 2026

This version

0.2.4.dev202606011101 pre-release

Jun 1, 2026

0.2.3.dev202605310830 pre-release

May 31, 2026

0.2.2.dev202605300811 pre-release

May 30, 2026

0.2.1.dev202605292307 pre-release

May 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

marin_zephyr-0.2.4.dev202606011101.tar.gz (76.6 kB view details)

Uploaded Jun 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

marin_zephyr-0.2.4.dev202606011101-py3-none-any.whl (82.5 kB view details)

Uploaded Jun 1, 2026 Python 3

File details

Details for the file marin_zephyr-0.2.4.dev202606011101.tar.gz.

File metadata

Download URL: marin_zephyr-0.2.4.dev202606011101.tar.gz
Upload date: Jun 1, 2026
Size: 76.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for marin_zephyr-0.2.4.dev202606011101.tar.gz
Algorithm	Hash digest
SHA256	`ab7d8b54ace63477d7a50cdca8d6d190bd533f2b7296a1986ba32491465ae4f0`
MD5	`7b8787055f3631e69eb66d90456447e6`
BLAKE2b-256	`989d0c7e2069c7695a00afe1908a77d5a9639305fed48ad3e11af7678c529ad8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for marin_zephyr-0.2.4.dev202606011101.tar.gz:

Publisher: marin-release-libs-wheels.yaml on marin-community/marin

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: marin_zephyr-0.2.4.dev202606011101.tar.gz
- Subject digest: ab7d8b54ace63477d7a50cdca8d6d190bd533f2b7296a1986ba32491465ae4f0
- Sigstore transparency entry: 1691107061
- Sigstore integration time: Jun 1, 2026
Source repository:
- Permalink: marin-community/marin@dbc1f7b07f69db8440cb4d382378362825151d18
- Branch / Tag: refs/heads/main
- Owner: https://github.com/marin-community
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: marin-release-libs-wheels.yaml@dbc1f7b07f69db8440cb4d382378362825151d18
- Trigger Event: schedule

File details

Details for the file marin_zephyr-0.2.4.dev202606011101-py3-none-any.whl.

File metadata

Download URL: marin_zephyr-0.2.4.dev202606011101-py3-none-any.whl
Upload date: Jun 1, 2026
Size: 82.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for marin_zephyr-0.2.4.dev202606011101-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5282cba229fd318c60ee5a6880352b1cc64f79184e14baf617eaf03b843929fa`
MD5	`817d54f7d7fbb5b03948d71d18ce9edb`
BLAKE2b-256	`08b1f5fd36f5e4c2cc0aa14983f23c4d6da9df50b1ade9d0574cecd221ad2a5c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for marin_zephyr-0.2.4.dev202606011101-py3-none-any.whl:

Publisher: marin-release-libs-wheels.yaml on marin-community/marin

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: marin_zephyr-0.2.4.dev202606011101-py3-none-any.whl
- Subject digest: 5282cba229fd318c60ee5a6880352b1cc64f79184e14baf617eaf03b843929fa
- Sigstore transparency entry: 1691107740
- Sigstore integration time: Jun 1, 2026
Source repository:
- Permalink: marin-community/marin@dbc1f7b07f69db8440cb4d382378362825151d18
- Branch / Tag: refs/heads/main
- Owner: https://github.com/marin-community
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: marin-release-libs-wheels.yaml@dbc1f7b07f69db8440cb4d382378362825151d18
- Trigger Event: schedule

marin-zephyr 0.2.4.dev202606011101

Navigation

Verified details

Owner

Maintainers

Unverified details

Meta

Project description

Zephyr

Quick Start

Key Patterns

Real Usage

Installation

Running Tests

All Tests on Both Backends (Default)

Run Specific Backend Only

Design

Project details

Verified details

Owner

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance