Build lazy Zarr scans with Polars

These details have not been verified by PyPI

Project description

Rainbear

Python + Rust experiment for lazy Zarr scanning into Polars, with an API inspired by xarray’s coordinate-based selection.

This repo currently contains:

A first-pass scan_zarr(...) that streams a Zarr store using Rust zarrs and yields Polars LazyFrames.
A test suite that compares rainbear against xarray for various Zarr datasets and filter conditions.

Status / caveats

Zarr v3: the Rust backend uses zarrs. Note that zarrs-v2 is not likely to work for this reason.
Tidy table output: scan_zarr currently emits a “tidy” DataFrame with one row per element and columns:
- dimension/coord columns (e.g. time, lat)
- variable columns (e.g. temp)
Predicate pushdown:
- Rust attempts to compile a limited subset of predicates (simple comparisons on coord columns combined with &) for chunk pruning.
- If Polars Expr deserialization fails (typically because Python Polars and the Rust-side Polars ABI/serde don’t match), scan_zarr automatically falls back to Python-side filtering (correct but slower).

Quickstart (uv)

The project is configured as a maturin extension module.

Run a quick import check:

uv run --with polars python -c "import rainbear; print(rainbear.print_extension_info())"

Using `scan_zarr`

import polars as pl
import rainbear

lf = rainbear.scan_zarr("/path/to/data.zarr")

# Filter the LazyFrame (predicate's are pushed down and used for chunk pruning)
lf = lf.filter((pl.col("lat") >= 32.0) & (pl.col("lat") <= 52.0))

df = lf.collect()
print(df)

Caching Backends

Rainbear provides three backend classes that own the store connection and cache metadata and coordinate chunks across multiple scans. This dramatically improves performance for repeated queries on the same dataset.

`ZarrBackend` (Async)

The async caching backend for standard Zarr stores. Best for cloud storage (S3, GCS, Azure) where async I/O provides significant performance benefits.

Features:

Persistent caching of coordinate array chunks and metadata across scans
Async I/O with configurable concurrency for parallel chunk reads
Compatible with any ObjectStore (S3, GCS, Azure, HTTP, local filesystem)
Cache statistics and management (clear cache, view stats)

When to use:

Cloud-based Zarr stores where network latency dominates
Applications already using async/await patterns
High-concurrency workloads with many simultaneous chunk reads

import polars as pl
from datetime import datetime
import rainbear

# Create backend from URL
backend = rainbear.ZarrBackend.from_url("s3://bucket/dataset.zarr")

# First scan - reads and caches coordinates
df1 = await backend.scan_zarr_async(pl.col("time") > datetime(2024, 1, 1))

# Second scan - reuses cached coordinates (much faster!)
df2 = await backend.scan_zarr_async(pl.col("time") > datetime(2024, 6, 1))

# Check what's cached
stats = await backend.cache_stats()
print(f"Cached {stats['coord_entries']} coordinate chunks")

# Clear cache if needed
await backend.clear_coord_cache()

`ZarrBackendSync` (Sync)

The synchronous caching backend for standard Zarr stores. Best for local filesystem access or simpler synchronous codebases.

Features:

Same persistent caching as ZarrBackend (coordinates and metadata)
Synchronous API - no async/await required
Blocking I/O suitable for local or low-latency stores
Additional options: column selection, row limits, batch size control

When to use:

Local filesystem Zarr stores
Synchronous applications or scripts
Interactive data exploration (notebooks, REPL)
When you don't need async concurrency

import polars as pl
from datetime import datetime
import rainbear

# Create backend from URL
backend = rainbear.ZarrBackendSync.from_url("/path/to/local/dataset.zarr")

# Scan with column selection and row limit
df1 = backend.scan_zarr_sync(
    predicate=pl.col("time") > datetime(2024, 1, 1),
    with_columns=["temp", "pressure"],
    n_rows=1000
)

# Second scan reuses cached coordinates
df2 = backend.scan_zarr_sync(pl.col("time") > datetime(2024, 6, 1))

# No await needed for cache operations in sync backend
stats = backend.cache_stats()
backend.clear_coord_cache()

`IcechunkBackend` (Async, Version Control)

The async-only caching backend for Icechunk-backed Zarr stores. Icechunk adds Git-like version control to Zarr datasets, enabling branches, commits, and time-travel queries.

Features:

Same persistent caching as ZarrBackend (coordinates and metadata)
Access to versioned Zarr data with branch/snapshot support
Direct integration with icechunk-python Session objects
Async-only (Icechunk operations are inherently async)

When to use:

Working with version-controlled Zarr datasets
Need to query specific branches or historical snapshots
Collaborative workflows with multiple dataset versions
Reproducible analysis requiring exact dataset versions

import polars as pl
from datetime import datetime
import rainbear

# Create backend from Icechunk filesystem repository
backend = await rainbear.IcechunkBackend.from_filesystem(
    path="/path/to/icechunk/repo",
    branch="main"  # or specific branch name
)

# Scan like normal - caching works the same
df1 = await backend.scan_zarr_async(pl.col("time") > datetime(2024, 1, 1))
df2 = await backend.scan_zarr_async(pl.col("time") > datetime(2024, 6, 1))

# Or use existing Icechunk session directly
from icechunk import Repository, local_filesystem_storage

storage = local_filesystem_storage("/path/to/repo")
repo = Repository.open(storage)
session = repo.readonly_session("experimental-branch")

# No manual serialization needed!
backend = await rainbear.IcechunkBackend.from_session(session)
df = await backend.scan_zarr_async(pl.col("lat") < 45.0)

Backend Comparison

Feature	ZarrBackend	ZarrBackendSync	IcechunkBackend
API Style	Async	Sync	Async
Caching	✓ Coordinates & metadata	✓ Coordinates & metadata	✓ Coordinates & metadata
Best For	Cloud storage (S3, GCS, Azure)	Local filesystem	Version-controlled datasets
Concurrency	High (configurable)	Single-threaded	High (configurable)
Version Control	✗	✗	✓ (branches, snapshots)
Column Selection	✗	✓	✗
Row Limits	✗	✓	✗

Running the smoke tests

The Python tests create some local Zarr stores and then scan them.

From the workspace root:

cd rainbear-tests
uv run pytest

Development

To run the Rust tests:

cargo test

To run the Python tests:

uv run pytest

Profiling:

samply record -- uv run python -m pytest tests/test_benchmark_novel_queries.py -m 'benchmark' --no-header -rN

Roadmap

Near Term

Geospatial support via ewkb and polars-st
Interpolation support
Tests against cloud storage backends
Benchmarks
Documentation

Longer Term

Improved manner of application to take full advantage of Polars' lazy engine
Caching Support?
Writing to zarr?
Capability to work with datatrees
Allow output to arrow/pandas/etc.
Icechunk support
Zarr V2 support (backwards compatibility)

Code map

Rust extension module: rainbear/src/lib.rs exports _core
Zarr store opener (multi-backend URLs): rainbear/src/zarr_store.rs
Metadata loader (dims/coords/vars + schema): rainbear/src/zarr_meta.rs
Streaming IO source: rainbear/src/zarr_source.rs (exposed to Python as ZarrBackendSync)
Python API: rainbear/src/rainbear/__init__.py (scan_random, scan_zarr, ZarrBackendSync)
Tests: rainbear-tests/tests/ (separate workspace package)

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.6.0

Apr 3, 2026

2.4.0

Feb 27, 2026

2.3.0

Feb 27, 2026

2.2.0

Feb 25, 2026

2.1.0

Feb 21, 2026

This version

2.0.0

Feb 11, 2026

1.1.0b3 pre-release

Jan 17, 2026

1.1.0b2 pre-release

Jan 15, 2026

1.1.0b1 pre-release

Jan 7, 2026

1.1.0a1 pre-release

Jan 6, 2026

1.0.0a1 pre-release

Jan 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rainbear-2.0.0-cp39-abi3-manylinux_2_24_x86_64.whl (45.1 MB view details)

Uploaded Feb 11, 2026 CPython 3.9+manylinux: glibc 2.24+ x86-64

File details

Details for the file rainbear-2.0.0-cp39-abi3-manylinux_2_24_x86_64.whl.

File metadata

Download URL: rainbear-2.0.0-cp39-abi3-manylinux_2_24_x86_64.whl
Upload date: Feb 11, 2026
Size: 45.1 MB
Tags: CPython 3.9+, manylinux: glibc 2.24+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Linux Mint","version":"22","id":"wilma","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for rainbear-2.0.0-cp39-abi3-manylinux_2_24_x86_64.whl
Algorithm	Hash digest
SHA256	`efba90a6fdc0cc5ef5143f401e68b16e4c013be2d6902853a831f3548a6212d6`
MD5	`269015c22fd2020cb758588a50143e98`
BLAKE2b-256	`f6744194f4823b790b876bc0d1938e5d148b9cd4da8ded21b01fa5f3cd7b9a6a`

See more details on using hashes here.

rainbear 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Rainbear

Status / caveats

Quickstart (uv)

Using `scan_zarr`

Caching Backends

`ZarrBackend` (Async)

`ZarrBackendSync` (Sync)

`IcechunkBackend` (Async, Version Control)

Backend Comparison

Running the smoke tests

Development

Roadmap

Near Term

Longer Term

Code map

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

rainbear 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Rainbear

Status / caveats

Quickstart (uv)

Using scan_zarr

Caching Backends

ZarrBackend (Async)

ZarrBackendSync (Sync)

IcechunkBackend (Async, Version Control)

Backend Comparison

Running the smoke tests

Development

Roadmap

Near Term

Longer Term

Code map

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

Using `scan_zarr`

`ZarrBackend` (Async)

`ZarrBackendSync` (Sync)

`IcechunkBackend` (Async, Version Control)