Skip to main content

Build lazy Zarr scans with Polars

Project description

Rainbear

Python + Rust experiment for lazy Zarr scanning into Polars, with an API inspired by xarray’s coordinate-based selection.

This repo currently contains:

  • A first-pass scan_zarr(...) that streams a Zarr store using Rust zarrs and yields Polars LazyFrames.
  • A test suite that compares rainbear against xarray for various Zarr datasets and filter conditions.

Status / caveats

  • Zarr v3: the Rust backend uses zarrs. Note that zarrs-v2 is not likely to work for this reason.
  • Tidy table output: scan_zarr currently emits a “tidy” DataFrame with one row per element and columns:
    • dimension/coord columns (e.g. time, lat)
    • variable columns (e.g. temp)
  • Predicate pushdown:
    • Rust attempts to compile a limited subset of predicates (simple comparisons on coord columns combined with &) for chunk pruning.
    • If Polars Expr deserialization fails (typically because Python Polars and the Rust-side Polars ABI/serde don’t match), scan_zarr automatically falls back to Python-side filtering (correct but slower).

Quickstart (uv)

The project is configured as a maturin extension module.

  • Run a quick import check:
uv run --with polars python -c "import rainbear; print(rainbear.print_extension_info())"

Using scan_zarr

import polars as pl
import rainbear

lf = rainbear.scan_zarr("/path/to/data.zarr")

# Filter the LazyFrame (predicate's are pushed down and used for chunk pruning)
lf = lf.filter((pl.col("lat") >= 32.0) & (pl.col("lat") <= 52.0))

df = lf.collect()
print(df)

Caching Backends

Rainbear provides three backend classes that own the store connection and cache metadata and coordinate chunks across multiple scans. This dramatically improves performance for repeated queries on the same dataset.

ZarrBackend (Async)

The async caching backend for standard Zarr stores. Best for cloud storage (S3, GCS, Azure) where async I/O provides significant performance benefits.

Features:

  • Persistent caching of coordinate array chunks and metadata across scans
  • Async I/O with configurable concurrency for parallel chunk reads
  • Compatible with any ObjectStore (S3, GCS, Azure, HTTP, local filesystem)
  • Cache statistics and management (clear cache, view stats)

When to use:

  • Cloud-based Zarr stores where network latency dominates
  • Applications already using async/await patterns
  • High-concurrency workloads with many simultaneous chunk reads
import polars as pl
from datetime import datetime
import rainbear

# Create backend from URL
backend = rainbear.ZarrBackend.from_url("s3://bucket/dataset.zarr")

# First scan - reads and caches coordinates
df1 = await backend.scan_zarr_async(pl.col("time") > datetime(2024, 1, 1))

# Second scan - reuses cached coordinates (much faster!)
df2 = await backend.scan_zarr_async(pl.col("time") > datetime(2024, 6, 1))

# Check what's cached
stats = await backend.cache_stats()
print(f"Cached {stats['coord_entries']} coordinate chunks")

# Clear cache if needed
await backend.clear_coord_cache()

ZarrBackendSync (Sync)

The synchronous caching backend for standard Zarr stores. Best for local filesystem access or simpler synchronous codebases.

Features:

  • Same persistent caching as ZarrBackend (coordinates and metadata)
  • Synchronous API - no async/await required
  • Blocking I/O suitable for local or low-latency stores
  • Additional options: column selection, row limits, batch size control

When to use:

  • Local filesystem Zarr stores
  • Synchronous applications or scripts
  • Interactive data exploration (notebooks, REPL)
  • When you don't need async concurrency
import polars as pl
from datetime import datetime
import rainbear

# Create backend from URL
backend = rainbear.ZarrBackendSync.from_url("/path/to/local/dataset.zarr")

# Scan with column selection and row limit
df1 = backend.scan_zarr_sync(
    predicate=pl.col("time") > datetime(2024, 1, 1),
    with_columns=["temp", "pressure"],
    n_rows=1000
)

# Second scan reuses cached coordinates
df2 = backend.scan_zarr_sync(pl.col("time") > datetime(2024, 6, 1))

# No await needed for cache operations in sync backend
stats = backend.cache_stats()
backend.clear_coord_cache()

IcechunkBackend (Async, Version Control)

The async-only caching backend for Icechunk-backed Zarr stores. Icechunk adds Git-like version control to Zarr datasets, enabling branches, commits, and time-travel queries.

Features:

  • Same persistent caching as ZarrBackend (coordinates and metadata)
  • Access to versioned Zarr data with branch/snapshot support
  • Direct integration with icechunk-python Session objects
  • Async-only (Icechunk operations are inherently async)

When to use:

  • Working with version-controlled Zarr datasets
  • Need to query specific branches or historical snapshots
  • Collaborative workflows with multiple dataset versions
  • Reproducible analysis requiring exact dataset versions
import polars as pl
from datetime import datetime
import rainbear

# Create backend from Icechunk filesystem repository
backend = await rainbear.IcechunkBackend.from_filesystem(
    path="/path/to/icechunk/repo",
    branch="main"  # or specific branch name
)

# Scan like normal - caching works the same
df1 = await backend.scan_zarr_async(pl.col("time") > datetime(2024, 1, 1))
df2 = await backend.scan_zarr_async(pl.col("time") > datetime(2024, 6, 1))

# Or use existing Icechunk session directly
from icechunk import Repository, local_filesystem_storage

storage = local_filesystem_storage("/path/to/repo")
repo = Repository.open(storage)
session = repo.readonly_session("experimental-branch")

# No manual serialization needed!
backend = await rainbear.IcechunkBackend.from_session(session)
df = await backend.scan_zarr_async(pl.col("lat") < 45.0)

Backend Comparison

Feature ZarrBackend ZarrBackendSync IcechunkBackend
API Style Async Sync Async
Caching ✓ Coordinates & metadata ✓ Coordinates & metadata ✓ Coordinates & metadata
Best For Cloud storage (S3, GCS, Azure) Local filesystem Version-controlled datasets
Concurrency High (configurable) Single-threaded High (configurable)
Version Control ✓ (branches, snapshots)
Column Selection
Row Limits

Running the smoke tests

The Python tests create some local Zarr stores and then scan them.

From the workspace root:

cd rainbear-tests
uv run pytest

Development

To run the Rust tests:

cargo test

To run the Python tests:

uv run pytest

Profiling:

samply record -- uv run python -m pytest tests/test_benchmark_novel_queries.py -m 'benchmark' --no-header -rN

Roadmap

Near Term

  • Geospatial support via ewkb and polars-st
  • Interpolation support
  • Tests against cloud storage backends
  • Benchmarks
  • Documentation

Longer Term

  • Improved manner of application to take full advantage of Polars' lazy engine
  • Caching Support?
  • Writing to zarr?
  • Capability to work with datatrees
  • Allow output to arrow/pandas/etc.
  • Icechunk support
  • Zarr V2 support (backwards compatibility)

Code map

  • Rust extension module: rainbear/src/lib.rs exports _core
  • Zarr store opener (multi-backend URLs): rainbear/src/zarr_store.rs
  • Metadata loader (dims/coords/vars + schema): rainbear/src/zarr_meta.rs
  • Streaming IO source: rainbear/src/zarr_source.rs (exposed to Python as ZarrBackendSync)
  • Python API: rainbear/src/rainbear/__init__.py (scan_random, scan_zarr, ZarrBackendSync)
  • Tests: rainbear-tests/tests/ (separate workspace package)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rainbear-2.0.0-cp39-abi3-manylinux_2_24_x86_64.whl (45.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.24+ x86-64

File details

Details for the file rainbear-2.0.0-cp39-abi3-manylinux_2_24_x86_64.whl.

File metadata

  • Download URL: rainbear-2.0.0-cp39-abi3-manylinux_2_24_x86_64.whl
  • Upload date:
  • Size: 45.1 MB
  • Tags: CPython 3.9+, manylinux: glibc 2.24+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Linux Mint","version":"22","id":"wilma","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for rainbear-2.0.0-cp39-abi3-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 efba90a6fdc0cc5ef5143f401e68b16e4c013be2d6902853a831f3548a6212d6
MD5 269015c22fd2020cb758588a50143e98
BLAKE2b-256 f6744194f4823b790b876bc0d1938e5d148b9cd4da8ded21b01fa5f3cd7b9a6a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page