Skip to main content

A modern library for converting scientific data to Zarr format

Project description

zarrio

A modern, clean library for converting scientific data formats to Zarr format.

Overview

zarrio is a rewrite of the original onzarr library with a focus on simplicity, performance, and maintainability. It leverages modern xarray and zarr capabilities to provide efficient conversion of NetCDF and other scientific data formats to Zarr format.

Features

  • Simple API: Clean, intuitive interfaces for common operations
  • Efficient Conversion: Fast conversion of NetCDF to Zarr format
  • Data Packing: Compress data using fixed-scale offset encoding
  • Intelligent Chunking: Automatic chunking recommendations based on access patterns (temporal, spatial, balanced) with intelligent chunking for parallel archives
  • Compression: Support for various compression algorithms
  • Time Series Handling: Efficient handling of time-series data
  • Data Appending: Append new data to existing Zarr archives
  • Parallel Writing: Create template archives and write regions in parallel with intelligent chunking
  • Metadata Preservation: Maintain dataset metadata during conversion

Installation

pip install zarrio

Usage

Command Line Interface

# Convert NetCDF to Zarr
zarrio convert input.nc output.zarr

# Convert with chunking
zarrio convert input.nc output.zarr --chunking "time:100,lat:50,lon:100"

# Convert with compression
zarrio convert input.nc output.zarr --compression "blosc:zstd:3"

# Convert with data packing
zarrio convert input.nc output.zarr --packing --packing-bits 16

# Convert with manual packing ranges
zarrio convert input.nc output.zarr --packing \
    --packing-manual-ranges '{"temperature": {"min": -50, "max": 50}}'

# Analyze NetCDF file for optimization recommendations
zarrio analyze input.nc

# Analyze with theoretical performance testing
zarrio analyze input.nc --test-performance

# Analyze with actual performance testing
zarrio analyze input.nc --run-tests

# Analyze with interactive configuration setup
zarrio analyze input.nc --interactive

# Create template for parallel writing
zarrio create-template template.nc archive.zarr --global-start 2023-01-01 --global-end 2023-12-31

# Create template with intelligent chunking
zarrio create-template template.nc archive.zarr --global-start 2023-01-01 --global-end 2023-12-31 --intelligent-chunking --access-pattern temporal

# Write region to existing archive
zarrio write-region data.nc archive.zarr

# Append to existing Zarr store
zarrio append new_data.nc existing.zarr

Python API

from zarrio import convert_to_zarr, append_to_zarr, ZarrConverter

# Simple conversion
convert_to_zarr("input.nc", "output.zarr")

# Conversion with options
convert_to_zarr(
    "input.nc",
    "output.zarr",
    chunking={"time": 100, "lat": 50, "lon": 100},
    compression="blosc:zstd:3",
    packing=True,
    packing_bits=16,
    packing_manual_ranges={
        "temperature": {"min": -50, "max": 50}
    },
    packing_auto_buffer_factor=0.05
)

# Using the class-based interface
converter = ZarrConverter(
    chunking={"time": 100, "lat": 50, "lon": 100},
    compression="blosc:zstd:3",
    packing=True,
    packing_manual_ranges={
        "temperature": {"min": -50, "max": 50}
    }
)
converter.convert("input.nc", "output.zarr")

# Parallel writing workflow
# 1. Create template archive
converter.create_template(
    template_dataset=template_ds,
    output_path="archive.zarr",
    global_start="2023-01-01",
    global_end="2023-12-31",
    compute=False  # Metadata only
)

# 2. Write regions in parallel (in separate processes)
converter.write_region("data1.nc", "archive.zarr")
converter.write_region("data2.nc", "archive.zarr")
converter.write_region("data3.nc", "archive.zarr")

# Append to existing Zarr store
append_to_zarr("new_data.nc", "existing.zarr")

Parallel Writing

One of the key features of zarrio is support for parallel writing of large datasets:

# Step 1: Create template archive with intelligent chunking
converter = ZarrConverter(
    chunking={"time": 100, "lat": 50, "lon": 100},
    access_pattern="temporal"  # Optimize for time series analysis
)
converter.create_template(
    template_dataset=template_dataset,
    output_path="large_archive.zarr",
    global_start="2020-01-01",
    global_end="2023-12-31",
    compute=False,  # Metadata only, no data computation
    intelligent_chunking=True,  # Enable intelligent chunking based on full archive dimensions
    access_pattern="temporal"   # Optimize for time series analysis
)

# Step 2: Write regions in parallel processes
# Process 1: converter.write_region("file1.nc", "large_archive.zarr")
# Process 2: converter.write_region("file2.nc", "large_archive.zarr")
# Process 3: converter.write_region("file3.nc", "large_archive.zarr")

This approach is ideal for converting large numbers of NetCDF files to a single Zarr archive in parallel. The intelligent chunking feature ensures optimal chunking based on the full archive dimensions rather than just the template dataset.

Rolling Archive

Manage forecast cycle archives with automatic cleanup of old data based on time-based retention windows.

What is Rolling Archive?

Rolling archive automatically removes old forecast cycles from your Zarr store based on a configurable retention window. This is useful when:

  • You run forecasts multiple times per day
  • You want to keep only the last N hours/days of data
  • You need to manage disk space or datamesh storage

Python API

from zarrio import ZarrConverter
from datetime import timedelta

# Configure rolling archive
converter = ZarrConverter(
    rolling_archive={
        "enabled": True,
        "retention_window": timedelta(hours=24),  # Keep last 24 hours
        "min_groups_to_keep": 4,  # Always keep at least 4 cycles
        "auto_cleanup": True,  # Cleanup after each write
    }
)

# Write data - cleanup happens automatically after each write
converter.convert("forecast.nc", "archive.zarr", group="cycle/20240101T000000")

Manual Cleanup

You can also trigger cleanup manually:

# Dry run - see what would be deleted without making changes
result = converter.cleanup_archive("archive.zarr", dry_run=True)
print(f"Would delete: {len(result['deleted'])} groups")
for g in result['deleted']:
    print(f"  - {g}")
print(f"Would keep: {len(result['kept'])} groups")

# Actual cleanup
result = converter.cleanup_archive("archive.zarr")
print(f"Deleted: {len(result['deleted'])} groups")
print(f"Kept: {len(result['kept'])} groups")
if result['skipped']:
    print(f"Skipped: {len(result['skipped'])} groups (unparseable timestamp)")

CLI Usage

Enable rolling archive via command line:

# Convert with 24-hour retention
zarrio convert forecast.nc archive.zarr --rolling-archive-hours 24

# This enables automatic cleanup after each write

Configuration Options

Option Type Default Description
enabled bool False Enable automatic rolling archive cleanup
retention_window timedelta None How long to keep data (minimum 1 hour)
time_reference_attr str 'cycle_time' Attribute name containing timestamp
auto_cleanup bool True Cleanup automatically after each write
min_groups_to_keep int 1 Minimum number of groups to always preserve

Best Practices

  1. Use dry_run first to verify what will be deleted before actual cleanup
  2. Set min_groups_to_keep to prevent accidental total deletion
  3. Use time-based retention (not cycle count) for predictable behavior
  4. Monitor cleanup logs to ensure it's working as expected

See examples/rolling_archive_demo.py for a complete working example.

Configuration

You can also use configuration files (YAML or JSON):

# config.yaml
chunking:
  time: 100
  lat: 50
  lon: 100
compression: "blosc:zstd:3"
packing:
  enabled: true
  bits: 16
  manual_ranges:
    temperature:
      min: -50
      max: 50
  auto_buffer_factor: 0.05
variables:
  - temperature
  - pressure
drop_variables:
  - unused_var

Then use it with the CLI:

zarrio convert input.nc output.zarr --config config.yaml

Development

Installation

git clone https://github.com/oceanum/zarrio.git
cd zarrio
pip install -e .

Running Tests

pip install -e ".[dev]"
pytest

Code Quality

# Format code
black .

# Check code style
flake8

# Type checking
mypy zarrio

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zarrio-0.1.3.tar.gz (66.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zarrio-0.1.3-py3-none-any.whl (43.7 kB view details)

Uploaded Python 3

File details

Details for the file zarrio-0.1.3.tar.gz.

File metadata

  • Download URL: zarrio-0.1.3.tar.gz
  • Upload date:
  • Size: 66.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zarrio-0.1.3.tar.gz
Algorithm Hash digest
SHA256 d415efcddfcd773e60fe4c7fe63d2d158aa138df5517899f3771fed33486e9c2
MD5 65951e9580661c6b1f5c72d428631d2c
BLAKE2b-256 78784bf80d23a9149f974927753af178d62c33b70c856ed73768be731d9f9609

See more details on using hashes here.

Provenance

The following attestation bundles were made for zarrio-0.1.3.tar.gz:

Publisher: python-publish.yml on oceanum/zarrio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file zarrio-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: zarrio-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 43.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zarrio-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 53a1ba8191c7bfd0bb033e3fe7747f7b30001aaec965cead0468cb44049b1f97
MD5 d385a8239846ad0c75cdcfc879decb93
BLAKE2b-256 95fbb18a85fdf6128dffc8f1cfc4a67b281b100124d3e891a71ee37ebb3aad74

See more details on using hashes here.

Provenance

The following attestation bundles were made for zarrio-0.1.3-py3-none-any.whl:

Publisher: python-publish.yml on oceanum/zarrio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page