Skip to main content

A modern library for converting scientific data to Zarr format

Project description

zarrio

A modern, clean library for converting scientific data formats to Zarr format.

Overview

zarrio is a rewrite of the original onzarr library with a focus on simplicity, performance, and maintainability. It leverages modern xarray and zarr capabilities to provide efficient conversion of NetCDF and other scientific data formats to Zarr format.

Features

  • Simple API: Clean, intuitive interfaces for common operations
  • Efficient Conversion: Fast conversion of NetCDF to Zarr format
  • Data Packing: Compress data using fixed-scale offset encoding
  • Intelligent Chunking: Automatic chunking recommendations based on access patterns (temporal, spatial, balanced) with intelligent chunking for parallel archives
  • Compression: Support for various compression algorithms
  • Time Series Handling: Efficient handling of time-series data
  • Data Appending: Append new data to existing Zarr archives
  • Parallel Writing: Create template archives and write regions in parallel with intelligent chunking
  • Metadata Preservation: Maintain dataset metadata during conversion

Installation

pip install zarrio

Usage

Command Line Interface

# Convert NetCDF to Zarr
zarrio convert input.nc output.zarr

# Convert with chunking
zarrio convert input.nc output.zarr --chunking "time:100,lat:50,lon:100"

# Convert with compression
zarrio convert input.nc output.zarr --compression "blosc:zstd:3"

# Convert with data packing
zarrio convert input.nc output.zarr --packing --packing-bits 16

# Convert with manual packing ranges
zarrio convert input.nc output.zarr --packing \
    --packing-manual-ranges '{"temperature": {"min": -50, "max": 50}}'

# Analyze NetCDF file for optimization recommendations
zarrio analyze input.nc

# Analyze with theoretical performance testing
zarrio analyze input.nc --test-performance

# Analyze with actual performance testing
zarrio analyze input.nc --run-tests

# Analyze with interactive configuration setup
zarrio analyze input.nc --interactive

# Create template for parallel writing
zarrio create-template template.nc archive.zarr --global-start 2023-01-01 --global-end 2023-12-31

# Create template with intelligent chunking
zarrio create-template template.nc archive.zarr --global-start 2023-01-01 --global-end 2023-12-31 --intelligent-chunking --access-pattern temporal

# Write region to existing archive
zarrio write-region data.nc archive.zarr

# Append to existing Zarr store
zarrio append new_data.nc existing.zarr

Python API

from zarrio import convert_to_zarr, append_to_zarr, ZarrConverter

# Simple conversion
convert_to_zarr("input.nc", "output.zarr")

# Conversion with options
convert_to_zarr(
    "input.nc",
    "output.zarr",
    chunking={"time": 100, "lat": 50, "lon": 100},
    compression="blosc:zstd:3",
    packing=True,
    packing_bits=16,
    packing_manual_ranges={
        "temperature": {"min": -50, "max": 50}
    },
    packing_auto_buffer_factor=0.05
)

# Using the class-based interface
converter = ZarrConverter(
    chunking={"time": 100, "lat": 50, "lon": 100},
    compression="blosc:zstd:3",
    packing=True,
    packing_manual_ranges={
        "temperature": {"min": -50, "max": 50}
    }
)
converter.convert("input.nc", "output.zarr")

# Parallel writing workflow
# 1. Create template archive
converter.create_template(
    template_dataset=template_ds,
    output_path="archive.zarr",
    global_start="2023-01-01",
    global_end="2023-12-31",
    compute=False  # Metadata only
)

# 2. Write regions in parallel (in separate processes)
converter.write_region("data1.nc", "archive.zarr")
converter.write_region("data2.nc", "archive.zarr")
converter.write_region("data3.nc", "archive.zarr")

# Append to existing Zarr store
append_to_zarr("new_data.nc", "existing.zarr")

Parallel Writing

One of the key features of zarrio is support for parallel writing of large datasets:

# Step 1: Create template archive with intelligent chunking
converter = ZarrConverter(
    chunking={"time": 100, "lat": 50, "lon": 100},
    access_pattern="temporal"  # Optimize for time series analysis
)
converter.create_template(
    template_dataset=template_dataset,
    output_path="large_archive.zarr",
    global_start="2020-01-01",
    global_end="2023-12-31",
    compute=False,  # Metadata only, no data computation
    intelligent_chunking=True,  # Enable intelligent chunking based on full archive dimensions
    access_pattern="temporal"   # Optimize for time series analysis
)

# Step 2: Write regions in parallel processes
# Process 1: converter.write_region("file1.nc", "large_archive.zarr")
# Process 2: converter.write_region("file2.nc", "large_archive.zarr")
# Process 3: converter.write_region("file3.nc", "large_archive.zarr")

This approach is ideal for converting large numbers of NetCDF files to a single Zarr archive in parallel. The intelligent chunking feature ensures optimal chunking based on the full archive dimensions rather than just the template dataset.

Configuration

You can also use configuration files (YAML or JSON):

# config.yaml
chunking:
  time: 100
  lat: 50
  lon: 100
compression: "blosc:zstd:3"
packing:
  enabled: true
  bits: 16
  manual_ranges:
    temperature:
      min: -50
      max: 50
  auto_buffer_factor: 0.05
variables:
  - temperature
  - pressure
drop_variables:
  - unused_var

Then use it with the CLI:

zarrio convert input.nc output.zarr --config config.yaml

Development

Installation

git clone https://github.com/oceanum/zarrio.git
cd zarrio
pip install -e .

Running Tests

pip install -e ".[dev]"
pytest

Code Quality

# Format code
black .

# Check code style
flake8

# Type checking
mypy zarrio

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zarrio-0.1.1.tar.gz (46.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zarrio-0.1.1-py3-none-any.whl (36.6 kB view details)

Uploaded Python 3

File details

Details for the file zarrio-0.1.1.tar.gz.

File metadata

  • Download URL: zarrio-0.1.1.tar.gz
  • Upload date:
  • Size: 46.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for zarrio-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c1ec14705d81e9259a3b7cb4d3f1e6032b9606011f91577b7c3c1c8f420d32f5
MD5 276cd28cfceea103d404bdd4ae92bb04
BLAKE2b-256 3453449d294fa1c19afb2eead7473c7eda3b12f5cfe1e1d9a26344b4de37f11b

See more details on using hashes here.

Provenance

The following attestation bundles were made for zarrio-0.1.1.tar.gz:

Publisher: python-publish.yml on oceanum/zarrio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file zarrio-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: zarrio-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 36.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for zarrio-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 31590fd554ce0e604a9e08bea3fd3c9040af9227079218c0b26e49820aa5a273
MD5 c8138fde56ab97babfc18d49ce03af7b
BLAKE2b-256 14178e4dbde8c436f8e268a5bc9ea8d34b8285d86dc078e51ed96110d3a6d307

See more details on using hashes here.

Provenance

The following attestation bundles were made for zarrio-0.1.1-py3-none-any.whl:

Publisher: python-publish.yml on oceanum/zarrio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page