A modern library for converting scientific data to Zarr format

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

tomdurrant

These details have not been verified by PyPI

Project description

zarrio

A modern, clean library for converting scientific data formats to Zarr format.

Overview

zarrio is a rewrite of the original onzarr library with a focus on simplicity, performance, and maintainability. It leverages modern xarray and zarr capabilities to provide efficient conversion of NetCDF and other scientific data formats to Zarr format.

Features

Simple API: Clean, intuitive interfaces for common operations
Efficient Conversion: Fast conversion of NetCDF to Zarr format
Data Packing: Compress data using fixed-scale offset encoding
Intelligent Chunking: Automatic chunking recommendations based on access patterns (temporal, spatial, balanced) with intelligent chunking for parallel archives
Compression: Support for various compression algorithms
Time Series Handling: Efficient handling of time-series data
Data Appending: Append new data to existing Zarr archives
Parallel Writing: Create template archives and write regions in parallel with intelligent chunking
Metadata Preservation: Maintain dataset metadata during conversion

Installation

pip install zarrio

Usage

Command Line Interface

# Convert NetCDF to Zarr
zarrio convert input.nc output.zarr

# Convert with chunking
zarrio convert input.nc output.zarr --chunking "time:100,lat:50,lon:100"

# Convert with compression
zarrio convert input.nc output.zarr --compression "blosc:zstd:3"

# Convert with data packing
zarrio convert input.nc output.zarr --packing --packing-bits 16

# Convert with manual packing ranges
zarrio convert input.nc output.zarr --packing \
    --packing-manual-ranges '{"temperature": {"min": -50, "max": 50}}'

# Analyze NetCDF file for optimization recommendations
zarrio analyze input.nc

# Analyze with theoretical performance testing
zarrio analyze input.nc --test-performance

# Analyze with actual performance testing
zarrio analyze input.nc --run-tests

# Analyze with interactive configuration setup
zarrio analyze input.nc --interactive

# Create template for parallel writing
zarrio create-template template.nc archive.zarr --global-start 2023-01-01 --global-end 2023-12-31

# Create template with intelligent chunking
zarrio create-template template.nc archive.zarr --global-start 2023-01-01 --global-end 2023-12-31 --intelligent-chunking --access-pattern temporal

# Write region to existing archive
zarrio write-region data.nc archive.zarr

# Append to existing Zarr store
zarrio append new_data.nc existing.zarr

Python API

from zarrio import convert_to_zarr, append_to_zarr, ZarrConverter

# Simple conversion
convert_to_zarr("input.nc", "output.zarr")

# Conversion with options
convert_to_zarr(
    "input.nc",
    "output.zarr",
    chunking={"time": 100, "lat": 50, "lon": 100},
    compression="blosc:zstd:3",
    packing=True,
    packing_bits=16,
    packing_manual_ranges={
        "temperature": {"min": -50, "max": 50}
    },
    packing_auto_buffer_factor=0.05
)

# Using the class-based interface
converter = ZarrConverter(
    chunking={"time": 100, "lat": 50, "lon": 100},
    compression="blosc:zstd:3",
    packing=True,
    packing_manual_ranges={
        "temperature": {"min": -50, "max": 50}
    }
)
converter.convert("input.nc", "output.zarr")

# Parallel writing workflow
# 1. Create template archive
converter.create_template(
    template_dataset=template_ds,
    output_path="archive.zarr",
    global_start="2023-01-01",
    global_end="2023-12-31",
    compute=False  # Metadata only
)

# 2. Write regions in parallel (in separate processes)
converter.write_region("data1.nc", "archive.zarr")
converter.write_region("data2.nc", "archive.zarr")
converter.write_region("data3.nc", "archive.zarr")

# Append to existing Zarr store
append_to_zarr("new_data.nc", "existing.zarr")

Parallel Writing

One of the key features of zarrio is support for parallel writing of large datasets:

# Step 1: Create template archive with intelligent chunking
converter = ZarrConverter(
    chunking={"time": 100, "lat": 50, "lon": 100},
    access_pattern="temporal"  # Optimize for time series analysis
)
converter.create_template(
    template_dataset=template_dataset,
    output_path="large_archive.zarr",
    global_start="2020-01-01",
    global_end="2023-12-31",
    compute=False,  # Metadata only, no data computation
    intelligent_chunking=True,  # Enable intelligent chunking based on full archive dimensions
    access_pattern="temporal"   # Optimize for time series analysis
)

# Step 2: Write regions in parallel processes
# Process 1: converter.write_region("file1.nc", "large_archive.zarr")
# Process 2: converter.write_region("file2.nc", "large_archive.zarr")
# Process 3: converter.write_region("file3.nc", "large_archive.zarr")

This approach is ideal for converting large numbers of NetCDF files to a single Zarr archive in parallel. The intelligent chunking feature ensures optimal chunking based on the full archive dimensions rather than just the template dataset.

Rolling Archive

Manage forecast cycle archives with automatic cleanup of old data based on time-based retention windows.

What is Rolling Archive?

Rolling archive automatically removes old forecast cycles from your Zarr store based on a configurable retention window. This is useful when:

You run forecasts multiple times per day
You want to keep only the last N hours/days of data
You need to manage disk space or datamesh storage

Python API

from zarrio import ZarrConverter
from datetime import timedelta

# Configure rolling archive
converter = ZarrConverter(
    rolling_archive={
        "enabled": True,
        "retention_window": timedelta(hours=24),  # Keep last 24 hours
        "min_groups_to_keep": 4,  # Always keep at least 4 cycles
        "auto_cleanup": True,  # Cleanup after each write
    }
)

# Write data - cleanup happens automatically after each write
converter.convert("forecast.nc", "archive.zarr", group="cycle/20240101T000000")

Manual Cleanup

You can also trigger cleanup manually:

# Dry run - see what would be deleted without making changes
result = converter.cleanup_archive("archive.zarr", dry_run=True)
print(f"Would delete: {len(result['deleted'])} groups")
for g in result['deleted']:
    print(f"  - {g}")
print(f"Would keep: {len(result['kept'])} groups")

# Actual cleanup
result = converter.cleanup_archive("archive.zarr")
print(f"Deleted: {len(result['deleted'])} groups")
print(f"Kept: {len(result['kept'])} groups")
if result['skipped']:
    print(f"Skipped: {len(result['skipped'])} groups (unparseable timestamp)")

CLI Usage

Enable rolling archive via command line:

# Convert with 24-hour retention
zarrio convert forecast.nc archive.zarr --rolling-archive-hours 24

# This enables automatic cleanup after each write

Configuration Options

Option	Type	Default	Description
`enabled`	bool	`False`	Enable automatic rolling archive cleanup
`retention_window`	timedelta	`None`	How long to keep data (minimum 1 hour)
`time_reference_attr`	str	`'cycle_time'`	Attribute name containing timestamp
`auto_cleanup`	bool	`True`	Cleanup automatically after each write
`min_groups_to_keep`	int	`1`	Minimum number of groups to always preserve

Best Practices

Use dry_run first to verify what will be deleted before actual cleanup
Set min_groups_to_keep to prevent accidental total deletion
Use time-based retention (not cycle count) for predictable behavior
Monitor cleanup logs to ensure it's working as expected

See examples/rolling_archive_demo.py for a complete working example.

Configuration

You can also use configuration files (YAML or JSON):

# config.yaml
chunking:
  time: 100
  lat: 50
  lon: 100
compression: "blosc:zstd:3"
packing:
  enabled: true
  bits: 16
  manual_ranges:
    temperature:
      min: -50
      max: 50
  auto_buffer_factor: 0.05
variables:
  - temperature
  - pressure
drop_variables:
  - unused_var

Then use it with the CLI:

zarrio convert input.nc output.zarr --config config.yaml

Development

Installation

git clone https://github.com/oceanum/zarrio.git
cd zarrio
pip install -e .

Running Tests

pip install -e ".[dev]"
pytest

Code Quality

# Format code
black .

# Check code style
flake8

# Type checking
mypy zarrio

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

tomdurrant

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.3

May 6, 2026

0.1.1

Nov 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zarrio-0.1.3.tar.gz (66.4 kB view details)

Uploaded May 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

zarrio-0.1.3-py3-none-any.whl (43.7 kB view details)

Uploaded May 6, 2026 Python 3

File details

Details for the file zarrio-0.1.3.tar.gz.

File metadata

Download URL: zarrio-0.1.3.tar.gz
Upload date: May 6, 2026
Size: 66.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zarrio-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`d415efcddfcd773e60fe4c7fe63d2d158aa138df5517899f3771fed33486e9c2`
MD5	`65951e9580661c6b1f5c72d428631d2c`
BLAKE2b-256	`78784bf80d23a9149f974927753af178d62c33b70c856ed73768be731d9f9609`

See more details on using hashes here.

Provenance

The following attestation bundles were made for zarrio-0.1.3.tar.gz:

Publisher: python-publish.yml on oceanum/zarrio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: zarrio-0.1.3.tar.gz
- Subject digest: d415efcddfcd773e60fe4c7fe63d2d158aa138df5517899f3771fed33486e9c2
- Sigstore transparency entry: 1448270261
- Sigstore integration time: May 6, 2026
Source repository:
- Permalink: oceanum/zarrio@32b46809de3fe6a06438ac72a3c0bbb390f68ed5
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/oceanum
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@32b46809de3fe6a06438ac72a3c0bbb390f68ed5
- Trigger Event: release

File details

Details for the file zarrio-0.1.3-py3-none-any.whl.

File metadata

Download URL: zarrio-0.1.3-py3-none-any.whl
Upload date: May 6, 2026
Size: 43.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zarrio-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`53a1ba8191c7bfd0bb033e3fe7747f7b30001aaec965cead0468cb44049b1f97`
MD5	`d385a8239846ad0c75cdcfc879decb93`
BLAKE2b-256	`95fbb18a85fdf6128dffc8f1cfc4a67b281b100124d3e891a71ee37ebb3aad74`

See more details on using hashes here.

Provenance

The following attestation bundles were made for zarrio-0.1.3-py3-none-any.whl:

Publisher: python-publish.yml on oceanum/zarrio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: zarrio-0.1.3-py3-none-any.whl
- Subject digest: 53a1ba8191c7bfd0bb033e3fe7747f7b30001aaec965cead0468cb44049b1f97
- Sigstore transparency entry: 1448270367
- Sigstore integration time: May 6, 2026
Source repository:
- Permalink: oceanum/zarrio@32b46809de3fe6a06438ac72a3c0bbb390f68ed5
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/oceanum
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@32b46809de3fe6a06438ac72a3c0bbb390f68ed5
- Trigger Event: release

zarrio 0.1.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

zarrio

Overview

Features

Installation

Usage

Command Line Interface

Python API

Parallel Writing

Rolling Archive

What is Rolling Archive?

Python API

Manual Cleanup

CLI Usage

Configuration Options

Best Practices

Configuration

Development

Installation

Running Tests

Code Quality

License

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance