A modern library for converting scientific data to Zarr format
Project description
zarrio
A modern, clean library for converting scientific data formats to Zarr format.
Overview
zarrio is a rewrite of the original onzarr library with a focus on simplicity, performance, and maintainability. It leverages modern xarray and zarr capabilities to provide efficient conversion of NetCDF and other scientific data formats to Zarr format.
Features
- Simple API: Clean, intuitive interfaces for common operations
- Efficient Conversion: Fast conversion of NetCDF to Zarr format
- Data Packing: Compress data using fixed-scale offset encoding
- Intelligent Chunking: Automatic chunking recommendations based on access patterns (temporal, spatial, balanced) with intelligent chunking for parallel archives
- Compression: Support for various compression algorithms
- Time Series Handling: Efficient handling of time-series data
- Data Appending: Append new data to existing Zarr archives
- Parallel Writing: Create template archives and write regions in parallel with intelligent chunking
- Metadata Preservation: Maintain dataset metadata during conversion
Installation
pip install zarrio
Usage
Command Line Interface
# Convert NetCDF to Zarr
zarrio convert input.nc output.zarr
# Convert with chunking
zarrio convert input.nc output.zarr --chunking "time:100,lat:50,lon:100"
# Convert with compression
zarrio convert input.nc output.zarr --compression "blosc:zstd:3"
# Convert with data packing
zarrio convert input.nc output.zarr --packing --packing-bits 16
# Convert with manual packing ranges
zarrio convert input.nc output.zarr --packing \
--packing-manual-ranges '{"temperature": {"min": -50, "max": 50}}'
# Analyze NetCDF file for optimization recommendations
zarrio analyze input.nc
# Analyze with theoretical performance testing
zarrio analyze input.nc --test-performance
# Analyze with actual performance testing
zarrio analyze input.nc --run-tests
# Analyze with interactive configuration setup
zarrio analyze input.nc --interactive
# Create template for parallel writing
zarrio create-template template.nc archive.zarr --global-start 2023-01-01 --global-end 2023-12-31
# Create template with intelligent chunking
zarrio create-template template.nc archive.zarr --global-start 2023-01-01 --global-end 2023-12-31 --intelligent-chunking --access-pattern temporal
# Write region to existing archive
zarrio write-region data.nc archive.zarr
# Append to existing Zarr store
zarrio append new_data.nc existing.zarr
Python API
from zarrio import convert_to_zarr, append_to_zarr, ZarrConverter
# Simple conversion
convert_to_zarr("input.nc", "output.zarr")
# Conversion with options
convert_to_zarr(
"input.nc",
"output.zarr",
chunking={"time": 100, "lat": 50, "lon": 100},
compression="blosc:zstd:3",
packing=True,
packing_bits=16,
packing_manual_ranges={
"temperature": {"min": -50, "max": 50}
},
packing_auto_buffer_factor=0.05
)
# Using the class-based interface
converter = ZarrConverter(
chunking={"time": 100, "lat": 50, "lon": 100},
compression="blosc:zstd:3",
packing=True,
packing_manual_ranges={
"temperature": {"min": -50, "max": 50}
}
)
converter.convert("input.nc", "output.zarr")
# Parallel writing workflow
# 1. Create template archive
converter.create_template(
template_dataset=template_ds,
output_path="archive.zarr",
global_start="2023-01-01",
global_end="2023-12-31",
compute=False # Metadata only
)
# 2. Write regions in parallel (in separate processes)
converter.write_region("data1.nc", "archive.zarr")
converter.write_region("data2.nc", "archive.zarr")
converter.write_region("data3.nc", "archive.zarr")
# Append to existing Zarr store
append_to_zarr("new_data.nc", "existing.zarr")
Parallel Writing
One of the key features of zarrio is support for parallel writing of large datasets:
# Step 1: Create template archive with intelligent chunking
converter = ZarrConverter(
chunking={"time": 100, "lat": 50, "lon": 100},
access_pattern="temporal" # Optimize for time series analysis
)
converter.create_template(
template_dataset=template_dataset,
output_path="large_archive.zarr",
global_start="2020-01-01",
global_end="2023-12-31",
compute=False, # Metadata only, no data computation
intelligent_chunking=True, # Enable intelligent chunking based on full archive dimensions
access_pattern="temporal" # Optimize for time series analysis
)
# Step 2: Write regions in parallel processes
# Process 1: converter.write_region("file1.nc", "large_archive.zarr")
# Process 2: converter.write_region("file2.nc", "large_archive.zarr")
# Process 3: converter.write_region("file3.nc", "large_archive.zarr")
This approach is ideal for converting large numbers of NetCDF files to a single Zarr archive in parallel. The intelligent chunking feature ensures optimal chunking based on the full archive dimensions rather than just the template dataset.
Rolling Archive
Manage forecast cycle archives with automatic cleanup of old data based on time-based retention windows.
What is Rolling Archive?
Rolling archive automatically removes old forecast cycles from your Zarr store based on a configurable retention window. This is useful when:
- You run forecasts multiple times per day
- You want to keep only the last N hours/days of data
- You need to manage disk space or datamesh storage
Python API
from zarrio import ZarrConverter
from datetime import timedelta
# Configure rolling archive
converter = ZarrConverter(
rolling_archive={
"enabled": True,
"retention_window": timedelta(hours=24), # Keep last 24 hours
"min_groups_to_keep": 4, # Always keep at least 4 cycles
"auto_cleanup": True, # Cleanup after each write
}
)
# Write data - cleanup happens automatically after each write
converter.convert("forecast.nc", "archive.zarr", group="cycle/20240101T000000")
Manual Cleanup
You can also trigger cleanup manually:
# Dry run - see what would be deleted without making changes
result = converter.cleanup_archive("archive.zarr", dry_run=True)
print(f"Would delete: {len(result['deleted'])} groups")
for g in result['deleted']:
print(f" - {g}")
print(f"Would keep: {len(result['kept'])} groups")
# Actual cleanup
result = converter.cleanup_archive("archive.zarr")
print(f"Deleted: {len(result['deleted'])} groups")
print(f"Kept: {len(result['kept'])} groups")
if result['skipped']:
print(f"Skipped: {len(result['skipped'])} groups (unparseable timestamp)")
CLI Usage
Enable rolling archive via command line:
# Convert with 24-hour retention
zarrio convert forecast.nc archive.zarr --rolling-archive-hours 24
# This enables automatic cleanup after each write
Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
enabled |
bool | False |
Enable automatic rolling archive cleanup |
retention_window |
timedelta | None |
How long to keep data (minimum 1 hour) |
time_reference_attr |
str | 'cycle_time' |
Attribute name containing timestamp |
auto_cleanup |
bool | True |
Cleanup automatically after each write |
min_groups_to_keep |
int | 1 |
Minimum number of groups to always preserve |
Best Practices
- Use dry_run first to verify what will be deleted before actual cleanup
- Set min_groups_to_keep to prevent accidental total deletion
- Use time-based retention (not cycle count) for predictable behavior
- Monitor cleanup logs to ensure it's working as expected
See examples/rolling_archive_demo.py for a complete working example.
Configuration
You can also use configuration files (YAML or JSON):
# config.yaml
chunking:
time: 100
lat: 50
lon: 100
compression: "blosc:zstd:3"
packing:
enabled: true
bits: 16
manual_ranges:
temperature:
min: -50
max: 50
auto_buffer_factor: 0.05
variables:
- temperature
- pressure
drop_variables:
- unused_var
Then use it with the CLI:
zarrio convert input.nc output.zarr --config config.yaml
Development
Installation
git clone https://github.com/oceanum/zarrio.git
cd zarrio
pip install -e .
Running Tests
pip install -e ".[dev]"
pytest
Code Quality
# Format code
black .
# Check code style
flake8
# Type checking
mypy zarrio
License
MIT License
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zarrio-0.1.3.tar.gz.
File metadata
- Download URL: zarrio-0.1.3.tar.gz
- Upload date:
- Size: 66.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d415efcddfcd773e60fe4c7fe63d2d158aa138df5517899f3771fed33486e9c2
|
|
| MD5 |
65951e9580661c6b1f5c72d428631d2c
|
|
| BLAKE2b-256 |
78784bf80d23a9149f974927753af178d62c33b70c856ed73768be731d9f9609
|
Provenance
The following attestation bundles were made for zarrio-0.1.3.tar.gz:
Publisher:
python-publish.yml on oceanum/zarrio
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
zarrio-0.1.3.tar.gz -
Subject digest:
d415efcddfcd773e60fe4c7fe63d2d158aa138df5517899f3771fed33486e9c2 - Sigstore transparency entry: 1448270261
- Sigstore integration time:
-
Permalink:
oceanum/zarrio@32b46809de3fe6a06438ac72a3c0bbb390f68ed5 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/oceanum
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@32b46809de3fe6a06438ac72a3c0bbb390f68ed5 -
Trigger Event:
release
-
Statement type:
File details
Details for the file zarrio-0.1.3-py3-none-any.whl.
File metadata
- Download URL: zarrio-0.1.3-py3-none-any.whl
- Upload date:
- Size: 43.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
53a1ba8191c7bfd0bb033e3fe7747f7b30001aaec965cead0468cb44049b1f97
|
|
| MD5 |
d385a8239846ad0c75cdcfc879decb93
|
|
| BLAKE2b-256 |
95fbb18a85fdf6128dffc8f1cfc4a67b281b100124d3e891a71ee37ebb3aad74
|
Provenance
The following attestation bundles were made for zarrio-0.1.3-py3-none-any.whl:
Publisher:
python-publish.yml on oceanum/zarrio
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
zarrio-0.1.3-py3-none-any.whl -
Subject digest:
53a1ba8191c7bfd0bb033e3fe7747f7b30001aaec965cead0468cb44049b1f97 - Sigstore transparency entry: 1448270367
- Sigstore integration time:
-
Permalink:
oceanum/zarrio@32b46809de3fe6a06438ac72a3c0bbb390f68ed5 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/oceanum
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@32b46809de3fe6a06438ac72a3c0bbb390f68ed5 -
Trigger Event:
release
-
Statement type: