GeoZarr compliant data model for EOPF datasets
Project description
EOPF GeoZarr
GeoZarr compliant data model for EOPF (Earth Observation Processing Framework) datasets.
Overview
This library provides tools to convert EOPF datasets to GeoZarr-spec 0.4 compliant format while maintaining native projections and using /2 downsampling logic for multiscale support.
Key Features
- GeoZarr Specification Compliance: Full compliance with GeoZarr spec 0.4
- Native CRS Preservation: No reprojection to TMS, maintains original coordinate reference systems
- Multiscale Support: COG-style /2 downsampling with overview levels as children groups
- CF Conventions: Proper CF standard names and grid_mapping attributes
- Robust Processing: Band-by-band writing with validation and retry logic
- S3 Support: Direct output to Amazon S3 buckets with automatic credential validation
- Parallel Processing: Optional dask cluster support for parallel chunk processing
- Chunk Alignment: Automatic chunk alignment to prevent data corruption with dask
GeoZarr Compliance Features
_ARRAY_DIMENSIONSattributes on all arrays- CF standard names for all variables
grid_mappingattributes referencing CF grid_mapping variablesGeoTransformattributes in grid_mapping variables- Proper multiscales metadata structure
- Native CRS tile matrix sets
Installation
pip install eopf-geozarr
For development:
git clone <repository-url>
cd eopf-geozarr
pip install -e ".[dev]"
Quick Start
Command Line Interface
After installation, you can use the eopf-geozarr command:
# Convert EOPF dataset to GeoZarr format (local output)
eopf-geozarr convert input.zarr output.zarr
# Convert EOPF dataset to GeoZarr format (S3 output)
eopf-geozarr convert input.zarr s3://my-bucket/path/to/output.zarr
# Convert with parallel processing using dask cluster
eopf-geozarr convert input.zarr output.zarr --dask-cluster
# Convert with dask cluster and verbose output
eopf-geozarr convert input.zarr output.zarr --dask-cluster --verbose
# Get information about a dataset
eopf-geozarr info input.zarr
# Validate GeoZarr compliance
eopf-geozarr validate output.zarr
# Get help
eopf-geozarr --help
S3 Support
The library supports direct output to S3-compatible storage, including custom providers like OVH Cloud. Simply provide an S3 URL as the output path:
# Convert to S3
eopf-geozarr convert local_input.zarr s3://my-bucket/geozarr-data/output.zarr --verbose
S3 Configuration
Before using S3 output, ensure your S3 credentials are configured:
For AWS S3:
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1
For OVH Cloud Object Storage:
export AWS_ACCESS_KEY_ID=your_ovh_access_key
export AWS_SECRET_ACCESS_KEY=your_ovh_secret_key
export AWS_DEFAULT_REGION=gra # or other OVH region
export AWS_ENDPOINT_URL=https://s3.gra.cloud.ovh.net # OVH endpoint
For other S3-compatible providers:
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=your_region
export AWS_ENDPOINT_URL=https://your-s3-endpoint.com
Alternative: AWS CLI Configuration
aws configure
# Note: For custom endpoints, you'll still need to set AWS_ENDPOINT_URL
S3 Features
- Custom Endpoints: Support for any S3-compatible storage (AWS, OVH Cloud, MinIO, etc.)
- Automatic Validation: The tool validates S3 access before starting conversion
- Credential Detection: Automatically detects and validates S3 credentials
- Error Handling: Provides helpful error messages for S3 configuration issues
- Performance: Optimized for S3 with proper chunking and retry logic
Parallel Processing with Dask
The library supports parallel processing using dask clusters for improved performance on large datasets:
# Enable dask cluster for parallel processing
eopf-geozarr convert input.zarr output.zarr --dask-cluster
# With verbose output to see cluster information
eopf-geozarr convert input.zarr output.zarr --dask-cluster --verbose
Dask Features
- Local Cluster: Automatically starts a local dask cluster with multiple workers
- Dashboard Access: Provides access to the dask dashboard for monitoring (shown in verbose mode)
- Automatic Cleanup: Properly closes the cluster even if errors occur during processing
- Chunk Alignment: Automatically aligns Zarr chunks with dask chunks to prevent data corruption
- Memory Efficiency: Better memory management through parallel chunk processing
- Error Handling: Graceful handling of dask import errors with helpful installation instructions
Chunk Alignment
The library includes advanced chunk alignment logic to prevent the common issue of overlapping chunks when using dask:
- Smart Detection: Automatically detects if data is dask-backed and uses existing chunk structure
- Aligned Calculation: Uses
calculate_aligned_chunk_size()to find optimal chunk sizes that divide evenly into data dimensions - Proper Rechunking: Ensures datasets are rechunked to match encoding before writing
- Fallback Logic: For non-dask arrays, uses reasonable chunk sizes that don't exceed data dimensions
This prevents errors like:
❌ Failed to write tci after 2 attempts: Specified Zarr chunks encoding['chunks']=(1, 3660, 3660)
for variable named 'tci' would overlap multiple Dask chunks
S3 Python API
import os
import xarray as xr
from eopf_geozarr import create_geozarr_dataset
# Configure for OVH Cloud (example)
os.environ['AWS_ACCESS_KEY_ID'] = 'your_ovh_access_key'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'your_ovh_secret_key'
os.environ['AWS_DEFAULT_REGION'] = 'gra'
os.environ['AWS_ENDPOINT_URL'] = 'https://s3.gra.cloud.ovh.net'
# Load your EOPF DataTree
dt = xr.open_datatree("path/to/eopf/dataset.zarr", engine="zarr")
# Convert directly to S3
dt_geozarr = create_geozarr_dataset(
dt_input=dt,
groups=["/measurements/r10m", "/measurements/r20m", "/measurements/r60m"],
output_path="s3://my-bucket/geozarr-data/output.zarr",
spatial_chunk=4096,
min_dimension=256,
tile_width=256,
max_retries=3
)
Python API
import xarray as xr
from eopf_geozarr import create_geozarr_dataset
# Load your EOPF DataTree
dt = xr.open_datatree("path/to/eopf/dataset.zarr", engine="zarr")
# Define groups to convert (e.g., resolution groups)
groups = ["/measurements/r10m", "/measurements/r20m", "/measurements/r60m"]
# Convert to GeoZarr compliant format
dt_geozarr = create_geozarr_dataset(
dt_input=dt,
groups=groups,
output_path="path/to/output/geozarr.zarr",
spatial_chunk=4096,
min_dimension=256,
tile_width=256,
max_retries=3
)
API Reference
Main Functions
create_geozarr_dataset
Create a GeoZarr-spec 0.4 compliant dataset from EOPF data.
Parameters:
dt_input(xr.DataTree): Input EOPF DataTreegroups(List[str]): List of group names to process as Geozarr datasetsoutput_path(str): Output path for the Zarr storespatial_chunk(int, default=4096): Spatial chunk size for encodingmin_dimension(int, default=256): Minimum dimension for overview levelstile_width(int, default=256): Tile width for TMS compatibilitymax_retries(int, default=3): Maximum number of retries for network operations
Returns:
xr.DataTree: DataTree containing the GeoZarr compliant data
setup_datatree_metadata_geozarr_spec_compliant
Set up GeoZarr-spec compliant CF standard names and CRS information.
Parameters:
dt(xr.DataTree): The data tree containing the datasets to processgroups(List[str]): List of group names to process as Geozarr datasets
Returns:
Dict[str, xr.Dataset]: Dictionary of datasets with GeoZarr compliance applied
Utility Functions
downsample_2d_array
Downsample a 2D array using block averaging.
calculate_aligned_chunk_size
Calculate a chunk size that divides evenly into the dimension size. This ensures that Zarr chunks align properly with the data dimensions, preventing chunk overlap issues when writing with Dask.
Parameters:
dimension_size(int): Size of the dimension to chunktarget_chunk_size(int): Desired chunk size
Returns:
int: Aligned chunk size that divides evenly into dimension_size
Example:
from eopf_geozarr.conversion.utils import calculate_aligned_chunk_size
# For a dimension of size 5490 with target chunk size 3660
aligned_size = calculate_aligned_chunk_size(5490, 3660) # Returns 2745
is_grid_mapping_variable
Check if a variable is a grid_mapping variable by looking for references to it.
validate_existing_band_data
Validate that a specific band exists and is complete in the dataset.
Architecture
The library is organized into the following modules:
conversion: Core conversion tools for EOPF to GeoZarr transformationgeozarr.py: Main conversion functions and GeoZarr spec complianceutils.py: Utility functions for data processing and validation
data_api: Data access API (future development with pydantic-zarr)
GeoZarr Specification Compliance
This library implements the GeoZarr specification 0.4 with the following key requirements:
- Array Dimensions: All arrays must have
_ARRAY_DIMENSIONSattributes - CF Standard Names: All variables must have CF-compliant
standard_nameattributes - Grid Mapping: Data variables must reference CF grid_mapping variables via
grid_mappingattributes - Multiscales Structure: Overview levels are stored as children groups with proper tile matrix metadata
- Native CRS: Coordinate reference systems are preserved without reprojection
Contributing to GeoZarr Specification
Our implementation has contributed valuable feedback to the GeoZarr specification development process. Based on our real-world experience with Earth observation data, we have identified and reported several areas for improvement:
Key Contributions
- Arbitrary Coordinate Systems Support: Advocating for native CRS preservation instead of web mapping bias
- Chunking Performance Optimization: Proposing flexible chunking strategies for optimal performance
- Multiscale Hierarchy Clarification: Providing clear structure definitions for multiscale implementations
Our implementation demonstrates that scientific accuracy and performance can be maintained while working with arbitrary coordinate systems, not just web mapping projections. This is particularly important for Earth observation data that often comes in UTM zones, polar stereographic, or other scientific projections.
For detailed information about our contributions, see our GeoZarr Specification Contribution documentation.
Development
Setting up Development Environment
# Clone the repository
git clone <repository-url>
cd eopf-geozarr
# Install in development mode with all dependencies
pip install -e ".[dev,docs,all]"
# Install pre-commit hooks
pre-commit install
Running Tests
pytest
Code Quality
The project uses:
- Black for code formatting
- isort for import sorting
- flake8 for linting
- mypy for type checking
- pre-commit for automated checks
Building Documentation
cd docs
make html
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests and ensure code quality checks pass
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Acknowledgments
- Built on top of the excellent xarray and zarr libraries
- Follows the GeoZarr specification for geospatial data in Zarr
- Designed for compatibility with EOPF datasets
Support
For questions, issues, or contributions, please visit the GitHub repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file eopf_geozarr-0.7.0.tar.gz.
File metadata
- Download URL: eopf_geozarr-0.7.0.tar.gz
- Upload date:
- Size: 570.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c9b7a0d3db5fdb05fc28c01f65a910f6c322368eaede7ffee6a231470989fda3
|
|
| MD5 |
1a51af9bd895b9e7e45127623511c1ae
|
|
| BLAKE2b-256 |
126729fe4856be0fb6c7b4e4a1547d842fe94a93aaf01fd227f1f1313577c8e2
|
Provenance
The following attestation bundles were made for eopf_geozarr-0.7.0.tar.gz:
Publisher:
release.yml on EOPF-Explorer/data-model
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
eopf_geozarr-0.7.0.tar.gz -
Subject digest:
c9b7a0d3db5fdb05fc28c01f65a910f6c322368eaede7ffee6a231470989fda3 - Sigstore transparency entry: 822399839
- Sigstore integration time:
-
Permalink:
EOPF-Explorer/data-model@f9de87f9af06d46c867bcb6ff96959c4691254af -
Branch / Tag:
refs/tags/v0.7.0 - Owner: https://github.com/EOPF-Explorer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f9de87f9af06d46c867bcb6ff96959c4691254af -
Trigger Event:
release
-
Statement type:
File details
Details for the file eopf_geozarr-0.7.0-py3-none-any.whl.
File metadata
- Download URL: eopf_geozarr-0.7.0-py3-none-any.whl
- Upload date:
- Size: 97.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f302621dbd3b0b4761be1bcf4922c3c85774f00e85950828f6c3fbcf045bbd2c
|
|
| MD5 |
23ca0d5a0ed34cebb76348e4a21ecaf8
|
|
| BLAKE2b-256 |
35e70b2479596582011ac22e45dbd64c83fbdf5058a4827e61cd4831e5303adc
|
Provenance
The following attestation bundles were made for eopf_geozarr-0.7.0-py3-none-any.whl:
Publisher:
release.yml on EOPF-Explorer/data-model
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
eopf_geozarr-0.7.0-py3-none-any.whl -
Subject digest:
f302621dbd3b0b4761be1bcf4922c3c85774f00e85950828f6c3fbcf045bbd2c - Sigstore transparency entry: 822399858
- Sigstore integration time:
-
Permalink:
EOPF-Explorer/data-model@f9de87f9af06d46c867bcb6ff96959c4691254af -
Branch / Tag:
refs/tags/v0.7.0 - Owner: https://github.com/EOPF-Explorer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f9de87f9af06d46c867bcb6ff96959c4691254af -
Trigger Event:
release
-
Statement type: