Add your description here

Project description

mlcast-dataset-validator

Dataset validator for the MLCast Intake catalog (mlcast-datasets).

What is this?

This repository contains the validation tool for datasets contributed to the MLCast community (currently only radar precipitation source datasets). The validator ensures that the datasets meet the technical requirements for inclusion in the MLCast data collection.

Background

During the MLCast community meeting, multiple entities offered to contribute datasets. To streamline the contribution process and ensure data quality, we developed this validator to help data providers verify that their Zarr archives are compliant with MLCast requirements before submission.

This tool addresses two key needs identified in the community:

Specification compliance (#6): Validates datasets against the formal MLCast Zarr format specification v1.0 (RFC 2119 keywords)
Tool compatibility (#5): Tests that datasets work correctly with common geospatial tools (xarray, GDAL, cartopy)

What does it validate?

The validator checks both specification compliance and practical tool compatibility, for example for radar precipitation datasets it checks:

Minimum Requirements for Dataset Acceptance:
- 2D radar composite at 1km resolution or finer
- At least 256×256 pixel valid sensing area
- Minimum 3 years of temporal coverage
- Consistent spatial domain across all timesteps
- Data variable in mm (depth), mm/h (rate), or dBZ (reflectivity)
Technical Requirements:
- GeoZarr format (Zarr v2/v3 with proper georeferencing)
- CF-compliant coordinate and variable names
- Correct dimension ordering (time × H × W)
- Proper chunking strategy (1 chunk per timestep)
- ZSTD compression (recommended)
- NaN values for missing/out-of-range data
- License metadata (CC-BY, CC-BY-SA, OGL, etc.)
Tool Compatibility:
- xarray can load and slice the data correctly
- GDAL can interpret the georeferencing (WKT parsing)
- cartopy can create CRS objects and transform coordinates
- Cross-tool CRS consistency checks

How is the tool implemented?

Spec modules are organized by data stage/product Each validator lives under mlcast_dataset_validator/specs/<data_stage>/<product>.py. For example, the source data radar precipitation spec is found at specs/source_data/radar_precipitation.py, while specs/training_data/ is prepared for future ML training datasets derived from source data.
Spec sections mirror an xr.Dataset Within each spec module, the validation flow follows the dataset layout (coordinates, variables, global attrs, tool compatibility). This makes it easy to place new checks in the appropriate section as the spec evolves.
Inline spec text drives each requirement Every section block contains the human-readable spec text (RFC 2119 wording) followed immediately by function calls that implement the corresponding checks (e.g., check_coordinate_names, check_georeferencing, check_gdal_compatibility). This keeps the specification and enforcement side-by-side.
Checking functions live under mlcast_dataset_validator/checks/<dataset_section>/<dataset_aspect>.py Reusable validators for coordinates, data variables, global attrs, and tool compatibility live under paths like mlcast_dataset_validator/checks/<dataset_section>/<dataset_aspect>.py:check_<dataset_property>. Specs import the relevant function(s) for each section.

mlcast_dataset_validator/
├── specs/
│   ├── source_data/
│   │   └── radar_precipitation.py
│   ├── training_data/
│   │   └── ... (no specs yet)
│   └── cli.py
└── checks/
    ├── coords/
    │   ├── names.py (check_coordinate_names)
    │   ├── spatial.py
    │   ├── temporal.py
    │   └── variable_timestep.py
    ├── data_vars/
    ├── global_attributes/
    └── tool_compatibility/

Example usage

Until mllam-dataset-validator is published to PyPI, the easiest way to run it is to use uv to run it directly from the GitHub repository:

uvx --with "git+https://github.com/mlcast-community/mlcast-dataset-validator" mlcast.validate_dataset <data_stage> <product> <dataset-path>

I.e. you can validate a local Zarr dataset like this:

uvx --with "git+https://github.com/mlcast-community/mlcast-dataset-validator" mlcast.validate_dataset source_data radar_precipitation /path/to/radar_precip_source.zarr

The validator supports also remote zarr hosted in S3 buckets at custom endpoints. We can run it on the Radklim Zarr already available in the intake catalog:

uvx --with "git+https://github.com/mlcast-community/mlcast-dataset-validator" mlcast.validate_dataset source_data radar_precipitation s3://mlcast-source-datasets/radklim/v0.1.0/5_minutes.zarr/ --s3-endpoint-url https://object-store.os-api.cci2.ecmwf.int --s3-anon

Or you can of course clone the repository and run it directly:

git clone
cd mlcast-sourcedata-validator
pip install -e .
mlcast.validate_dataset source_data radar_precipitation /path/to/zarr/file.zarr

Project details

Release history Release notifications | RSS feed

0.3.0

Mar 16, 2026

0.2.0

Feb 6, 2026

This version

0.1.0

Jan 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlcast_dataset_validator-0.1.0-py3-none-any.whl (47.0 kB view details)

Uploaded Jan 29, 2026 Python 3

File details

Details for the file mlcast_dataset_validator-0.1.0-py3-none-any.whl.

File metadata

Download URL: mlcast_dataset_validator-0.1.0-py3-none-any.whl
Upload date: Jan 29, 2026
Size: 47.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for mlcast_dataset_validator-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9d5bed80a40bc5e964d9eaa30c36989eec06b08cd41304073f14c35e1cc681ff`
MD5	`70c85ac899b47d65601c9ddd92cf3938`
BLAKE2b-256	`d5a8dcb56cbcac461acf055f76bf9728919f5a80bd8969795e47e85c746b9a48`

See more details on using hashes here.

mlcast-dataset-validator 0.1.0

Navigation

Verified details

Owner

Unverified details

Meta