Add your description here
Project description
mlcast-dataset-validator
Dataset validator for the MLCast Intake catalog (mlcast-datasets).
What is this?
This repository contains the validation tool for datasets contributed to the MLCast community (currently only radar precipitation source datasets). The validator ensures that the datasets meet the technical requirements for inclusion in the MLCast data collection.
Background
During the MLCast community meeting, multiple entities offered to contribute datasets. To streamline the contribution process and ensure data quality, we developed this validator to help data providers verify that their Zarr archives are compliant with MLCast requirements before submission.
This tool addresses two key needs identified in the community:
- Specification compliance (#6): Validates datasets against the formal MLCast Zarr format specification v1.0 (RFC 2119 keywords)
- Tool compatibility (#5): Tests that datasets work correctly with common geospatial tools (xarray, GDAL, cartopy)
What does it validate?
The validator checks both specification compliance and practical tool compatibility, for example for radar precipitation datasets it checks:
-
Minimum Requirements for Dataset Acceptance:
- 2D radar composite at 1km resolution or finer
- At least 256×256 pixel valid sensing area
- Minimum 3 years of temporal coverage
- Consistent spatial domain across all timesteps
- Data variable in mm (depth), mm/h (rate), or dBZ (reflectivity)
-
Technical Requirements:
- GeoZarr format (Zarr v2/v3 with proper georeferencing)
- CF-compliant coordinate and variable names
- Correct dimension ordering (time × H × W)
- Proper chunking strategy (1 chunk per timestep)
- ZSTD compression (recommended)
- NaN values for missing/out-of-range data
- License metadata (CC-BY, CC-BY-SA, OGL, etc.)
-
Tool Compatibility:
- xarray can load and slice the data correctly
- GDAL can interpret the georeferencing (WKT parsing)
- cartopy can create CRS objects and transform coordinates
- Cross-tool CRS consistency checks
How is the tool implemented?
-
Spec modules are organized by data stage/product Each validator lives under
mlcast_dataset_validator/specs/<data_stage>/<product>.py. For example, the source data radar precipitation spec is found atspecs/source_data/radar_precipitation.py, whilespecs/training_data/is prepared for future ML training datasets derived from source data. -
Spec sections mirror an
xr.DatasetWithin each spec module, the validation flow follows the dataset layout (coordinates, variables, global attrs, tool compatibility). This makes it easy to place new checks in the appropriate section as the spec evolves. -
Inline spec text drives each requirement Every section block contains the human-readable spec text (RFC 2119 wording) followed immediately by function calls that implement the corresponding checks (e.g.,
check_coordinate_names,check_georeferencing,check_gdal_compatibility). This keeps the specification and enforcement side-by-side. -
Checking functions live under
mlcast_dataset_validator/checks/<dataset_section>/<dataset_aspect>.pyReusable validators for coordinates, data variables, global attrs, and tool compatibility live under paths likemlcast_dataset_validator/checks/<dataset_section>/<dataset_aspect>.py:check_<dataset_property>. Specs import the relevant function(s) for each section.
mlcast_dataset_validator/
├── specs/
│ ├── source_data/
│ │ └── radar_precipitation.py
│ ├── training_data/
│ │ └── ... (no specs yet)
│ └── cli.py
└── checks/
├── coords/
│ ├── names.py (check_coordinate_names)
│ ├── spatial.py
│ ├── temporal.py
│ └── variable_timestep.py
├── data_vars/
├── global_attributes/
└── tool_compatibility/
Example usage
Until mllam-dataset-validator is published to PyPI, the easiest way to run it is to use uv to run it directly from the GitHub repository:
uvx --with "git+https://github.com/mlcast-community/mlcast-dataset-validator" mlcast.validate_dataset <data_stage> <product> <dataset-path>
I.e. you can validate a local Zarr dataset like this:
uvx --with "git+https://github.com/mlcast-community/mlcast-dataset-validator" mlcast.validate_dataset source_data radar_precipitation /path/to/radar_precip_source.zarr
The validator supports also remote zarr hosted in S3 buckets at custom endpoints. We can run it on the Radklim Zarr already available in the intake catalog:
uvx --with "git+https://github.com/mlcast-community/mlcast-dataset-validator" mlcast.validate_dataset source_data radar_precipitation s3://mlcast-source-datasets/radklim/v0.1.0/5_minutes.zarr/ --s3-endpoint-url https://object-store.os-api.cci2.ecmwf.int --s3-anon
Or you can of course clone the repository and run it directly:
git clone
cd mlcast-sourcedata-validator
pip install -e .
mlcast.validate_dataset source_data radar_precipitation /path/to/zarr/file.zarr
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlcast_dataset_validator-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mlcast_dataset_validator-0.1.0-py3-none-any.whl
- Upload date:
- Size: 47.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d5bed80a40bc5e964d9eaa30c36989eec06b08cd41304073f14c35e1cc681ff
|
|
| MD5 |
70c85ac899b47d65601c9ddd92cf3938
|
|
| BLAKE2b-256 |
d5a8dcb56cbcac461acf055f76bf9728919f5a80bd8969795e47e85c746b9a48
|