Skip to main content

Add your description here

Project description

mlcast-dataset-validator

Spec Docs

Dataset validator for the MLCast Intake catalog (mlcast-datasets).

What is this?

This repository contains the validation tool for datasets contributed to the MLCast community (currently only radar precipitation source datasets). The validator ensures that the datasets meet the technical requirements for inclusion in the MLCast data collection.

Background

During the MLCast community meeting, multiple entities offered to contribute datasets. To streamline the contribution process and ensure data quality, we developed this validator to help data providers verify that their Zarr archives are compliant with MLCast requirements before submission.

This tool addresses two key needs identified in the community:

  1. Specification compliance (#6): Validates datasets against the formal MLCast Zarr format specification v1.0 (RFC 2119 keywords)
  2. Tool compatibility (#5): Tests that datasets work correctly with common geospatial tools (xarray, GDAL, cartopy)

What does it validate?

The validator checks both specification compliance and practical tool compatibility, for example for radar precipitation datasets it checks:

  • Minimum Requirements for Dataset Acceptance:

    • 2D radar composite at 1km resolution or finer
    • At least 256×256 pixel valid sensing area
    • Minimum 3 years of temporal coverage
    • Consistent spatial domain across all timesteps
    • Data variable in mm (depth), mm/h (rate), or dBZ (reflectivity)
  • Technical Requirements:

    • GeoZarr format (Zarr v2/v3 with proper georeferencing)
    • CF-compliant coordinate and variable names
    • Correct dimension ordering (time × H × W)
    • Proper chunking strategy (1 chunk per timestep)
    • ZSTD compression (recommended)
    • NaN values for missing/out-of-range data
    • License metadata (CC-BY, CC-BY-SA, OGL, etc.)
  • Tool Compatibility:

    • xarray can load and slice the data correctly
    • GDAL can interpret the georeferencing (WKT parsing)
    • cartopy can create CRS objects and transform coordinates
    • Cross-tool CRS consistency checks

How is the tool implemented?

  1. Spec modules are organized by data stage/product Each validator lives under mlcast_dataset_validator/specs/<data_stage>/<product>.py. For example, the source data radar precipitation spec is found at specs/source_data/radar_precipitation.py, while specs/training_data/ is prepared for future ML training datasets derived from source data.

  2. Spec sections mirror an xr.Dataset Within each spec module, the validation flow follows the dataset layout (coordinates, variables, global attrs, tool compatibility). This makes it easy to place new checks in the appropriate section as the spec evolves.

  3. Inline spec text drives each requirement Every section block contains the human-readable spec text (RFC 2119 wording) followed immediately by function calls that implement the corresponding checks (e.g., check_coordinate_names, check_georeferencing, check_gdal_compatibility). This keeps the specification and enforcement side-by-side.

  4. Checking functions live under mlcast_dataset_validator/checks/<dataset_section>/<dataset_aspect>.py Reusable validators for coordinates, data variables, global attrs, and tool compatibility live under paths like mlcast_dataset_validator/checks/<dataset_section>/<dataset_aspect>.py:check_<dataset_property>. Specs import the relevant function(s) for each section.

mlcast_dataset_validator/
├── specs/
│   ├── source_data/
│   │   └── radar_precipitation.py
│   ├── training_data/
│   │   └── ... (no specs yet)
│   └── cli.py
└── checks/
    ├── coords/
    │   ├── names.py (check_coordinate_names)
    │   ├── spatial.py
    │   ├── temporal.py
    │   └── variable_timestep.py
    ├── data_vars/
    ├── global_attributes/
    └── tool_compatibility/

Usage

The validator can be run from the command-line or imported and called directly from Python as you will see below.

From the command-line

The easiest way to run the validator is to use uv and execute it directly from the PyPI release (mlcast-dataset-validator):

uvx --from mlcast-dataset-validator mlcast.validate_dataset <data_stage> <product> <dataset-path>

I.e. you can validate a local Zarr dataset like this:

uvx --from mlcast-dataset-validator mlcast.validate_dataset source_data radar_precipitation /path/to/radar_precip_source.zarr

The validator supports also remote zarr hosted in S3 buckets at custom endpoints. We can run it on the Radklim Zarr already available in the intake catalog:

uvx --from mlcast-dataset-validator mlcast.validate_dataset source_data radar_precipitation s3://mlcast-source-datasets/radklim/v0.1.1/5_minutes.zarr/ --s3-endpoint-url https://object-store.os-api.cci2.ecmwf.int --s3-anon

Or you can of course clone the repository and run it directly:

git clone
cd mlcast-sourcedata-validator
pip install -e .
mlcast.validate_dataset source_data radar_precipitation /path/to/zarr/file.zarr

From python

You can also integrate the validator into your Python workflow by importing the relevant spec and calling it directly with an xr.Dataset object. This is how the validator is used in the CI of the mlcast-datasets repository to validate datasets on every PR and main branch commit.

For example to validate the same Radklim Zarr dataset from Python, you can do:

import xarray as xr

from mlcast_dataset_validator.specs.source_data import radar_precipitation

storage_options = {
    "endpoint_url": "https://object-store.os-api.cci2.ecmwf.int",
    "anon": True,
}

ds = xr.open_zarr(
    "s3://mlcast-source-datasets/radklim/v0.1.1/5_minutes.zarr/",
    storage_options=storage_options,
)
# Preserve storage options on the dataset so zarr_format checks can inspect remote-store metadata correctly.
ds.encoding.setdefault("storage_options", storage_options)

report, _ = radar_precipitation.validate_dataset(ds)
report.console_print()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlcast_dataset_validator-0.3.0-py3-none-any.whl (48.9 kB view details)

Uploaded Python 3

File details

Details for the file mlcast_dataset_validator-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mlcast_dataset_validator-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8f62f75df92c9ee6b61db0e49a3764d2b4b4d0dd8a3101e9c77bc1ec6dc8e01f
MD5 bf1d7ec291e8225c7914cb4817f046d9
BLAKE2b-256 78a9b8742b9807d26f9cb062e3f2426a9aa0a0eb150624be98c770863addef21

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page