Skip to main content

Intake catalog for datasets relevant for machine learning based nowcasting

Project description

MLCast Community intake catalog

data-availability-check linting Jupyter Book Badge

Hi! 👋

You are looking at the source data intake catalog for the MLCast community. This is a collection of datasets we have currated with the aim of making them available to build machine learning training datasets from.

The following diagram shows the intended data flow and how the intake catalog (this repository) fits into the overall architecture of the MLCast project.

source for this graphic

How to use this catalog

To use the catalog, you can either a) install the necessary python packages yourself and read the catalog directly from github or b) install the most recent tagged release of the mlcast_datasets python package from pypi.org and read the catalog included in that release. Reading the catalog from github is useful if you want to use the most recent version of the catalog, while installing the mlcast_datasets package is useful if you want to use a stable version of the catalog.

a) Reading the catalog directly from github

To read and open datasets in the catalog you will need to have the following packages installed:

pip install intake intake-xarray zarr jinja2

Or, you can installing the mlcast-datasets package directly from this repository, which will install all the necessary dependencies:

pip install git+https://github.com/mlcast-community/mlcast-datasets

The catalogue (and underlying data) can then be accessed directly from python:

import intake
cat = intake.open_catalog("https://raw.githubusercontent.com/mlcast-community/mlcast-datasets/main/src/mlcast_datasets/catalog/catalog.yml")

b) Installing the mlcast_datasets package

To install the most recent tagged release of the mlcast_datasets package, you can use pip:

pip install mlcast-datasets

and then read the catalog from the package:

import mlcast_datasets
cat = mlcast_datasets.open_catalog()

Using data within the catalog

Once you have opened the catalog, you can list the available sources with:

>> list(cat)
['precipitation']

>> list(cat.precipitation)
['radklim_hourly', 'radklim_5_minutes']

Then load up a dask-backed xarray.Dataset so that you have access to all the available variables and attributes in the dataset:

>> ds = cat.precipitation.radklim_5_minutes.to_dask()
>> ds
<xarray.Dataset> Size: 10TB
Dimensions:  (time: 2419200, y: 1100, x: 900)
Coordinates:
    lat      (y, x) float64 8MB dask.array<chunksize=(1100, 900), meta=np.ndarray>
    lon      (y, x) float64 8MB dask.array<chunksize=(1100, 900), meta=np.ndarray>
  * time     (time) datetime64[ns] 19MB 2001-01-01 ... 2023-12-31T23:55:00
  * x        (x) float64 7kB -443.0 -442.0 -441.0 -440.0 ... 454.0 455.0 456.0
  * y        (y) float64 9kB -4.758e+03 -4.757e+03 ... -3.66e+03 -3.659e+03
Data variables:
    RR       (time, y, x) float32 10TB dask.array<chunksize=(1, 1100, 900), meta=np.ndarray>
    crs      float64 8B ...
Attributes:
    Author:                Harald Rybka, Katharina Lengfeld
    Conventions:           CF-1.6
    history:               Created at 2021-07-09 09:10:06.385653
    institution:           Deutscher Wetterdienst (DWD)
    reference:             10.5676/DWD/RADKLIM_YW_V2017.002
    title:                 RADKLIM - radar-based precipitation climatology
    zarr_creation:         created with mlcast_dataset_radklim (https://githu...
    zarr_dataset_version:  0.1.0

Start using the dataset 🙂

Contributing

We are always looking for new datasets to add to the catalog. If you have a dataset you would like to contribute, please open an issue or a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlcast_datasets-0.1.0-py3-none-any.whl (5.1 kB view details)

Uploaded Python 3

File details

Details for the file mlcast_datasets-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mlcast_datasets-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cf0eb31673a9d002e0290248aaa02e650e19b8935349398bc0427311c0e93da6
MD5 ef2dc04c04f2e1b593fd15b65e058cba
BLAKE2b-256 7968fb8063cf160867ed78aa6e91a082554ac40af19c4c505e4ccaa0e7af1d11

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page