Africa-first library for data, forecasts, and benchmarking.
Project description
Sheerwater
A weather forecast and data benchmarking library. The Sheerwater project is working to benchmark ML- and physics-based weather and climate forecasts regionally and globally with a focus on model performance on the African continent.
Sheerwater contains a set of data accessors to fetch common forecasts and ground-truth data sources, a library of common evaluation metrics, and a metrics interface to validate forecasts against data products and station data.
Getting started
To run this code, you need read access to Sheerwater forecasts and ground truth data stored in our cloud bucket. Some of this data, included CHIRPS, IMERG, ERA5, and ECMWF ER are in a public bucket that requires no additional credentials, so all you have to do is:
- Install sheerwater in your environment:
pip install sheerwater
- Use sheerwater to access forecasts or data:
from sheerwater.reanalysis import era5
from sheerwater.data import ghcn, chirps_v3
from sheerwater.metrics import grouped_metric
# Get ERA5 as an xarray
ds_era5 = era5("2020-01-01", "2022-01-01", agg_days=1, variable="precip", grid="global1_5",)
# Get gridded GHCN weather station data
ds_ghcn = ghcn("2020-01-01", "2022-01-01", agg_days=7, variable="precip", grid="global0_25")
# Get chirps data with default parameters
ds_chirps = chirps_v2()
- Run evaluation metrics on public forecasts or data
# Run an evaluation metric - this might take some time!
val = metric("2016-01-01", "2022-12-31", forecast="era5", truth="ghcn", variable="precip",
metric_name="mae", region="country", grid="global1_5")
print(val)
Data access and storage philosophy
Sheerwater access and transforms terra- and sometimes peta-byte scale datasets. It uses Nuthatch
to store, recall and slice these datasets to enable more efficient access. In Nuthatch the results of functions
are stored in caches based on the function name and the arguments that are passed. When the same function is called with
the same key arguments, the data is returned rather than running the function.
Datasets are primarily stored in cloud storage buckets - caches - some of which are public and some of which are access-restricted.
When a function is called, Nuthatch therefore checks to see if a global view of the data exists, it slices the data down in time and space
as requested by the function arguments, and then it returns that data, or a view of that data, back to the user.
Nuthatch also enables a user to copy a slice of data to their local machine and then reuse that data for more efficient repeated analysis.
When you install Sheerwater you are automatically configured to access all of Sheerwater's public data through Nuthatch. Therefore when you call Sheerwater functions, you will mostly just be hitting pre-computed results, but the code serves as a self-documenting API of how the data was transformed from its source to the result and enables users with enough compute to rerun the code either for purposes of verification or to compute the functions with arguments that have not already been computed and stored.
Available data
| Dataset | Variations | Grids | Aggregations (days) | Available date range |
Notes |
|---|---|---|---|---|---|
| IMERG | imerg_late, imerg_final | imerg (native), global0_25, global1_5 | 1, 5, 7, 10 | 1998-01-01 2024-12-31 |
|
| CHIRPS | chirps_v2, chirps_v3, chirp_v2, chirp_v3 |
chirps, global0_25, global1_5 | 1, 5, 7, 10 | 2000-06-01 2024-12-31 |
Some variations extend back to 1998 |
| ERA5 | era5 | global0_25, global1_5 | 1, 5, 7, 10 | 1998-01-01 2024-12-31 |
From google ARCO only tmp2m and precip regridded |
| GHCN | ghcn, ghcn_avg | global0_25, global1_5 | 1, 5, 7, 10, 14, 30 | 1998-01-01 2024-12-31 |
ghcn picks a random station in a grid cell, ghcn_avg averageas all stations in a grid cell |
| TAHMO | tahmo, tahmo_avg | global0_25, global1_5 | 1, 5, 7, 10, 14, 30 | 2016-01-01 2025-06-01 |
Requires TAHMO Data Agreement, not in public bucket. tahmo picks a random station in a grid cell, tahmo_avg averageas all stations in a grid cell |
| ECMWF IFS ER | ecmwf_ifs_er | global1_5 | 1, 7, 14 | 2016-01-04 2023-02-12 |
From the weatherbench archive, known version |
| FuXi S2S | fuxi | global1_5 | 7 | 2016-01-03 2022-02-02 |
Only precip and tmp2m |
Additional data accessors may be available. Please reach out if you see it in the code base but it's not listed here.
Accessing sheerwater private data
Some data requires access to the sheerwater private bucket. Please send us an email for access so we can discuss use cases and collaboration. After we have added you to our bucket you can run the following commands to access data.
curl https://sdk.cloud.google.com | bash
gcloud auth application-default login
Evaluating your own forecasts against your own data
If you have a forecast you would like to evaluate, you can tag it in the sheerwater forecast decorator so that sheerwater can find it for evaluation.
from sheerwater.forecasts import forecast
from sheerwater.data import data
from sheerwater.metrics import metric
# Forecasts must be xarrays with coordinates for lat, lon, init_time, and
# prediction_timedelta with a matching variable on the correct grid
@forecast
def my_forecast(start_time, end_time, agg_days, variable, grid, **kwargs):
ds = fetch_forecast(start_time, end_time, agg_days, variable, grid)
ds = ds.rename({'start_time': 'init_time',
'timestep': 'prediction_timedelta',
'latitude': 'lat',
'longitude': 'lon'})
ds = ds.rename_vars({'precipitation_mm': 'precip'})
return ds
# Data must be xarrays with coordinates for lat, lon, and time with a
# matching variable on the correct grid
@data
def my_station_data(start_time, end_time, agg_days, variable, grid, **kwargs):
ds = fetch_data(start_time, end_time, agg_days, variable, grid)
return ds
# Evaluate the forecast
metric("2015-01-01", "2022-01-01", forecast="my_forecast", truth="my_station_data",
agg_days=1, variable='precip', grid='global1_5', metric_name="bias",
region="country", time_grouping="month_of_year")
To support data fetching, sheerwater depends on Nuthatch.
Developing on sheerwater
- Install UV
curl -Ls https://astral.sh/uv/install.sh | sh
- Install Google Cloud CLI and log in:
curl https://sdk.cloud.google.com | bash
gcloud auth application-default login
- Install non-Python dependencies:
brew install hdf5 netcdf
- Install Python dependencies:
uv sync
- Run commands with UV:
uv run python ...
or
uv run jupyter lab
Deployment and Infrastructure
This repository is integrated with the Rhiza infrastructure for deployment of metrics to databases and integration of those databases into Grafana dashboards for visualization. If you are deploying this code on backend infrastructure with Grafana and Terraform, please read the Infrastructure README.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sheerwater-0.2.8.tar.gz.
File metadata
- Download URL: sheerwater-0.2.8.tar.gz
- Upload date:
- Size: 105.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b772567e7d0844680b4617721776aaa21d24d55357ceeef01b4158082d8710e1
|
|
| MD5 |
629030b34dfe9e0b2152bbe6b2ec9848
|
|
| BLAKE2b-256 |
854a002ebd3643f3d7f135b45ed689d03c192fbd305d1a78e3417d864257047c
|
File details
Details for the file sheerwater-0.2.8-py3-none-any.whl.
File metadata
- Download URL: sheerwater-0.2.8-py3-none-any.whl
- Upload date:
- Size: 146.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
301f8db3ccdb8fb69c0860f68ff33ddc97fd0d166df5b28d8144ebf18dd6c7c8
|
|
| MD5 |
383216eaf7401c435a20123eb2cfa9de
|
|
| BLAKE2b-256 |
b5b9e63a0e07e58b59001e331a4c1d8a3161052063feddf4522a5ee983f86df1
|