A geostatistical extraction and alignment engine for East African weather and satellite data.
Project description
MasharikiWeather
Overview
MasharikiWeather is an open experimental initiative to create an ML-ready, framework-agnostic weather dataset for East Africa.
It draws inspiration from the PeakWeather project — an integrated, harmonized, and machine-learning–ready global climate dataset.
At the moment, this is to remain an in-house tool for DSAIL.
The goal is to study and reproduce PeakWeather’s design philosophy, adapting its core principles to African data realities such as sparse station coverage, multimodal data sources, and irregular spatiotemporal grids.
Ultimately, MasharikiWeather aims to be a multi-variable benchmark dataset that supports physics-based, AI-based, and hybrid forecasting pipelines across frameworks like PyTorch, TensorFlow, JAX, and NumPy.
Usage
Create a python virtual environment and activate it.
python -m venv .venv
source .venv/bin/activate
Install the package.
pip install masharikiweather
Authentication
This pipeline streams data directly from the DeKUT-DSAIL/weather-data Hugging Face repository. You must have a specific Hugging Face Access Token.
-
If running locally, you can authenticate via the CLI:
huggingface-cli login -
If running in Colab, securely store your token in the Colab Secrets manager.
Quickstart
from masharikiweather import MasharikiWeatherDataset
# 1. Initialize the Pipeline (Handles caching and network fusion)
ds = MasharikiWeatherDataset(
repo_id="DeKUT-DSAIL/weather-data",
token="YOUR_HF_TOKEN",
source_obs=["tahmo", "ghcnd"], # Fusing hourly and daily networks
freq="h",
years=[2023, 2024]
)
# 2. Extract Gridded Satellite/Reanalysis Context
gridded_data = ds.get_gridded_for_stations(
groups=["era5"],
stations=['TA00001', 'TA00283'],
variables=['total_precipitation'],
method="linear" # Bilinear interpolation
)
# 3. Generate ML Tensors (Aligned and Windowed)
ml_tensors = ds.get_windows(
window_size=24, # 24 hours of historical context
horizon_size=6, # 6 hours of prediction
stations=['TA00001', 'TA00283'],
gridded_url=["era5"],
as_xarray=True
)
print(ml_tensors.x) # Your aligned features
print(ml_tensors.y) # Your targets
Objectives
- Reproduce and understand the PeakWeather pipeline
- Explore its dataset schema, preprocessing philosophy, and data fusion principles.
- Develop an East Africa-centered multi-source fusion framework
- Harmonize station, reanalysis, satellite, and static prior datasets in a unified structure.
- Build a benchmark-ready, multi-variable dataset
- Include precipitation, temperature, humidity, solar radiation, wind, and other key atmospheric variables.
- Enable framework-agnostic ML integration
- Support easy export and loading across ML frameworks using formats like Zarr, NetCDF, and HDF5.
- Advance East African climate AI infrastructure
- Provide standardized, transparent, and reproducible weather datasets tailored to African needs.
Core Concept
East Africa’s meteorological landscape is characterized by:
- Sparse ground observations (TAHMO, GHCNd).
- Diverse gridded data products (ERA5, CHIRPS, TAMSAT, IMERG).
- Static surface properties that influence local weather (elevation, slope, aspect, land cover).
- Spatial and temporal inconsistencies across sources.
MasharikiWeather seeks to bridge these gaps through:
- Spatiotemporal Graph Learning of station, satellite, and reanalysis data.
- Integration of static priors to capture topographic and land–surface context.
- Unified variable alignment for consistent modeling inputs.
- Multi-scale representation, enabling both local and continental model evaluation.
- ML-ready exports, inspired by PeakWeather’s compatibility-first design.
Data Sources
| Source | Type | Coverage | Variables | Role |
|---|---|---|---|---|
| TAHMO | In-situ (stations) | Sub-Saharan Africa | Precipitation, Temperature | Ground truth |
| ERA5 | Reanalysis | Global | Full atmospheric suite | Physics-based baseline |
| CHIRPS | Satellite + Gauge | 1981–Present | Precipitation | Long-term rainfall |
| TAMSAT | Satellite | Africa | Precipitation | Bias-corrected rainfall |
| IMERG | Satellite | Global | Precipitation | Half-houly rainfall |
| Static Priors (EE) | Earth Engine Layers | Africa | Elevation, Slope, Aspect, Land Cover, Distance to Water | Geophysical context |
| (Future) ECMWF ML, FuXi, GraphCast, FourCastNet | Global | Precip, Temp, Wind, Radiation | ML & hybrid forecasts |
Alignment with PeakWeather Roadmap
| PeakWeather Focus | MasharikiWeather Adaptation |
|---|---|
| Global ML-ready weather dataset | East African-focused ML-ready dataset |
| Harmonized across ERA5, GFS, and observations | Fusion of TAHMO, ERA5, CHIRPS, TAMSAT, static priors |
| Precipitation-focused benchmarking | Multi-variable (precip, temp, humidity, radiation, topography) |
| Cloud-scale Zarr exports | Cloud and local exports via Zarr / NetCDF |
| Open and reproducible ML access | Reproducible African weather research |
Phased Roadmap
Phase 1 — PeakWeather Exploration
- Study PeakWeather’s documentation, schema, and data loaders.
- Analyze its variable harmonization and metadata organization.
- Run sample ML-ready preprocessing on a small African region.
Phase 2 — MasharikiWeather Schema Design
- Define temporal resolution (e.g., 6-hourly or daily).
- Define spatial structure (station points vs gridded data).
- Standardize variable names and CF-compliant metadata.
- Establish coordinate references (lat/lon/time).
Phase 3 — TAHMO + ERA5 Integration
- Align station-based and gridded data through nearest-grid or interpolation.
- Handle irregular sampling and missing timestamps.
- Store as unified
xarray.Datasetwith metadata and attributes.
Phase 4 — Multi-source Expansion
- Add CHIRPS, IMERG and TAMSAT for multi-sensor rainfall comparison.
- Incorporate temperature, humidity, radiation, and wind from ERA5.
- Evaluate inter-product correlations, bias, and consistency.
Phase 5 — Integrate Static Priors
- Merge Earth Engine static features (elevation, slope, aspect, land cover, distance to water).
- Harmonize to match ERA5 and CHIRPS grids.
- Enable topography-aware model development.
Phase 6 — ML-Ready Export
- Export standardized, chunked datasets to Zarr and NetCDF.
- Develop lightweight data loaders for PyTorch, TensorFlow, and JAX.
- Preserve metadata and normalization info for each variable.
Phase 7 — Benchmark & Evaluation
- Implement baseline models using PeakWeather-style workflows.
- Compare model performance across variables and regions.
- Publish visual and quantitative evaluations.
Guiding Principles
- Reproducibility — Version-controlled, scriptable data processing.
- Transparency — Clear documentation for every transformation step.
- Scalability — Built for cloud-scale workflows (DVC, Prefect, Zarr).
- Inclusivity — Designed around African data sources and use cases.
- Framework-agnosticism — ML-ready for PyTorch, TensorFlow, and beyond.
Contributing
We welcome active experimentation and stress-testing from the DSAIL team! Whether you are testing a new spatial masking technique, adding a new satellite data source, or optimizing the data loaders, we want your contributions.
To ensure the core engine remains stable while we experiment, please review our Contribution Guidelines before pushing code. All new features and experiments should be developed on a separate branch and submitted via a Pull Request (PR) for peer review.
Credits
Developed as part of an effort to advance localized, data-driven weather prediction for East Africa,
inspired by PeakWeather and WeatherBench2.
MasharikiWeather is a step toward open, harmonized, and equitable climate AI infrastructure for East Africa.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file masharikiweather-0.1.0.dev2.tar.gz.
File metadata
- Download URL: masharikiweather-0.1.0.dev2.tar.gz
- Upload date:
- Size: 24.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da96d7693eeb9aeba6080cb42c8e55fdb1ad8a6673c2cb6033be1ead3dda8cf2
|
|
| MD5 |
4b75abbacf3500ea43f7ef5d7a2a3ec6
|
|
| BLAKE2b-256 |
9d0fb1908efab5dc9f7e9bdb318e6a5dcf71379878f8d9301da88be650c0d2b8
|
File details
Details for the file masharikiweather-0.1.0.dev2-py3-none-any.whl.
File metadata
- Download URL: masharikiweather-0.1.0.dev2-py3-none-any.whl
- Upload date:
- Size: 21.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a5aa6ee6612b8ba40493c4ab955c79d361646ac0caec671e2f4b24ee286a332a
|
|
| MD5 |
cd4e7c6d90e19425642151de401ddcaa
|
|
| BLAKE2b-256 |
70638cc38a314dbe3dd409fe1ffdc55f5219d7ef663c1abd31a8b2decfb6bc09
|