Skip to main content

A geostatistical extraction and alignment engine for East African weather and satellite data.

Project description

MasharikiWeather

Overview

MasharikiWeather is an open experimental initiative to create an ML-ready, framework-agnostic weather dataset for East Africa.
It draws inspiration from the PeakWeather project — an integrated, harmonized, and machine-learning–ready global climate dataset.

At the moment, this is to remain an in-house tool for DSAIL.

The goal is to study and reproduce PeakWeather’s design philosophy, adapting its core principles to African data realities such as sparse station coverage, multimodal data sources, and irregular spatiotemporal grids.

Ultimately, MasharikiWeather aims to be a multi-variable benchmark dataset that supports physics-based, AI-based, and hybrid forecasting pipelines across frameworks like PyTorch, TensorFlow, JAX, and NumPy.


Usage

Create a python virtual environment and activate it.

python -m venv .venv
source .venv/bin/activate

Install the package.

pip install masharikiweather

Authentication

This pipeline streams data directly from the DeKUT-DSAIL/weather-data Hugging Face repository. You must have a specific Hugging Face Access Token.

  • If running locally, you can authenticate via the CLI: huggingface-cli login

  • If running in Colab, securely store your token in the Colab Secrets manager.

Quickstart

from masharikiweather import MasharikiWeatherDataset

# 1. Initialize the Pipeline (Handles caching and network fusion)
ds = MasharikiWeatherDataset(
    repo_id="DeKUT-DSAIL/weather-data",
    token="YOUR_HF_TOKEN", 
    source_obs=["tahmo", "ghcnd"], # Fusing hourly and daily networks
    freq="h", 
    years=[2023, 2024]
)

# 2. Extract Gridded Satellite/Reanalysis Context
gridded_data = ds.get_gridded_for_stations(
    groups=["era5"], 
    stations=['TA00001', 'TA00283'], 
    variables=['total_precipitation'],
    method="linear" # Bilinear interpolation
)

# 3. Generate ML Tensors (Aligned and Windowed)
ml_tensors = ds.get_windows(
    window_size=24,  # 24 hours of historical context
    horizon_size=6,  # 6 hours of prediction
    stations=['TA00001', 'TA00283'],
    gridded_url=["era5"],
    as_xarray=True
)

print(ml_tensors.x) # Your aligned features
print(ml_tensors.y) # Your targets

Objectives

  1. Reproduce and understand the PeakWeather pipeline
    • Explore its dataset schema, preprocessing philosophy, and data fusion principles.
  2. Develop an East Africa-centered multi-source fusion framework
    • Harmonize station, reanalysis, satellite, and static prior datasets in a unified structure.
  3. Build a benchmark-ready, multi-variable dataset
    • Include precipitation, temperature, humidity, solar radiation, wind, and other key atmospheric variables.
  4. Enable framework-agnostic ML integration
    • Support easy export and loading across ML frameworks using formats like Zarr, NetCDF, and HDF5.
  5. Advance East African climate AI infrastructure
    • Provide standardized, transparent, and reproducible weather datasets tailored to African needs.

Core Concept

East Africa’s meteorological landscape is characterized by:

  • Sparse ground observations (TAHMO, GHCNd).
  • Diverse gridded data products (ERA5, CHIRPS, TAMSAT, IMERG).
  • Static surface properties that influence local weather (elevation, slope, aspect, land cover).
  • Spatial and temporal inconsistencies across sources.

MasharikiWeather seeks to bridge these gaps through:

  • Spatiotemporal Graph Learning of station, satellite, and reanalysis data.
  • Integration of static priors to capture topographic and land–surface context.
  • Unified variable alignment for consistent modeling inputs.
  • Multi-scale representation, enabling both local and continental model evaluation.
  • ML-ready exports, inspired by PeakWeather’s compatibility-first design.

Data Sources

Source Type Coverage Variables Role
TAHMO In-situ (stations) Sub-Saharan Africa Precipitation, Temperature Ground truth
ERA5 Reanalysis Global Full atmospheric suite Physics-based baseline
CHIRPS Satellite + Gauge 1981–Present Precipitation Long-term rainfall
TAMSAT Satellite Africa Precipitation Bias-corrected rainfall
IMERG Satellite Global Precipitation Half-houly rainfall
Static Priors (EE) Earth Engine Layers Africa Elevation, Slope, Aspect, Land Cover, Distance to Water Geophysical context
(Future) ECMWF ML, FuXi, GraphCast, FourCastNet Global Precip, Temp, Wind, Radiation ML & hybrid forecasts

Alignment with PeakWeather Roadmap

PeakWeather Focus MasharikiWeather Adaptation
Global ML-ready weather dataset East African-focused ML-ready dataset
Harmonized across ERA5, GFS, and observations Fusion of TAHMO, ERA5, CHIRPS, TAMSAT, static priors
Precipitation-focused benchmarking Multi-variable (precip, temp, humidity, radiation, topography)
Cloud-scale Zarr exports Cloud and local exports via Zarr / NetCDF
Open and reproducible ML access Reproducible African weather research

Phased Roadmap

Phase 1 — PeakWeather Exploration

  • Study PeakWeather’s documentation, schema, and data loaders.
  • Analyze its variable harmonization and metadata organization.
  • Run sample ML-ready preprocessing on a small African region.

Phase 2 — MasharikiWeather Schema Design

  • Define temporal resolution (e.g., 6-hourly or daily).
  • Define spatial structure (station points vs gridded data).
  • Standardize variable names and CF-compliant metadata.
  • Establish coordinate references (lat/lon/time).

Phase 3 — TAHMO + ERA5 Integration

  • Align station-based and gridded data through nearest-grid or interpolation.
  • Handle irregular sampling and missing timestamps.
  • Store as unified xarray.Dataset with metadata and attributes.

Phase 4 — Multi-source Expansion

  • Add CHIRPS, IMERG and TAMSAT for multi-sensor rainfall comparison.
  • Incorporate temperature, humidity, radiation, and wind from ERA5.
  • Evaluate inter-product correlations, bias, and consistency.

Phase 5 — Integrate Static Priors

  • Merge Earth Engine static features (elevation, slope, aspect, land cover, distance to water).
  • Harmonize to match ERA5 and CHIRPS grids.
  • Enable topography-aware model development.

Phase 6 — ML-Ready Export

  • Export standardized, chunked datasets to Zarr and NetCDF.
  • Develop lightweight data loaders for PyTorch, TensorFlow, and JAX.
  • Preserve metadata and normalization info for each variable.

Phase 7 — Benchmark & Evaluation

  • Implement baseline models using PeakWeather-style workflows.
  • Compare model performance across variables and regions.
  • Publish visual and quantitative evaluations.

Guiding Principles

  • Reproducibility — Version-controlled, scriptable data processing.
  • Transparency — Clear documentation for every transformation step.
  • Scalability — Built for cloud-scale workflows (DVC, Prefect, Zarr).
  • Inclusivity — Designed around African data sources and use cases.
  • Framework-agnosticism — ML-ready for PyTorch, TensorFlow, and beyond.

Contributing

We welcome active experimentation and stress-testing from the DSAIL team! Whether you are testing a new spatial masking technique, adding a new satellite data source, or optimizing the data loaders, we want your contributions.

To ensure the core engine remains stable while we experiment, please review our Contribution Guidelines before pushing code. All new features and experiments should be developed on a separate branch and submitted via a Pull Request (PR) for peer review.

Credits

Developed as part of an effort to advance localized, data-driven weather prediction for East Africa,
inspired by PeakWeather and WeatherBench2.

MasharikiWeather is a step toward open, harmonized, and equitable climate AI infrastructure for East Africa.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

masharikiweather-0.1.2.dev0.tar.gz (42.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

masharikiweather-0.1.2.dev0-py3-none-any.whl (28.5 kB view details)

Uploaded Python 3

File details

Details for the file masharikiweather-0.1.2.dev0.tar.gz.

File metadata

  • Download URL: masharikiweather-0.1.2.dev0.tar.gz
  • Upload date:
  • Size: 42.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for masharikiweather-0.1.2.dev0.tar.gz
Algorithm Hash digest
SHA256 ced999df5ef5f4c04dbb589a3d780d7a1a0f25a67056476c03ef926e2b5993ec
MD5 6e5cf99db714534faae2a55c791b66c1
BLAKE2b-256 cbe63fac699563306b785bbf5cf7e41f2d5fe21b5f469a32a7f159a0d0df14a0

See more details on using hashes here.

File details

Details for the file masharikiweather-0.1.2.dev0-py3-none-any.whl.

File metadata

File hashes

Hashes for masharikiweather-0.1.2.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 c7251658919503fc4d008040fb7f28dad8dca4b9afb440fdf01a9183fab06df3
MD5 ed02260ade1bbe622f6aacf320b4c210
BLAKE2b-256 13e78a30558064a7d8e9ca9fbc29396e7a288e3a1707dbfa8cab38f92fc75b51

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page