Skip to main content

Library for getting dataset from noaa site

Project description

noawclg · GFS Dataset Manager

Download, cache and analyse NOAA GFS forecast data in one line of Python.

PyPI Python License: MIT

noawclg wraps the NOAA NOMADS grib-filter endpoint and exposes a clean Python API that lets you:

  • Download GFS 0.25° GRIB2 files with a single method call — one HTTP request per forecast hour regardless of how many variables you need.
  • Cache raw GRIB2 files to disk so repeated runs cost nothing.
  • Extract any combination of surface and upper-air variables into analysis-ready xarray.Dataset objects.
  • Save output as compressed NetCDF4 or chunked Zarr for downstream processing.

Table of Contents

  1. Installation
  2. Quick Start
  3. How It Works
  4. API Reference
  5. Variable Catalogue
  6. Pre-defined Hour Sequences
  7. Region Subsetting
  8. Logging
  9. Examples
  10. Contributing
  11. License

Installation

pip install noawclg

System dependency — eccodes

cfgrib requires the eccodes C library to decode GRIB2 files.

Platform Command
Ubuntu / Debian sudo apt install libeccodes-dev
macOS (Homebrew) brew install eccodes
Conda (any OS) conda install -c conda-forge eccodes

Quick Start

from noawclg import GFSDatasetManager

# Create a manager for the 06 Z run of 2026-04-03
mgr = GFSDatasetManager(date="20260403", cycle="06")

# Download t2m + precipitation for the next 48 h (6-hourly)
# → only 9 HTTP requests (one per hour), not 18
ds = mgr.build_multi_dataset(
    var_keys=["t2m", "prate"],
    hours=list(range(0, 49, 6)),
)

print(ds)
# <xarray.Dataset>
# Dimensions:  (time: 9, latitude: 721, longitude: 1440)
# Data variables:
#     t2m      (time, latitude, longitude) float64 ...
#     prate    (time, latitude, longitude) float64 ...

mgr.save_netcdf(ds, "/tmp/gfs_48h.nc")

How It Works

Single-download architecture

Previous approaches sent one HTTP request per variable per forecast hour.
For 5 variables × 48 hours that means 240 requests.

noawclg exploits the NOMADS grib-filter's multi-variable syntax to bundle every requested variable into a single URL per hour:

https://nomads.ncep.noaa.gov/cgi-bin/filter_gfs_0p25_1hr.pl
  ?dir=/gfs.20260403/06/atmos
  &file=gfs.t06z.pgrb2.0p25.f024
  &var_TMP=on&lev_2_m_above_ground=on       ← t2m
  &var_PRATE=on&lev_surface=on              ← prate
  &var_PRMSL=on&lev_mean_sea_level=on       ← prmsl
  &subregion=&toplat=5&bottomlat=-35&...    ← optional region

Result: 5 variables × 48 hours = 9 requests (one per hour).

Disk cache

Every downloaded GRIB2 file is saved under output_dir with a deterministic filename that encodes the date, cycle, variable set, region tag and forecast hour:

gfs_20260403_06z_prate_t2m_5N35S75W34E_f024.grib2

On subsequent runs the file is reused without any network I/O.

cfgrib extraction

After downloading, each variable is extracted from the cached GRIB2 using cfgrib with a cascade of filter strategies (shortName → typeOfLevel → full scan) to handle the GRIB table inconsistencies that appear across GFS versions and sub-region files.


API Reference

GFSDatasetManager

GFSDatasetManager(
    date: str,
    cycle: str = "00",
    output_dir: str = "./gfs_output",
    region: dict | None = None,
    pause: float = 1.5,
)

Main entry point. All other methods are called on an instance of this class.

Parameter Type Description
date str Model run date in YYYYMMDD format. Required.
cycle str Model run cycle: "00", "06", "12" or "18". Default "00".
output_dir str Directory where GRIB2 files are cached. Created automatically. Default "./gfs_output".
region dict | None Bounding box for spatial subsetting (see Region Subsetting). None downloads the global grid.
pause float Seconds to sleep between consecutive HTTP requests. Helps avoid rate-limiting on NOMADS. Default 1.5.

Raises: ValueError if cycle is not one of the four valid values.
Raises: ValueError if date does not match YYYYMMDD.

from noawclg import GFSDatasetManager

mgr = GFSDatasetManager(
    date="20260403",
    cycle="06",
    output_dir="./cache",
    region={"toplat": 5, "bottomlat": -35, "leftlon": -75, "rightlon": -34},
    pause=2.0,
)

get_noaa_data

get_noaa_data(
    date: str | None = None,
    cycle: str = "00",
    keys: list[str] = ["t2m"],
    hours: list[int] | None = None,
    *,
    lat_dim: str | None = None,
    lon_dim: str | None = None,
    time_dim: str | None = None,
)

High-level convenience wrapper for geocoded and point-based queries over GFS data.

This class loads one or more variables and lets you query the nearest grid point by:

  • Direct coordinates (get_data_from_point)
  • Place name via geocoding (get_data_from_place)
  • Full time-series extraction (get_time_series)
Parameter Type Description
date str | None Date in DD/MM/YYYY format. If omitted, current date is used.
cycle str Model cycle: "00", "06", "12", "18". Default "00".
keys list[str] Variable keys from Variable Catalogue. Default ['t2m'].
hours list[int] | None Forecast hours to load. Default is 0..384 every 3 h.
lat_dim str | None Optional latitude coordinate name override.
lon_dim str | None Optional longitude coordinate name override.
time_dim str | None Optional time coordinate name override.

Note: get_noaa_data is defined in noawclg.main.

from noawclg.main import get_noaa_data

noaa = get_noaa_data(
    date="03/04/2026",
    cycle="06",
    keys=["t2m", "prate"],
    hours=list(range(0, 49, 6)),
)

# by coordinates (lat, lon)
point_data = noaa.get_data_from_point((-3.73, -38.52))
print(point_data["t2m"])

# by place name
city_data = noaa.get_data_from_place("Fortaleza, Brazil")
print(city_data.to_dataframe().head())

load

load(
    date: str | None = None,
    cycle: str = "00",
    keys: list[str] = ["t2m"],
    hours: list[int] | None = None,
    *,
    lat_dim: str | None = None,
    lon_dim: str | None = None,
    time_dim: str | None = None,
) -> xr.Dataset

Functional convenience wrapper that returns the underlying xarray.Dataset directly.

Internally, load(...) calls get_noaa_data(...) with the same arguments and returns ._ds.

Parameter Type Description
date str | None Date in DD/MM/YYYY format. If omitted, current date is used.
cycle str Model cycle: "00", "06", "12", "18". Default "00".
keys list[str] Variable keys from Variable Catalogue. Default ['t2m'].
hours list[int] | None Forecast hours to load. Default is 0..384 every 3 h.
lat_dim str | None Optional latitude coordinate name override.
lon_dim str | None Optional longitude coordinate name override.
time_dim str | None Optional time coordinate name override.
from noawclg import load

ds = load(
    date="03/04/2026",
    cycle="06",
    keys=["t2m", "prate"],
    hours=list(range(0, 25, 6)),
)

print(ds.data_vars)

get_data_from_point

noaa.get_data_from_point(
    point: tuple[float, float],
    *,
    time: str | slice | list | None = None,
    tolerance: float | None = None,
) -> _DatasetView

Returns data from the nearest grid point to (lat, lon). Longitude is automatically normalized to the dataset convention.

get_data_from_place

noaa.get_data_from_place(
    place: str,
    *,
    time: str | slice | list | None = None,
    tolerance: float | None = None,
) -> _DatasetView

Geocodes a place name and forwards to get_data_from_point.

get_time_series

noaa.get_time_series(
    point: tuple[float, float],
    variable: str | None = None,
) -> xr.Dataset | xr.DataArray

Returns full time-series at the nearest grid point. If variable is provided, returns only that variable.

get_keys

noaa.get_keys() -> dict[str, str]

Returns {variable: long_name} for every variable in the loaded dataset.


build_dataset

mgr.build_dataset(
    var_key: str,
    hours: list[int],
    force_download: bool = False,
) -> xr.Dataset

Download and assemble a Dataset for a single variable.

Parameter Type Description
var_key str Variable key from the Variable Catalogue.
hours list[int] Forecast hours to include (e.g. [0, 6, 12, 24]).
force_download bool If True, re-download even if cached files exist. Default False.

Returns: xr.Dataset with dimensions:

  • Surface/single-level variables → (time, latitude, longitude)
  • Multi-level variables → (time, level, latitude, longitude)

Both datasets include a forecast_hour coordinate aligned to the time dimension.

Raises: RuntimeError if no files could be downloaded or read.

ds = mgr.build_dataset("t2m", hours=[0, 6, 12, 24, 48])
print(ds["t2m"].dims)   # ('time', 'latitude', 'longitude')
print(ds["t2m"].attrs)  # {'long_name': '2 metre temperature', 'units': 'C', ...}

build_multi_dataset

mgr.build_multi_dataset(
    var_keys: list[str],
    hours: list[int],
    force_download: bool = False,
) -> xr.Dataset

Download one file per hour containing all requested variables, then extract and merge them into a single Dataset.

This is the recommended method when you need more than one variable — it uses N_hours requests instead of N_vars × N_hours.

Parameter Type Description
var_keys list[str] List of variable keys from the Variable Catalogue.
hours list[int] Forecast hours to include.
force_download bool Re-download even if cached. Default False.

Returns: xr.Dataset with all requested variables merged via xr.merge(..., join="inner").

Variables that fail to extract are logged and skipped; a RuntimeError is raised only if all variables fail.

ds = mgr.build_multi_dataset(
    var_keys=["t2m", "prmsl", "prate", "u10", "v10"],
    hours=list(range(0, 25, 6)),
)
# ds contains t2m, prmsl, prate, u10, v10 all on the same time axis

download_hours

mgr.download_hours(
    var_keys: list[str],
    hours: list[int],
    force: bool = False,
) -> dict[int, Path]

Low-level method that performs the actual HTTP downloads.
Called internally by build_dataset and build_multi_dataset, but exposed for advanced use cases (e.g. downloading files without immediately building a Dataset).

Parameter Type Description
var_keys list[str] Variables to bundle into each download URL.
hours list[int] Forecast hours to download.
force bool Re-download cached files. Default False.

Returns: dict[int, Path] — mapping of {hour: path_to_grib2_file} for every successfully downloaded hour.

Files already on disk are returned immediately without any network I/O (cache hit is logged at INFO level).

files = mgr.download_hours(["t2m", "prate"], hours=[0, 6, 12])
# {0: PosixPath('.../gfs_..._f000.grib2'),
#  6: PosixPath('.../gfs_..._f006.grib2'),
#  12: PosixPath('.../gfs_..._f012.grib2')}

save_netcdf

mgr.save_netcdf(
    ds: xr.Dataset,
    filename: str,
    complevel: int = 4,
) -> Path

Save a Dataset to a zlib-compressed NetCDF4 file.

Parameter Type Description
ds xr.Dataset Dataset to save.
filename str Output file path. Absolute paths are used as-is; relative paths are resolved against output_dir.
complevel int zlib compression level 1–9 (higher = smaller file, slower write). Default 4.

Returns: Path — absolute path of the saved file.

path = mgr.save_netcdf(ds, "/data/gfs_t2m_48h.nc")
# or relative (saved inside output_dir):
path = mgr.save_netcdf(ds, "gfs_t2m_48h.nc")

save_zarr

mgr.save_zarr(
    ds: xr.Dataset,
    store: str,
) -> Path

Save a Dataset as a chunked Zarr store (directory).

Zarr is preferred over NetCDF for large time-series because it supports:

  • Lazy chunked reads without loading the whole file into memory.
  • Appending new timesteps without rewriting existing data.
Parameter Type Description
ds xr.Dataset Dataset to save.
store str Output directory path. Relative paths are resolved against output_dir.

Returns: Path — absolute path of the Zarr store directory.

path = mgr.save_zarr(ds, "gfs_surface_16days.zarr")

load_netcdf

GFSDatasetManager.load_netcdf(path: str | Path) -> xr.Dataset

Static method. Lazily open a previously saved NetCDF file using Dask-backed chunking.

ds = GFSDatasetManager.load_netcdf("/data/gfs_t2m_48h.nc")
print(dict(ds.dims))  # {'time': 9, 'latitude': 721, 'longitude': 1440}

load_zarr

GFSDatasetManager.load_zarr(store: str | Path) -> xr.Dataset

Static method. Lazily open a previously saved Zarr store.

ds = GFSDatasetManager.load_zarr("gfs_surface_16days.zarr")

Variable Catalogue

Access the full catalogue at runtime:

from noawclg import VARIABLES, SURFACE_VARS, MULTILEVEL_VARS

print(SURFACE_VARS)    # all 2-D (no level dimension) variable keys
print(MULTILEVEL_VARS) # all variables with a vertical level dimension

Surface / single-level variables

Key Long name Units
t2m 2 metre temperature °C
d2m 2 metre dewpoint temperature °C
r2 2 metre relative humidity %
sh2 2 metre specific humidity kg kg⁻¹
aptmp Apparent temperature °C
u10 10 metre U wind component m s⁻¹
v10 10 metre V wind component m s⁻¹
gust Wind speed (gust) m s⁻¹
prmsl Pressure reduced to MSL hPa
mslet MSLP (Eta model reduction) hPa
sp Surface pressure hPa
orog Orography m
lsm Land-sea mask 0–1
vis Visibility m
prate Precipitation rate kg m⁻² s⁻¹
cpofp Percent frozen precipitation %
crain Categorical rain
csnow Categorical snow
cfrzr Categorical freezing rain
cicep Categorical ice pellets
sde Snow depth m
sdwe Water equivalent of snow depth kg m⁻²
pwat Precipitable water kg m⁻²
cwat Cloud water kg m⁻²
tcc Total cloud cover %
lcc Low cloud cover %
mcc Medium cloud cover %
hcc High cloud cover %
lftx Surface lifted index K
lftx4 Best (4-layer) lifted index K
hlcy Storm relative helicity m² s⁻²
refc Composite radar reflectivity dB
siconc Sea ice area fraction 0–1
veg Vegetation %
tozne Total ozone DU

Multi-level variables

These variables include a level dimension in the output Dataset.

Key Long name Units Levels
t Temperature °C 80–1000 hPa (13 levels)
r Relative humidity % 80–1000 hPa (13 levels)
q Specific humidity kg kg⁻¹ 80, 1000 hPa
gh Geopotential height gpm 500–1000 hPa (5 levels)
u U component of wind m s⁻¹ 200–1000 hPa (9 levels)
v V component of wind m s⁻¹ 200–1000 hPa (9 levels)
w Vertical velocity Pa s⁻¹ 100–850 hPa (8 levels)
absv Absolute vorticity s⁻¹ 100–1000 hPa (8 levels)
cape CAPE J kg⁻¹ surface layers
cin Convective inhibition J kg⁻¹ surface layers
st Soil temperature °C 0–100 cm (4 layers)
soilw Volumetric soil moisture Proportion 0–100 cm (4 layers)

Pre-defined Hour Sequences

from noawclg import (
    HOURS_16DAYS,     # 0–120 h (6-hourly) + 123–384 h (3-hourly) — full 16-day run
    HOURS_5DAYS_1H,   # 0–120 h (1-hourly)
    HOURS_10DAYS_3H,  # 0–240 h (3-hourly)
    HOURS_16DAYS_3H,  # 0–120 h (3-hourly) + 123–384 h (3-hourly)
)

Use them directly with build_dataset or build_multi_dataset:

ds = mgr.build_dataset("t2m", hours=HOURS_16DAYS)

Region Subsetting

Pass a region dict to download only the data inside a bounding box.
This dramatically reduces file size and download time for regional studies.

# South America
REGION_SA = {
    "toplat":    12,
    "bottomlat": -56,
    "leftlon":   -82,
    "rightlon":  -34,
}

# Brazil
REGION_BR = {
    "toplat":    5,
    "bottomlat": -35,
    "leftlon":   -75,
    "rightlon":  -34,
}

mgr = GFSDatasetManager(
    date="20260403",
    cycle="06",
    region=REGION_BR,
)

Pass region=None (the default) for a global download.

Note: The region tag is embedded in the cache filename, so global and regional downloads never collide even when sharing the same output_dir.


Logging

noawclg uses Python's standard logging module under the logger name gfs_dataset.
Enable it in your application to see download progress, cache hits and extraction warnings:

import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s  %(levelname)-8s  %(message)s",
    datefmt="%H:%M:%S",
)

Sample output:

10:02:15  INFO      Download: 9 hour(s) × 1 file each = 9 request(s)  (vars: ['t2m', 'prate'])
10:02:17  INFO      [multi] → f000  https://nomads.ncep.noaa.gov/cgi-bin/filter_gfs_0p25_1hr.pl?...
10:02:19  INFO        [ok] f000  284 KB  |  11.1%  (1/9)  elapsed=2.1s  remaining≈15.2s
10:02:21  INFO      [cache] f006  gfs_20260403_06z_prate_t2m_global_f006.grib2
10:02:21  INFO      Extracting 't2m' …
10:02:21  INFO      Extracting 'prate' …

Examples

1 — Surface forecast for Brazil, 48 h

from noawclg import GFSDatasetManager

mgr = GFSDatasetManager(
    date="20260403",
    cycle="06",
    region={"toplat": 5, "bottomlat": -35, "leftlon": -75, "rightlon": -34},
)

ds = mgr.build_multi_dataset(
    var_keys=["t2m", "prate", "prmsl", "u10", "v10"],
    hours=list(range(0, 49, 6)),
)
mgr.save_netcdf(ds, "gfs_brazil_48h.nc")

2 — Upper-air wind profile, global, 24 h

ds = mgr.build_multi_dataset(
    var_keys=["u", "v", "gh"],   # multi-level isobaric
    hours=list(range(0, 25, 6)),
)
# ds["u"] has dims (time, level, latitude, longitude)
u_500 = ds["u"].sel(level=500)   # wind at 500 hPa

3 — 16-day t2m time-series, saved as Zarr

from noawclg import GFSDatasetManager, HOURS_16DAYS

mgr = GFSDatasetManager(date="20260403", cycle="00")
ds  = mgr.build_dataset("t2m", hours=HOURS_16DAYS)
mgr.save_zarr(ds, "gfs_t2m_16days.zarr")

4 — Reload and compute a daily mean

import xarray as xr
from noawclg import GFSDatasetManager

ds   = GFSDatasetManager.load_netcdf("gfs_brazil_48h.nc")
t2m  = ds["t2m"]
daily_mean = t2m.resample(time="1D").mean()
print(daily_mean)

5 — Download only (no Dataset construction)

files = mgr.download_hours(
    var_keys=["t2m", "prate"],
    hours=[0, 6, 12, 24],
)
# {0: PosixPath('./gfs_output/gfs_20260403_06z_prate_t2m_global_f000.grib2'), ...}

Contributing

Pull requests are welcome. For major changes please open an issue first to discuss what you would like to change.

git clone https://github.com/reinanbr/noawclg
cd noawclg
pip install -e ".[dev]"

License

MIT © Reinan BR

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

noawclg-2.2.7.tar.gz (57.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

noawclg-2.2.7-py3-none-any.whl (43.6 kB view details)

Uploaded Python 3

File details

Details for the file noawclg-2.2.7.tar.gz.

File metadata

  • Download URL: noawclg-2.2.7.tar.gz
  • Upload date:
  • Size: 57.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for noawclg-2.2.7.tar.gz
Algorithm Hash digest
SHA256 9cc0509e494ecbbcbe839634b87a9a938bc0f0bc23768f29481bed3adb82f713
MD5 84fcb96cf0b533d36c1c0d440725fefd
BLAKE2b-256 95489afedc3dc1f8efb5c1f8370401153aef0a2cc694f77040b286544acad559

See more details on using hashes here.

Provenance

The following attestation bundles were made for noawclg-2.2.7.tar.gz:

Publisher: ci.yml on reinanbr/noawclg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file noawclg-2.2.7-py3-none-any.whl.

File metadata

  • Download URL: noawclg-2.2.7-py3-none-any.whl
  • Upload date:
  • Size: 43.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for noawclg-2.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 1b470673fb745b9f52260281c24c371bb1a5fc30b17fd48f28237c1110edb887
MD5 e547567fdad49b7a8bc6b1e89ce0ac39
BLAKE2b-256 71be96cdbfc42563351c673175b91a6ffd77faa34b7466fab51098ae7bc5d72c

See more details on using hashes here.

Provenance

The following attestation bundles were made for noawclg-2.2.7-py3-none-any.whl:

Publisher: ci.yml on reinanbr/noawclg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page