Library for getting dataset from noaa site
Project description
noawclg · GFS Dataset Manager
Download, cache and analyse NOAA GFS forecast data in one line of Python.
noawclg wraps the NOAA NOMADS grib-filter endpoint and exposes a clean Python API that lets you:
- Download GFS 0.25° GRIB2 files with a single method call — one HTTP request per forecast hour regardless of how many variables you need.
- Cache raw GRIB2 files to disk so repeated runs cost nothing.
- Extract any combination of surface and upper-air variables into analysis-ready
xarray.Datasetobjects. - Save output as compressed NetCDF4 or chunked Zarr for downstream processing.
Table of Contents
- Installation
- Quick Start
- How It Works
- API Reference
- Variable Catalogue
- Pre-defined Hour Sequences
- Region Subsetting
- Logging
- Examples
- Contributing
- License
Installation
pip install noawclg
System dependency — eccodes
cfgrib requires the eccodes C library to decode GRIB2 files.
| Platform | Command |
|---|---|
| Ubuntu / Debian | sudo apt install libeccodes-dev |
| macOS (Homebrew) | brew install eccodes |
| Conda (any OS) | conda install -c conda-forge eccodes |
Quick Start
from noawclg import GFSDatasetManager
# Create a manager for the 06 Z run of 2026-04-03
mgr = GFSDatasetManager(date="20260403", cycle="06")
# Download t2m + precipitation for the next 48 h (6-hourly)
# → only 9 HTTP requests (one per hour), not 18
ds = mgr.build_multi_dataset(
var_keys=["t2m", "prate"],
hours=list(range(0, 49, 6)),
)
print(ds)
# <xarray.Dataset>
# Dimensions: (time: 9, latitude: 721, longitude: 1440)
# Data variables:
# t2m (time, latitude, longitude) float64 ...
# prate (time, latitude, longitude) float64 ...
mgr.save_netcdf(ds, "/tmp/gfs_48h.nc")
How It Works
Single-download architecture
Previous approaches sent one HTTP request per variable per forecast hour.
For 5 variables × 48 hours that means 240 requests.
noawclg exploits the NOMADS grib-filter's multi-variable syntax to bundle every requested variable into a single URL per hour:
https://nomads.ncep.noaa.gov/cgi-bin/filter_gfs_0p25_1hr.pl
?dir=/gfs.20260403/06/atmos
&file=gfs.t06z.pgrb2.0p25.f024
&var_TMP=on&lev_2_m_above_ground=on ← t2m
&var_PRATE=on&lev_surface=on ← prate
&var_PRMSL=on&lev_mean_sea_level=on ← prmsl
&subregion=&toplat=5&bottomlat=-35&... ← optional region
Result: 5 variables × 48 hours = 9 requests (one per hour).
Disk cache
Every downloaded GRIB2 file is saved under output_dir with a deterministic filename that encodes the date, cycle, variable set, region tag and forecast hour:
gfs_20260403_06z_prate_t2m_5N35S75W34E_f024.grib2
On subsequent runs the file is reused without any network I/O.
cfgrib extraction
After downloading, each variable is extracted from the cached GRIB2 using cfgrib with a cascade of filter strategies (shortName → typeOfLevel → full scan) to handle the GRIB table inconsistencies that appear across GFS versions and sub-region files.
API Reference
GFSDatasetManager
GFSDatasetManager(
date: str,
cycle: str = "00",
output_dir: str = "./gfs_output",
region: dict | None = None,
pause: float = 1.5,
)
Main entry point. All other methods are called on an instance of this class.
| Parameter | Type | Description |
|---|---|---|
date |
str |
Model run date in YYYYMMDD format. Required. |
cycle |
str |
Model run cycle: "00", "06", "12" or "18". Default "00". |
output_dir |
str |
Directory where GRIB2 files are cached. Created automatically. Default "./gfs_output". |
region |
dict | None |
Bounding box for spatial subsetting (see Region Subsetting). None downloads the global grid. |
pause |
float |
Seconds to sleep between consecutive HTTP requests. Helps avoid rate-limiting on NOMADS. Default 1.5. |
Raises: ValueError if cycle is not one of the four valid values.
Raises: ValueError if date does not match YYYYMMDD.
from noawclg import GFSDatasetManager
mgr = GFSDatasetManager(
date="20260403",
cycle="06",
output_dir="./cache",
region={"toplat": 5, "bottomlat": -35, "leftlon": -75, "rightlon": -34},
pause=2.0,
)
get_noaa_data
get_noaa_data(
date: str | None = None,
cycle: str = "00",
keys: list[str] = ["t2m"],
hours: list[int] | None = None,
*,
lat_dim: str | None = None,
lon_dim: str | None = None,
time_dim: str | None = None,
)
High-level convenience wrapper for geocoded and point-based queries over GFS data.
This class loads one or more variables and lets you query the nearest grid point by:
- Direct coordinates (
get_data_from_point) - Place name via geocoding (
get_data_from_place) - Full time-series extraction (
get_time_series)
| Parameter | Type | Description |
|---|---|---|
date |
str | None |
Date in DD/MM/YYYY format. If omitted, current date is used. |
cycle |
str |
Model cycle: "00", "06", "12", "18". Default "00". |
keys |
list[str] |
Variable keys from Variable Catalogue. Default ['t2m']. |
hours |
list[int] | None |
Forecast hours to load. Default is 0..384 every 3 h. |
lat_dim |
str | None |
Optional latitude coordinate name override. |
lon_dim |
str | None |
Optional longitude coordinate name override. |
time_dim |
str | None |
Optional time coordinate name override. |
Note:
get_noaa_datais defined innoawclg.main.
from noawclg.main import get_noaa_data
noaa = get_noaa_data(
date="03/04/2026",
cycle="06",
keys=["t2m", "prate"],
hours=list(range(0, 49, 6)),
)
# by coordinates (lat, lon)
point_data = noaa.get_data_from_point((-3.73, -38.52))
print(point_data["t2m"])
# by place name
city_data = noaa.get_data_from_place("Fortaleza, Brazil")
print(city_data.to_dataframe().head())
get_data_from_point
noaa.get_data_from_point(
point: tuple[float, float],
*,
time: str | slice | list | None = None,
tolerance: float | None = None,
) -> _DatasetView
Returns data from the nearest grid point to (lat, lon). Longitude is automatically normalized to the dataset convention.
get_data_from_place
noaa.get_data_from_place(
place: str,
*,
time: str | slice | list | None = None,
tolerance: float | None = None,
) -> _DatasetView
Geocodes a place name and forwards to get_data_from_point.
get_time_series
noaa.get_time_series(
point: tuple[float, float],
variable: str | None = None,
) -> xr.Dataset | xr.DataArray
Returns full time-series at the nearest grid point. If variable is provided, returns only that variable.
get_keys
noaa.get_keys() -> dict[str, str]
Returns {variable: long_name} for every variable in the loaded dataset.
build_dataset
mgr.build_dataset(
var_key: str,
hours: list[int],
force_download: bool = False,
) -> xr.Dataset
Download and assemble a Dataset for a single variable.
| Parameter | Type | Description |
|---|---|---|
var_key |
str |
Variable key from the Variable Catalogue. |
hours |
list[int] |
Forecast hours to include (e.g. [0, 6, 12, 24]). |
force_download |
bool |
If True, re-download even if cached files exist. Default False. |
Returns: xr.Dataset with dimensions:
- Surface/single-level variables →
(time, latitude, longitude) - Multi-level variables →
(time, level, latitude, longitude)
Both datasets include a forecast_hour coordinate aligned to the time dimension.
Raises: RuntimeError if no files could be downloaded or read.
ds = mgr.build_dataset("t2m", hours=[0, 6, 12, 24, 48])
print(ds["t2m"].dims) # ('time', 'latitude', 'longitude')
print(ds["t2m"].attrs) # {'long_name': '2 metre temperature', 'units': 'C', ...}
build_multi_dataset
mgr.build_multi_dataset(
var_keys: list[str],
hours: list[int],
force_download: bool = False,
) -> xr.Dataset
Download one file per hour containing all requested variables, then extract and merge them into a single Dataset.
This is the recommended method when you need more than one variable — it uses N_hours requests instead of N_vars × N_hours.
| Parameter | Type | Description |
|---|---|---|
var_keys |
list[str] |
List of variable keys from the Variable Catalogue. |
hours |
list[int] |
Forecast hours to include. |
force_download |
bool |
Re-download even if cached. Default False. |
Returns: xr.Dataset with all requested variables merged via xr.merge(..., join="inner").
Variables that fail to extract are logged and skipped; a RuntimeError is raised only if all variables fail.
ds = mgr.build_multi_dataset(
var_keys=["t2m", "prmsl", "prate", "u10", "v10"],
hours=list(range(0, 25, 6)),
)
# ds contains t2m, prmsl, prate, u10, v10 all on the same time axis
download_hours
mgr.download_hours(
var_keys: list[str],
hours: list[int],
force: bool = False,
) -> dict[int, Path]
Low-level method that performs the actual HTTP downloads.
Called internally by build_dataset and build_multi_dataset, but exposed for advanced use cases (e.g. downloading files without immediately building a Dataset).
| Parameter | Type | Description |
|---|---|---|
var_keys |
list[str] |
Variables to bundle into each download URL. |
hours |
list[int] |
Forecast hours to download. |
force |
bool |
Re-download cached files. Default False. |
Returns: dict[int, Path] — mapping of {hour: path_to_grib2_file} for every successfully downloaded hour.
Files already on disk are returned immediately without any network I/O (cache hit is logged at INFO level).
files = mgr.download_hours(["t2m", "prate"], hours=[0, 6, 12])
# {0: PosixPath('.../gfs_..._f000.grib2'),
# 6: PosixPath('.../gfs_..._f006.grib2'),
# 12: PosixPath('.../gfs_..._f012.grib2')}
save_netcdf
mgr.save_netcdf(
ds: xr.Dataset,
filename: str,
complevel: int = 4,
) -> Path
Save a Dataset to a zlib-compressed NetCDF4 file.
| Parameter | Type | Description |
|---|---|---|
ds |
xr.Dataset |
Dataset to save. |
filename |
str |
Output file path. Absolute paths are used as-is; relative paths are resolved against output_dir. |
complevel |
int |
zlib compression level 1–9 (higher = smaller file, slower write). Default 4. |
Returns: Path — absolute path of the saved file.
path = mgr.save_netcdf(ds, "/data/gfs_t2m_48h.nc")
# or relative (saved inside output_dir):
path = mgr.save_netcdf(ds, "gfs_t2m_48h.nc")
save_zarr
mgr.save_zarr(
ds: xr.Dataset,
store: str,
) -> Path
Save a Dataset as a chunked Zarr store (directory).
Zarr is preferred over NetCDF for large time-series because it supports:
- Lazy chunked reads without loading the whole file into memory.
- Appending new timesteps without rewriting existing data.
| Parameter | Type | Description |
|---|---|---|
ds |
xr.Dataset |
Dataset to save. |
store |
str |
Output directory path. Relative paths are resolved against output_dir. |
Returns: Path — absolute path of the Zarr store directory.
path = mgr.save_zarr(ds, "gfs_surface_16days.zarr")
load_netcdf
GFSDatasetManager.load_netcdf(path: str | Path) -> xr.Dataset
Static method. Lazily open a previously saved NetCDF file using Dask-backed chunking.
ds = GFSDatasetManager.load_netcdf("/data/gfs_t2m_48h.nc")
print(dict(ds.dims)) # {'time': 9, 'latitude': 721, 'longitude': 1440}
load_zarr
GFSDatasetManager.load_zarr(store: str | Path) -> xr.Dataset
Static method. Lazily open a previously saved Zarr store.
ds = GFSDatasetManager.load_zarr("gfs_surface_16days.zarr")
Variable Catalogue
Access the full catalogue at runtime:
from noawclg import VARIABLES, SURFACE_VARS, MULTILEVEL_VARS
print(SURFACE_VARS) # all 2-D (no level dimension) variable keys
print(MULTILEVEL_VARS) # all variables with a vertical level dimension
Surface / single-level variables
| Key | Long name | Units |
|---|---|---|
t2m |
2 metre temperature | °C |
d2m |
2 metre dewpoint temperature | °C |
r2 |
2 metre relative humidity | % |
sh2 |
2 metre specific humidity | kg kg⁻¹ |
aptmp |
Apparent temperature | °C |
u10 |
10 metre U wind component | m s⁻¹ |
v10 |
10 metre V wind component | m s⁻¹ |
gust |
Wind speed (gust) | m s⁻¹ |
prmsl |
Pressure reduced to MSL | hPa |
mslet |
MSLP (Eta model reduction) | hPa |
sp |
Surface pressure | hPa |
orog |
Orography | m |
lsm |
Land-sea mask | 0–1 |
vis |
Visibility | m |
prate |
Precipitation rate | kg m⁻² s⁻¹ |
cpofp |
Percent frozen precipitation | % |
crain |
Categorical rain | — |
csnow |
Categorical snow | — |
cfrzr |
Categorical freezing rain | — |
cicep |
Categorical ice pellets | — |
sde |
Snow depth | m |
sdwe |
Water equivalent of snow depth | kg m⁻² |
pwat |
Precipitable water | kg m⁻² |
cwat |
Cloud water | kg m⁻² |
tcc |
Total cloud cover | % |
lcc |
Low cloud cover | % |
mcc |
Medium cloud cover | % |
hcc |
High cloud cover | % |
lftx |
Surface lifted index | K |
lftx4 |
Best (4-layer) lifted index | K |
hlcy |
Storm relative helicity | m² s⁻² |
refc |
Composite radar reflectivity | dB |
siconc |
Sea ice area fraction | 0–1 |
veg |
Vegetation | % |
tozne |
Total ozone | DU |
Multi-level variables
These variables include a level dimension in the output Dataset.
| Key | Long name | Units | Levels |
|---|---|---|---|
t |
Temperature | °C | 80–1000 hPa (13 levels) |
r |
Relative humidity | % | 80–1000 hPa (13 levels) |
q |
Specific humidity | kg kg⁻¹ | 80, 1000 hPa |
gh |
Geopotential height | gpm | 500–1000 hPa (5 levels) |
u |
U component of wind | m s⁻¹ | 200–1000 hPa (9 levels) |
v |
V component of wind | m s⁻¹ | 200–1000 hPa (9 levels) |
w |
Vertical velocity | Pa s⁻¹ | 100–850 hPa (8 levels) |
absv |
Absolute vorticity | s⁻¹ | 100–1000 hPa (8 levels) |
cape |
CAPE | J kg⁻¹ | surface layers |
cin |
Convective inhibition | J kg⁻¹ | surface layers |
st |
Soil temperature | °C | 0–100 cm (4 layers) |
soilw |
Volumetric soil moisture | Proportion | 0–100 cm (4 layers) |
Pre-defined Hour Sequences
from noawclg import (
HOURS_16DAYS, # 0–120 h (6-hourly) + 123–384 h (3-hourly) — full 16-day run
HOURS_5DAYS_1H, # 0–120 h (1-hourly)
HOURS_10DAYS_3H, # 0–240 h (3-hourly)
HOURS_16DAYS_3H, # 0–120 h (3-hourly) + 123–384 h (3-hourly)
)
Use them directly with build_dataset or build_multi_dataset:
ds = mgr.build_dataset("t2m", hours=HOURS_16DAYS)
Region Subsetting
Pass a region dict to download only the data inside a bounding box.
This dramatically reduces file size and download time for regional studies.
# South America
REGION_SA = {
"toplat": 12,
"bottomlat": -56,
"leftlon": -82,
"rightlon": -34,
}
# Brazil
REGION_BR = {
"toplat": 5,
"bottomlat": -35,
"leftlon": -75,
"rightlon": -34,
}
mgr = GFSDatasetManager(
date="20260403",
cycle="06",
region=REGION_BR,
)
Pass region=None (the default) for a global download.
Note: The region tag is embedded in the cache filename, so global and regional downloads never collide even when sharing the same
output_dir.
Logging
noawclg uses Python's standard logging module under the logger name gfs_dataset.
Enable it in your application to see download progress, cache hits and extraction warnings:
import logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)-8s %(message)s",
datefmt="%H:%M:%S",
)
Sample output:
10:02:15 INFO Download: 9 hour(s) × 1 file each = 9 request(s) (vars: ['t2m', 'prate'])
10:02:17 INFO [multi] → f000 https://nomads.ncep.noaa.gov/cgi-bin/filter_gfs_0p25_1hr.pl?...
10:02:19 INFO [ok] f000 284 KB | 11.1% (1/9) elapsed=2.1s remaining≈15.2s
10:02:21 INFO [cache] f006 gfs_20260403_06z_prate_t2m_global_f006.grib2
10:02:21 INFO Extracting 't2m' …
10:02:21 INFO Extracting 'prate' …
Examples
1 — Surface forecast for Brazil, 48 h
from noawclg import GFSDatasetManager
mgr = GFSDatasetManager(
date="20260403",
cycle="06",
region={"toplat": 5, "bottomlat": -35, "leftlon": -75, "rightlon": -34},
)
ds = mgr.build_multi_dataset(
var_keys=["t2m", "prate", "prmsl", "u10", "v10"],
hours=list(range(0, 49, 6)),
)
mgr.save_netcdf(ds, "gfs_brazil_48h.nc")
2 — Upper-air wind profile, global, 24 h
ds = mgr.build_multi_dataset(
var_keys=["u", "v", "gh"], # multi-level isobaric
hours=list(range(0, 25, 6)),
)
# ds["u"] has dims (time, level, latitude, longitude)
u_500 = ds["u"].sel(level=500) # wind at 500 hPa
3 — 16-day t2m time-series, saved as Zarr
from noawclg import GFSDatasetManager, HOURS_16DAYS
mgr = GFSDatasetManager(date="20260403", cycle="00")
ds = mgr.build_dataset("t2m", hours=HOURS_16DAYS)
mgr.save_zarr(ds, "gfs_t2m_16days.zarr")
4 — Reload and compute a daily mean
import xarray as xr
from noawclg import GFSDatasetManager
ds = GFSDatasetManager.load_netcdf("gfs_brazil_48h.nc")
t2m = ds["t2m"]
daily_mean = t2m.resample(time="1D").mean()
print(daily_mean)
5 — Download only (no Dataset construction)
files = mgr.download_hours(
var_keys=["t2m", "prate"],
hours=[0, 6, 12, 24],
)
# {0: PosixPath('./gfs_output/gfs_20260403_06z_prate_t2m_global_f000.grib2'), ...}
Contributing
Pull requests are welcome. For major changes please open an issue first to discuss what you would like to change.
git clone https://github.com/reinanbr/noawclg
cd noawclg
pip install -e ".[dev]"
License
MIT © Reinan BR
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file noawclg-2.2.3.tar.gz.
File metadata
- Download URL: noawclg-2.2.3.tar.gz
- Upload date:
- Size: 67.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d1a532bc7d3b0fdac7cd54f9f67dff81b3f3b87cfeb40431bfc4467139cd1af
|
|
| MD5 |
81be649740e69287be54aa50a8a941be
|
|
| BLAKE2b-256 |
5ae5dc9522c0ba5e13edcf0d92b766f172876f3479f1c0706f787333aabe312f
|
Provenance
The following attestation bundles were made for noawclg-2.2.3.tar.gz:
Publisher:
ci.yml on reinanbr/noawclg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
noawclg-2.2.3.tar.gz -
Subject digest:
5d1a532bc7d3b0fdac7cd54f9f67dff81b3f3b87cfeb40431bfc4467139cd1af - Sigstore transparency entry: 1329035693
- Sigstore integration time:
-
Permalink:
reinanbr/noawclg@21c7b42d9a2b3fd535584b7f56b834cbc428a7f3 -
Branch / Tag:
refs/tags/v2.2.3 - Owner: https://github.com/reinanbr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@21c7b42d9a2b3fd535584b7f56b834cbc428a7f3 -
Trigger Event:
push
-
Statement type:
File details
Details for the file noawclg-2.2.3-py3-none-any.whl.
File metadata
- Download URL: noawclg-2.2.3-py3-none-any.whl
- Upload date:
- Size: 57.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4920fd5dcaf07e234fbd4833f6fbc116d26c4c78a34a6c48be60f1c3c53d16b1
|
|
| MD5 |
9042170b714a44d3f53074cf25bb6c10
|
|
| BLAKE2b-256 |
0148b6cd66c867d9d6e58d1acd515d98e69b649be07f2ef98730d4a5f268cfa8
|
Provenance
The following attestation bundles were made for noawclg-2.2.3-py3-none-any.whl:
Publisher:
ci.yml on reinanbr/noawclg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
noawclg-2.2.3-py3-none-any.whl -
Subject digest:
4920fd5dcaf07e234fbd4833f6fbc116d26c4c78a34a6c48be60f1c3c53d16b1 - Sigstore transparency entry: 1329035866
- Sigstore integration time:
-
Permalink:
reinanbr/noawclg@21c7b42d9a2b3fd535584b7f56b834cbc428a7f3 -
Branch / Tag:
refs/tags/v2.2.3 - Owner: https://github.com/reinanbr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@21c7b42d9a2b3fd535584b7f56b834cbc428a7f3 -
Trigger Event:
push
-
Statement type: