Skip to main content

xarray extension for GDAL

Project description

gdalxarray

PyPI License

An xarray backend powered directly by GDAL.

import xarray as xr
ds = xr.open_dataset(path_or_uri, engine="gdalxarray")

gdalxarray is a thin bridge between GDAL's reading capabilities- classic raster, multidimensional, and any of the virtualized stores GDAL knows about- and xarray's labelled-array model. Lazy by default, optionally Dask-chunked, with native CRS and CF time handling.

Installation

GDAL has no usable PyPI wheels, so pip install gdalxarray alone is not enough. You need a working osgeo.gdal Python binding first. The recommended paths, in order of friction:

  • conda-forge (cross-platform, just works): mamba install -c conda-forge gdal
  • Docker image with GDAL preinstalled (e.g. ghcr.io/hypertidy/gdal-r-python:latest)
  • System package manager (apt, brew) plus matching system Python bindings

Then pip install gdalxarray for the engine itself. See INSTALL.md for the full guide, including troubleshooting for NumPy ABI mismatches and Python version pinning.

Why this exists (vs rioxarray)

rioxarray is an xarray accessor and backend built on rasterio, which wraps GDAL with its own Python conventions. For straightforward 2D/3D raster work it's the mature, widely-used choice- the da.rio.reproject(...) accessor pattern is well-known and well-tested.

gdalxarray goes directly to osgeo.gdal, with no rasterio layer in between. That choice matters in a few specific cases:

  • GDAL's multidimensional API is exposed natively- N-D arrays with named dimensions, not just (y, x) rasters with optional bands
  • Any GDAL virtualization composes- /vsicurl/, /vsis3/, vrt://, ZARR:, NETCDF:, classic VRT, multidim VRT
  • Codec and driver support tracks GDAL rather than whatever rasterio re-exposes- Zarr v3, Icechunk, kerchunk-Parquet stores, GRIB, HDF4/5 multidim- all readable via the GDAL drivers

For a single GeoTIFF or a STAC item, rioxarray is usually a better fit. For multidim cloud-native datasets, virtualized Zarr/Icechunk stores, or anything where you want GDAL itself to be the source of truth, gdalxarray puts you closer to the metal.

Three core usage modes

The package has three ways to open a dataset, and almost everything else is a composition of these with GDAL virtual paths.

1. Classic raster, bands as a dimension (default)

For multispectral imagery, image stacks, and anything where bands are interchangeable axes. Produces a single band_data DataArray with dims (band, y, x)- the rioxarray-compatible layout.

import xarray as xr

ds = xr.open_dataset("image.tif", engine="gdalxarray", multidim=False)
ds["band_data"]
# <xarray.DataArray 'band_data' (band: 3, y: 1024, x: 1024)>
#   ...

# xarray idioms work as expected:
mean_image = ds["band_data"].mean(dim="band")
just_nir = ds["band_data"].sel(band=4)

2. Classic raster, bands as separate variables

For multiband rasters where each band carries a semantically distinct quantity (e.g. a NetCDF translated to multiband GeoTIFF where bands are different physical variables). Each band becomes a separate data variable named after its description.

ds = xr.open_dataset(
    "multivariable.tif",
    engine="gdalxarray",
    multidim=False,
    band_as_dim=False,
)
ds
# <xarray.Dataset>
# Data variables:
#     temperature  (y, x) float32
#     salinity     (y, x) float32
#     density      (y, x) float32

3. Multidim- N-D arrays with named dimensions

For datasets with their own dimension/coordinate structure: HDF5, NetCDF, multidim VRT, GRIB, Zarr (v2 and v3). Produces a Dataset whose dims and coords come from the source.

ds = xr.open_dataset("dataset.nc", engine="gdalxarray", multidim=True)
ds["temperature"].sel(time="2024-06", level=500).isel(latitude=slice(100, 200))

multidim=True is the default for engine="gdalxarray".

4. Warp recipes — lazy reprojection

For warping any GDAL-readable source into a target CRS, grid, or projection, gdalxarray.warp returns a VRT recipe string rather than materialising pixels:

import gdalxarray
import xarray as xr

vrt = gdalxarray.warp(source, crs="+proj=laea")
ds = xr.open_dataset(vrt, engine="gdalxarray", multidim=False)

The full warp configuration (target CRS, GCPs/RPCs/geolocation arrays, cutlines, resampling) is encoded in ~2 KB of VRT XML. Only the bytes your code actually reads flow over the network or off disk.

Composing with GDAL virtual paths

The three modes above combine with GDAL's virtualization layers to cover nearly every cloud-native and remote-data scenario. None of these require any code changes in gdalxarray- they're just different paths:

Prefix or syntax What it does
/vsicurl/<url> HTTP/HTTPS-served files
/vsis3/<bucket>/<key> S3 (anonymous via AWS_NO_SIGN_REQUEST=YES)
/vsigs/... Google Cloud Storage
vrt://<path>?<options> Inline classic-raster VRT- subdataset selection, resampling, ...
NETCDF:<path>:<var> Pick a subdataset from a NetCDF
ZARR:"<path>":/<array> Open one array of a Zarr store as a classic raster
Classic VRT (.vrt) XML file referencing other sources
Multidim VRT (.vrt) N-D version, layered over NetCDF/HDF/Zarr sources

A few illustrative compositions:

# Public COG over HTTPS:
xr.open_dataset(
    "/vsicurl/https://example.com/data.tif",
    engine="gdalxarray", multidim=False,
)

# All variables of a CMEMS NetCDF on S3:
xr.open_dataset(
    "NETCDF:/vsis3/bucket/path/file.nc",
    engine="gdalxarray", multidim=True,
)

# A multidim VRT as a labelled coordinate-aware view over a raw NetCDF:
xr.open_dataset("study_area.vrt", engine="gdalxarray", multidim=True)

Which mode for which format?

As a rough guide, multidim=True is the natural fit for formats whose own data model is N-dimensional with named axes:

  • NetCDF (3 and 4)
  • HDF5 / HDF4
  • Multidim VRT
  • GRIB / GRIB2
  • Zarr (v2 and v3)
  • Icechunk (where supported by your GDAL build)

multidim=False is the natural fit for image-like formats:

  • GeoTIFF (including COG)
  • JPEG, PNG, JPEG2000
  • ERDAS Imagine (.img)
  • Classic VRT files
  • Anything GDAL identifies as a 2D-with-bands raster

Status

Active development. The API has settled but small changes are possible before 1.0. See CHANGELOG.md for what's landed and the issue tracker for what's next.

For worked examples against real cloud-native data (BRAN2023 ocean reanalysis, ECMWF AIFS forecasts, CMEMS sea level, NOAA OISST), see docs/cookbook.md.

License

Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gdalxarray-0.4.0.tar.gz (38.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gdalxarray-0.4.0-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file gdalxarray-0.4.0.tar.gz.

File metadata

  • Download URL: gdalxarray-0.4.0.tar.gz
  • Upload date:
  • Size: 38.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gdalxarray-0.4.0.tar.gz
Algorithm Hash digest
SHA256 71db63c06e90efc4b66ce229f48c71b279bc8abab3a349bb4052bfe4e4021eed
MD5 65582e8cf25c6c5c413bfdb0f7644a1e
BLAKE2b-256 c81f39b04d28ffcddd97f54b9067ad9f31cc058562bc1cd7ff5db71bd2611280

See more details on using hashes here.

Provenance

The following attestation bundles were made for gdalxarray-0.4.0.tar.gz:

Publisher: release.yml on hypertidy/gdalxarray

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gdalxarray-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: gdalxarray-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 17.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gdalxarray-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cf668142e4ea9bf95906e441f2d97b27f3874728fc0c923361caf07225d35a93
MD5 ca39c6a804bca438e6cc4df7414070d8
BLAKE2b-256 b8aeaae18e802fa1707d3af3cd7814face44cdb9dc1eb1625105ce8ba9f842b1

See more details on using hashes here.

Provenance

The following attestation bundles were made for gdalxarray-0.4.0-py3-none-any.whl:

Publisher: release.yml on hypertidy/gdalxarray

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page