Skip to main content

Query and access Microsoft Planetary Computer Data Catalogs using geopandas and xarray.

Project description

pcxarray

PyPI version License: MIT

Planetary Computer + Xarray: A Python package for seamless querying, downloading, and processing of Microsoft Planetary Computer raster data using GeoPandas and Xarray.

Overview

pcxarray bridges the gap between Microsoft's Planetary Computer STAC API and modern Python geospatial workflows. It enables querying satellite imagery using simple geometries and automatically loads the results as analysis-ready xarray DataArrays with proper spatial reference handling, mosaicking, and preprocessing. This package is designed to work seamlessly with Dask for lazy execution and distributed processing, making it ideal for large-scale geospatial data analysis.

Key Concepts

  • Geometry-based queries: Use any shapely geometry to define areas of interest
  • Automatic spatial processing: Handle reprojection, resampling, and mosaicking transparently
  • Dask integration: Lazy loading and parallel processing for large datasets
  • Analysis-ready data: Get properly georeferenced xarray DataArrays ready for analysis

Features

  • Query Microsoft Planetary Computer STAC API using shapely geometries
  • Retrieve results as GeoDataFrames for inspection, filtering, and spatial analysis
  • Download and mosaic raster data into xarray DataArrays with reprojection and resampling
  • Create timeseries datasets from multiple satellite acquisitions
  • Utilities for spatial analysis: grid creation and US Census TIGER shapefiles
  • Simple caching of expensive or repeated downloads
  • Designed for integration with Dask, Jupyter, and modern geospatial Python workflows

Installation

Install from PyPI:

python -m pip install pcxarray

Or install the development version from GitHub:

git clone https://github.com/gcermsu/pcxarray
cd pcxarray
python -m pip install -e ".[dev]"

Core Workflow

The typical pcxarray workflow follows three main steps:

1. Define Area of Interest

from shapely.geometry import Polygon
import geopandas as gpd

# Create a geometry (CRS is important - results will match this CRS)
geom = Polygon([...])  # Area of interest
gdf = gpd.GeoDataFrame({"geometry": [geom]}, crs="EPSG:4326")
gdf = gdf.to_crs("EPSG:32616")  # Project to appropriate UTM zone
roi_geom = gdf.geometry.values[0]

2. Query Planetary Computer

from pcxarray import pc_query

# Query for satellite data
items_gdf = pc_query(
    collections='sentinel-2-l2a',  # Collection ID
    geometry=roi_geom,
    datetime='2024-01-01/2024-12-31',  # RFC 3339 datetime
    crs=gdf.crs
)
print(f"Found {len(items_gdf)} items")

3. Load and Process Data

from pcxarray import prepare_data

# Load as xarray DataArray
imagery = prepare_data(
    items_gdf=items_gdf,
    geometry=roi_geom,
    crs=gdf.crs,
    bands=['B04', 'B03', 'B02'],  # Red, Green, Blue
    target_resolution=10.0,  # meters
    merge_method='mean'
)

# Visualize
(imagery / 3000).plot.imshow()

Quick Examples

NAIP Imagery

from pcxarray import query_and_prepare
from pcxarray.utils import create_grid, load_census_shapefile

# Load state boundaries and create processing grid
states_gdf = load_census_shapefile(level="state")
ms_gdf = states_gdf[states_gdf['STUSPS'] == 'MS'].to_crs(3814)

# Create 1km grid and select a cell
grid_gdf = create_grid(ms_gdf.iloc[0].geometry, crs=ms_gdf.crs, cell_size=1000)
selected_geom = grid_gdf.iloc[10000].geometry

# Query and load NAIP imagery
imagery = query_and_prepare(
    collections='naip',
    geometry=selected_geom,
    crs=ms_gdf.crs,
    datetime='2023',
    target_resolution=1.0,
    bands=[4, 1, 2]  # NIR, Red, Green
)

Satellite Timeseries Analysis

from pcxarray import prepare_timeseries
import xarray as xr

# Query multiple years of Landsat data
items_gdf = pc_query(
    collections="landsat-c2-l2",
    geometry=roi_geom,
    datetime="2020/2024", 
    crs=utm_crs,
    # query={"eo:cloud_cover": {"lt": 5}}  # Optional cloud cover filter
)

# Create timeseries DataArray
timeseries = prepare_timeseries(
    items_gdf=items_gdf,
    geometry=roi_geom,
    crs=utm_crs,
    bands=["green", "nir08"],
    chunks={"time": 16, "x": 2048, "y": 2048}
)

# Convert from DN to reflectance
timeseries = (timeseries * 0.0000275) - 0.2

# Calculate NDVI timeseries
ndvi = (timeseries.sel(band="nir08") - timeseries.sel(band="green")) / \
       (timeseries.sel(band="nir08") + timeseries.sel(band="green"))

# Compute monthly means
monthly_ndvi = ndvi.resample(time="1M").mean().persist() # use lazy execution

For more complete examples, see the examples/ directory.

Working with Large Datasets

pcxarray is designed for Dask's lazy execution model, making it efficient for large datasets:

from distributed import Client

# Start Dask client for parallel processing
client = Client(processes=True)

# Prepare timeseries (creates computation graph, doesn't load data)
da = prepare_timeseries(
    items_gdf=large_items_gdf,
    geometry=roi_geom, 
    crs=target_crs,
    bands=["B04", "B08"],
    chunks={"time": 32, "x": 2048, "y": 2048}
)

# Process data (computation happens here)
result = da.resample(time="1M").mean().compute()

Supported Collections

pcxarray works with Microsoft Planetary Computer collections that provide data in Cloud Optimized GeoTIFF (COG) format and are accessible via the STAC API. The collections are identified by their unique IDs, which can be used in queries to retrieve data. Popular examples include:

  • Landsat: landsat-c2-l2 (Landsat Collection 2 Level-2)
  • Sentinel-2: sentinel-2-l2a (Sentinel-2 Level-2A)
  • NAIP: naip (National Agriculture Imagery Program)
  • HLS: hls2-l30, hls2-s30 (Harmonized Landsat Sentinel-2)
  • Soil Data: gnatsgo-rasters (Gridded National Soil Survey)

Use get_pc_collections() to discover available collections. Note that not all collections are compatible with pcxarray, such those that do not provide COGs (such as Sentinel-3/5p collections) or those that are not cataloged in the Planetary Computer STAC API (such as NLCD). Consult the Planetary Computer Data Catalog for a complete list of available datasets.

Complete Examples

Explore these comprehensive examples in the examples/ directory:

API Reference

Documentation is currently unavailable, but each function should have descriptive docstrings and full type hints. Use python's built-in help() function or IDE tooltips to explore available methods and parameters.

import pcxarray as pcx
help(pcx.prepare_data)  # View function signature and docstring

Known Issues

  • Chunking along band or time dimension when preparing timeseries datasets can trigger rechunks, which may be undesirable.
  • Some collections may have different metadata schemas causing issues. If an issue is encountered, please open an issue on GitHub.
  • When using Dask distributed scheduler, open_rasterio tasks may get stuck and prevent the computation graph from fully executing. Initializing the Dask client with processes=True seems to resolve this.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pcxarray-1.0.0.tar.gz (31.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pcxarray-1.0.0-py3-none-any.whl (22.5 kB view details)

Uploaded Python 3

File details

Details for the file pcxarray-1.0.0.tar.gz.

File metadata

  • Download URL: pcxarray-1.0.0.tar.gz
  • Upload date:
  • Size: 31.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for pcxarray-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c71bee37d344d329ec30b5845ab471eeefab558f77be90315454adca5fb26ca6
MD5 4cc4c1f5313916471e9f072983231037
BLAKE2b-256 4e4c69f63ea81222802051d3a3170cd186752fccf3bed369eededa675cd772cd

See more details on using hashes here.

File details

Details for the file pcxarray-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: pcxarray-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 22.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for pcxarray-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 895140a473e8d5008c12ff69fedabb1ce078cde24ac4dbd5313ebc5072cc1592
MD5 7c974ae0094e1360e6d02474f03d1872
BLAKE2b-256 f10e993717169a74b5e940307535b0d747db2ec7fa934ada219506e2da82cde9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page