Skip to main content

Query and access Microsoft Planetary Computer Data Catalogs using geopandas and xarray.

Project description

pcxarray

A Python package for seamless querying, downloading, and processing of Microsoft Planetary Computer raster data using GeoPandas and Xarray.

PyPI version Downloads License: MIT Documentation Status example workflow Open in NBViewer

Overview

pcxarray (Planetary Computer + xarray) bridges the gap between Microsoft's Planetary Computer Data Catalog and modern Python geospatial workflows. It enables querying satellite imagery using simple geometries and automatically loads the results as analysis-ready xarray DataArrays with proper spatial reference handling, mosaicking, and preprocessing. This package is designed to work seamlessly with Dask for lazy execution and distributed processing, making it ideal for large-scale geospatial data analysis.

Key Concepts

  • Geometry-based queries: Use any shapely geometry to define areas of interest
  • Automatic spatial processing: Handle reprojection, resampling, and mosaicking transparently
  • Dask integration: Lazy loading and parallel processing for large datasets
  • Analysis-ready data: Get properly georeferenced xarray DataArrays ready for analysis

Features

  • Query Microsoft Planetary Computer STAC API using shapely geometries
  • Retrieve results as GeoDataFrames for inspection, filtering, and spatial analysis
  • Download and mosaic raster data into xarray DataArrays with reprojection and resampling
  • Create timeseries datasets from multiple satellite acquisitions
  • Utilities for spatial analysis: grid creation and US Census TIGER shapefiles
  • Simple caching of expensive or repeated downloads
  • Designed for integration with Dask, Jupyter, and modern geospatial Python workflows

Installation

Install from PyPI:

python -m pip install pcxarray

Or install the development version from GitHub:

git clone https://github.com/gcermsu/pcxarray
cd pcxarray
python -m pip install -e ".[dev]"

Core Workflow

For a comprehensive quickstart guide, see the HLS time series example Open in NBViewer

The typical pcxarray workflow follows three main steps:

1. Define Area of Interest

from shapely.geometry import Polygon
import geopandas as gpd

# Create a geometry (CRS is important - results will match this CRS)
geom = Polygon([...])  # Area of interest
gdf = gpd.GeoDataFrame({"geometry": [geom]}, crs="EPSG:4326")
gdf = gdf.to_crs("EPSG:32616")  # Project to appropriate UTM zone
roi_geom = gdf.geometry.values[0]

2. Query Planetary Computer

from pcxarray import pc_query

# Query for satellite data
items_gdf = pc_query(
    collections='sentinel-2-l2a',  # Collection ID
    geometry=roi_geom,
    datetime='2024-01-01/2024-12-31',  # RFC 3339 datetime
    crs=gdf.crs
)
print(f"Found {len(items_gdf)} items")

3. Load and Process Data

from pcxarray import prepare_data

# Load as xarray DataArray
imagery = prepare_data(
    items_gdf=items_gdf,
    geometry=roi_geom,
    crs=gdf.crs,
    bands=['B04', 'B03', 'B02'],  # Red, Green, Blue
    target_resolution=10.0,  # meters
    merge_method='mean'
)

# Visualize
(imagery / 3000).plot.imshow()

Quick Examples

NAIP Imagery

from pcxarray import query_and_prepare
from pcxarray.utils import create_grid, load_census_shapefile

# Load state boundaries and create processing grid
states_gdf = load_census_shapefile(level="state")
ms_gdf = states_gdf[states_gdf['STUSPS'] == 'MS'].to_crs(3814)

# Create 1km grid and select a cell
grid_gdf = create_grid(ms_gdf.iloc[0].geometry, crs=ms_gdf.crs, cell_size=1000)
selected_geom = grid_gdf.iloc[10000].geometry

# Query and load NAIP imagery
imagery = query_and_prepare(
    collections='naip',
    geometry=selected_geom,
    crs=ms_gdf.crs,
    datetime='2023',
    target_resolution=1.0,
    bands=[4, 1, 2]  # NIR, Red, Green
)

Satellite Timeseries Analysis

from pcxarray import prepare_timeseries
import xarray as xr

# Query multiple years of Landsat data
items_gdf = pc_query(
    collections="landsat-c2-l2",
    geometry=roi_geom,
    datetime="2020/2024", 
    crs=utm_crs,
    # query={"eo:cloud_cover": {"lt": 5}}  # Optional cloud cover filter
)

# Create timeseries DataArray
timeseries = prepare_timeseries(
    items_gdf=items_gdf,
    geometry=roi_geom,
    crs=utm_crs,
    bands=["green", "nir08"],
    chunks={"time": 16, "x": 2048, "y": 2048}
)

# Convert from DN to reflectance
timeseries = (timeseries * 0.0000275) - 0.2

# Calculate NDVI timeseries
ndvi = (timeseries.sel(band="nir08") - timeseries.sel(band="green")) / \
       (timeseries.sel(band="nir08") + timeseries.sel(band="green"))

# Compute monthly means
monthly_ndvi = ndvi.resample(time="1M").mean().persist() # use lazy execution

For more complete examples, see the examples/ directory.

Working with Large Datasets

pcxarray is designed for Dask's lazy execution model, making it efficient for large datasets:

from distributed import Client

# Start Dask client for parallel processing
client = Client(processes=True)

# Prepare timeseries (creates computation graph, doesn't load data)
da = prepare_timeseries(
    items_gdf=large_items_gdf,
    geometry=roi_geom, 
    crs=target_crs,
    bands=["B04", "B08"],
    chunks={"time": 32, "x": 2048, "y": 2048}
)

# Process data (computation happens here)
result = da.resample(time="1M").mean().compute()

Supported Collections

pcxarray works with Microsoft Planetary Computer collections that provide data in Cloud Optimized GeoTIFF (COG) format and are accessible via the STAC API. The collections are identified by their unique IDs, which can be used in queries to retrieve data. Popular examples include:

  • Landsat: landsat-c2-l2 (Landsat Collection 2 Level-2)
  • Sentinel-2: sentinel-2-l2a (Sentinel-2 Level-2A)
  • NAIP: naip (National Agriculture Imagery Program)
  • HLS: hls2-l30, hls2-s30 (Harmonized Landsat Sentinel-2)
  • Soil Data: gnatsgo-rasters (Gridded National Soil Survey)

Use get_pc_collections() to discover available collections. Note that not all collections are compatible with pcxarray, such those that do not provide COGs (such as Sentinel-3/5p collections) or those that are not cataloged in the Planetary Computer STAC API (such as NLCD). Consult the Planetary Computer Data Catalog for a complete list of available datasets.

Complete Examples

Explore these comprehensive examples in the examples/ directory of the repository:

API Reference

Documentation can be found at pcxarray.readthedocs.io.

Known Issues

  • Chunking along band or time dimension when preparing timeseries datasets can trigger rechunks, which may be undesirable.
  • Some collections may have different metadata schemas causing issues. If an issue is encountered, please open an issue on GitHub.
  • When using Dask distributed scheduler, open_rasterio tasks may get stuck and prevent the computation graph from fully executing. Initializing the Dask client with processes=True seems to resolve this.

Acknowledgements

This package is developed and maintained by the GCERLab group at Mississippi State University. We welcome contributions and feedback from the community. If you find any issues or have feature requests, please open an issue on GitHub. Pull requests are also welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pcxarray-1.0.2.tar.gz (63.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pcxarray-1.0.2-py3-none-any.whl (22.8 kB view details)

Uploaded Python 3

File details

Details for the file pcxarray-1.0.2.tar.gz.

File metadata

  • Download URL: pcxarray-1.0.2.tar.gz
  • Upload date:
  • Size: 63.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for pcxarray-1.0.2.tar.gz
Algorithm Hash digest
SHA256 00febad135336413f79c3c971a7e4f986dd219c295c1de83d83db1d79c4bb7b3
MD5 4f7ed88b3a8c8dffb2833a41f17bc009
BLAKE2b-256 1824f7a5fec66cd4ec913ab5628cea049033769662dc93e4be4eee53fa0130bf

See more details on using hashes here.

File details

Details for the file pcxarray-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: pcxarray-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 22.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for pcxarray-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a32a01263b5a8f43e9c9375964496afdfdd9348d0c1c317823df8e8277145a4a
MD5 8efb62caf3c8ca3df299b20807342bd4
BLAKE2b-256 1fb48b557ffeb66b4ee57a87d4bd4192655f7dfb3db735a3d655cbfb36a45293

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page