earthcatalog is a scalable STAC ingestion library for partitioned GeoParquet catalogs

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

betolink

These details have not been verified by PyPI

Project description

EarthCatalog

A library for processing STAC items into spatially partitioned GeoParquet catalogs.

Why EarthCatalog?

The Problem: Working with massive collections of geospatial data (satellite imagery, drone surveys, IoT sensors) is challenging because:

Traditional databases struggle with spatial queries at scale
Files become too large to process efficiently
Spatial overlap makes data organization complex
Updates may require full rebuilds

EarthCatalog transforms STAC items into fast, spatially-partitioned GeoParquet catalogs that:

Eliminate full table scans - Query only relevant spatial partitions using spatial hive-partition pruning first.
Scale to terabytes - Each partition is independently manageable
Support incremental updates - Add new data without rebuilding the whole catalog
Handle complex geometries - Smart global partitioning for multi-region items

What about Apache Sedona, GeoMesa, Iceberg or PostGIS?

We intentionally avoided introducing a heavier data management layer such as Apache Sedona, Apache Iceberg, or PostGIS. Our use case does not require a catalog or metadata service beyond the files themselves, and keeping the system file-centric significantly reduces complexity. Direct reads from Parquet provide faster access by eliminating additional metadata lookups, while also allowing the system to remain truly serverless, with no long-running services to deploy or maintain.

The dataset already benefits from spatial partitioning, embedded statistics, and automatic schema generation, which cover the primary performance and discovery needs. STAC items are designed for immutable, versioned assets and do not require capabilities like time travel or complex schema evolution that motivate more sophisticated table formats. In this context, a simpler Parquet-based approach is both sufficient and operationally preferable.

Key Features

Smart Spatial Partitioning: Multiple grid systems (H3, S2, UTM, MGRS, LatLon, Custom geojson)
Global Partition Schema: Auto-routes large/complex geometries to global partitions
Temporal Binning: Year, month, or day-based time partitioning
Distributed Processing: Local multi-threading or Dask distributed
Incremental Updates: Merge new data with existing partitions

Quick Start

Installation

pip install earthcatalog

# With distributed processing support
pip install "earthcatalog[dask]"

Basic Usage

# Process STAC URLs into a spatial catalog
stac-ingest \
  --input stac_urls.parquet \
  --output ./catalog \
  --scratch ./scratch \
  --workers 4

# Generate schema metadata for efficient querying (enabled by default)
stac-ingest \
  --input stac_urls.parquet \
  --output ./catalog \
  --scratch ./scratch \
  --workers 4

Example: Create Input Data

import pandas as pd

# Sample STAC item URLs
urls = [
    "https://earth-search.aws.element84.com/v1/collections/sentinel-2-l2a/items/S2A_20240101_123456",
    "https://earth-search.aws.element84.com/v1/collections/landsat-8-c2-l2/items/LC08_20240103_345678",
]

df = pd.DataFrame({"url": urls})
df.to_parquet("stac_urls.parquet", index=False)

Configuration Examples

# Use S2 grid with daily partitioning
stac-ingest --input urls.parquet --output ./catalog --scratch ./scratch \
  --grid s2 --grid-resolution 13 --temporal-bin day

# Enable global partitioning with custom thresholds
stac-ingest --input urls.parquet --output ./catalog --scratch ./scratch \
  --global-thresholds-file custom-thresholds.json

# Distributed processing with Dask
stac-ingest --input s3://bucket/urls.parquet --output s3://bucket/catalog \
  --scratch s3://bucket/scratch --processor dask --workers 16

Example: Efficient Spatial Queries

# Traditional approach (slow - scans entire catalog)
import geopandas as gpd
from shapely.geometry import box

roi = box(-122.5, 37.7, -122.0, 38.0)  # San Francisco area
df = gpd.read_parquet("catalog/**/*.parquet")  # Reads EVERYTHING
results = df[df.intersects(roi)]
print(f"Found {len(results)} items (but scanned entire catalog)")

# EarthCatalog approach (fast - scans only relevant partitions)
from earthcatalog.spatial_resolver import spatial_resolver
import duckdb

resolver = spatial_resolver("catalog/catalog_schema.json")
partitions = resolver.resolve_partitions(roi)
paths = resolver.generate_query_paths(partitions)

result = duckdb.sql(f"SELECT * FROM read_parquet({paths})").df()
print(f"Found {len(result)} items (scanned only {len(partitions)} partitions)")

# Remote schema files (S3, GCS, Azure, HTTP) - requires fsspec
resolver = spatial_resolver("s3://my-bucket/catalog_schema.json", "s3://my-bucket/catalog/")
resolver = spatial_resolver("https://example.com/schema.json", "./local-catalog/")

Output Structure

Uses Hive-style temporal partitioning for optimal query pruning in DuckDB, Athena, and Spark:

catalog/
├── {mission}/
│   └── partition=h3/
│       └── level=2/
│           ├── 8928308280fffff/
│           │   └── year=2024/
│           │       ├── month=01/
│           │       │   └── items.parquet  # January 2024 items
│           │       └── month=02/
│           │           └── items.parquet
│           └── global/
│               └── year=2024/
│                   └── month=01/
│                       └── items.parquet  # Large geometries spanning multiple cells
└── catalog_schema.json  # Generated metadata for efficient querying (enabled by default)

Schema Metadata and Efficient Querying

EarthCatalog generates comprehensive metadata about your catalog's partitioning scheme by default:

# Schema is generated by default
stac-ingest --input urls.parquet --output ./catalog --scratch ./scratch

# Use custom schema filename
stac-ingest --input urls.parquet --output ./catalog --scratch ./scratch \
  --schema-filename my_catalog_schema.json

# Disable schema generation
stac-ingest --input urls.parquet --output ./catalog --scratch ./scratch \
  --no-generate-schema

The generated schema includes:

Grid system details: Type, resolution, cell sizes, coordinate system
Partition structure: All spatial and temporal partitions created
Usage examples: DuckDB queries for efficient partition pruning
Statistics: Item counts, partition counts, processing info

Automatic Global Partition Detection

The resolver intelligently includes the global partition when needed:

# Threshold-based inclusion (queries spanning many cells include global)
large_area = box(-130, 30, -110, 50)  # Multi-state region
partitions = resolver.resolve_partitions(large_area)
# Includes 'global' because query spans > threshold cells

# Geography-based inclusion (continental-scale areas include global)
continental = box(-180, -60, 180, 80)  # Nearly global extent
partitions = resolver.resolve_partitions(continental)
# Includes 'global' because geometry area > large geometry threshold

# Manual control when needed
partitions_no_global = resolver.resolve_partitions(large_area, include_global=False)
partitions_force_global = resolver.resolve_partitions(small_area, include_global=True)

Remote Schema Files

The spatial_resolver() function supports schema files stored in cloud storage or remote locations:

from earthcatalog.spatial_resolver import spatial_resolver

# S3 (requires fsspec[s3])
resolver = spatial_resolver("s3://my-bucket/catalog_schema.json", "s3://my-bucket/catalog/")

# Google Cloud Storage (requires fsspec[gcs])
resolver = spatial_resolver("gs://my-bucket/catalog_schema.json", "gs://my-bucket/catalog/")

# Azure Blob Storage (requires fsspec[azure])
resolver = spatial_resolver("abfs://container/catalog_schema.json", "abfs://container/catalog/")

# HTTP/HTTPS
resolver = spatial_resolver("https://example.com/catalog_schema.json", "./local-catalog/")

# Mixed: Remote schema with local catalog
resolver = spatial_resolver("s3://bucket/schema.json", "/local/catalog/")

Requirements:

Install fsspec with appropriate extras: pip install fsspec[s3], fsspec[gcs], fsspec[azure]
The catalog_path parameter is required for remote schema files
Authentication follows fsspec conventions (AWS credentials, service accounts, etc.)

Grid-Specific Resolution

Key Benefits:

Automatic Resolution: No need to manually calculate grid intersections
All Grid Systems: Works with H3, S2, MGRS, UTM, LatLon, and custom GeoJSON
Configurable Overlap: Control boundary handling and buffer zones
Performance: Query only relevant partitions instead of full catalog scan
DuckDB Integration: Generates ready-to-use file path patterns

⚡ Performance Benchmarks

Query Performance Comparison (San Francisco Bay Area query on global dataset):

Metric	Without Pruning	With Spatial Resolution	Improvement
Data Scanned	50GB+	6GB	88.5% reduction
Query Time	45 seconds	5.2 seconds	8.7x faster
Memory Usage	12GB	2.1GB	82% reduction
Files Read	15,000+	1,200	92% fewer files

Grid System Performance (typical regional query):

H3 Resolution 6: 8-12 cells → ~85-90% data reduction
MGRS 100km: 1-4 zones → ~95-98% data reduction
Custom GeoJSON: Variable based on tile design

Documentation

📖 Full Documentation - Complete guides and API reference
🏁 Quick Start Guide - Get up and running in minutes
⚙️ Configuration Guide - All configuration options
🌍 Global Partitioning - Handle large/complex geometries
🔧 API Reference - Python and CLI documentation

Contributing

# Development setup
git clone https://github.com/betolink/earthcatalog.git
cd earthcatalog
pip install -e ".[dev]"

# Run tests
python -m pytest

# Format and lint
black earthcatalog/ && ruff check earthcatalog/

License

MIT License - see LICENSE file for details.

Acknowledgements

This project was inspired by the need for efficient geospatial data management and builds upon the work of the open-source geospatial community. Special thanks to the developers of STAC, GeoParquet, H3, S2, and other foundational libraries that made this project possible.

Thanks to NSIDC DAAC, NASA ITS_LIVE and NASA Openscapes for supporting open data initiatives that drive innovation in geospatial data processing.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

betolink

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.2

Feb 3, 2026

This version

0.4.1

Feb 3, 2026

0.4.0

Feb 3, 2026

0.3.0

Feb 3, 2026

0.2.0

Feb 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

earthcatalog-0.4.1.tar.gz (220.1 kB view details)

Uploaded Feb 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

earthcatalog-0.4.1-py3-none-any.whl (240.0 kB view details)

Uploaded Feb 3, 2026 Python 3

File details

Details for the file earthcatalog-0.4.1.tar.gz.

File metadata

Download URL: earthcatalog-0.4.1.tar.gz
Upload date: Feb 3, 2026
Size: 220.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for earthcatalog-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`99b707cc16964c6236093e2b11bf6d97a498e9ef0e33acca64f976d7e35a24ee`
MD5	`fa4e8bbeabc14edaa1997705c51d3a42`
BLAKE2b-256	`d55068e91703509f2723a98a3395e3f8e5ca8e3637d4757ce8dda9a85fc72aa8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for earthcatalog-0.4.1.tar.gz:

Publisher: publish.yml on betolink/earthcatalog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: earthcatalog-0.4.1.tar.gz
- Subject digest: 99b707cc16964c6236093e2b11bf6d97a498e9ef0e33acca64f976d7e35a24ee
- Sigstore transparency entry: 909286809
- Sigstore integration time: Feb 3, 2026
Source repository:
- Permalink: betolink/earthcatalog@2eba3e7ae1da4c35b43cadebcb9ff502178acf98
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/betolink
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2eba3e7ae1da4c35b43cadebcb9ff502178acf98
- Trigger Event: push

File details

Details for the file earthcatalog-0.4.1-py3-none-any.whl.

File metadata

Download URL: earthcatalog-0.4.1-py3-none-any.whl
Upload date: Feb 3, 2026
Size: 240.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for earthcatalog-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6861a56122496ba037c6ccd68863e35cba3183b7ae99e3f4d5615adbf2415c62`
MD5	`13186c72d2084c967300e645ce439cd9`
BLAKE2b-256	`56e6ccf0d730b132d0828bf4c646fa7f235d6100ac5f81ae95d59e40b20a3db2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for earthcatalog-0.4.1-py3-none-any.whl:

Publisher: publish.yml on betolink/earthcatalog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: earthcatalog-0.4.1-py3-none-any.whl
- Subject digest: 6861a56122496ba037c6ccd68863e35cba3183b7ae99e3f4d5615adbf2415c62
- Sigstore transparency entry: 909286817
- Sigstore integration time: Feb 3, 2026
Source repository:
- Permalink: betolink/earthcatalog@2eba3e7ae1da4c35b43cadebcb9ff502178acf98
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/betolink
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2eba3e7ae1da4c35b43cadebcb9ff502178acf98
- Trigger Event: push

earthcatalog 0.4.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

EarthCatalog

Why EarthCatalog?

What about Apache Sedona, GeoMesa, Iceberg or PostGIS?

Key Features

Quick Start

Installation

Basic Usage

Example: Create Input Data

Configuration Examples

Example: Efficient Spatial Queries

Output Structure

Schema Metadata and Efficient Querying

Automatic Global Partition Detection

Remote Schema Files

Grid-Specific Resolution

⚡ Performance Benchmarks

Documentation

Contributing

License

Acknowledgements

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance