Skip to main content

Index-first GeoTIFF access layer for ML and analysis, powered by queryable Parquet indexes.

Project description

🛰️ Rasteret

The AI practitioner's multiplier for cloud-native satellite data.
A high-performance rasterio/GDAL alternative for scaleable ML workflows.

Rasteret helps you manage and read massive satellite imagery collections with zero friction.
It provides a high-performance "drop-in" backend for **TorchGeo**, **xarray**, and **NumPy** that is up to 20x faster than traditional GDAL-based workflows.

Documentation Discord PyPI Python License


Why Rasteret?

Geospatial data science is often 80% "plumbing." You spend hours writing pystac-client loops, manual ThreadPoolExecutor code, and fragile CRS-alignment logic just to get a batch of pixels for your model.

Rasteret turns those 80% into a single line of code.

It separates the Control Plane (managing your scenes, labels, and splits in a local Parquet index) from the Data Plane (streaming pixels directly from cloud COGs).

The "Friction" vs. "Flow" Comparison

The Old Way (25+ lines of fragile plumbing):

  1. Search STAC catalog ✅
  2. Loop over items ✅
  3. Handle pagination ✅
  4. Filter by cloud cover ✅
  5. Wait 500ms per file to parse remote TIFF headers (GDAL cold start) ❌
  6. Manage ThreadPoolExecutor manually ❌
  7. Manually stack results and align CRS ❌

The Rasteret Way (3 lines of robust code):

import rasteret

# 1. Load or Build your collection (Index is local, metadata is relational)
collection = rasteret.load("my_s2_experiment")

# 2. Query like a Table: "Give me the training scenes with <10% clouds"
filtered = collection.subset(split="train", cloud_cover_lt=10)

# 3. Batch Read: "Fetch aligned pixels for these 1000 polygons"
data = filtered.get_numpy(geometries=my_polygons, bands=["B04", "B08"])

Key Features

  • 🚀 20x Faster Cold Starts: By caching tile-layout metadata locally, Rasteret jumps straight to the pixels, skipping expensive remote header parsing, which happens in every new environment.
  • 📦 Seamless "Drop-in" Backends: Boost TorchGeo or xarray performance by simply swapping the reader. No need to rewrite your training code.
  • 🧬 Relational Imagery: Store your labels, train/val/test splits, and custom metadata directly in the imagery index. No more separate CSVs.
  • 🛠️ Zero-Config Throughput: Automatic cloud storage presigning with Obstore, and custom async I/O handles the networking so you don't have to.

Performance

Rasteret's claims are backed by rigorous, reproducible benchmarks. We measure across three dimensions: cold-start latency, cloud-native scale, and comparison against legacy "data-inside-parquet" patterns.

1. Cold-start comparison with TorchGeo

Same AOIs, same scenes, same sampler, same DataLoader. Rasteret eliminates the "cold start tax" by caching IFD headers in the local Parquet index.

Scenario rasterio/GDAL (Standard) Rasteret (Index-First) Speedup
Single AOI, 15 scenes 9.08 s 1.14 s 8x
Multi-AOI, 30 scenes 42.05 s 2.25 s 19x
Cross-CRS boundary 12.47 s 0.59 s 21x

Processing time comparison Speedup breakdown

2. The Cloud vs. Edge Comparison

How does Rasteret stack up against Google Earth Engine (GEE) or a highly parallelized Rasterio setup for time-series extraction?

Library First Run (Cold) Subsequent Runs (Hot)
Rasterio + ThreadPool 32 s 24 s
Google Earth Engine 10–30 s 3–5 s
Rasteret 3 s 3 s

Single request performance

3. HuggingFace MajorTOM vs. Rasteret

Recent "images-inside-Parquet" approaches (like MajorTOM) try to store image bytes in Parquet files. Rasteret keeps imagery in cloud COGs while using Parquet as a high-performance index—delivering better throughput without the data movement overhead.

Patches HF datasets (streaming) Rasteret index+COGs Speedup
120 46.83 s 12.09 s 3.88x
1000 771.59 s 118.69 s 6.50x

HF vs Rasteret speedup

All numbers measured on AWS us-west-2 4CPU machine (same region as data) vs. cold-start GDAL.


Technical Deep Dives

For the full architectural rationale, methodology, and reproducibility scripts, see:

STAC API / GeoParquet  -->  Parquet Collection  -->  Tile-level byte reads
       (once)                  (queryable)             (no GDAL hot path)

Quick Start

1. Build a Collection

import rasteret

# Build from any STAC API or Parquet Metadata table
collection = rasteret.build(
    "earthsearch/sentinel-2-l2a",
    name="s2_training",
    bbox=(77.5, 12.9, 77.7, 13.1),
    date_range=("2024-01-01", "2024-06-30")
)

2. Turbocharge your ML (TorchGeo)

Rasteret provides a high-performance backend that honors the GeoDataset contract.

from torch.utils.data import DataLoader
from torchgeo.samplers import RandomGeoSampler

# Same API as TorchGeo, much faster pixel pipe
dataset = collection.to_torchgeo_dataset(bands=["B04", "B08"], chip_size=256)

sampler = RandomGeoSampler(dataset, size=256, length=100)
loader  = DataLoader(dataset, sampler=sampler, batch_size=4)

3. Fast Xarray creation

ds = collection.get_xarray(geometries=my_aoi, bands=["B04", "B08"])
ndvi = (ds.B08 - ds.B04) / (ds.B08 + ds.B04)

Key Entry Points

Rasteret is built for flexibility. Choose the output format that fits your existing workflow:

Method Output Purpose
to_torchgeo_dataset() RasteretGeoDataset Drop-in high-performance backend for TorchGeo training.
get_xarray() xarray.Dataset Quick create Xarray for analysis.
get_numpy() numpy.ndarray Raw pixel arrays ([N, C, H, W]) directly.
get_gdf() GeoDataFrame Metadata and pixel arrays as a standard geopandas dataframe.
sample_points() DataFrame Exact pixel values at points geometries with intuitive configurable fallback for nodata pixels

Full documentation at terrafloww.github.io/rasteret:

License

Code: Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rasteret-0.3.9.tar.gz (215.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rasteret-0.3.9-py3-none-any.whl (169.5 kB view details)

Uploaded Python 3

File details

Details for the file rasteret-0.3.9.tar.gz.

File metadata

  • Download URL: rasteret-0.3.9.tar.gz
  • Upload date:
  • Size: 215.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rasteret-0.3.9.tar.gz
Algorithm Hash digest
SHA256 5ace6b8945707dd0afdec9551672a62287f1acee5dd6c75ea2da0dd8e7ffc223
MD5 4244c7f64802287623923df92bccc76f
BLAKE2b-256 c1d864fd0619dc809574e645c2ac1704d14c13ac82d854aa91e5f1548b10ea04

See more details on using hashes here.

Provenance

The following attestation bundles were made for rasteret-0.3.9.tar.gz:

Publisher: pypi.yaml on terrafloww/rasteret

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rasteret-0.3.9-py3-none-any.whl.

File metadata

  • Download URL: rasteret-0.3.9-py3-none-any.whl
  • Upload date:
  • Size: 169.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rasteret-0.3.9-py3-none-any.whl
Algorithm Hash digest
SHA256 6eee1cfcb8347b93f0a84bc62233e49448a7747373b5d03ea258564610645180
MD5 6895987554b1a5710f1be401fb8fb3f5
BLAKE2b-256 6e40c53aa52558828efef203142631163d4d7d4fd5c0f4d7efe2d7b9dd019943

See more details on using hashes here.

Provenance

The following attestation bundles were made for rasteret-0.3.9-py3-none-any.whl:

Publisher: pypi.yaml on terrafloww/rasteret

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page