Skip to main content

Index-first GeoTIFF access layer for ML and analysis, powered by queryable Parquet indexes.

Project description

🛰️ Rasteret

The AI practitioner's multiplier for cloud-native satellite data.
A high-performance rasterio/GDAL alternative for scaleable ML workflows.

Rasteret helps you manage and read massive satellite imagery collections with zero friction.
It provides a 20x faster "drop-in" backend for TorchGeo, interops with all Arrow-based tools like DuckDB, Polars, while maintaining xarray, and NumPy support.

Documentation Discord PyPI Python License


Why Rasteret?

Geospatial data science is often 80% "plumbing." You spend hours writing STAC reading loops, manual ThreadPoolExecutor, and fragile CRS-alignment logic just to get a batch of pixels for your model.

Rasteret turns those 80% into a single line of code.

It separates the Control Plane (managing your scenes, labels, and splits in a local Parquet index) from the Data Plane (streaming pixels directly from cloud COGs).

The "Friction" vs. "Flow" Comparison

The Old Way (25+ lines of fragile plumbing):

  1. Search STAC catalog
  2. Loop over items
  3. Handle pagination
  4. Filter by cloud cover
  5. Wait 500ms per file to parse remote TIFF headers (GDAL cold start)
  6. Manage ThreadPoolExecutor manually
  7. Manually stack results and align CRS

The Rasteret Way (3 lines of robust code):

import rasteret

# 1. Load or Build your collection
collection = rasteret.load("my_s2_experiment")

# 2. Query like a Table: "Give me the training scenes with <10% clouds"
filtered = collection.subset(split="train", cloud_cover_lt=10)
# OR send Collection to DuckDB or Polars for enriching with your own data and bring it back to Rasteret
filtered = duckdb.sql("""
SELECT
    c.*,
    p.plot_id,
    p.is_target
FROM collection c
JOIN plots p ON ST_Intersects(c.geometry, p.geometry)
WHERE p.is_target = true AND c.cloud_cover_lt < 10
""")
filtered = rasteret.as_collection(filtered)

# 3. Batch Read: "Fetch aligned pixels for all geometries in the filtered collection"
data = filtered.get_numpy(geometries=filtered.geometry, bands=["B04", "B08"])

Key Features

  • 🚀 20x Faster Cold Starts: By caching tile-layout metadata locally, Rasteret jumps straight to the pixels, skipping expensive remote header parsing, which happens in every new environment.
  • 📦 Seamless "Drop-in" Backends: Boost TorchGeo or xarray performance by simply swapping the reader. No need to rewrite your analysis code.
  • 🧬 Relational Imagery: Store your labels, train/val/test splits, and custom metadata directly in the imagery index. No more separate CSVs.
  • 🛠️ Zero-Config Throughput: Automatic cloud storage presigning with Obstore, and custom async I/O handles the networking so you don't have to.

Performance

Rasteret's claims are backed by rigorous, reproducible benchmarks. We measure across three dimensions: cold-start latency, cloud-native scale, and comparison against legacy "data-inside-parquet" patterns.

1. Cold-start comparison with TorchGeo

Same AOIs, same scenes, same sampler, same DataLoader. Rasteret eliminates the "cold start tax" by caching IFD headers in the local Parquet index.

Scenario rasterio/GDAL (Standard) Rasteret (Index-First) Speedup
Single AOI, 15 scenes 9.08 s 1.14 s 8x
Multi-AOI, 30 scenes 42.05 s 2.25 s 19x
Cross-CRS boundary 12.47 s 0.59 s 21x

Processing time comparison Speedup breakdown

2. The Cloud vs. Edge Comparison

How does Rasteret stack up against Google Earth Engine (GEE) or a highly parallelized Rasterio setup for time-series extraction?

Library First Run (Cold) Subsequent Runs (Hot)
Rasterio + ThreadPool 32 s 24 s
Google Earth Engine 10–30 s 3–5 s
Rasteret 3 s 3 s

Single request performance

3. HuggingFace MajorTOM vs. Rasteret

Recent "images-inside-Parquet" approaches (like MajorTOM) try to store image bytes in Parquet files. Rasteret keeps imagery in cloud COGs while using Parquet as a high-performance index—delivering better throughput without the data movement overhead.

Patches HF datasets (streaming) Rasteret index+COGs Speedup
120 46.83 s 12.09 s 3.88x
1000 771.59 s 118.69 s 6.50x

HF vs Rasteret speedup

All numbers measured on AWS us-west-2 4CPU machine (same region as data) vs. cold-start GDAL.


Technical Deep Dives

For the full architectural rationale, methodology, and reproducibility scripts, see:

STAC API / GeoParquet  -->  Parquet Collection  -->  Tile-level byte reads
       (once)                  (queryable)             (no GDAL hot path)

Quick Start

1. Build a Collection

import rasteret

# Build from any STAC API or Parquet Metadata table
collection = rasteret.build(
    "earthsearch/sentinel-2-l2a",
    name="s2_training",
    bbox=(77.5, 12.9, 77.7, 13.1),
    date_range=("2024-01-01", "2024-06-30")
)

2. Turbocharge your ML (TorchGeo)

Rasteret provides a high-performance backend that honors the GeoDataset contract.

from torch.utils.data import DataLoader
from torchgeo.samplers import RandomGeoSampler

# Same API as TorchGeo, much faster pixel pipe
dataset = collection.to_torchgeo_dataset(bands=["B04", "B08"], chip_size=256)

sampler = RandomGeoSampler(dataset, size=256, length=100)
loader  = DataLoader(dataset, sampler=sampler, batch_size=4)

3. Fast Xarray creation

ds = collection.get_xarray(geometries=my_aoi, bands=["B04", "B08"])
ndvi = (ds.B08 - ds.B04) / (ds.B08 + ds.B04)

Key Entry Points

Rasteret is built for flexibility. Choose the output format that fits your existing workflow:

Method Output Purpose
to_torchgeo_dataset() RasteretGeoDataset Drop-in high-performance backend for TorchGeo training.
get_xarray() xarray.Dataset Quick create Xarray for analysis.
get_numpy() numpy.ndarray Raw pixel arrays ([N, C, H, W]) directly.
get_gdf() GeoDataFrame Metadata and pixel arrays as a standard geopandas dataframe.
sample_points() DataFrame Exact pixel values at points geometries with intuitive configurable fallback for nodata pixels

Full documentation at terrafloww.github.io/rasteret:

License

Code: Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rasteret-0.3.12.tar.gz (221.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rasteret-0.3.12-py3-none-any.whl (173.2 kB view details)

Uploaded Python 3

File details

Details for the file rasteret-0.3.12.tar.gz.

File metadata

  • Download URL: rasteret-0.3.12.tar.gz
  • Upload date:
  • Size: 221.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rasteret-0.3.12.tar.gz
Algorithm Hash digest
SHA256 e919264cebb842cb5d31e6d36a443a429f27d15c0ee61eb694c438984c45e923
MD5 7c0518616d524effee3f8a269d071477
BLAKE2b-256 bf33c79d2ec2b7f893094c37daba0e6aa3fb77b7f2515edd427a0c66657bcd1f

See more details on using hashes here.

Provenance

The following attestation bundles were made for rasteret-0.3.12.tar.gz:

Publisher: pypi.yaml on terrafloww/rasteret

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rasteret-0.3.12-py3-none-any.whl.

File metadata

  • Download URL: rasteret-0.3.12-py3-none-any.whl
  • Upload date:
  • Size: 173.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rasteret-0.3.12-py3-none-any.whl
Algorithm Hash digest
SHA256 7d9ff90440b7f9dca3f28c353506951918f61892a6c298e10c5e758012c79b88
MD5 2657d82b23132fb188cff60e67e09c64
BLAKE2b-256 9675e5ca190611c2921133ec27f94d6526af6f67af6ec6df5447dc7a8c01bb95

See more details on using hashes here.

Provenance

The following attestation bundles were made for rasteret-0.3.12-py3-none-any.whl:

Publisher: pypi.yaml on terrafloww/rasteret

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page