Skip to main content

Index-first GeoTIFF access layer for ML and analysis, powered by queryable Parquet indexes.

Project description

🛰️ Rasteret

The AI practitioner's multiplier for cloud-native satellite data.
A high-performance rasterio/GDAL alternative for scaleable ML workflows.

Rasteret helps you manage and read massive satellite imagery collections with zero friction.
It provides a 20x faster "drop-in" backend for TorchGeo, interops with all Arrow-based tools like DuckDB, Polars, while maintaining xarray, and NumPy support.

Documentation Discord PyPI Python License


Why Rasteret?

Geospatial data science is often 80% "plumbing." You spend hours writing STAC reading loops, manual ThreadPoolExecutor, and fragile CRS-alignment logic just to get a batch of pixels for your model.

Rasteret turns those 80% into a single line of code.

It separates the Control Plane (managing your scenes, labels, and splits in a local Parquet index) from the Data Plane (streaming pixels directly from cloud COGs).

The "Friction" vs. "Flow" Comparison

The Old Way (25+ lines of fragile plumbing):

  1. Search STAC catalog
  2. Loop over items
  3. Handle pagination
  4. Filter by cloud cover
  5. Wait 500ms per file to parse remote TIFF headers (GDAL cold start)
  6. Manage ThreadPoolExecutor manually
  7. Manually stack results and align CRS

The Rasteret Way (3 lines of robust code):

import rasteret

# 1. Load or Build your collection
collection = rasteret.load("my_s2_experiment")

# 2. Query like a Table: "Give me the training scenes with <10% clouds"
filtered = collection.subset(split="train", cloud_cover_lt=10)
# OR send Collection to DuckDB or Polars for enriching with your own data and bring it back to Rasteret
filtered = duckdb.sql("""
SELECT
    c.*,
    p.plot_id,
    p.is_target
FROM collection c
JOIN plots p ON ST_Intersects(c.geometry, p.geometry)
WHERE p.is_target = true AND c.cloud_cover_lt < 10
""")
filtered = rasteret.as_collection(filtered)

# 3. Batch Read: "Fetch aligned pixels for all geometries in the filtered collection"
data = filtered.get_numpy(geometries=filtered.geometry, bands=["B04", "B08"])

Key Features

  • 🚀 20x Faster Cold Starts: By caching tile-layout metadata locally, Rasteret jumps straight to the pixels, skipping expensive remote header parsing, which happens in every new environment.
  • 📦 Seamless "Drop-in" Backends: Boost TorchGeo or xarray performance by simply swapping the reader. No need to rewrite your analysis code.
  • 🧬 Relational Imagery: Store your labels, train/val/test splits, and custom metadata directly in the imagery index. No more separate CSVs.
  • 🛠️ Zero-Config Throughput: Automatic cloud storage presigning with Obstore, and custom async I/O handles the networking so you don't have to.

Performance

Rasteret's claims are backed by rigorous, reproducible benchmarks. We measure across three dimensions: cold-start latency, cloud-native scale, and comparison against legacy "data-inside-parquet" patterns.

1. Cold-start comparison with TorchGeo

Same AOIs, same scenes, same sampler, same DataLoader. Rasteret eliminates the "cold start tax" by caching IFD headers in the local Parquet index.

Scenario rasterio/GDAL (Standard) Rasteret (Index-First) Speedup
Single AOI, 15 scenes 9.08 s 1.14 s 8x
Multi-AOI, 30 scenes 42.05 s 2.25 s 19x
Cross-CRS boundary 12.47 s 0.59 s 21x

Processing time comparison Speedup breakdown

2. The Cloud vs. Edge Comparison

How does Rasteret stack up against Google Earth Engine (GEE) or a highly parallelized Rasterio setup for time-series extraction?

Library First Run (Cold) Subsequent Runs (Hot)
Rasterio + ThreadPool 32 s 24 s
Google Earth Engine 10–30 s 3–5 s
Rasteret 3 s 3 s

Single request performance

3. HuggingFace MajorTOM vs. Rasteret

Recent "images-inside-Parquet" approaches (like MajorTOM) try to store image bytes in Parquet files. Rasteret keeps imagery in cloud COGs while using Parquet as a high-performance index—delivering better throughput without the data movement overhead.

Patches HF datasets (streaming) Rasteret index+COGs Speedup
120 46.83 s 12.09 s 3.88x
1000 771.59 s 118.69 s 6.50x

HF vs Rasteret speedup

All numbers measured on AWS us-west-2 4CPU machine (same region as data) vs. cold-start GDAL.


Technical Deep Dives

For the full architectural rationale, methodology, and reproducibility scripts, see:

STAC API / GeoParquet  -->  Parquet Collection  -->  Tile-level byte reads
       (once)                  (queryable)             (no GDAL hot path)

Quick Start

1. Build a Collection

import rasteret

# Build from any STAC API or Parquet Metadata table
collection = rasteret.build(
    "earthsearch/sentinel-2-l2a",
    name="s2_training",
    bbox=(77.5, 12.9, 77.7, 13.1),
    date_range=("2024-01-01", "2024-06-30")
)

2. Turbocharge your ML (TorchGeo)

Rasteret provides a high-performance backend that honors the GeoDataset contract.

from torch.utils.data import DataLoader
from torchgeo.samplers import RandomGeoSampler

# Same API as TorchGeo, much faster pixel pipe
dataset = collection.to_torchgeo_dataset(bands=["B04", "B08"], chip_size=256)

sampler = RandomGeoSampler(dataset, size=256, length=100)
loader  = DataLoader(dataset, sampler=sampler, batch_size=4)

3. Fast Xarray creation

ds = collection.get_xarray(geometries=my_aoi, bands=["B04", "B08"])
ndvi = (ds.B08 - ds.B04) / (ds.B08 + ds.B04)

Key Entry Points

Rasteret is built for flexibility. Choose the output format that fits your existing workflow:

Method Output Purpose
to_torchgeo_dataset() RasteretGeoDataset Drop-in high-performance backend for TorchGeo training.
get_xarray() xarray.Dataset Quick create Xarray for analysis.
get_numpy() numpy.ndarray Raw pixel arrays ([N, C, H, W]) directly.
get_gdf() GeoDataFrame Metadata and pixel arrays as a standard geopandas dataframe.
sample_points() DataFrame Exact pixel values at points geometries with intuitive configurable fallback for nodata pixels

Full documentation at terrafloww.github.io/rasteret:

License

Code: Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rasteret-0.3.11.tar.gz (221.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rasteret-0.3.11-py3-none-any.whl (172.9 kB view details)

Uploaded Python 3

File details

Details for the file rasteret-0.3.11.tar.gz.

File metadata

  • Download URL: rasteret-0.3.11.tar.gz
  • Upload date:
  • Size: 221.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rasteret-0.3.11.tar.gz
Algorithm Hash digest
SHA256 e2322c39f93e8c8bb382ba57685e139e7a848d3bf64c79999f376041a67dd749
MD5 331a5c44a7e25a48aabb96fb24feddad
BLAKE2b-256 11d12449dc822017f49bf2f29efb5ae95758cd7eb1e2ce0b7c95d69e61feef6b

See more details on using hashes here.

Provenance

The following attestation bundles were made for rasteret-0.3.11.tar.gz:

Publisher: pypi.yaml on terrafloww/rasteret

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rasteret-0.3.11-py3-none-any.whl.

File metadata

  • Download URL: rasteret-0.3.11-py3-none-any.whl
  • Upload date:
  • Size: 172.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rasteret-0.3.11-py3-none-any.whl
Algorithm Hash digest
SHA256 cf8762ee75a9f29a9176ba9cee54fc42431cb0c41d46c2cd028771eb84f88f92
MD5 7e695db0432de331d61640ade19ec034
BLAKE2b-256 d6d1c76cbd2de029a73ac2691a33f30dd0d5f75aae14b87f62aafeade9066ee5

See more details on using hashes here.

Provenance

The following attestation bundles were made for rasteret-0.3.11-py3-none-any.whl:

Publisher: pypi.yaml on terrafloww/rasteret

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page