Index-first GeoTIFF access layer for ML and analysis, powered by queryable Parquet indexes.
Project description
🛰️ Rasteret
The AI practitioner's multiplier for cloud-native satellite data.
A high-performance rasterio/GDAL alternative for scaleable ML workflows.
Rasteret helps you manage and read massive satellite imagery collections with zero friction.
It provides a high-performance "drop-in" backend for **TorchGeo**, **xarray**, and **NumPy** that is up to 20x faster than traditional GDAL-based workflows.
Why Rasteret?
Geospatial data science is often 80% "plumbing." You spend hours writing pystac-client loops, manual ThreadPoolExecutor code, and fragile CRS-alignment logic just to get a batch of pixels for your model.
Rasteret turns those 80% into a single line of code.
It separates the Control Plane (managing your scenes, labels, and splits in a local Parquet index) from the Data Plane (streaming pixels directly from cloud COGs).
The "Friction" vs. "Flow" Comparison
The Old Way (25+ lines of fragile plumbing):
- Search STAC catalog ✅
- Loop over items ✅
- Handle pagination ✅
- Filter by cloud cover ✅
- Wait 500ms per file to parse remote TIFF headers (GDAL cold start) ❌
- Manage
ThreadPoolExecutormanually ❌ - Manually stack results and align CRS ❌
The Rasteret Way (3 lines of robust code):
import rasteret
# 1. Load or Build your collection (Index is local, metadata is relational)
collection = rasteret.load("my_s2_experiment")
# 2. Query like a Table: "Give me the training scenes with <10% clouds"
filtered = collection.subset(split="train", cloud_cover_lt=10)
# 3. Batch Read: "Fetch aligned pixels for these 1000 polygons"
data = filtered.get_numpy(geometries=my_polygons, bands=["B04", "B08"])
Key Features
- 🚀 20x Faster Cold Starts: By caching tile-layout metadata locally, Rasteret jumps straight to the pixels, skipping expensive remote header parsing, which happens in every new environment.
- 📦 Seamless "Drop-in" Backends: Boost TorchGeo or xarray performance by simply swapping the reader. No need to rewrite your training code.
- 🧬 Relational Imagery: Store your labels,
train/val/testsplits, and custom metadata directly in the imagery index. No more separate CSVs. - 🛠️ Zero-Config Throughput: Automatic cloud storage presigning with
Obstore, and custom async I/O handles the networking so you don't have to.
Performance
Rasteret's claims are backed by rigorous, reproducible benchmarks. We measure across three dimensions: cold-start latency, cloud-native scale, and comparison against legacy "data-inside-parquet" patterns.
1. Cold-start comparison with TorchGeo
Same AOIs, same scenes, same sampler, same DataLoader. Rasteret eliminates the "cold start tax" by caching IFD headers in the local Parquet index.
| Scenario | rasterio/GDAL (Standard) | Rasteret (Index-First) | Speedup |
|---|---|---|---|
| Single AOI, 15 scenes | 9.08 s | 1.14 s | 8x |
| Multi-AOI, 30 scenes | 42.05 s | 2.25 s | 19x |
| Cross-CRS boundary | 12.47 s | 0.59 s | 21x |
2. The Cloud vs. Edge Comparison
How does Rasteret stack up against Google Earth Engine (GEE) or a highly parallelized Rasterio setup for time-series extraction?
| Library | First Run (Cold) | Subsequent Runs (Hot) |
|---|---|---|
| Rasterio + ThreadPool | 32 s | 24 s |
| Google Earth Engine | 10–30 s | 3–5 s |
| Rasteret | 3 s | 3 s |
3. HuggingFace MajorTOM vs. Rasteret
Recent "images-inside-Parquet" approaches (like MajorTOM) try to store image bytes in Parquet files. Rasteret keeps imagery in cloud COGs while using Parquet as a high-performance index—delivering better throughput without the data movement overhead.
| Patches | HF datasets (streaming) |
Rasteret index+COGs | Speedup |
|---|---|---|---|
| 120 | 46.83 s | 12.09 s | 3.88x |
| 1000 | 771.59 s | 118.69 s | 6.50x |
All numbers measured on AWS us-west-2 4CPU machine (same region as data) vs. cold-start GDAL.
Technical Deep Dives
For the full architectural rationale, methodology, and reproducibility scripts, see:
- Full Benchmarks Guide: Methodology and results.
- Design Decisions: Why we chose Parquet + COGs
- Schema Contract: The internal anatomy of a Collection.
STAC API / GeoParquet --> Parquet Collection --> Tile-level byte reads
(once) (queryable) (no GDAL hot path)
Quick Start
1. Build a Collection
import rasteret
# Build from any STAC API or Parquet Metadata table
collection = rasteret.build(
"earthsearch/sentinel-2-l2a",
name="s2_training",
bbox=(77.5, 12.9, 77.7, 13.1),
date_range=("2024-01-01", "2024-06-30")
)
2. Turbocharge your ML (TorchGeo)
Rasteret provides a high-performance backend that honors the GeoDataset contract.
from torch.utils.data import DataLoader
from torchgeo.samplers import RandomGeoSampler
# Same API as TorchGeo, much faster pixel pipe
dataset = collection.to_torchgeo_dataset(bands=["B04", "B08"], chip_size=256)
sampler = RandomGeoSampler(dataset, size=256, length=100)
loader = DataLoader(dataset, sampler=sampler, batch_size=4)
3. Fast Xarray creation
ds = collection.get_xarray(geometries=my_aoi, bands=["B04", "B08"])
ndvi = (ds.B08 - ds.B04) / (ds.B08 + ds.B04)
Key Entry Points
Rasteret is built for flexibility. Choose the output format that fits your existing workflow:
| Method | Output | Purpose |
|---|---|---|
to_torchgeo_dataset() |
RasteretGeoDataset |
Drop-in high-performance backend for TorchGeo training. |
get_xarray() |
xarray.Dataset |
Quick create Xarray for analysis. |
get_numpy() |
numpy.ndarray |
Raw pixel arrays ([N, C, H, W]) directly. |
get_gdf() |
GeoDataFrame |
Metadata and pixel arrays as a standard geopandas dataframe. |
sample_points() |
DataFrame |
Exact pixel values at points geometries with intuitive configurable fallback for nodata pixels |
Full documentation at terrafloww.github.io/rasteret:
- Conceptual Roadmap: Why Rasteret?
- Transitioning from Rasterio: Side-by-side patterns.
- Turbocharging TorchGeo: Scaling your DL loaders.
- Tutorials: Hands-on examples.
License
Code: Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rasteret-0.3.10.tar.gz.
File metadata
- Download URL: rasteret-0.3.10.tar.gz
- Upload date:
- Size: 215.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
418797cd835b142159489e6ab5ae4190f75a5d82933f29c62f59507d12864783
|
|
| MD5 |
72432aeec1a0209ce775783d4b52ef75
|
|
| BLAKE2b-256 |
07fec6aaec2d19f7878f0525dc468823dc5f67c0ebd851e913d57a49e6159eb0
|
Provenance
The following attestation bundles were made for rasteret-0.3.10.tar.gz:
Publisher:
pypi.yaml on terrafloww/rasteret
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rasteret-0.3.10.tar.gz -
Subject digest:
418797cd835b142159489e6ab5ae4190f75a5d82933f29c62f59507d12864783 - Sigstore transparency entry: 1277269711
- Sigstore integration time:
-
Permalink:
terrafloww/rasteret@9f9ad9bca100b09835aca3b4ff4904a56ff08a57 -
Branch / Tag:
refs/tags/v0.3.10 - Owner: https://github.com/terrafloww
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yaml@9f9ad9bca100b09835aca3b4ff4904a56ff08a57 -
Trigger Event:
push
-
Statement type:
File details
Details for the file rasteret-0.3.10-py3-none-any.whl.
File metadata
- Download URL: rasteret-0.3.10-py3-none-any.whl
- Upload date:
- Size: 169.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
416d1cdffbd466e213525debbf0d02896d771d80ae4ee7e998cdb806fa46dd90
|
|
| MD5 |
a48ed8400c8df1f9dee03d6b58bb002a
|
|
| BLAKE2b-256 |
478e0fce882442b25bfc2fdaf3e4df41cf94f51ec495bb6ddd0eab51d3c614ff
|
Provenance
The following attestation bundles were made for rasteret-0.3.10-py3-none-any.whl:
Publisher:
pypi.yaml on terrafloww/rasteret
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rasteret-0.3.10-py3-none-any.whl -
Subject digest:
416d1cdffbd466e213525debbf0d02896d771d80ae4ee7e998cdb806fa46dd90 - Sigstore transparency entry: 1277269957
- Sigstore integration time:
-
Permalink:
terrafloww/rasteret@9f9ad9bca100b09835aca3b4ff4904a56ff08a57 -
Branch / Tag:
refs/tags/v0.3.10 - Owner: https://github.com/terrafloww
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yaml@9f9ad9bca100b09835aca3b4ff4904a56ff08a57 -
Trigger Event:
push
-
Statement type: