Index-first GeoTIFF access layer for ML and analysis, powered by queryable Parquet indexes.
Project description
🛰️ Rasteret
Made to beat cold starts.
Index-first access to cloud-native GeoTIFF collections for ML and analysis.
Every cold start re-parses satellite image metadata over HTTP - per scene, per band. Sentinel-2, Landsat, NAIP, every time. Your colleague did it last Tuesday, CI did it overnight, PyTorch respawns DataLoader workers every epoch. A single project repeats millions of redundant requests before a pixel moves.
Rasteret parses those headers once, caches them in Parquet, and its own reader fetches pixels concurrently with no GDAL in the path. Up to 20x faster on cold starts.
Rasteret calls this pattern index-first geospatial retrieval:
- Control plane: a queryable Parquet index (scene metadata, COG header metadata, user columns like splits/labels)
- Data plane: on-demand tile reads from the original GeoTIFF/COG objects
This keeps metadata and experiment logic in tables while leaving imagery bytes in source COGs.
Key Features -
- Easy - three lines from STAC search or Parquet file to a TorchGeo-compatible dataset
- 20x faster, saves cloud LISTs and GETs - Our custom I/O reads tiles fast with zero STAC/header overhead once a Collection is built
- Zero data downloads - work with terabytes of imagery while storing only megabytes of metadata.
- No STAC at training time - query once at setup; zero API calls during training with Collection you can extend.
- Reproducible - same Parquet index = same records = same results
- Native dtypes - integer imagery stays integer; missing/edge coverage is represented via fill values (nodata or 0) instead of NaNs
- Shareable cache - enrich our Collection with your ML splits, patch geometries, custom data points for ML, and share it, don't write folders of image chips!
Rasteret is an opt-in accelerator that integrates with TorchGeo by
returning a standard GeoDataset. Your samplers, DataLoader, xarray
workflows, and analysis tools stay the same - Rasteret handles the async
tile I/O underneath.
Installation
Requires Python 3.12+.
uv pip install rasteret
Extras
uv pip install "rasteret[xarray]" # + xarray output
uv pip install "rasteret[torchgeo]" # + TorchGeo for ML pipelines
uv pip install "rasteret[aws]" # + requester-pays buckets (Landsat, NAIP)
uv pip install "rasteret[azure]" # + Planetary Computer signed URLs
Combine as needed: uv pip install "rasteret[xarray,aws]".
Available extras: xarray, torchgeo, aws, azure, earthdata.
See Getting Started for details.
[!NOTE] Requester-pays data (Landsat, etc.): Install the
awsextra and configure AWS credentials (aws configureor environment variables). Free public collections like Sentinel-2 on Element84 work without credentials.
Built-in datasets
Rasteret ships with a growing catalog of datasets.
Each entry includes license metadata and a commercial_use flag for quick
filtering.
Pick an ID, pass it to build() and go:
$ rasteret datasets list
ID Name Coverage License Auth
aef/v1-annual AlphaEarth Foundation Embeddings (Annual) global CC-BY-4.0 none
earthsearch/sentinel-2-l2a Sentinel-2 Level-2A global proprietary(free) none
earthsearch/landsat-c2-l2 Landsat Collection 2 Level-2 global proprietary(free) required
earthsearch/naip NAIP north-america proprietary(free) required
earthsearch/cop-dem-glo-30 Copernicus DEM 30m global proprietary(free) none
earthsearch/cop-dem-glo-90 Copernicus DEM 90m global proprietary(free) none
pc/sentinel-2-l2a Sentinel-2 Level-2A (Planetary Computer) global proprietary(free) required
pc/io-lulc-annual-v02 ESRI 10m Land Use/Land Cover global CC-BY-4.0 required
pc/alos-dem ALOS World 3D 30m DEM global proprietary(free) required
pc/nasadem NASADEM global proprietary(free) required
pc/esa-worldcover ESA WorldCover global CC-BY-4.0 required
pc/usda-cdl USDA Cropland Data Layer conus proprietary(free) required
Use your own datasets
- Use
build_from_stac()for any STAC API - Use
build_from_table()for Parquets that have TIFF URLs in them (eg., SourceCoop AlphaEarth index parquet)
You can also build collections using CLI rasteret collections build read more details here
Here's a guide to add a dataset to rasteret's catalog so everyone benefits. The catalog is open to edit by anyone and will be community-driven.
Each new dataset entry is around ~20 lines of Python pointing to a STAC API or a GeoParquet file.
One PR adds a dataset, every rasteret user sees it in rasteret datasets list on the next release of rasteret.
Quick start
Build a Collection
import rasteret
collection = rasteret.build(
"earthsearch/sentinel-2-l2a",
name="s2_training",
bbox=(77.5, 12.9, 77.7, 13.1),
date_range=("2024-01-01", "2024-06-30"),
)
build() picks the dataset from the catalog (backed by a STAC API or a
GeoParquet file, depending on the entry), parses COG headers, and caches
everything as Parquet. The next run loads in milliseconds.
Inspect and filter
collection # Collection('s2_training', source='sentinel-2-l2a', bands=13, records=42, crs=32643)
collection.bands # ['B01', 'B02', ..., 'B12', 'SCL']
len(collection) # 42
# Filter in memory, no network calls
filtered = collection.subset(cloud_cover_lt=15, date_range=("2024-03-01", "2024-06-01"))
subset() accepts cloud_cover_lt, date_range, bbox, geometries,
split, and split_column (when your split field uses a custom name).
For raw Arrow expressions, use collection.where(expr).
ML training (TorchGeo)
from torch.utils.data import DataLoader
from torchgeo.samplers import RandomGeoSampler
from torchgeo.datasets.utils import stack_samples
dataset = collection.to_torchgeo_dataset(
bands=["B04", "B03", "B02", "B08"],
chip_size=256,
)
sampler = RandomGeoSampler(dataset, size=256, length=100)
loader = DataLoader(dataset, sampler=sampler, batch_size=4, collate_fn=stack_samples)
Analysis (xarray)
ds = collection.get_xarray(
geometries=(77.55, 13.01, 77.58, 13.08), # bbox, Arrow array, Shapely, or WKB
bands=["B04", "B08"],
)
ndvi = (ds.B08 - ds.B04) / (ds.B08 + ds.B04)
Fast arrays (NumPy)
arr = collection.get_numpy(
geometries=(77.55, 13.01, 77.58, 13.08),
bands=["B04", "B08"],
)
# shape: [N, C, H, W] for multi-band, [N, H, W] for single-band
Point sampling
from shapely.geometry import Point
samples = collection.sample_points(
points=[Point(77.56, 13.03), Point(77.57, 13.04)],
bands=["B04", "B08"],
geometry_crs=4326,
)
# PyArrow Table — one row per (point, band, record)
Reads only the tiles containing your points. Works with Shapely points, or pass a PyArrow table with coordinate columns for millions of points. No extras needed — available in the base install.
Going further
| What | Where |
|---|---|
| Datasets not in the catalog | build_from_stac() |
| Parquet with COG URLs (Source Cooperative, STAC GeoParquet, custom) | build_from_table(path, name=...) |
| Sample values at many points (Arrow-native) | sample_points() |
| Multi-band COGs (AEF embeddings, etc.) | AEF Embeddings guide |
| Authenticated sources (PC, requester-pays, Earthdata, etc.) | Custom Cloud Provider |
| Share a Collection | collection.export("path/") then rasteret.load("path/") |
| Filter by cloud cover, date, bbox | collection.subset() |
Benchmarks
Single request performance (time series query)
Single request performance
Processing pipeline: Filter 450,000 scenes -> 22 matches -> Read 44 COG files
Single Farm NDVI Time Series (1 Year, Landsat 9)
Run on AWS t3.xlarge (4 CPU) —
| Library | First Run | Subsequent Runs |
|---|---|---|
| Rasterio (Multiprocessing) | 32 s | 24 s |
| Rasteret | 3 s | 3 s |
| Google Earth Engine | 10–30 s | 3–5 s |
Cold-start comparison with TorchGeo
Same AOIs, same scenes, same sampler, same DataLoader. Both paths output
identical [batch, T, C, H, W] tensors. TorchGeo runs with its
recommended GDAL settings for best-case remote COG performance.
| Scenario | rasterio/GDAL path | Rasteret path | Ratio |
|---|---|---|---|
| Single AOI, 15 scenes | 9.08 s | 1.14 s | 8x |
| Multi-AOI, 30 scenes | 42.05 s | 2.25 s | 19x |
| Cross-CRS boundary, 12 scenes | 12.47 s | 0.59 s | 21x |
The difference comes from how headers are accessed: the rasterio/GDAL path re-parses IFDs over HTTP on each cold start, while Rasteret reads them from a local Parquet cache. See Benchmarks for full methodology.
HF baseline (payload-Parquet patches)
HF datasets baseline (Major TOM keyed patches)
Baseline method: datasets.load_dataset(..., streaming=True, filters=...) with
local GeoTIFF decode, compared against Rasteret prebuilt index reads.
Reproduce with examples/major_tom_benchmark/03_hf_vs_rasteret_benchmark.py.
| Patches | HF datasets (streaming) |
Rasteret index+COG | Speedup |
|---|---|---|---|
| 120 | 46.83 s | 12.09 s | 3.88x |
| 1000 | 771.59 s | 118.69 s | 6.50x |
For exploration workflows, Major TOM notebooks often use HF streaming generators; Rasteret is optimized for reading the same patches directly from source COGs using an index-first cache.
Notebook: 05_torchgeo_comparison.ipynb
[!NOTE] Measured on an EC2 instance in the same region as the data (us-west-2). TorchGeo timings above use 12-30 scenes; HF timings above use 120/1000 patches. Results vary with network conditions. If you run Rasteret on your own workloads, share your numbers on GitHub Discussions or Discord.
Scope and stability
| Area | Status |
|---|---|
| STAC + COG scene workflows | Stable |
Parquet-first workflows (build_from_table()) |
Stable |
Multi-band / planar-separate COGs (band_index) |
Stable |
| Multi-cloud (S3, Azure Blob, GCS) | Stable |
| Dataset catalog | Stable |
| TorchGeo adapter | Stable |
Rasteret is optimized for remote, tiled GeoTIFFs (COGs). It also works with local tiled GeoTIFFs for indexing, filtering, and sharing collections. Non-tiled TIFFs and non-TIFF formats are best handled by TorchGeo or rasterio.
Documentation
Full docs at terrafloww.github.io/rasteret:
| Getting Started | Installation and first steps |
| Tutorials | Hands-on notebooks |
| How-To Guides | Task-oriented recipes |
| API Reference | Auto-generated from source |
| Architecture | Design decisions |
| Ecosystem Comparison | Rasteret vs TACO, async-geotiff, virtual-tiff |
Contributing
The catalog grows with community help:
- Add a dataset: write a ~20 line descriptor in
catalog.py, open a PR. See prerequisites and guide - Improve docs: fix a typo, add an example, clarify a section
- Build something new: ingest drivers, cloud backends, readers. See Architecture
All contributions are welcome. See Contributing for dev setup and we are happy to discuss all aspects of library. Ideas welcome on GitHub Discussions or join our Discord to just chat.
Technical notes
GeoParquet and Parquet Raster
Rasteret Collections are written as GeoParquet 1.1 (WKB footprint geometry
geometadata; coordinates in CRS84). Parquet is adding nativeGEOMETRY/GEOGRAPHYlogical types and GeoParquet 2.0 is evolving alongside that; Rasteret tracks this and plans to adopt when ecosystem support stabilizes.
GeoParquet also has an alpha "Parquet Raster" draft for storing raster payloads in Parquet. Rasteret does not write Parquet Raster files: pixels stay in GeoTIFF/COGs, and Parquet stays the index.
TorchGeo interop
RasteretGeoDataset is a standard TorchGeo GeoDataset subclass. It honors
the full GeoDataset contract:
__getitem__(GeoSlice)returns{"image": Tensor, "bounds": Tensor, "transform": Tensor}indexis a GeoPandas GeoDataFrame with an IntervalIndex named"datetime"crsandresare set correctly for sampler compatibility- Works with
RandomGeoSampler,GridGeoSampler, and any custom sampler - Works with
IntersectionDatasetandUnionDatasetfor dataset composition
Rasteret replaces the I/O backend (custom IO instead of rasterio/GDAL) but speaks the same interface. Your samplers, DataLoader, transforms, and training loop do not change.
Rasteret can also add extra keys to the sample dict (e.g. label from a
metadata column) without breaking interop - TorchGeo ignores unknown keys.
TorchGeo's rasterio/GDAL-backed RasterDataset remains the right choice for
non-tiled TIFFs and non-TIFF formats.
License
Code: Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rasteret-0.3.4.tar.gz.
File metadata
- Download URL: rasteret-0.3.4.tar.gz
- Upload date:
- Size: 184.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
003977d99050ea2eee89cb5cc06e693028a02991cc4b5a62c52f625322b9200c
|
|
| MD5 |
15e32cd70998f1ebbde50c645b0643b6
|
|
| BLAKE2b-256 |
4284b340a9845c103d9e126a2d3ae3e1071b8d356edd01f8ad2edbca04f201ed
|
Provenance
The following attestation bundles were made for rasteret-0.3.4.tar.gz:
Publisher:
pypi.yaml on terrafloww/rasteret
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rasteret-0.3.4.tar.gz -
Subject digest:
003977d99050ea2eee89cb5cc06e693028a02991cc4b5a62c52f625322b9200c - Sigstore transparency entry: 1057721563
- Sigstore integration time:
-
Permalink:
terrafloww/rasteret@6fb842bbc4c3e97d3fdbe29cc0f8dea79777e54e -
Branch / Tag:
refs/tags/v0.3.4 - Owner: https://github.com/terrafloww
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yaml@6fb842bbc4c3e97d3fdbe29cc0f8dea79777e54e -
Trigger Event:
push
-
Statement type:
File details
Details for the file rasteret-0.3.4-py3-none-any.whl.
File metadata
- Download URL: rasteret-0.3.4-py3-none-any.whl
- Upload date:
- Size: 147.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c4a61be6baabe2b46f5fba5aede8a2bb74898edee60c3711a6209f935ffb61d
|
|
| MD5 |
0efac8469990593efb2e963869c6faaa
|
|
| BLAKE2b-256 |
caab35fd128de25f8bb4ff68d6f274017d28248b2b3b52c7eab24f77514e49f1
|
Provenance
The following attestation bundles were made for rasteret-0.3.4-py3-none-any.whl:
Publisher:
pypi.yaml on terrafloww/rasteret
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rasteret-0.3.4-py3-none-any.whl -
Subject digest:
1c4a61be6baabe2b46f5fba5aede8a2bb74898edee60c3711a6209f935ffb61d - Sigstore transparency entry: 1057721649
- Sigstore integration time:
-
Permalink:
terrafloww/rasteret@6fb842bbc4c3e97d3fdbe29cc0f8dea79777e54e -
Branch / Tag:
refs/tags/v0.3.4 - Owner: https://github.com/terrafloww
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yaml@6fb842bbc4c3e97d3fdbe29cc0f8dea79777e54e -
Trigger Event:
push
-
Statement type: