Index-first GeoTIFF access layer for ML and analysis, powered by queryable Parquet indexes.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sid141

These details have not been verified by PyPI

Project description

🛰️ Rasteret

Made to beat cold starts.
Index-first access to cloud-native GeoTIFF collections for ML and analysis.

Every cold start re-parses satellite image metadata over HTTP - per scene, per band. Sentinel-2, Landsat, NAIP, every time. Your colleague did it last Tuesday, CI did it overnight, PyTorch respawns DataLoader workers every epoch. A single project repeats millions of redundant requests before a pixel moves.

Rasteret parses those headers once, caches them in Parquet, and its own reader fetches pixels concurrently with no GDAL in the path. Up to 20x faster on cold starts.

Rasteret calls this pattern index-first geospatial retrieval:

Control plane: a queryable Parquet index (scene metadata, COG header metadata, user columns like splits/labels)
Data plane: on-demand tile reads from the original GeoTIFF/COG objects

This keeps metadata and experiment logic in tables while leaving imagery bytes in source COGs.

Key Features -

Easy - three lines from STAC search or Parquet file to a TorchGeo-compatible dataset
20x faster, saves cloud LISTs and GETs - Our custom I/O reads tiles fast with zero STAC/header overhead once a Collection is built
Zero data downloads - work with terabytes of imagery while storing only megabytes of metadata.
No STAC at training time - query once at setup; zero API calls during training with Collection you can extend.
Reproducible - same Parquet index = same records = same results
Point sampling built in - sample_points() returns Arrow-native tables for feature pipelines and large point sets
Native dtypes - integer imagery stays integer; missing/edge coverage is represented via fill values (nodata or 0) instead of NaNs
Shareable cache - enrich our Collection with your ML splits, patch geometries, custom data points for ML, and share it, don't write folders of image chips!

Rasteret is an opt-in accelerator that integrates with TorchGeo by returning a standard GeoDataset. Your samplers, DataLoader, xarray workflows, and analysis tools stay the same - Rasteret handles the async tile I/O underneath.

Installation

Requires Python 3.12+.

uv pip install rasteret

Extras

uv pip install "rasteret[xarray]"       # + xarray output
uv pip install "rasteret[torchgeo]"     # + TorchGeo for ML pipelines
uv pip install "rasteret[aws]"          # + requester-pays buckets (Landsat, NAIP)
uv pip install "rasteret[azure]"        # + Planetary Computer signed URLs

Combine as needed: uv pip install "rasteret[xarray,aws]".

Available extras: xarray, torchgeo, aws, azure, earthdata. See Getting Started for details.

[!NOTE] Requester-pays data (Landsat, etc.): Install the aws extra and configure AWS credentials (aws configure or environment variables). Free public collections like Sentinel-2 on Element84 work without credentials.

Built-in datasets

Rasteret ships with a growing catalog of datasets. Each entry includes license metadata and a commercial_use flag for quick filtering.

Pick an ID, pass it to build() and go:

$ rasteret datasets list
ID                          Name                                       Coverage       License              Auth
aef/v1-annual               AlphaEarth Foundation Embeddings (Annual)  global         CC-BY-4.0            none
earthsearch/sentinel-2-l2a  Sentinel-2 Level-2A                        global         proprietary(free)    none
earthsearch/landsat-c2-l2   Landsat Collection 2 Level-2               global         proprietary(free)    required
earthsearch/naip            NAIP                                       north-america  proprietary(free)    required
earthsearch/cop-dem-glo-30  Copernicus DEM 30m                         global         proprietary(free)    none
earthsearch/cop-dem-glo-90  Copernicus DEM 90m                         global         proprietary(free)    none
pc/sentinel-2-l2a           Sentinel-2 Level-2A (Planetary Computer)   global         proprietary(free)    required
pc/io-lulc-annual-v02       ESRI 10m Land Use/Land Cover               global         CC-BY-4.0            required
pc/alos-dem                 ALOS World 3D 30m DEM                      global         proprietary(free)    required
pc/nasadem                  NASADEM                                    global         proprietary(free)    required
pc/esa-worldcover           ESA WorldCover                             global         CC-BY-4.0            required
pc/usda-cdl                 USDA Cropland Data Layer                   conus          proprietary(free)    required

Use your own datasets

Use build_from_stac() for any STAC API
Use build_from_table() for Parquets that have TIFF URLs in them (eg., SourceCoop AlphaEarth index parquet)

You can also build collections using CLI rasteret collections build read more details here

Here's a guide to add a dataset to rasteret's catalog so everyone benefits. The catalog is open to edit by anyone and will be community-driven.

Each new dataset entry is around ~20 lines of Python pointing to a STAC API or a GeoParquet file. One PR adds a dataset, every rasteret user sees it in rasteret datasets list on the next release of rasteret.

Quick start

Build a Collection

import rasteret

collection = rasteret.build(
    "earthsearch/sentinel-2-l2a",
    name="s2_training",
    bbox=(77.5, 12.9, 77.7, 13.1),
    date_range=("2024-01-01", "2024-06-30"),
)

build() picks the dataset from the catalog (backed by a STAC API or a GeoParquet file, depending on the entry), parses COG headers, and caches everything as Parquet. The next run loads in milliseconds.

Inspect and filter

collection        # Collection('s2_training', source='sentinel-2-l2a', bands=13, records=42, crs=32643)
collection.bands  # ['B01', 'B02', ..., 'B12', 'SCL']
len(collection)   # 42


# Filter in memory, no network calls
filtered = collection.subset(cloud_cover_lt=15, date_range=("2024-03-01", "2024-06-01"))

subset() accepts cloud_cover_lt, date_range, bbox, geometries, split, and split_column (when your split field uses a custom name). For raw Arrow expressions, use collection.where(expr).

ML training (TorchGeo)

from torch.utils.data import DataLoader
from torchgeo.samplers import RandomGeoSampler
from torchgeo.datasets.utils import stack_samples

dataset = collection.to_torchgeo_dataset(
    bands=["B04", "B03", "B02", "B08"],
    chip_size=256,
)

sampler = RandomGeoSampler(dataset, size=256, length=100)
loader = DataLoader(dataset, sampler=sampler, batch_size=4, collate_fn=stack_samples)

Analysis (xarray)

ds = collection.get_xarray(
    geometries=(77.55, 13.01, 77.58, 13.08),  # bbox, Arrow array, Shapely, or WKB
    bands=["B04", "B08"],
)
ndvi = (ds.B08 - ds.B04) / (ds.B08 + ds.B04)

Fast arrays (NumPy)

arr = collection.get_numpy(
    geometries=(77.55, 13.01, 77.58, 13.08),
    bands=["B04", "B08"],
    all_touched=False,  # rasterio default masking semantics
)
# shape: [N, C, H, W] for multi-band, [N, H, W] for single-band

Point values (Arrow-native)

import duckdb

# Keep your table in DuckDB and pass coordinate columns directly
points = duckdb.sql("""
    SELECT lon, lat
    FROM read_parquet('points.parquet')
""").arrow().read_all()

samples = collection.sample_points(
    points=points,     # query input table/array of point locations
    x_column="lon",
    y_column="lat",
    bands=["B04", "B08"],
    geometry_crs=4326,
    match="latest",        # or "all" for full time series matches
)
# pyarrow.Table with point_index, record_id, datetime, band, value, point_crs, raster_crs

Collection-centric loop: build/load/as_collection -> subset/where -> get_xarray/get_numpy/sample_points/to_torchgeo_dataset.

Going further

What	Where
Datasets not in the catalog	`build_from_stac()`
Parquet with COG URLs (Source Cooperative, STAC GeoParquet, custom)	`build_from_table(path, name=...)`
Multi-band COGs (AEF embeddings, etc.)	AEF Embeddings guide
Authenticated sources (PC, requester-pays, Earthdata, etc.)	Custom Cloud Provider
Share a Collection	`collection.export("path/")` then `rasteret.load("path/")`
Filter by cloud cover, date, bbox	`collection.subset()`
Sample values for large point sets	`collection.sample_points()`

Benchmarks

Single request performance

Processing pipeline: Filter 450,000 scenes -> 22 matches -> Read 44 COG files

Single request performance

Single Farm NDVI Time Series (1 Year, Landsat 9)

Run on AWS t3.xlarge (4 CPU) —

Library	First Run	Subsequent Runs
Rasterio (Multiprocessing)	32 s	24 s
Rasteret	3 s	3 s
Google Earth Engine	10–30 s	3–5 s

Cold-start comparison with TorchGeo

Same AOIs, same scenes, same sampler, same DataLoader. Both paths output identical [batch, T, C, H, W] tensors. TorchGeo runs with its recommended GDAL settings for best-case remote COG performance.

Scenario	rasterio/GDAL path	Rasteret path	Ratio
Single AOI, 15 scenes	9.08 s	1.14 s	8x
Multi-AOI, 30 scenes	42.05 s	2.25 s	19x
Cross-CRS boundary, 12 scenes	12.47 s	0.59 s	21x

The difference comes from how headers are accessed: the rasterio/GDAL path re-parses IFDs over HTTP on each cold start, while Rasteret reads them from a local Parquet cache. See Benchmarks for full methodology.

Processing time comparison Speedup breakdown

HF `datasets` baseline (Major TOM keyed patches)

Baseline method: datasets.load_dataset(..., streaming=True, filters=...) with local GeoTIFF decode, compared against Rasteret prebuilt index reads. Reproduce with examples/major_tom_benchmark/03_hf_vs_rasteret_benchmark.py.

Patches	HF `datasets` (streaming)	Rasteret index+COG	Speedup
120	46.83 s	12.09 s	3.88x
1000	771.59 s	118.69 s	6.50x

HF vs Rasteret processing time HF vs Rasteret speedup

For exploration workflows, Major TOM notebooks often use HF streaming generators; Rasteret is optimized for reading the same patches directly from source COGs using an index-first cache.

Notebook: 05_torchgeo_comparison.ipynb

[!NOTE] Measured on an EC2 instance in the same region as the data (us-west-2). TorchGeo timings above use 12-30 scenes; HF timings above use 120/1000 patches. Results vary with network conditions. If you run Rasteret on your own workloads, share your numbers on GitHub Discussions or Discord.

Scope and stability

Area	Status
STAC + COG scene workflows	Stable
Parquet-first workflows (`build_from_table()`)	Stable
Multi-band / planar-separate COGs (`band_index`)	Stable
Multi-cloud (S3, Azure Blob, GCS)	Stable
Dataset catalog	Stable
TorchGeo adapter	Stable

Rasteret is optimized for remote, tiled GeoTIFFs (COGs). It also works with local tiled GeoTIFFs for indexing, filtering, and sharing collections. Non-tiled TIFFs and non-TIFF formats are best handled by TorchGeo or rasterio.

Documentation

Full docs at terrafloww.github.io/rasteret:


Getting Started	Installation and first steps
Tutorials	Hands-on notebooks
How-To Guides	Task-oriented recipes
API Reference	Auto-generated from source
Architecture	Design decisions
Ecosystem Comparison	Rasteret vs TACO, async-geotiff, virtual-tiff

Contributing

The catalog grows with community help:

Add a dataset: write a ~20 line descriptor in catalog.py, open a PR. See prerequisites and guide
Improve docs: fix a typo, add an example, clarify a section
Build something new: ingest drivers, cloud backends, readers. See Architecture

All contributions are welcome. See Contributing for dev setup and we are happy to discuss all aspects of library. Ideas welcome on GitHub Discussions or join our Discord to just chat.

Technical notes

GeoParquet and Parquet Raster

Rasteret Collections are written as GeoParquet 1.1 (WKB footprint geometry

geo metadata; coordinates in CRS84). Parquet is adding native GEOMETRY/GEOGRAPHY logical types and GeoParquet 2.0 is evolving alongside that; Rasteret tracks this and plans to adopt when ecosystem support stabilizes.

GeoParquet also has an alpha "Parquet Raster" draft for storing raster payloads in Parquet. Rasteret does not write Parquet Raster files: pixels stay in GeoTIFF/COGs, and Parquet stays the index.

TorchGeo interop

RasteretGeoDataset is a standard TorchGeo GeoDataset subclass. It honors the full GeoDataset contract:

__getitem__(GeoSlice) returns {"image": Tensor, "bounds": Tensor, "transform": Tensor}
index is a GeoPandas GeoDataFrame with an IntervalIndex named "datetime"
crs and res are set correctly for sampler compatibility
Works with RandomGeoSampler, GridGeoSampler, and any custom sampler
Works with IntersectionDataset and UnionDataset for dataset composition

Rasteret replaces the I/O backend (custom IO instead of rasterio/GDAL) but speaks the same interface. Your samplers, DataLoader, transforms, and training loop do not change.

Rasteret can also add extra keys to the sample dict (e.g. label from a metadata column) without breaking interop - TorchGeo ignores unknown keys.

TorchGeo's rasterio/GDAL-backed RasterDataset remains the right choice for non-tiled TIFFs and non-TIFF formats.

License

Code: Apache-2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sid141

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.12

Apr 13, 2026

0.3.11

Apr 12, 2026

0.3.10

Apr 11, 2026

0.3.9

Apr 10, 2026

0.3.8

Apr 9, 2026

0.3.7

Apr 7, 2026

0.3.6

Apr 5, 2026

0.3.5

Mar 12, 2026

0.3.4

Mar 7, 2026

This version

0.3.3

Mar 6, 2026

0.3.2

Mar 6, 2026

0.3.1

Feb 27, 2026

0.3.0

Feb 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rasteret-0.3.3.tar.gz (178.3 kB view details)

Uploaded Mar 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rasteret-0.3.3-py3-none-any.whl (143.7 kB view details)

Uploaded Mar 6, 2026 Python 3

File details

Details for the file rasteret-0.3.3.tar.gz.

File metadata

Download URL: rasteret-0.3.3.tar.gz
Upload date: Mar 6, 2026
Size: 178.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rasteret-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`f9d4961835dfc9fb0edae01cdcd237c24d4c596ccfe41df0064d72f003e3649d`
MD5	`f24142364022036ebb544bdc7a140768`
BLAKE2b-256	`4e63b1404e24bfdd428471bbe76d6897f62fd7a17c00e51e2deba53513b3001e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rasteret-0.3.3.tar.gz:

Publisher: pypi.yaml on terrafloww/rasteret

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rasteret-0.3.3.tar.gz
- Subject digest: f9d4961835dfc9fb0edae01cdcd237c24d4c596ccfe41df0064d72f003e3649d
- Sigstore transparency entry: 1051773787
- Sigstore integration time: Mar 6, 2026
Source repository:
- Permalink: terrafloww/rasteret@0e6f879864ba28db17e1b303c92bb908be7fa826
- Branch / Tag: refs/tags/v0.3.3
- Owner: https://github.com/terrafloww
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yaml@0e6f879864ba28db17e1b303c92bb908be7fa826
- Trigger Event: push

File details

Details for the file rasteret-0.3.3-py3-none-any.whl.

File metadata

Download URL: rasteret-0.3.3-py3-none-any.whl
Upload date: Mar 6, 2026
Size: 143.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rasteret-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1db95c009afb99a873baf549305e3f2e1d9b5cea469ec0c25e512723442935c0`
MD5	`465313172eca59b02464c2dfcffeb1fb`
BLAKE2b-256	`da7eef263eb35c1a5e8a3091251b965d5d65b403225c11f17a75a3ed063ba8a8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rasteret-0.3.3-py3-none-any.whl:

Publisher: pypi.yaml on terrafloww/rasteret

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rasteret-0.3.3-py3-none-any.whl
- Subject digest: 1db95c009afb99a873baf549305e3f2e1d9b5cea469ec0c25e512723442935c0
- Sigstore transparency entry: 1051773796
- Sigstore integration time: Mar 6, 2026
Source repository:
- Permalink: terrafloww/rasteret@0e6f879864ba28db17e1b303c92bb908be7fa826
- Branch / Tag: refs/tags/v0.3.3
- Owner: https://github.com/terrafloww
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yaml@0e6f879864ba28db17e1b303c92bb908be7fa826
- Trigger Event: push

rasteret 0.3.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

🛰️ Rasteret

Installation

Built-in datasets

Use your own datasets

Quick start

Build a Collection

Inspect and filter

ML training (TorchGeo)

Analysis (xarray)

Fast arrays (NumPy)

Point values (Arrow-native)

Benchmarks

Single request performance

Single Farm NDVI Time Series (1 Year, Landsat 9)

Cold-start comparison with TorchGeo

HF datasets baseline (Major TOM keyed patches)

Scope and stability

Documentation

Contributing

Technical notes

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

HF `datasets` baseline (Major TOM keyed patches)