betaearth

BetaEarth: AlphaEarth Embedding Emulator

These details have not been verified by PyPI

Project links

Project description

Embedding Sentinel-2 and Sentinel-1 with a Little Help of AlphaEarth

What is BetaEarth?

BetaEarth is an open-source model that produces dense 10m geospatial embedding fields from Sentinel-2 and Sentinel-1 imagery. It is trained to reproduce the outputs of AlphaEarth Foundations (AEF) — a closed-source embedding model released by Google and Google DeepMind — using only AEF's publicly available precomputed embeddings as supervision.

BetaEarth has no access to AEF's weights or architecture. It is an independent model, not a variant or extension of AEF.

Why does this matter?

Reproducibility: AEF embeddings cannot be generated for new data without Google Earth Engine access. BetaEarth can run locally on any Sentinel-2/S1 imagery.
Auditability: BetaEarth enables the community to probe a closed-source model's behaviour — identifying biases, modality sensitivities, and failure modes — without direct model access.
Security research: This work demonstrates that releasing embeddings may not be a risk-free alternative to releasing model weights.

Models

We release 8 model variants spanning different trade-offs between quality, parameter efficiency, and input requirements.

Main results (6,200-tile test set)

Model	Test Cos Sim	Std	LULC Acc	Model Size	Inputs
SF curriculum (robust)	0.873	0.109	0.833	104.8M	Any subset of S2/S1/DEM + DOY
SF frozen+FiLM (reinit)	0.886	0.098	0.873	104.8M	S2 L1C+L2A, S1, DEM, DOY
SF frozen+FiLM (hilr)	0.886	0.099	0.866	104.8M	S2 L1C+L2A, S1, DEM, DOY
SF from scratch+FiLM	0.883	---	0.835	104.8M	S2 L1C+L2A, S1, DEM, DOY
SF no FiLM (ISPRS)	0.880	0.101	0.869	104.8M	S2 L1C+L2A, S1, DEM
DINOv3 ViT-L/16 (sat)	0.874	0.100	0.870	304M	6 primitives + DOY
DINOv3 ViT-S/16 (nat)	0.861	0.109	0.863	23.8M	6 primitives + DOY
SF RGB-only+FiLM	0.836	---	0.823	26.3M	S2 RGB, DOY
Real AlphaEarth (ceiling)	---	---	0.889	---	---

The curriculum (robust) model handles any modality subset gracefully:

Input subset	Cosine sim
All modalities	0.873
L1C only	0.806
L2A only	0.755
S1 only	0.712
DEM only	0.609

Which model should I use?

Use case	Recommended model	Why
General use (default)	SF curriculum (robust)	Works with any input subset; best for real-world deployment
Maximum quality	SF frozen+FiLM (reinit)	Highest cos sim (0.886) — requires all 4 modalities
No timestamp needed	SF no FiLM (ISPRS)	Does not require day-of-year input; still achieves 0.880
Lightweight / edge	DINOv3 ViT-S/16	23.8M params, good quality (0.861)
Minimal data requirements	SF RGB-only+FiLM	Only needs 3-band RGB + day-of-year
Research / ablation	SF frozen+FiLM (hilr)	Alternative fusion strategy for comparison

Architecture overview

DINOv3 models use a single shared frozen DINOv3 backbone applied to 3-band spectral primitives:

Primitive	Bands	Captures
True-colour RGB	B04/B03/B02	Visual texture, built environment
False-colour IR	B08/B04/B03	Vegetation health (NIR)
SWIR composite	B12/B11/B04	Moisture, bare soil, burn scars
Red-edge	B07/B06/B05	Canopy structure, chlorophyll
SAR	VV/VH/ratio	Structure, moisture (from S1)
Topography	Elevation/Slope/Aspect	Terrain (from COP-DEM)

Primitives are fused via permutation-invariant cross-attention (SetFusion).

SegFormer models use 4 separate MiT-B2 encoders processing each modality's raw bands natively (9ch S2-L1C, 9ch S2-L2A, 2ch S1, 1ch DEM), with channel concatenation fusion.

All models use FiLM temporal conditioning (day-of-year modulation) except the ISPRS baseline.

Key findings

Temporal conditioning as spectral compensation: FiLM importance scales inversely with spectral access — RGB-only (22pp) > DINOv3 (18pp) > SegFormer scratch (14pp) > frozen SegFormer (5pp).
Multi-temporal averaging of 4+ observations improves emulation by up to +18pp over single timestamps.
Predicted embeddings retain 98% of downstream LULC classification accuracy and are robust to 32x compression.

Model Properties

Property	Value
Output	Dense embedding field — `(H, W, 64)` per tile at 10m resolution
Output normalisation	L2-normalised per pixel (unit vectors on S^63)
Quantisation	Original AEF: int8 on S^63; BetaEarth outputs float32
Tile size	10.68 x 10.68 km (1068 x 1068 px), Major TOM grid
Training data	62,489 Major TOM grid cells (49,991 train / 6,248 val / 6,250 test)
Loss	Cosine similarity + 0.1 * MSE, masked to valid pixels

Quickstart

pip install betaearth

from betaearth import BetaEarth

model = BetaEarth.from_pretrained()  # default: robust variant
# BetaEarth(params=104.8M, device=cuda)

# All inputs are raw (unnormalised) — preprocessing is handled internally
embedding = model.predict(
    s2_l2a=s2_l2a,   # (9, H, W) uint16 DN (~0-10000)
    s2_l1c=s2_l1c,   # (9, H, W) uint16 DN (~0-10000)
    s1=s1,            # (2, H, W) float32 linear power
    dem=dem,          # (1, H, W) float32 elevation in meters
    doy=182,          # day of year (1-366)
)
# embedding: (H, W, 64) float32 numpy array, L2-normalised per pixel

Any modality can be omitted — the model handles missing inputs via zeroed features:

# S2-only (no S1, no DEM)
emb = model.predict(s2_l2a=s2_l2a, doy=182)

# S2 + DEM, no S1
emb = model.predict(s2_l2a=s2_l2a, dem=dem, doy=182)

Multi-temporal averaging

import numpy as np

preds = []
for s2, s1, doy in zip(s2_timeseries, s1_timeseries, doys):
    pred = model.predict(s2_l2a=s2, s1=s1, dem=dem, doy=doy)
    preds.append(pred)

# Simple averaging — saturates at ~4 observations
annual = np.mean(preds, axis=0)
annual /= np.linalg.norm(annual, axis=-1, keepdims=True)

Data Access

All training data is from the Major TOM community project and is freely available on HuggingFace:

Dataset	Description
Major-TOM/Core-S2-L2A	Sentinel-2 L2A imagery
Major-TOM/Core-S2-L1C	Sentinel-2 L1C imagery
Major-TOM/Core-S1-RTC	Sentinel-1 RTC imagery
Major-TOM/Core-AlphaEarth-Embeddings	AEF target embeddings

Data normalisation

All input data should be stored as raw values. Normalisation happens inside the model:

S2 L1C/L2A: uint16 DN (0-10000+), divided by 10000 internally
S1 RTC: linear power (float32, ~0-200), log-transformed internally
COP-DEM: pre-normalised to [0, 1] before passing to the model

Important: S2 band order must follow Major TOM convention: [B02, B03, B04, B08, B05, B06, B07, B11, B12] (10m bands first, then 20m).

Reproduce

git clone https://github.com/asterisk-labs/betaearth
cd betaearth
conda env create -f environment.yml
conda activate betaearth

# Train (requires A100 GPU)
python train_multi.py --batch_size 8 --max_epochs 20

# SegFormer FiLM variants (frozen encoder)
python train_segformer_frozen_variants.py --ckpt checkpoints/isprs_v1_epoch19_0.875.ckpt

# Evaluate on test set
python run_evaluation.py --ckpt checkpoints/multi_final.ckpt

# Generate paper figures
python generate_figures.py
python plot_training_curves.py

# Multi-temporal experiments
python test_multitemporal.py --ckpt checkpoints/segformer_film_frozen_best.ckpt

See CHECKLIST.md for a full step-by-step guide to reproducing each experiment in the paper.

Citation

@inproceedings{czerkawski2026betaearth,
  title     = {BetaEarth: Emulating Closed-Source Earth Observation Foundation Models Through Their Public Embeddings},
  author    = {Czerkawski, Mikolaj},
  booktitle = {ISPRS Congress 2026},
  year      = {2026}
}

License and Attribution

BetaEarth model weights are released under CC-BY 4.0, matching the license of the AlphaEarth Foundations embedding archive used for training supervision.

Required attribution for AEF training data:

"The AlphaEarth Foundations Satellite Embedding dataset is produced by Google and Google DeepMind."

Training imagery is sourced from Major TOM (Apache 2.0) and Copernicus Sentinel (free and open access).

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Apr 15, 2026

0.1.1

Apr 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

betaearth-0.2.0.tar.gz (18.5 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

betaearth-0.2.0-py3-none-any.whl (15.4 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file betaearth-0.2.0.tar.gz.

File metadata

Download URL: betaearth-0.2.0.tar.gz
Upload date: Apr 15, 2026
Size: 18.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for betaearth-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`7eb494d7bc7c6ccbf1dc2cd7642d6becf1fef6133a2ab6502e7b5cf33d09bf6c`
MD5	`fe1745ddaac642bef4f32e91ac81bf16`
BLAKE2b-256	`bed9458421213abc9ad09d182fad20ddcda11217d414b97ed5bd03ba906cc0b7`

See more details on using hashes here.

File details

Details for the file betaearth-0.2.0-py3-none-any.whl.

File metadata

Download URL: betaearth-0.2.0-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 15.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for betaearth-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`84efbcc5b3ec123b9cd7a0f736fee8e3939b47e4bf7efa416699130c19a51bf9`
MD5	`522ed0184762e3dd13055a8469e1ee84`
BLAKE2b-256	`f49358cf5e788069ff1afa3a463a30bc99d4f1ed5fec041d5523b245c7bb23ba`

See more details on using hashes here.

betaearth 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

What is BetaEarth?

Why does this matter?

Models

Main results (6,200-tile test set)

Which model should I use?

Architecture overview

Key findings

Model Properties

Quickstart

Multi-temporal averaging

Data Access

Data normalisation

Reproduce

Citation

License and Attribution

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes