Skip to main content

Deep Learning for Earth Observation — automated training-dataset builder for EO segmentation tasks

Project description

dl4eo

dl4eo is a Python package for building multi-source Earth Observation training datasets and training segmentation models end-to-end. It automates the full pipeline from raw satellite data to model checkpoint:

  • Sentinel-2 (L2A, cloud-filtered, spectral indices)
  • Sentinel-1 RTC (VV + VH, batched by date)
  • Copernicus DEM (elevation + slope, per-scene mosaic)
  • Segmentation masks from any vector label file
  • Train-ready PyTorch dataset with global normalization
  • Model training with UNet, DeepLabV3+, SegFormer, ViT, and more
  • Evaluation with per-class IoU / F1 / Precision / Recall / Kappa + GeoTIFF prediction export

Installation

# Pipeline only (no PyTorch required)
pip install dl4eo

# Pipeline + training stack
pip install dl4eo[train]

Requires Python ≥ 3.8.


Quick Start

1 — Build a dataset

import dl4eo

dl4eo.generate_dataset(
    base_dir="/data/glacial_lakes",
    aoi_shapefile_dir="/data/aoi/",           # folder with AOI.shp (study area polygon)
    feature_shapefile="/data/lake_boundaries.shp",  # label polygons
    date_range="2021-06-01/2021-08-31",
    cloud_cover=20,
    patch_size=256,           # pixels
    overlap=0.0,
    spectral_index="NDWI",    # NDWI | NDSI | NDVI | NDRE | EVI | None
    skip_sentinel1=False,
    skip_dem=False,
    normalize=False,          # recommended: normalize at load time via PatchDataset
    n_jobs=8,
)

2 — Quality control, splits, statistics

# Filter bad patches (nodata, no foreground, constant bands)
valid = dl4eo.qc.validate("/data/glacial_lakes", min_positive_fraction=0.001)

# Create train / val / test splits
splits = dl4eo.splits.make_splits(
    "/data/glacial_lakes",
    ratios=(0.7, 0.15, 0.15),
    strategy="temporal",   # "random" | "temporal" | "spatial"
    valid_file="/data/glacial_lakes/valid_patches.txt",
)

# Global per-band statistics (training split only — no leakage)
stats = dl4eo.stats.compute("/data/glacial_lakes", split="train")
# → {"band_1": {"mean": 6032.7, "std": 3471.1, "p2": 540.0, "p98": 11752.0},
#    "band_2": {...}, ..., "_meta": {"n_files": 25, "split": "train"}}

3 — PyTorch dataset

from dl4eo.io import PatchDataset
from torch.utils.data import DataLoader

ds = PatchDataset(
    "/data/glacial_lakes",
    split="train",
    split_file="/data/glacial_lakes/splits.json",
    stats_file="/data/glacial_lakes/stats.json",
    norm="zscore",    # "zscore" | "minmax" | "percentile" | None
    bands=None,       # None = all bands; or e.g. [0, 1, 2, 6, 7]
)

sample = ds[0]
# sample["image"]  →  FloatTensor [C, H, W]
# sample["mask"]   →  LongTensor  [H, W]

loader = DataLoader(ds, batch_size=16, shuffle=True, num_workers=4)

PatchDataset inherits from torchgeo.datasets.NonGeoDataset when torchgeo is installed, and falls back to torch.utils.data.Dataset otherwise.

4 — Train a model (one-liner)

module = dl4eo.train(
    data_dir="/data/glacial_lakes",
    model="unet",            # see SUPPORTED_MODELS below
    backbone="resnet34",
    num_classes=2,
    split_strategy="temporal",
    norm="zscore",
    loss="dice_ce",          # "dice_ce" | "dice" | "ce" | "focal"
    batch_size=16,
    max_epochs=50,
    accelerator="gpu",
    devices=1,
)
# → auto-generates splits.json + stats.json if missing
# → saves best checkpoint (monitored on val/iou)
# → returns loaded SegmentationModule

5 — Evaluate and export predictions

# Option A — use the module returned directly from dl4eo.train()
report = dl4eo.eval.evaluate(
    module,
    data_dir        = "/data/glacial_lakes",
    splits          = ("val", "test"),
    class_names     = ["background", "lake"],
    output_dir      = "/data/glacial_lakes/eval",
    save_predictions= True,   # writes GeoTIFFs in original CRS
)

# Option B — reload a checkpoint later
module = dl4eo.eval.load_module(
    "checkpoints/unet/best-epoch=10.ckpt",
    model       = "unet",
    backbone    = "resnet34",
    in_channels = 10,
)
report = dl4eo.eval.evaluate(module, "/data/glacial_lakes")

evaluate() prints a formatted table and saves two files:

eval/
├── predictions/
│   ├── val/   *.tif   ← single-band uint8 GeoTIFF, original CRS + transform
│   └── test/  *.tif
├── eval_report.json   ← full metrics + confusion matrix
└── eval_report.txt    ← plain-text table for logging

Metrics reported per class and as mean: IoU · F1 · Precision · Recall · Overall Accuracy · Cohen's Kappa

6 — Build a model manually

from dl4eo.train import build_model, SegmentationModule, SegDataModule, SUPPORTED_MODELS
import lightning as L

print(SUPPORTED_MODELS)
# ['unet', 'unet++', 'deeplabv3+', 'fpn', 'pspnet', 'linknet', 'pan', 'manet',
#  'segformer', 'vit-tiny', 'vit-small', 'vit-base']

net    = build_model("segformer", in_channels=10, num_classes=2)
module = SegmentationModule(net, num_classes=2, lr=5e-4, loss="dice_ce")

dm = SegDataModule(
    data_dir   = "/data/glacial_lakes",
    split_file = "/data/glacial_lakes/splits.json",
    stats_file = "/data/glacial_lakes/stats.json",
    batch_size = 8,
)

trainer = L.Trainer(max_epochs=100, accelerator="gpu", devices=1)
trainer.fit(module, dm)

Pipeline stages

Stage Description
1 Download Sentinel-2 L2A (STAC / Planetary Computer, cloud-filtered)
2 Preprocess S2: single-pass resample to 10 m + spectral index + stack
3 Generate patch AOIs: windowed reads, intersects user AOI polygon
4 Prepare DEM: one mosaic per scene, windowed reproject per patch
5 Prepare Sentinel-1 RTC: batched STAC search by date, VV+VH stack
6 Generate segmentation masks from label shapefile

Normalization is intentionally excluded from the pipeline. Use dl4eo.stats.compute() on the training split and PatchDataset(norm="zscore") at load time — this avoids per-patch scale inconsistency and data leakage.


Supported models

All models are trained from scratch on arbitrary input channels (no dataset-specific pretrained weights).

Model Family Default backbone Constraints
unet SMP resnet34
unet++ SMP resnet34
deeplabv3+ SMP resnet34 batch_size ≥ 2 per GPU (BatchNorm)
fpn SMP resnet34
pspnet SMP resnet34 batch_size ≥ 2 per GPU (BatchNorm)
linknet SMP resnet34
pan SMP resnet34 input ≥ 128 px (pyramid pooling)
manet SMP resnet34
segformer SegFormer swin_tiny_patch4_window7_224
vit-tiny ViT vit_tiny_patch16_224
vit-small ViT vit_small_patch16_224
vit-base ViT vit_base_patch16_224

SMP models also support ImageNet-pretrained encoders for 3-channel input: weights="imagenet".

BatchNorm note: deeplabv3+ and pspnet will raise an error if a mini-batch contains only 1 sample. Ensure len(train_set) % batch_size != 1, or choose a batch_size that divides your training set evenly.


Output structure

base_dir/
├── stack/               # Scene-level S2 stacks (bands + spectral index)
├── images/              # Clipped S2 patches
├── DEM/                 # Per-scene DEM mosaics + per-patch stacks
├── GRD/                 # Downloaded SAR granules (VV, VH)
├── Clipped_SAR/         # SAR reprojected to patch grid
├── stacked/             # S2 + DEM patches  (10 bands)
├── stacked_with_sar/    # S2 + DEM + SAR patches  (primary output)
├── mask/                # Binary (or multi-class) segmentation masks
├── AOI_boxes/           # Per-scene patch grid shapefiles
├── splits.json          # Train / val / test split (after dl4eo.splits)
├── stats.json           # Per-band statistics   (after dl4eo.stats)
└── valid_patches.txt    # QC-passing patch list  (after dl4eo.qc)

Input requirements

Parameter Description
aoi_shapefile_dir Folder containing one or more AOI .shp files (study area polygon)
feature_shapefile Label vector file (e.g. lake outlines) — used for mask generation and patch filtering
date_range "YYYY-MM-DD/YYYY-MM-DD"

The AOI polygon controls which patches are generated. Only patches that intersect both the AOI and at least one label feature are kept.


Dependencies

Core (installed automatically): numpy, rasterio, geopandas, shapely, fiona, matplotlib, joblib, pystac-client, planetary-computer, requests, scipy

Training (pip install dl4eo[train]): torch>=2.0, lightning>=2.0, segmentation-models-pytorch>=0.3, timm>=0.9, torchmetrics>=1.0

Optional: torchgeo>=0.5 — enables NonGeoDataset base class for PatchDataset


Example use cases

  • Glacial lake mapping and segmentation
  • Flood extent extraction
  • Multimodal image fusion (S2 + S1 + DEM)
  • Patch-based dataset generation for semantic segmentation

Author

Developed by Saurabh Kaushik Postdoctoral Researcher · University of Wisconsin–Madison Earth Observation · Deep Learning · Geo-Foundational Models · Cryosphere


License

MIT License


Citation

If you use dl4eo in your research, please cite:

@misc{kaushik2026dl4eo,
  author       = {Saurabh Kaushik},
  title        = {{dl4eo: A Python package for multi-source Earth Observation dataset building and segmentation model training}},
  year         = {2026},
  howpublished = {\url{https://pypi.org/project/dl4eo/}},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dl4eo-0.5.0.tar.gz (41.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dl4eo-0.5.0-py3-none-any.whl (46.3 kB view details)

Uploaded Python 3

File details

Details for the file dl4eo-0.5.0.tar.gz.

File metadata

  • Download URL: dl4eo-0.5.0.tar.gz
  • Upload date:
  • Size: 41.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for dl4eo-0.5.0.tar.gz
Algorithm Hash digest
SHA256 6d13f86a78553f46f4733f736f82ce176f7262f80e3a36333326913dcd2a166b
MD5 c31b2a686ab09f5ea2432c1a8c7abfce
BLAKE2b-256 9c1cefb197b4581c2ff8163c357731e433a24c331d8744f82696e614780d83e3

See more details on using hashes here.

File details

Details for the file dl4eo-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: dl4eo-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 46.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for dl4eo-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c85fe50cdd585ec075feef9901859237f4a58f76246a38b8518f792cea1aed89
MD5 d6037333f5859335733bc8c514e7fa3a
BLAKE2b-256 1044d673b7f4bcc5c71f4ad25dcb07062577152b38c39e2d12ef9759d627c68a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page