Skip to main content

Constraint-aware synthetic geospatial data augmentation engine for GeoAI

Project description

GeoAugment

GeoAugment is a constraint-aware synthetic geospatial data generation engine designed to address data scarcity in GeoAI, particularly in flood risk analysis, urban systems, and environmental modeling in data-limited regions.

It generates physically plausible synthetic labels and features from limited geospatial inputs (e.g., DEMs), enabling robust training of downstream Machine Learning (ML) and Deep Learning (DL) models.

GeoAugment does not train machine learning models.
GeoAugment creates high-quality synthetic training data that can later be used by ML or DL frameworks such as PyTorch, TensorFlow, or scikit-learn.


Why GeoAugment Exists

In many regions (especially across the Global South):

  • Labeled flood-risk maps do not exist
  • Historical flood records are incomplete
  • Satellite imagery is noisy, cloudy, or sparse
  • Urban layouts differ significantly from Global North datasets
  • Ground truth data is expensive or unavailable
  • ML/DL models fail due to label scarcity, not model weakness

Most GeoAI pipelines silently assume that clean labels already exist.
GeoAugment exists to solve that upstream problem.


What GeoAugment Is (and Is Not)

GeoAugment IS

  • A synthetic data generator
  • A data augmentation engine
  • A pre-training / pre-modeling tool
  • A CLI + Python library
  • Domain-aware and physically constrained

GeoAugment IS NOT

  • A flood prediction model
  • A neural network
  • An end-to-end ML system
  • A replacement for PyTorch or TensorFlow

GeoAugment stops at data generation.


Core Concepts (Critical Definitions)

This section defines every major technical term used in GeoAugment.


Synthetic Data

Artificially generated data that statistically and structurally resembles real-world data.

In GeoAugment:

  • Synthetic data represents flood risk, susceptibility, or potential
  • Generated from real geospatial inputs (e.g. DEMs)
  • Used as training labels for ML models

Flood Risk (Continuous)

A continuous surface representing relative likelihood or severity of flooding.

  • Values typically in [0, 1]
  • 0 = very low risk
  • 1 = very high risk

GeoAugment always generates continuous flood risk first.

Binary flood maps are optional and derived later.


Perturbation

A controlled modification applied to data to introduce variability.

In GeoAugment:

  • Perturbation simulates uncertainty and natural variability
  • Examples:
    • Slight elevation noise
    • Spatial variation in water accumulation
    • Random but constrained flood potential changes

Perturbations are never random noise alone — they are constrained.


Latent Field

An internal, hidden spatial field that represents flood-driving forces.

Think of it as:

“Flood potential before we observe it.”

Examples:

  • Accumulation tendency
  • Drainage inefficiency
  • Subsurface water pressure

Latent fields are later constrained and calibrated into usable flood risk.


Spatial Scale

The characteristic size of spatial patterns, measured in pixels.

  • Small scale → noisy, localized patterns
  • Large scale → smooth, broad flood zones

Used to ensure realism:

  • Floods are spatially coherent
  • No pixel-level randomness

Constraint

A rule that synthetic data must obey.

GeoAugment enforces:

  • Physical constraints (e.g. water flows downhill)
  • Statistical constraints (e.g. bounded risk values)
  • Spatial constraints (e.g. smoothness)

Constraints prevent hallucinated or impossible outputs.


Downhill Bias

A constraint that increases flood risk at lower elevations.

Without this:

  • High elevations might appear flood-prone
  • Outputs become physically implausible

Calibration

The process of normalizing and scaling synthetic outputs.

Example:

  • Mapping raw flood potential to [0, 1]
  • Aligning outputs to a percentile (e.g. top 10% = high risk)

Calibration makes outputs ML-ready.


Tile-Based Dataset Generation

Large rasters are split into fixed-size tiles for ML training.

Benefits:

  • GPU compatibility
  • Memory efficiency
  • Standard ML input sizes

GeoAugment supports overlap to reduce edge artifacts.


Dry-Run Mode

A validation-only execution mode.

When enabled:

  • YAML config is loaded
  • All parameters are validated
  • No data is read
  • No computation is performed

Used for:

  • Debugging configs
  • CI pipelines
  • Safe experimentation

YAML Configuration

A human-readable file format for defining parameters.

Why YAML:

  • Versionable
  • Shareable
  • Reproducible
  • Safer than long CLI commands

Flood Synthesis Pipeline (High-Level)

GeoAugment flood generation follows four explicit stages:

  1. Latent Field Generation
  2. Constraint Enforcement
  3. Calibration
  4. Dataset Export

Each stage is modular and inspectable.


Tile-Based Dataset Generation

Large rasters are split into fixed-size overlapping tiles to:

  • Fit ML/DL input requirements

  • Increase dataset size

  • Preserve spatial locality

Architecture Overview

DEM (.tif)
   
Feature Extraction
   
Constraint-Aware Synthesis
   
Continuous Flood Risk
   
Thresholding (optional)
   
Tile Generation
   
Export (NumPy / PyTorch)

Installation

pip install geoaugment

Command-Line Usage

Validate Configuration (Dry-Run)

geoaugment floods generate --config flood.yaml --dry-run

Output formats

  • npz → NumPy-based pipelines
  • torch → PyTorch training pipelines

Python Usage

from geo_augment.domains.floods.api import synthesize_flood_risk

synthetic_risk = synthesize_flood_risk(dem, n_samples=3)

Use the output in:

  • PyTorch
  • TensorFlow
  • scikit-learn
  • XGBoost
  • Any GeoAI pipeline

Generate Dataset

geoaugment floods generate \
  --dem dem.tif \
  --out ./dataset \
  --config flood.yaml

YAML Configuration Example

synthesis:
  perturbation_strength: 0.15
  spatial_scale: 30
  risk_percentile: 90
  random_seed: 42

constraints:
  enforce_bounds: true
  enforce_monotonic_downhill: true
  enforce_spatial_smoothness: true
  smoothness_kernel_size: 5
  downhill_weight: 1.0

latent:
  noise_type: gaussian
  normalize: true
  apply_low_frequency_bias: true

Python API Usage

from geo_augment.domains.floods.api import synthesize_flood_labels
from geo_augment.domains.floods.spec import (
    FloodSynthesisSpec,
    FloodConstraints,
    LatentFloodFieldSpec,
)

labels = synthesize_flood_labels(
    dem,
    synthesis_spec=FloodSynthesisSpec(
        perturbation_strength=0.15,
        spatial_scale=30
    ),
    constraints=FloodConstraints(),
    latent_spec=LatentFloodFieldSpec(),
)

Evaluation Utilities

  • GeoAugment includes basic evaluation helpers:
  • Distribution statistics
  • Visual sanity checks
from geo_augment.evaluation import summarize_distribution, plot_risk_surface

Design Philosophy

  • Explicit over implicit
  • Constraints over randomness
  • Data first, models later
  • Public, reproducible, inspectable

GeoAugment is meant to be infrastructure, not a black box.

Roadmap (What to expect next)

  • Flood domain (current)
  • Road connectivity synthesis
  • Urban morphology synthesis
  • TensorFlow export
  • GeoTIFF export
  • R bindings

Current Domains

✅ Flood Risk (v0.1.0) ⏳ Road Connectivity (planned) ⏳ Urban Morphology (planned)

Who Should Use GeoAugment

  • GeoAI researchers
  • Data scientists in climate, urban planning, disaster risk
  • ML engineers facing geospatial data scarcity
  • Public-sector analytics teams
  • Researchers working with satellite or drone imagery

LICENSE

MIT License

Author

Chidiebere V. Christopher (Data Scientist, Machine Learning Researcher)

Citation

If you use GeoAugment in academic or applied work, please cite the repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geoaugment-0.1.2.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geoaugment-0.1.2-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file geoaugment-0.1.2.tar.gz.

File metadata

  • Download URL: geoaugment-0.1.2.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for geoaugment-0.1.2.tar.gz
Algorithm Hash digest
SHA256 571c6e1bbba20151098a1240ca53e1a32108d1c27a25ab2e9278c5a6193b9236
MD5 8f8391b60a73eb95d5877e21a81cc0c9
BLAKE2b-256 ddd8c7c8b8ca3eb982510702c64e89c1b8c5752dbc2f7956d7c8396d45177a04

See more details on using hashes here.

File details

Details for the file geoaugment-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: geoaugment-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 12.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for geoaugment-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b77b6c0a384aab92a44ac079eec34a388e8b75746ea164fd304a705c5d1b3866
MD5 002ae7d2e0d7bbfd4d131becd7b4fd2a
BLAKE2b-256 552f271af21bf0eae46e7f53cb424e660997c8f20d9ce4fc4e49b5b33c232c04

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page