Constraint-aware synthetic geospatial data augmentation engine for GeoAI
Project description
GeoAugment
GeoAugment is a constraint-aware synthetic geospatial data generation engine designed to address data scarcity in GeoAI, particularly in flood risk analysis, urban systems, and road networks in data-limited regions.
It generates physically plausible synthetic labels and features from limited geospatial inputs (e.g., DEMs), enabling robust training of downstream Machine Learning (ML) and Deep Learning (DL) models.
Developed by Chidiebere V. Christopher, GeoAugment creates high-quality synthetic training data that can later be used by ML or DL frameworks such as PyTorch, TensorFlow, or scikit-learn.
Why GeoAugment Exists
In many regions (especially across the Global South):
- Labeled flood-risk maps do not exist
- Historical flood records are incomplete
- Satellite imagery is noisy, cloudy, or sparse
- Urban layouts differ significantly from Global North datasets
- Ground truth data is expensive or unavailable
- ML/DL models fail due to label scarcity, not model weakness
Most GeoAI pipelines silently assume that clean labels already exist.
GeoAugment exists to solve that upstream problem.
What GeoAugment Is (and Is Not)
GeoAugment IS
- A synthetic data generator
- A data augmentation engine
- A pre-training / pre-modeling tool
- A CLI + Python library
- Domain-aware and physically constrained
GeoAugment IS NOT
- A flood prediction model
- A neural network
- An end-to-end ML system
- A replacement for PyTorch or TensorFlow
GeoAugment stops at data generation.
Core Concepts (Critical Definitions)
This section defines every major technical term used in GeoAugment.
Synthetic Data
Artificially generated data that statistically and structurally resembles real-world data.
In GeoAugment:
- Synthetic data represents flood risk, susceptibility, or potential
- Generated from real geospatial inputs (e.g. DEMs)
- Used as training labels for ML models
Flood Risk (Continuous)
A continuous surface representing relative likelihood or severity of flooding.
- Values typically in
[0, 1] 0= very low risk1= very high risk
GeoAugment always generates continuous flood risk first.
Binary flood maps are optional and derived later.
Perturbation
A controlled modification applied to data to introduce variability.
In GeoAugment:
- Perturbation simulates uncertainty and natural variability
- Examples:
- Slight elevation noise
- Spatial variation in water accumulation
- Random but constrained flood potential changes
Perturbations are never random noise alone — they are constrained.
Latent Field
An internal, hidden spatial field that represents flood-driving forces.
Think of it as:
“Flood potential before we observe it.”
Examples:
- Accumulation tendency
- Drainage inefficiency
- Subsurface water pressure
Latent fields are later constrained and calibrated into usable flood risk.
Spatial Scale
The characteristic size of spatial patterns, measured in pixels.
- Small scale → noisy, localized patterns
- Large scale → smooth, broad flood zones
Used to ensure realism:
- Floods are spatially coherent
- No pixel-level randomness
Constraint
A rule that synthetic data must obey.
GeoAugment enforces:
- Physical constraints (e.g. water flows downhill)
- Statistical constraints (e.g. bounded risk values)
- Spatial constraints (e.g. smoothness)
Constraints prevent hallucinated or impossible outputs.
Downhill Bias
A constraint that increases flood risk at lower elevations.
Without this:
- High elevations might appear flood-prone
- Outputs become physically implausible
Calibration
The process of normalizing and scaling synthetic outputs.
Example:
- Mapping raw flood potential to
[0, 1] - Aligning outputs to a percentile (e.g. top 10% = high risk)
Calibration makes outputs ML-ready.
Tile-Based Dataset Generation
Large rasters are split into fixed-size tiles for ML training.
Benefits:
- GPU compatibility
- Memory efficiency
- Standard ML input sizes
GeoAugment supports overlap to reduce edge artifacts.
Dry-Run Mode
A validation-only execution mode.
When enabled:
- YAML config is loaded
- All parameters are validated
- No data is read
- No computation is performed
Used for:
- Debugging configs
- CI pipelines
- Safe experimentation
YAML Configuration
A human-readable file format for defining parameters.
Why YAML:
- Versionable
- Shareable
- Reproducible
- Safer than long CLI commands
Flood Synthesis Pipeline (High-Level)
GeoAugment flood generation follows four explicit stages:
- Latent Field Generation
- Constraint Enforcement
- Calibration
- Dataset Export
Each stage is modular and inspectable.
Tile-Based Dataset Generation
Large rasters are split into fixed-size overlapping tiles to:
-
Fit ML/DL input requirements
-
Increase dataset size
-
Preserve spatial locality
Architecture Overview
DEM (.tif)
↓
Feature Extraction
↓
Constraint-Aware Synthesis
↓
Continuous Flood Risk
↓
Thresholding (optional)
↓
Tile Generation
↓
Export (NumPy / PyTorch)
Installation
pip install geoaugment
Command-Line Usage
Validate Configuration (Dry-Run)
geoaugment floods generate --config flood.yaml --dry-run
Output formats
- npz → NumPy-based pipelines
- torch → PyTorch training pipelines
Python Usage
from geo_augment.domains.floods.api import synthesize_flood_risk
synthetic_risk = synthesize_flood_risk(dem, n_samples=3)
Use the output in:
- PyTorch
- TensorFlow
- scikit-learn
- XGBoost
- Any GeoAI pipeline
Generate Dataset
geoaugment floods generate \
--dem dem.tif \
--out ./dataset \
--config flood.yaml
YAML Configuration Example
synthesis:
perturbation_strength: 0.15
spatial_scale: 30
risk_percentile: 90
random_seed: 42
constraints:
enforce_bounds: true
enforce_monotonic_downhill: true
enforce_spatial_smoothness: true
smoothness_kernel_size: 5
downhill_weight: 1.0
latent:
noise_type: gaussian
normalize: true
apply_low_frequency_bias: true
Python API Usage
from geo_augment.domains.floods.api import synthesize_flood_labels
from geo_augment.domains.floods.spec import (
FloodSynthesisSpec,
FloodConstraints,
LatentFloodFieldSpec,
)
labels = synthesize_flood_labels(
dem,
synthesis_spec=FloodSynthesisSpec(
perturbation_strength=0.15,
spatial_scale=30
),
constraints=FloodConstraints(),
latent_spec=LatentFloodFieldSpec(),
)
Evaluation Utilities
- GeoAugment includes basic evaluation helpers:
- Distribution statistics
- Visual sanity checks
from geo_augment.evaluation import summarize_distribution, plot_risk_surface
Design Philosophy
- Explicit over implicit
- Constraints over randomness
- Data first, models later
- Public, reproducible, inspectable
GeoAugment is meant to be infrastructure, not a black box.
Roadmap (What to expect next)
- Flood domain (current)
- Road connectivity synthesis
- Urban morphology synthesis
- TensorFlow export
- GeoTIFF export
- R bindings
Current Domains
✅ Flood Risk (v0.1.0) ⏳ Road Connectivity (planned) ⏳ Urban Morphology (planned)
Who Should Use GeoAugment
- GeoAI researchers
- Data scientists in climate, urban planning, disaster risk
- ML engineers facing geospatial data scarcity
- Public-sector analytics teams
- Researchers working with satellite or drone imagery
LICENSE
MIT License
Author
Chidiebere V. Christopher (Data Scientist, Machine Learning Researcher)
Citation
If you use GeoAugment in academic or applied work, please cite the repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file geoaugment-0.1.7.tar.gz.
File metadata
- Download URL: geoaugment-0.1.7.tar.gz
- Upload date:
- Size: 22.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf4f5a0461fc977ae286c5af4384d5c714f55618ef0aa6ce7cadc68138455fdb
|
|
| MD5 |
7ea01963af666379ff2273ae87e1505d
|
|
| BLAKE2b-256 |
9ac8991b40f6268eb1163636018ccf0f5bdaacb5140e2bb743255e73cab31cb8
|
File details
Details for the file geoaugment-0.1.7-py3-none-any.whl.
File metadata
- Download URL: geoaugment-0.1.7-py3-none-any.whl
- Upload date:
- Size: 32.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a0484122164127c004141bbf7968a2b1934e71d3da3b607b6bfea6f1065abbe4
|
|
| MD5 |
df75ea5d6feadc60da8c3b49980b4523
|
|
| BLAKE2b-256 |
40b5b5959fb9f528cd438d2b77210f7965dbfb79dd40bd9c38ffcc58139db22c
|