Skip to main content

Geospatial Species Distribution Modeling with Ensemble Learning and Reinforcement Learning-based Threshold Optimization

Project description

GeoXERL

Geospatial species distribution modeling with eXtreme Ensemble methods and Reinforcement Learning-based threshold optimization.

Tests PyPI version Python License: MIT


Overview

GeoXERL is a modular Python toolkit for species distribution modeling (SDM) and geospatial prediction tasks. It combines:

  1. Multi-step data preprocessing — environment variable extraction, presence-point processing, background-point generation, dataset splitting, and feature-stack preparation.
  2. Base model training & evaluation — unified interface for training and batch inference across multiple algorithms.
  3. Ensemble methods — Bagging, Boosting, Stacking, Geographically Weighted Random Forest (GWRF), and SHAP-based RL feature selection.
  4. Reinforcement-learning threshold optimization — Q-Learning and PPO agents that search for the optimal prediction threshold instead of using the default 0.5.

Installation

pip install geoxerl

Or install from source for the latest development version:

git clone https://github.com/wenshunzhang/GeoXERL.git
cd GeoXERL
pip install -e ".[dev]"

Requirements: Python >= 3.8, numpy, pandas, scikit-learn, rasterio, geopandas.

To install optional extras:

pip install geoxerl[rl]    # adds stable-baselines3 and gymnasium for PPO
pip install geoxerl[docs]  # adds Sphinx for building documentation

Quick start

Command line

# Run each step individually
geoxerl preprocess
geoxerl train
geoxerl ensemble --method stacking
geoxerl optimize

# Or run the full pipeline in one command
geoxerl run-all

# Check version
geoxerl --version

Python API

from geoxerl.data_preprocessing.main import main as preprocess
from geoxerl.base_models.train import main as train_models
from geoxerl.ensemble.stacking import main as run_ensemble
from geoxerl.threshold_optimization.q_main import main as optimize_threshold

# Step 1: preprocess raw environmental rasters
preprocess()

# Step 2: train base models
train_models()

# Step 3: build the ensemble
run_ensemble()

# Step 4: find the optimal prediction threshold via Q-Learning
optimize_threshold()

See the examples/ directory for ready-to-run scripts covering each stage.


Project structure

GeoXERL/
├── geoxerl/                          # Main package
│   ├── __init__.py
│   ├── __version__.py
│   ├── __main__.py                   # Enables python -m geoxerl
│   ├── cli.py                        # Command-line interface
│   ├── data_preprocessing/           # Steps 00-05: env vars -> feature stack
│   │   ├── 00_env_variables_preprocessing.py
│   │   ├── 01_env_variables_preprocessing.py
│   │   ├── 02_presence_points_processing.py
│   │   ├── 03_background_points_generation.py
│   │   ├── 04_dataset_splitting.py
│   │   ├── 05_prepare_feature_stack.py
│   │   ├── config.py
│   │   ├── main.py
│   │   └── utils.py
│   ├── base_models/                  # Model training, evaluation, batch inference
│   │   ├── models.py
│   │   ├── train.py
│   │   ├── evaluate.py
│   │   ├── batch_models.py
│   │   └── config.json
│   ├── ensemble/                     # Bagging, Boosting, Stacking, GWRF, PPO
│   │   ├── bagging.py
│   │   ├── boosting.py
│   │   ├── stacking.py
│   │   ├── gwrf.py
│   │   ├── gwrf_shap_analysis.py
│   │   ├── gwrf_shap_tif.py
│   │   ├── feature_selector_rl2.py
│   │   ├── ppo_main.py
│   │   ├── predict_gwrf.py
│   │   └── metrics.py
│   └── threshold_optimization/       # Q-Learning / PPO threshold search
│       ├── q_learning_optimizer.py
│       ├── q_main.py
│       ├── threshold_analyzer.py
│       ├── data_processor.py
│       ├── visualizer.py
│       └── config.py
├── tests/                            # Unit tests
├── examples/                         # Ready-to-run example scripts
├── docs/                             # Documentation
├── .github/workflows/                # CI/CD (tests + PyPI publish)
├── pyproject.toml
├── README.md
├── CHANGELOG.md
├── CONTRIBUTING.md
└── LICENSE

Module descriptions

data_preprocessing

Processes raw environmental raster layers and species occurrence records into a clean, analysis-ready dataset. Scripts are numbered 00-05 to indicate execution order; main.py runs them all in sequence.

Script Purpose
00 / 01 Clip, reproject, and derive environmental variables from raw rasters
02 Filter and spatially thin species occurrence records
03 Generate background / pseudo-absence points
04 Split dataset into train / validation / test sets
05 Stack selected features into a single analysis-ready array

base_models

Provides a unified interface for fitting individual classifiers (train.py), computing standard SDM metrics — AUC, TSS, Kappa (evaluate.py), and running inference over large raster stacks (batch_models.py).

ensemble

Implements three classical ensemble strategies and two geospatial-aware methods:

Method File Notes
Bagging bagging.py Bootstrap aggregation
Boosting boosting.py Gradient boosting
Stacking stacking.py Meta-learner on base model outputs
GWRF gwrf.py Geographically Weighted Random Forest with SHAP explainability
PPO feature selector feature_selector_rl2.py / ppo_main.py RL agent that learns which features to include

threshold_optimization

Casts threshold selection as a reinforcement learning problem. The Q-Learning optimizer discretizes the threshold space into states and learns a policy through reward signals based on TSS / F1. threshold_analyzer.py and visualizer.py provide post-hoc analysis and plotting tools.


Configuration

Each module has its own config file. Edit these before running to set your data paths and hyperparameters:

Module Config file
data_preprocessing geoxerl/data_preprocessing/config.py
base_models geoxerl/base_models/config.json
ensemble geoxerl/ensemble/config.py
threshold_optimization geoxerl/threshold_optimization/config.py

Contributing

Contributions are welcome! Please read CONTRIBUTING.md for setup instructions, code style guidelines, and the pull request checklist.

# Set up development environment
git clone https://github.com/wenshunzhang/GeoXERL.git
cd GeoXERL
pip install -e ".[dev]"
pre-commit install

# Run tests
pytest tests/

Citation

If you use GeoXERL in your research, please cite:

@software{geoxerl2024,
  author  = {Zhang, Wenshun},
  title   = {GeoXERL: Geospatial Ensemble and Reinforcement Learning Toolkit for Species Distribution Modeling},
  year    = {2024},
  url     = {https://github.com/wenshunzhang/GeoXERL},
  version = {0.1.0}
}

License

MIT — see LICENSE for details.


Contact

Wenshun Zhang — zhangwenshun24@mails.ucas.ac.cn

University of Chinese Academy of Sciences

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geoxerl-0.2.0.tar.gz (154.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geoxerl-0.2.0-py3-none-any.whl (190.0 kB view details)

Uploaded Python 3

File details

Details for the file geoxerl-0.2.0.tar.gz.

File metadata

  • Download URL: geoxerl-0.2.0.tar.gz
  • Upload date:
  • Size: 154.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for geoxerl-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8114c979bddd4a6d59c6b84a478b3e09b3b344d7472e1b325fc9037515b7f739
MD5 311b421baf5dbdf4f0030b5da6cb9f81
BLAKE2b-256 3db1f0c640fe5043bd3c9ae7c48b358ead6d47d13514d2f5d0452a29d2624ef0

See more details on using hashes here.

File details

Details for the file geoxerl-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: geoxerl-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 190.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for geoxerl-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 198db0fc4cc597328378e4604fa297d4557f7a7484de58b809df73e7faa7fd53
MD5 7eed4059422bd65c23e63cf564b2c5a8
BLAKE2b-256 f4adb6019d6a8f511ce392ebb9ccb5dcd920e3eb396dfde56b62114887e9dbc6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page