Geospatial Species Distribution Modeling with Ensemble Learning and Reinforcement Learning-based Threshold Optimization
Project description
GeoXERL
Geospatial species distribution modeling with eXtreme Ensemble methods and Reinforcement Learning-based threshold optimization.
Overview
GeoXERL is a modular Python toolkit for species distribution modeling (SDM) and geospatial prediction tasks. It combines:
- Multi-step data preprocessing — environment variable extraction, presence-point processing, background-point generation, dataset splitting, and feature-stack preparation.
- Base model training & evaluation — unified interface for training and batch inference across multiple algorithms.
- Ensemble methods — Bagging, Boosting, Stacking, Geographically Weighted Random Forest (GWRF), and SHAP-based RL feature selection.
- Reinforcement-learning threshold optimization — Q-Learning and PPO agents that search for the optimal prediction threshold instead of using the default 0.5.
Installation
pip install geoxerl
Or install from source for the latest development version:
git clone https://github.com/wenshunzhang/GeoXERL.git
cd GeoXERL
pip install -e ".[dev]"
Requirements: Python >= 3.8, numpy, pandas, scikit-learn, rasterio, geopandas.
To install optional extras:
pip install geoxerl[rl] # adds stable-baselines3 and gymnasium for PPO
pip install geoxerl[docs] # adds Sphinx for building documentation
Quick start
Command line
# Run each step individually
geoxerl preprocess
geoxerl train
geoxerl ensemble --method stacking
geoxerl optimize
# Or run the full pipeline in one command
geoxerl run-all
# Check version
geoxerl --version
Python API
from geoxerl.data_preprocessing.main import main as preprocess
from geoxerl.base_models.train import main as train_models
from geoxerl.ensemble.stacking import main as run_ensemble
from geoxerl.threshold_optimization.q_main import main as optimize_threshold
# Step 1: preprocess raw environmental rasters
preprocess()
# Step 2: train base models
train_models()
# Step 3: build the ensemble
run_ensemble()
# Step 4: find the optimal prediction threshold via Q-Learning
optimize_threshold()
See the examples/ directory for ready-to-run scripts covering each stage.
Project structure
GeoXERL/
├── geoxerl/ # Main package
│ ├── __init__.py
│ ├── __version__.py
│ ├── __main__.py # Enables python -m geoxerl
│ ├── cli.py # Command-line interface
│ ├── data_preprocessing/ # Steps 00-05: env vars -> feature stack
│ │ ├── 00_env_variables_preprocessing.py
│ │ ├── 01_env_variables_preprocessing.py
│ │ ├── 02_presence_points_processing.py
│ │ ├── 03_background_points_generation.py
│ │ ├── 04_dataset_splitting.py
│ │ ├── 05_prepare_feature_stack.py
│ │ ├── config.py
│ │ ├── main.py
│ │ └── utils.py
│ ├── base_models/ # Model training, evaluation, batch inference
│ │ ├── models.py
│ │ ├── train.py
│ │ ├── evaluate.py
│ │ ├── batch_models.py
│ │ └── config.json
│ ├── ensemble/ # Bagging, Boosting, Stacking, GWRF, PPO
│ │ ├── bagging.py
│ │ ├── boosting.py
│ │ ├── stacking.py
│ │ ├── gwrf.py
│ │ ├── gwrf_shap_analysis.py
│ │ ├── gwrf_shap_tif.py
│ │ ├── feature_selector_rl2.py
│ │ ├── ppo_main.py
│ │ ├── predict_gwrf.py
│ │ └── metrics.py
│ └── threshold_optimization/ # Q-Learning / PPO threshold search
│ ├── q_learning_optimizer.py
│ ├── q_main.py
│ ├── threshold_analyzer.py
│ ├── data_processor.py
│ ├── visualizer.py
│ └── config.py
├── tests/ # Unit tests
├── examples/ # Ready-to-run example scripts
├── docs/ # Documentation
├── .github/workflows/ # CI/CD (tests + PyPI publish)
├── pyproject.toml
├── README.md
├── CHANGELOG.md
├── CONTRIBUTING.md
└── LICENSE
Module descriptions
data_preprocessing
Processes raw environmental raster layers and species occurrence records into a clean, analysis-ready dataset. Scripts are numbered 00-05 to indicate execution order; main.py runs them all in sequence.
| Script | Purpose |
|---|---|
00 / 01 |
Clip, reproject, and derive environmental variables from raw rasters |
02 |
Filter and spatially thin species occurrence records |
03 |
Generate background / pseudo-absence points |
04 |
Split dataset into train / validation / test sets |
05 |
Stack selected features into a single analysis-ready array |
base_models
Provides a unified interface for fitting individual classifiers (train.py), computing standard SDM metrics — AUC, TSS, Kappa (evaluate.py), and running inference over large raster stacks (batch_models.py).
ensemble
Implements three classical ensemble strategies and two geospatial-aware methods:
| Method | File | Notes |
|---|---|---|
| Bagging | bagging.py |
Bootstrap aggregation |
| Boosting | boosting.py |
Gradient boosting |
| Stacking | stacking.py |
Meta-learner on base model outputs |
| GWRF | gwrf.py |
Geographically Weighted Random Forest with SHAP explainability |
| PPO feature selector | feature_selector_rl2.py / ppo_main.py |
RL agent that learns which features to include |
threshold_optimization
Casts threshold selection as a reinforcement learning problem. The Q-Learning optimizer discretizes the threshold space into states and learns a policy through reward signals based on TSS / F1. threshold_analyzer.py and visualizer.py provide post-hoc analysis and plotting tools.
Configuration
Each module has its own config file. Edit these before running to set your data paths and hyperparameters:
| Module | Config file |
|---|---|
data_preprocessing |
geoxerl/data_preprocessing/config.py |
base_models |
geoxerl/base_models/config.json |
ensemble |
geoxerl/ensemble/config.py |
threshold_optimization |
geoxerl/threshold_optimization/config.py |
Contributing
Contributions are welcome! Please read CONTRIBUTING.md for setup instructions, code style guidelines, and the pull request checklist.
# Set up development environment
git clone https://github.com/wenshunzhang/GeoXERL.git
cd GeoXERL
pip install -e ".[dev]"
pre-commit install
# Run tests
pytest tests/
Citation
If you use GeoXERL in your research, please cite:
@software{geoxerl2024,
author = {Zhang, Wenshun},
title = {GeoXERL: Geospatial Ensemble and Reinforcement Learning Toolkit for Species Distribution Modeling},
year = {2024},
url = {https://github.com/wenshunzhang/GeoXERL},
version = {0.1.0}
}
License
MIT — see LICENSE for details.
Contact
Wenshun Zhang — zhangwenshun24@mails.ucas.ac.cn
University of Chinese Academy of Sciences
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file geoxerl-0.2.0.tar.gz.
File metadata
- Download URL: geoxerl-0.2.0.tar.gz
- Upload date:
- Size: 154.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8114c979bddd4a6d59c6b84a478b3e09b3b344d7472e1b325fc9037515b7f739
|
|
| MD5 |
311b421baf5dbdf4f0030b5da6cb9f81
|
|
| BLAKE2b-256 |
3db1f0c640fe5043bd3c9ae7c48b358ead6d47d13514d2f5d0452a29d2624ef0
|
File details
Details for the file geoxerl-0.2.0-py3-none-any.whl.
File metadata
- Download URL: geoxerl-0.2.0-py3-none-any.whl
- Upload date:
- Size: 190.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
198db0fc4cc597328378e4604fa297d4557f7a7484de58b809df73e7faa7fd53
|
|
| MD5 |
7eed4059422bd65c23e63cf564b2c5a8
|
|
| BLAKE2b-256 |
f4adb6019d6a8f511ce392ebb9ccb5dcd920e3eb396dfde56b62114887e9dbc6
|