Species distribution modeling support tools
Project description
elapid
elapid
provides python support for species distribution modeling. This includes a custom MaxEnt implementation and general spatial processing tools. It will soon include tools for working with GBIF-format datasets.
The name is an homage to A Biogeographic Analysis of Australian Elapid Snakes (H.A. Nix, 1986), the paper widely credited with defining the essential bioclimatic variables to use in species distribution modeling. It's also a snake pun (a python wrapper for mapping snake biogeography).
The maxent modeling tools and feature transformations are translations of the R maxnet
package. It uses the glmnet
python bindings, and is implemented using sklearn
conventions.
Table of Contents
Background
Maxent is a species distribution modeling (SDM) system, which uses species observations and environmental data to predict where a species might be found under past, present or future environmental conditions.
Its a presence/background model, meaning it uses data on where a species is present and, instead of data on where it is absent, a random sample of the region where you might expect to find that species. This is convenient because it reduces data collection burdens (absence data isn't required) but it also introduces some unique challenges for fitting and interpreting models.
Formally, Maxent estimates habitat suitability (i.e. the fundamental niche) using species occurrence records (y = 1
), randomly sampled "background" location records (y = 0
), and environmental covariate data (x
) for each location of y
.
Maxent doesn't directly estimate relationships between presence/background data and environmental covariates (so, not just y ~ x
). Instead, it fits a series of feature transformatons (z
) to the covariate data (e.g. computing pairwise products between covariates, setting random covariate thresholds). Maxent then estimates the conditional probability of finding a species given a set of environmental conditions as Pr(y = 1 | f(z))
.
elapid
provides python tools for fitting maxent models, computing features, and working with geospatial data. It's goal is to help us better understand where the species are, why they are there, and where they will go in the future.
Installation
This library can be installed via pip
.
pip install elapid
You can also clone the source repository and install it locally.
git clone https://github.com/earth-chris/elapid.git
cd elapid
pip install -e .
With conda
You can use conda
to ensure you have all the required dependencies (glmnet
and rasterio
have some library dependencies). From the repository's base directory:
conda env update
conda activate elapid
pip install -e .
This will create an environment named elapid
and install an editable version of the package that you can use.
Working with elapid
There are two primary Maxent functions: fitting features and fitting models. You can do it all at once with:
import elapid
x, y = elapid.load_sample_data()
model = elapid.MaxentModel()
model.fit(x, y)
Where:
x
is an array or dataframe of environmental covariates of shape (n_samples
,n_covariates
)y
is an array or series of species presence/background labels (rows labeled1
or0
)
The elapid.MaxentModel()
object takes these data, fits features from covariate data, computes sample weights and feature regularization, fits a series of models, and returns an estimator that can be used for applying predictions to new data.
MaxentModel()
behaves like an sklearn estimator
class. Use model.fit(x, y)
to train a model, and model.predict(x)
to generate model predictions.
You can also generate and evaluate features before passing them to the model:
features = elapid.MaxentFeatureTransformer()
z = features.fit_transform(x)
model.fit(z, y, is_features=True)
MaxentFeatureTransformer()
behaves like an sklearn preprocessing
class. Use features.fit(x_train)
to fit features, features.transform(x_test)
to apply to new covariate data, or features.fit_transform(x)
to fit features and return the transformed data.
The base Maxent classes can be modified with parameters that are available in other Maxent implementations:
model = elapid.MaxentModel(
feature_types = ['linear', 'hinge', 'product'], # the feature transformations
tau = 0.5, # prevalence scaler
clamp = True, # set covariate min/max based on range of training data
scorer = 'roc_auc', # metric to optimize (from sklearn.metrics.SCORERS)
beta_multiplier = 1.0, # regularization scaler (high values drop more features)
beta_lqp = 1.0, # linear, quadratic, product regularization scaler
beta_hinge = 1.0, # hinge regularization scaler
beta_threshold = 1.0, # threshold regularization scaler
beta_categorical = 1.0, # categorical regularization scaler
n_hinge_features = 50, # number of hinge features to compute
n_threshold_features = 50, # number of threshold features to compute
convergence_tolerance = 1e-07, # model fit convergence threshold
use_lambdas = 'last', # set to 'best' (least overfit), 'last' (highest score)
n_cpus = 4, # number of cpu cores to use
)
Geospatial support
In addition to the maxent modeling support tools, elapid
includes a series of geospatial data processing routines. These should make it easy to work with species occurrence records and raster covariates in multiple formats. The workflows rely on geopandas
for vectors and rasterio
for rasters.
Here's an example end-to-end workflow (using dummy paths to demonstrate functionality).
import elapid
vector_path = "/home/slug/ariolimax-californicus.shp"
raster_path = "/home/slug/california-climate-veg.tif"
output_path = "/home/slug/ariolimax-californicus-habitat.tif"
model_path = "/home/slug/ariolimax-claifornicus-model.ela"
# sample the raster values at point locations
presence = elapid.raster_values_from_vector(vector_path, raster_path)
pseudoabsence_points = elapid.pseudoabsence_from_raster(raster_path)
pseudoabsence = elapid.raster_values_from_geoseries(pseudoabsence_points, raster_path)
# merge the datasets into one dataframe
pseudoabsence['presence'] = 0
presence['presence'] = 1
y = presence['presence'].append(pseudoabsence['presence']).reset_index(drop=True)
x = presence.drop(['presence'], axis=1).append(pseudoabsence.drop(['presence'], axis=1)).reset_index(drop=True)
# train the model
model = elapid.MaxentModel()
model.fit(x, y)
# apply it to the full extent and save the model for later
elapid.apply_model_to_rasters(model, raster_path, output_path, transform="logistic")
elapid.save_object(model, model_path)
To work with this saved model later, you can run:
model = elapid.load_object(model_path)
Let's dig in to the components of this workflow.
Working with x-y data
Almost all of the data sampling and indexing uses geopandas.GeoSeries
objects. These are the format of the geometry
column for a GeoDataFrame
.
import geopandas as gpd
gdf = gpd.read_file(vector_path)
print(type(gdf.geometry))
> <class 'geopandas.geoseries.GeoSeries'>
Sometimes you don't have a vector of point-format location data. The java
implementation of maxent uses csv files, for example. You can work with those using the xy_to_geoseries
function:
import pandas as pd
csv_path = "/home/cba/ariolimax-californicus.csv"
df = pd.read_csv(csv_path)
presence = elapid.xy_to_geoseries(df.x, df.y, crs="EPSG:32610")
Make sure you specify the projection of your x/y data. The default assumption is lat/lon, which in many cases is not correct.
You can also convert arbitrary arrays of x/y data:
lons = [-122.49, 151.0]
lats = [37.79, -33.87]
locations = elapid.xy_to_geoseries(lons, lats)
print(locations)
> 0 POINT (-122.49000 37.79000)
> 1 POINT (151.00000 -33.87000)
> dtype: geometry
Generating pseudo-absence records
In addition to species occurrence records, maxent requires a set of pseudo-absence (i.e. background) points. These are a random geographic samping of where you might expect to find a species.
You can use elapid
to create a random geographic sampling of points from unmasked locations within a raster's extent:
count = 10000 # the number of points to generate
pseudoabsence_points = elapid.pseudoabsence_from_raster(raster_path, count)
Species occurrence records are often biased in their collection (collected near roads, in protected areas, etc.), so we typically need to be more precise in where we select pseudo-absence points. You could use a vector with a species range map to select records:
range_path = "/home/slug/ariolimax-californicus-range.shp"
pseudoabsence_points = elapid.pseudoabsence_from_vector(range_path, count)
If you've already computed a polygon using geopandas, you can pass it instead to elapid.pseudoabsence_from_geoseries()
, which is what pseudoabsence_from_vector()
does under the hood.
You could also pass a raster bias file, where the raster grid cells contain information on the probability of sampling an area:
# assume covariance between vertebrate and invertebrate banana slugs
bias_path = "/home/slug/proximity-to-ucsc.tif"
pseudoabsence_points = pseudoabsence_from_bias_file(bias_path)
The grid cells can be an arbitrary range of values. What's important is that the values encode a linear range of numbers that are higher where you're more likely to draw a sample. The probability of drawing a sample is dependent on two factors: the range of values provided and the frequency of values across the dataset.
So, for a raster with values of 1
and 2
, you're sampling probability for raster locations of 2
is twice that as 1
locations. If these occur in equal frequency (i.e. half the data are 1
valuess, half are 2
values), then you'll likely sample twice as many areas with 2
values. But if the frequency of 1
values is much greater than 2
values, you'll shift the distribution. But you're still more likely, on a per-draw basis, to draw samples from 2
locations.
The above example prioritizes sampling frequency in the areas around UC Santa Cruz, home to all types of slug, based on the distance to the campus.
Extracting raster values
Once you have your species presence and pseudo-absence records, you can extract the covariate data from each location.
pseudoabsence_covariates = elapid.raster_values_from_geoseries(
pseudoabsence_points,
raster_path,
drop_na = True,
)
This could also be done with raster_values_from_vector(vector_path, raster_path)
if you haven't already loaded the geoseries data into memory.
This function, since it's geographically indexed, doesn't require the point data and the raster data to be in the same projection. elapid
handles reprojection and sampling on the fly.
It also allows you to pass multiple raster files, which can be in different projections, extents, grid sizes, etc. This means you don't have to explicitly re-sample your raster data prior to analysis, which is always a chore.
raster_paths = [
"/home/slug/california-leaf-area-index.tif", # 1-band vegetation data
"/home/slug/global-cloud-cover.tif", # 3-band min, mean, max annual cloud cover
"/home/slug/usa-mean-temperature.tif", # 1-band mean temperature
]
# since you have five raster bands total, specify each band label
labels = [
"LAI",
"CLD-min",
"CLD-mean",
"CLD-max",
"TMP-mean",
]
pseudoabsence_covariates = elapid.raster_values_from_geoseries(
pseudoabsence_points,
raster_paths,
labels = labels
drop_na = True,
)
If you don't specify the labels, elapid
will assign ['band_001', 'band_002', ...]
.
Applying models to rasters
Once you've fit a model, you can apply it to a set of raster covariates to produce gridded suitability maps.
elapid.apply_model_to_rasters(
model,
raster_paths,
output_path,
template_idx = 0,
transform = "cloglog",
nodata=-9999,
)
The list and band order of the rasters must match the order of the covariates used to train the model. It reads each dataset in a block-wise basis, applies the model, and writes gridded predictions.
If the raster datasets are not consistent (different extents, resolutions, etc.), it wll re-project the data on the fly, with the grid size, extent and projection based on a 'template' raster. Use the template_idx
keyword to specify the index of which raster file to use as the template (template_idx=0
sets the first raster as the template).
In the example above, it's important to set the template to the california-leaf-area-index.tif
file. This is because this is the smallest extent with data, and it'll only read and apply the model to the usa
and global
datasets in the area covered by california
. If you were to set the extent to usa-mean-temperature.tif
, it would still technically function, but there would be a large region of nodata
values where there's insufficient covariate coverage.
Package design
This python package was developed to overcome some of the limitations of the Maxent java package. Some gripes with the java
distribution include:
- It requires using specific and inconvenient raster data formats (
mxe
,grd
, orbil
raster data), which create large files that take a long time to read/write. - There's no support for working with common point occurence formats, like shapefiles or geojson.
- Its hard to include multi-temporal environmental covariate data, meaning time series data need to be aggregated (e.g., averaged) in some way prior to analysis.
- Applying a previously-trained model to new data (e.g., under climate projections) is difficult.
- Training and applying batches of models is laborious.
elapid
was designed to support common geospatial data formats and contemporary approaches to training statistical models. It uses sklearn
conventions to fit and apply models, rasterio
to handle raster operations, geopandas
for vector operations, and returns pandas
dataframes.
This places some burden on the user to handle processes like creating train/test cross-validation splits and sampling background data. The hope is that this extra flexibility removes some of the frustrations with the black-box approach of the java implementation and enables users to better tune and evaluate their models.
There's also more flexibility supporting contemporary geospatial data formats, like cloud-optimized geotiffs. This means you can train and apply models using cloud-hosted covariate data (e.g., from s3
) without having to download the data to a local hard-drive.
Contact
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.