Implementation of SDM - Experimental Notes
Implementation of SDM - Experimental Notes
101
Basics Hands-On
Rita Castilho
May 2015
SDM101 Rita Castilho
Species distribution modelling (SDM) is a recent scientific development that has enormous
application potential in biological sciences. Most of the information on the geographic
distribution of species stems from fieldwork data accumulated throughout centuries.
Species distribution modelling is also known under other names including climate envelope-
modelling, habitat modelling, and environmental or ecological niche-modelling (ENM). The
advent of SDM has allowed inferring hypothetical geographic species’ distributions by
relating the presence or presence and absence of a species with environmental variables
(Franklin 2010). It is possible to predict the environmental conditions that are suitable for a
species by classifying grid cells according to the degree in which they are suitable/unsuitable
for a species, resulting in a predictive model describing the suitability of any site for that
species (Guisan & Thuiller 2005). Common applications of SDM include exploring the
response of geographic species distributions to climate change (Peterson 2011), predicting
range expansions of invasive species (Benedict et al. 2009), supporting conservation planning
(Wilson et al. 2011), identifying areas of endemism (Raxworthy et al. 2007) and facilitating
field surveys of species with poorly known geographic distributions (Guisan & Thuiller 2005;
Raxworthy et al. 2003).
Workflow of SDM
There are lots of different ways to go about SDM, however, the general workflow for
obtaining an SDM is:
(0) Choose a target species. You may do so because you are already working on a given
species, or will be working on that species in the future. However, if this is not the case, you
still need to choose a species...
(1) Collect the locations (geographical coordinates) of occurrence of the target species.
(2) Assemble values of environmental predictor variables at these locations taken from spatial
databases;
(3) Use the environmental values to fit a model to estimate similarity to the sites of
occurrence;
(4) Predict the variable(s) of interest across the region of interest.
Steps 0-2 are easy and quick to perform in contrast with steps 3 and 4, which may be complex
and long to implement.
Page | 1
SDM101 Rita Castilho
Page | 2
SDM101 Rita Castilho
5) View Predictions
View the prediction rasters. You can download them individually and import into a GIS for
further analysis.
Task 1. Run “Wallace” for your target species.
You will see at the top of the map the number of duplicated records automatically removed.
You have also the chance to remove akward or uncertainpoints, one by one. However, more
often than not, there are too many points to eliminate, and a process of automatic removal
would be more efficient and practical. Any suggestions?
Task 2. Determine the best model. Problems?
Task 3. Discuss the limitations of Wallace implementation.
Page | 3
SDM101 Rita Castilho
MAXENT (java)
You can use the Java application of Maxent
(https://www.cs.princeton.edu/~schapire/maxent/) to obtain a species distribution model. For
that you will need two types of input: the presence data and the environmental layers. From
the Wallace exercise, you obtained a csv file, which you can use in Maxent-java. Check the
format of the occurrences file for Maxent-java (open Maxent-java and press Help). Do the
necessary adjustments. Then you need to know where to retrieve the predictors. Depending on
whether you are interested in marine or terrestrial data, there are different repositories for
these layers.
Examples of repositories of environmental predictors:
http://www.oracle.ugent.be/download.html (marine)
https://www.nodc.noaa.gov/OC5/woa13/woa13data.html
You can change the amount of memory that MaxEnt uses by simply changing 512 to 1024.
Once you have changed the memory, click File and Save and close the window. To open
MaxEnt, click on the maxent.bat file. A window should open that looks like this:
To begin, you must provide a Samples file. This file is the presence localities in .csv format.
Navigate to your csv file by clicking the Browse button under Samples. Next you have to
provide the Environmental Layers to be used for the model. This will be the folder that
Page | 4
SDM101 Rita Castilho
contains all your environmental layers in ASCII format (they must have an .asc file extension)
with the same geographic bounds, cell size, and projection system. Navigate to this folder by
clicking the Browse button under Environmental Layers. Notice how you can change the
environmental layers to either continuous or categorical. If any of the layers you include in
your environmental layers are categorical (e.g. vegetation type), make sure you change them
by clicking on the down arrow and choosing categorical. An Output folder also needs to be
selected. This will be the folder where all the MaxEnt outputs will be stored. We will use the
folder created earlier named Outputs. Navigate to this folder by clicking the Browse located
next to the Output Directory.
You can leave the Projection Layers Folder/File window blank if you do not intend on
producing future scenarios. Make sure that the Create Response Curves, Make Pictures of
Predictions, and Do jackknife to Measure Variable Importance boxes are all checked.
Keep the Auto Features box checked and leave the Output Format as Logistic and the Output
file type as .asc.
Now, the MaxEnt GUI (graphic user interface) should look like this:
Page | 5
SDM101 Rita Castilho
NOTE: You will need to check the “Random seed” box when using test data. If you forget to
check this, MaxEnt will pop up an error and force you to check this box.
Step 4: Reducing disk space and increasing speed (optional setting) (Advanced tab)
When the only output needed from a MaxEnt run is the averaged results from multiple runs
(replications), you can change the model setting to turn off the “write output grids.” This will
prevent MaxEnt from writing output grids from individual runs and only output the summary
statistic grids (e.g. Average, Minimum, Maximum, etc.) from all the runs. This will speed up
the total run time and decrease disk space. You can turn off the “write output grids” option by
going to Settings, selecting the Advanced tab and then de-check write output grids.
Page | 6
SDM101 Rita Castilho
Page | 7
SDM101 Rita Castilho
The next graph you see when you scroll down is the Sensitivity vs 1 –Specificity. This is a
graph of the Area Under the Receiver Operating Characteristic (ROC) Curve or AUC. The
AUC values allow you to easily compare performance of one model with another, and are
useful in evaluating multiple MaxEnt models. An AUC value of 0.5 indicates that the
performance of the model is no better than random, while values closer to 1.0 indicate better
model performance.
Page | 8
SDM101 Rita Castilho
Further down the page, you will see a picture of the model. You can click on the picture to see
an enlarged version. You can also find this image in the Plots folder in the outputs as a
Portable Network Graphic (.png) file.
Below the variable contributions is a graph of the Jackknife of Regularized Training Gain.
The jackknifing shows the training gain of each variable if the model was run in isolation, and
Page | 9
SDM101 Rita Castilho
If you use the same data for training and for testing then the red and blue lines will be
identical. If you split your data into two partitions, one for training and one for testing it is
normal for the red (training) line to show a higher AUC than the blue (testing) line. The red
(training) line shows the “fit” of the model to the training data. The blue (testing) line
indicates the fit of the model to the testing data, and is the real test of the models predictive
power.
It is important to note that AUC values tend to be higher for species with narrow ranges,
relative to the study area described by the environmental data. This does not necessarily mean
that the models are better; instead this behavior is an artifact of the AUC statistic.
Observe the graphs below for a given species, and identify the most important and less
important variable for predicting the distribution of the occurrence data.
Page | 10
SDM101 Rita Castilho
We see that if Maxent uses only pre6190_l1 (average January rainfall) it achieves almost no
gain, so that variable is not (by itself) useful for estimating the distribution of Bradypus.
October rainfall (pre6190_l10) allows a reasonably good fit to the training data. Turning to
the lighter blue bars, it appears that no variable contains a substantial amount of useful
information that is not already contained in the other variables, because omitting each
variable in turn did not decrease the training gain considerably. However, if we would have to
remove variables from the analysis, h_dem would be the chosen variable.
Comparing the three jackknife plots can be very informative. The AUC plot shows that annual
precipitation (pre6190_ann) is the most effective single variable for predicting the distribution of
the occurrence data that was set aside for testing, when predictive performance is measured using
AUC, even though it was hardly used by the model built using all variables. The relative
importance of annual precipitation also increases in the test gain plot, when compared against the
training gain plot. In addition, in the test gain and AUC plots, some of the light blue bars
(especially for the monthly precipitation variables) are longer than the red bar, showing that
predictive performance improves when the corresponding variables are not used. This tells us that
these (monthly precipitation) variables are helping Maxent to obtain a good fit to the training data,
but the annual precipitation variable generalizes better, giving comparatively better results on the
set-aside test data. Phrased differently, models made with the monthly precipitation variables
appear to be less transferable. This is important if our goal is to transfer the model, for example by
applying the model to future climate variables in order to estimate its future distribution under
climate change. It makes sense that monthly precipitation values are less transferable: likely
suitable conditions for this species will depend not on precise rainfall values in selected months,
but on the aggregate average rainfall, and perhaps on rainfall consistency or lack of extended dry
periods. When we are modeling on a continental scale, there will probably be shifts in the precise
timing of seasonal rainfall patterns, affecting the monthly precipitation but not suitable conditions
for the species. The same line of thought can be pursued when dealing with a marine species. Sea
surface temperature for particular months may not be as good descriptor variables as sea surface
temperature range, or minimum or maximum values. Bio-Oracle has these aggregate variables
while World Ocean Atlas (https://www.nodc.noaa.gov/OC5/woa13/woa13data.html) has another
king of aggregation (see below).
Page | 11
SDM101 Rita Castilho
In general, it would be better to use variables that are more likely to be directly relevant to the
species being modeled. For example, the Worldclim website (www.worldclim.org) provides
“BIOCLIM” variables, including derived variables such as “rainfall in the wettest quarter”, rather
than monthly values.
A last note on the jackknife outputs: the test gain plot shows that a model made only with January
precipitation (pre6190_l1) results in a negative test gain. This means that the model is slightly
worse than a null model (i.e., a uniform distribution) for predicting the distribution of occurrences
set aside for testing. This can be regarded as more evidence that the monthly precipitation values
are not the best choice for predictor variables.
Page | 12
SDM101 Rita Castilho
SDM BY SCRIPTING
There are several ways to get presence data from the target species: (1) databases: Gbif
(http://www.gbif.org), Obis (http://iobis.org/), Fishbase (http://fishbase.org); (2) previous
published work ; and (3) own work. There are several problems with using data retrieved
from databases, mostly related to either duplicate data and georeferencing errors, for instance
data relating to marine species that is georeferenced in land or the other way around. Errors
are identified by creating an overlay between the point locality layer and a maritime
boundaries layer (coastlines) provided by the data(wrld_simpl). Any mismatch between
these layers was indicative of a potential georeferencing error and outlying points can be
removed.
We will use a script that automatically retrieves data from these two databases (marine
organisms), eliminates the duplicates and eliminates all the inland or marine points,
depending on the species, written by Miguel Gandra.
###################################################################################################################
## Miguel Gandra || [email protected] || April 2015 ##############################################################
###################################################################################################################
# Script to plot a distribution map of a chosen species from GBIF, OBIS and FishBase records.
# http://www.gbif.org
# http://iobis.org
# http://www.fishbase.org
# 1. Define the genus and species name in the variables field (below).
# 2. Set the directory containing the data files (txt from GBIF and csv from OBIS) - user's desktop predefined.
# GBIF data: the script first looks for the "occurrence.txt" file in the chosen directory,
# if it's not available or contains wrong species, the data is downloaded directly from the GBIF portal.
# OBIS data: the script looks for a csv file in the chosen directory (if multiple csv files are found
# it uses the first one listed).
# FishBase data: the script downloads the data directly from the web.
# The "marginal.tolerance" variable sets the margin at which records are considered to be in land.
# A positive value will consider records to be on land, even if they are up to x degrees outside the nearest coast.
# A negative value will constraint land records to the ones that are at least x degrees within the coastline.
# User can automatically delete points in land by setting the "remove.land.points" variable to TRUE,
# or delete points in water by setting the "remove.ocean.points" variable to TRUE.
######################################################################################
# Automatically install required libraries ##########################################
# (check http://tlocoh.r-forge.r-project.org/mac_rgeos_rgdal.html
# if rgeos and rgdal installation fails on a Mac)
if(!require(dismo)){install.packages("dismo"); library(dismo)}
Page | 13
SDM101 Rita Castilho
if(!require(XML)){install.packages("XML"); library(XML)}
if(!require(jsonlite)){install.packages("jsonlite"); library(jsonlite)}
if(!require(graphics)){install.packages("graphics"); library(graphics)}
if(!require(maps)){install.packages("maps"); library(maps)}
if(!require(maptools)){install.packages("maptools"); library(maptools)}
if(!require(rgeos)){install.packages("rgeos"); library(rgeos)}
if(!require(rgdal)){install.packages("rgdal"); library(rgdal)}
######################################################################################
# Variables ##########################################################################
#######################################################################################
# Get GBIF records from txt file or download them directly from the portal ############
gbif.file<-file.path(directory,"occurrence.txt")
txt <- TRUE
if(file.exists(gbif.file)==TRUE){
gbif.data <- read.table(gbif.file, sep="\t", header=TRUE, fill=TRUE, quote=NULL, comment='')
if(length(grep(species,gbif.data$scientificName))==0){
txt <- FALSE
stop("occurrence.txt file from a different species, downloading data from GBIF")
}else{
gbif.coordinates <- data.frame(gbif.data$decimalLongitude,gbif.data$decimalLatitude)
gbif.coordinates <- na.omit(gbif.coordinates)}
}else{
txt <- FALSE
stop("occurrence.txt file not available, downloading data from GBIF")
}
if(txt==FALSE){
gbif.data <- try(gbif(genus,species,geo=TRUE,removeZeros=TRUE))
if(class(gbif.data)=="try-error"){
gbif.coordinates<-data.frame(matrix(ncol=2, nrow=0))
stop("GBIF data download failed, check internet connection")
}else{
gbif.coordinates <- data.frame(gbif.data$lon,gbif.data$lat)
gbif.coordinates <- na.omit(gbif.coordinates)}
}
#######################################################################################
# Get OBIS records from csv file ######################################################
obis.file<-list.files(directory,pattern="*.csv")
obis.file <- file.path(directory,obis.file[1])
if(file.exists(obis.file)==TRUE){
obis.data <- read.csv(obis.file, sep=",", header=TRUE)
if(length(grep(species,obis.data$sname))==0){
obis.coordinates <- data.frame(matrix(ncol=2, nrow=0))
stop("OBIS csv file from a different species")
}else{
obis.coordinates <- data.frame(obis.data$longitude,as.numeric(as.character(obis.data$latitude)))
obis.coordinates <- na.omit(obis.coordinates)}
}else{
obis.coordinates <- data.frame(matrix(ncol=2, nrow=0))
stop("OBIS csv file not available")
}
#######################################################################################
# Get records from FishBase ###########################################################
Page | 14
SDM101 Rita Castilho
#######################################################################################
# Merge records and remove duplicates #################################################
colnames(gbif.coordinates)<-c("long","lat")
colnames(obis.coordinates)<-c("long","lat")
colnames(fishbase.coordinates)<-c("long","lat")
coordinates <- rbind(gbif.coordinates,obis.coordinates,fishbase.coordinates)
total <- nrow(coordinates)
dups <- duplicated(coordinates[,1:2])
dups <- dups[dups==TRUE]
coordinates <- unique(coordinates)
#######################################################################################
# Set geographical area ###############################################################
x <- coordinates[,1]
y <- coordinates[,2]
xmin=min(x)-5
xmax=max(x)+5
ymin=min(y)-5
ymax=max(y)+5
########################################################################################
# Plot Map ############################################################################
########################################################################################
# Compute land and ocean points #######################################################
data(wrld_simpl)
if (marginal.tolerance==0){
x.ocean <- x[pts.on.land==FALSE]
y.ocean <- y[pts.on.land==FALSE]
x.land <- x[pts.on.land==TRUE]
y.land <- y[pts.on.land==TRUE]
} else if (marginal.tolerance>0){
distances <- gDistance(specie.pts,wrld_simpl,byid=TRUE)
for (i in 1:length(specie.pts)) {min.distances[i] <- min(distances[,i])}
x.ocean <- x[min.distances>marginal.tolerance]
x.land <- x[min.distances<=marginal.tolerance]
y.ocean <- y[min.distances>marginal.tolerance]
y.land <- y[min.distances<=marginal.tolerance]
} else if (marginal.tolerance<0) {
specie.pts <- specie.pts[pts.on.land==TRUE]
mp <- map("world", xlim=c(xmin,xmax), ylim=c(ymin,ymax), col="gray60", border="gray60", fill=TRUE, resolution=0)
coastline <- cbind(mp$x, mp$y)[!is.na(mp$x),]
coast.pts <- SpatialPoints(coastline, proj4string=CRS(proj4string(wrld_simpl)))
distances <- gDistance(specie.pts,coast.pts,byid=TRUE)
for (i in 1:length(specie.pts)) {min.distances[i] <- min(distances[,i])}
x.ocean <- c(x[pts.on.land==FALSE],specie.pts@coords[min.distances<=abs(marginal.tolerance),1])
y.ocean <- c(y[pts.on.land==FALSE],specie.pts@coords[min.distances<=abs(marginal.tolerance),2])
x.land <- specie.pts@coords[min.distances>abs(marginal.tolerance),1]
y.land <- specie.pts@coords[min.distances>abs(marginal.tolerance),2]
}
########################################################################################
# Plot ocean points ? #################################################################
if (remove.ocean.points==FALSE){
points(x.ocean, y.ocean, pch=21, col='black', bg='blue', cex=0.2, lwd=0.2)
}
Page | 15
SDM101 Rita Castilho
########################################################################################
# Plot land points ? ##################################################################
if (remove.land.points==FALSE){
points(x.land, y.land, pch=21, col='black', bg='red', cex=0.2, lwd=0.2)
}
########################################################################################
# Save pdf #############################################################################
########################################################################################
# Export data as csv ###################################################################
if(export.csv==TRUE){
occurrences <- data.frame(x.ocean,y.ocean)
colnames(occurrences) <- c("lon","lat")
csv.name <- paste(genus,'_',species,".csv",sep="")
csv.file <- file.path(directory,csv.name)
write.csv(occurrences, file=csv.file, row.names=FALSE)
}
########################################################################################
# Print summary table ##################################################################
For processing, the bioclimatic variable layers need to be “stacked” in a single object and
trimmed to a smaller size adjusted to the geographical coordinate limits of the species
distribution so that computational time is reduced.
######################################################################################
## Load predictor rasters
# make raster "stack" with raster for each predictor... note that using formats other than ASCII (as used here) can save on memory...
# note that the files have the same name as the predictors do in the species' data file (sans the file extension)
# download rasters from http://www.oracle.ugent.be/DATA/90_90_ST/BioOracle_9090ST.rar
# place rasters in folder Desktop/SDM/Oracle
#list.rasters<-(list.files("~/Shadow_Desktop/0_R_ready_to_use/0_Oracle", full.names=T, pattern=".asc"))
list.rasters<-(list.files("~/Desktop/SDM/Oracle", full.names=T, pattern=".asc"))
list.rasters
rasters <- stack(list.rasters)
Page | 16
SDM101 Rita Castilho
## set the coordinate reference system for the raster stack... this is not absolutely necessary if the rasters are unprojected (e.g., WGS84), but
we'll do it to avoid warning messages below
projection(rasters) <- CRS("+proj=longlat +datum=WGS84")
## Crop rasters
rasters.crop <- crop(rasters,limits)
######################################################################################
By now you should have a set of predictor variables (rasters) and occurrence points. The next
step is to extract the values of the predictors at the locations of the points. This is a very
straightforward thing to do using the ’extract’ function from the raster package. In your case
you use that function first for your species occurrence points, then for 500 random
background points. We combine these into a single data.frame in which the first column
(variable ’pb’) indicates whether this is a presence or a background point.
################################################################################################################
## Extracting values from rasters
presvals <- extract(rasters.selected, spoints)
presvals
################################################################################################################
Page | 17
SDM101 Rita Castilho
5. MODEL FITTING
It is expected that some of the bioclimatic variables will be correlated, so in order to obtain
independent variables only, we need to we calculated pairwise correlations on values
extracted from the occurrence records, and exclude highly correlated variables (r>0.9). This
assessment results in a decrease in the total of variables that we use for ecological niche
modeling.
# AIC (Akaike information criterion:the preferred model is the one with the minimum AIC value)
# AIC model with all variables------------------------------------------------------------------------------------------------------------------
k <- length(model.present$coefficients)
aic <- (2*k)-(2*logLik(model.present)[[1]])
round(aic)
gof <- (model.present$null.deviance-model.present$deviance)/model.present$null.deviance
gof
# AIC model with selected variables------------------------------------------------------------------------------------------------------------------
k1 <- length(reduced.present.model$coefficients)
aic1 <- (2*k1)-(2*logLik(reduced.present.model)[[1]])
round(aic1)
gof1 <- (reduced.present.model$null.deviance-reduced.present.model$deviance)/reduced.present.model$null.deviance
gof1
5.2. Maxent
################################################################################################################
## Train Maxent model
# Call Maxent using the "raster-points" format (x=a raster stack and p=a two-column matrix of coordinates).
Page | 18
SDM101 Rita Castilho
# Only x and p are really required, but I'm showing many of the commands in case you want to tweak some later.
# All of the "args" are set to their default values except for "randomtestpoints" which says to randomly
# select 30% of the species' records and use these as test sites (the default for this is 0).
# The R object made by the model will remain in R's memory.
# All other arguments are set to the defaults except "threads" which can be set up to the number of cores
# you have on your computer to speed things up... to see more "args" see bottom of "Help" file from the Maxent program.
rasters.final<-subset(rasters.selected,c("calcite", "chlorange","chlomean","cloudmean","cloudmin","damean",
"temperature", "nitrate", "parmax", "parmean", "phosphate",
"salinity", "silicate"))
rasters.final
################################################################################################################
# look at model output (HTML page)
model.maxent
################################################################################################################
# variable contribution
plot(model.maxent)
## ----------------------------------------------------------------------------------------------------------------------------------------
## write prediction map
# note that you could save the prediction raster in any number of formats (see ?writeFormats), but GeoTiffs are small and can be read by
ArcMap...
# ASCIIs can also be read by other programs but are large... the default raster format (GRD) sometimes can't be read by ArcMap...
# if you don't specifiy a file name then the results are written to R's memory
map.model.maxent <- predict(
object=model.maxent,
x=rasters.crop,
na.rm=TRUE,
format='GTiff',
filename= "~/Desktop/model",
overwrite=TRUE,
progress='text'
)
# look at map
plot(map.model.maxent, main='Present-day')
# add species' records
points(spoints, col='blue', pch=20, cex=0.2)
################################################################################################################
Page | 19
SDM101 Rita Castilho
6. MORE R-PACKAGES
Package ‘dismo’
http://cran.r-project.org/web/packages/dismo/dismo.pdf
Functions for species distribution modeling, that is, predicting entire
geographic distribu- tions form occurrences at a number of sites.
Package ‘sdmvspecies’
http://cran.r-project.org/web/packages/sdmvspecies/sdmvspecies.pdf
Package ‘ENiRG’
http://cran.r-project.org/web/packages/ENiRG/ENiRG.pdf
The package allows to perform the Ecological Niche Factor Analysis,
calculate habitat suitability maps and classify the habitat in suitability
classes. Computations are executed in a throw-away GRASS environment from R
in order to be able to perform analysis with large data sets.
Package 'unmarked'
http://cran.r-project.org/web/packages/unmarked/unmarked.pdf
Fits hierarchical models of animal abundance and occurrence to data
collected using sur- vey methods such as point counts, site occupancy
sampling, distance sampling, removal sam- pling, and double observer
sampling. Parameters governing the state and observation pro- cesses can be
modeled as functions of covariates.
Page | 20
SDM101 Rita Castilho
Wilson CD, Roberts D, Reid N (2011) Applying species distribution modelling to identify areas of high
conservation value for endangered species: A case study using Margaritifera margaritifera (L.).
Biological Conservation 144, 821-829.
Books
Franklin, J 2009 Mapping Species Distributions: Spatial inference and prediction. Cambridge Univ. Press.
Cambridge, UK. 320 pp. [more methodological]
Peterson, AT, J Soberon, RG Pearson, RP Anderson, E Martinez-Martin, M Nakamura, MB Araujo
2011 Ecological Niches and Geographic Distributions (Monographs in Population Biology). Princeton U Press
328 pp. [more conceptual]
Page | 21
SDM101 Rita Castilho
Page | 22
SDM101 Rita Castilho
Page | 23
SDM101 Rita Castilho
Page | 24