Figure caption: Main causes of deforestation in central Menabe. a-a’: Slash-and-burn agriculture (“hatsake”) for peanut crop. Peanut (a’) is cultivated as a cash crop. Part of the production is at the destination of the national market but most of it is exported outside Madagascar, mainly for the Chinese market. b-b’: Slash-and-burn agriculture for maize crop. maize (b’) is cultivated for auto-consumption and as a cash crop. The production of maize is at the destination of the national market and is used in particular to brew the national beers. c-c’: Cyclone followed by uncontrolled fires. Cyclone “Fanele” (2009) caused tree mortality and accumulation of wood fuel on the ground. As a consequence, uncontrolled fires set on nearby pastures (c’) spread over large areas of forest after 2009. d-d’: Illegal logging. Timbers are used for house and boat construction.
This tutorial is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.
Docker is used to run software packages called “containers”. Containers are isolated from each other and bundle their own tools, libraries and configuration files. Containers are created from “images” that specify their precise contents.
To install Docker for your operating system, go to https://docs.docker.com/install/. For Windows user, go directly to https://docs.docker.com/docker-for-windows/install/.
For our tutorial, we have built a specific Docker image including R and Python:
Some computations, such as computation by blocks on large raster files, were more easily coded and faster with Python. That’s why we decided to use Python as a complementary language to R. The reticulate
R package allow us to use Python functions from R and thus to work in a single computing environment within RStudio.
forestatrisk
Docker imageDownload the image from internet with the following command in a terminal: docker pull ghislainv/docker-forestatrisk .
(do not forget the final “.”)
Note: If the internet connection is slow, use a USB key and copy the image somewhere on your computer and load it with the following command: docker load -i <path to image tar file>
Run the container with docker run -d -e PASSWORD=PWD -e ROOT=true -p 8787:8787 -v ORIGIN_FOLDER:/home/rstudio ghislainv/docker-forestatrisk
.
Note:
PWD
, it cannot be rstudio
.ORIGIN_FOLDER
with a folder on your local machine where you want to save the R scripts. For example /c/Users/YOUR_NAME
or /home/YOUR_NAME
.Point your browser to localhost:8787
to access RStudio interface. Log in with rstudio/PWD
.
Note on docker’s containers and images:
docker ps
docker images
or docker image list
docker stop <id container>
docker rm <idcontainer>
In RStudio, open a terminal and clone the GitHub repository locally:
git clone https://github.com/ghislainv/forecasting-deforestation-Mada
Navigate with RStudio in the folder named forecasting-deforestation-Mada
and open the R Notebook file named training.Rmd
. R Markdown files (*.Rmd
) can be used in R to embbed text, code, table and figues inside a unique document. Code are organized into ‘chunks’ that can be executed independently and interactively. An R Notebook is an R Markdown document with output visible immediately beneath the input.
I used this notebook to write the tutorial available at https://ecology.ghislainv.fr/forecasting-deforestation-Mada.
In the preambule, change the author and date with your name and today’s date.
# Environmental variables
Sys.unsetenv("DISPLAY") # Remove DISPLAY for Python plot
# Libraries
pkg <- c("reticulate","glue","curl","dplyr","readr","broom",
"ggplot2","raster","rasterVis","rgdal","leaflet","kableExtra")
load.pkg <- function(x) {
if(!require(x, character.only=TRUE)) {
install.packages(x)
require(x, character.only=TRUE)
}
}
loaded <- lapply(pkg,load.pkg)
## Remove useless objects
rm(pkg,load.pkg,loaded)
# Source R plotting functions
source("R/far_plot.R")
Specify the Python version to use with reticulate and check that it is the right one.
use_python("/usr/bin/python3.5")
py_config()
## python: /usr/bin/python3.5
## libpython: /usr/lib/python3.5/config-3.5m-x86_64-linux-gnu/libpython3.5.so
## pythonhome: /usr:/usr
## version: 3.5.3 (default, Sep 27 2018, 17:25:39) [GCC 6.3.0 20170516]
## numpy: /usr/local/lib/python3.5/dist-packages/numpy
## numpy_version: 1.15.3
##
## python versions found:
## /usr/bin/python3.5
## /usr/bin/python
## /usr/bin/python3
Import the Python modules into R.
far <- import("forestatrisk")
patsy <- import("patsy")
sm <- import("statsmodels.api")
smf <- import("statsmodels.formula.api")
# Generating data
nobs <- 100
x.seq <- runif(n=nobs,min=0,max=100)
a.target <- 2
b.target <- 2
sigma2.target <- 300
y.seq <- a.target + b.target*x.seq +
rnorm(n=nobs,mean=0,sd=sqrt(sigma2.target))
# Data-set
Data <- as.data.frame(cbind(y.seq,x.seq))
# Plot
plot(x.seq,y.seq)
# Estimation
M <- lm(y.seq~x.seq,data=Data)
summary(M)
##
## Call:
## lm(formula = y.seq ~ x.seq, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.340 -11.649 1.587 11.141 41.977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.61161 3.03037 1.852 0.0671 .
## x.seq 2.00341 0.05435 36.864 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.6 on 98 degrees of freedom
## Multiple R-squared: 0.9327, Adjusted R-squared: 0.932
## F-statistic: 1359 on 1 and 98 DF, p-value: < 2.2e-16
M$coefficients
## (Intercept) x.seq
## 5.611605 2.003410
var(M$residuals)
## [1] 240.9543
M$df
## [1] 98
# Graphics
x.pred <- seq(from=0,to=100,length.out=100) # We will predict y for 100 values of x
X.pred <- as.data.frame(x.pred)
names(X.pred) <- "x.seq"
Y.pred <- predict.lm(M,X.pred,interval="confidence",
type="response")
plot(x.seq,y.seq,col=grey(0.7),
xlab="X",
ylab="Y",
main="Simple linear regression") # Row data
# Confidence envelops for predictive posterior
lines(x.pred,Y.pred[,1],col="red",lwd=3)
lines(x.pred,Y.pred[,2],col="blue",lwd=2,lty=2)
lines(x.pred,Y.pred[,3],col="blue",lwd=2,lty=2)
You can execute Python code within a R notebooks. We need to indicate which version of Python to use in the notebook for Python chunk.
import numpy as np
a = np.array([20,5,6,2,9,12])
b = 1
c = a + b
print(type(c))
## <class 'numpy.ndarray'>
print(c)
## [21 6 7 3 10 13]
To model the spatial probability of deforestation, we need a map of the past deforestation and maps of potential explicative environmental variables. Environmental variables can be derived from topography (altitude and slope), accessibility (distance to roads, towns, and forest edge), deforestation history (distance to previous deforestation) or landscape policy (protected area network) for example. In our example, we use the following variables :
tab1 <- read.table("tables/variables.txt",sep=",",header=TRUE)
names(tab1) <- c("Product","Source","Variable derived","Unit","Resolution (m)")
# Kable
knitr::kable(tab1, linesep="") %>%
kable_styling(bootstrap_options = c("striped"))
Product | Source | Variable derived | Unit | Resolution (m) |
---|---|---|---|---|
Deforestation maps (1990-2000-2010) | BioSceneMada (1) | forest/non-forest | – | 30 |
distance to forest edge | m | 30 | ||
distance to previous deforestation | m | 30 | ||
Digital Elevation Model | SRTM v4.1 CSI-CGIAR (2) | altitude | m | 90 |
slope | ° | 90 | ||
Highways | OSM - Geofabrik (3) | distance to roads | m | 150 |
Places | distance to towns | m | 150 | |
Waterways | distance to river | m | 150 | |
Protected areas | Rebioma (4) | presence of protected area | – | 30 |
# Make data directory
if (!dir.exists("data")) {
dir.create("data")
dir.create("data/model")
dir.create("data/mada")
dir.create("data/validation")
}
# Files
files <- c("fordefor2010.tif", "dist_edge.tif", "dist_defor.tif",
"altitude.tif", "slope.tif",
"dist_road.tif", "dist_town.tif", "dist_river.tif",
"sapm.tif", "roads.kml", "towns.kml", "rivers.kml", "sapm.kml")
# Download model data
for (i in files) {
if (!file.exists(glue("data/model/{i}"))) {
f <- glue("https://zenodo.org/record/259582/files/{i}")
curl_download(f, glue("data/model/{i}"), quiet=FALSE)
}
}
# Download validation data
if (!file.exists("data/validation/for2017.tif")) {
f <- "http://bioscenemada.cirad.fr/FileTransfer/for2017.tif"
curl_download(f, "data/validation/for2017.tif", quiet=FALSE)
}
# Download mada outline
if (!file.exists("data/mada/mada38s.shp")) {
f <- "http://bioscenemada.cirad.fr/FileTransfer/mada38s.zip"
curl_download(f, "data/mada/mada38s.zip", quiet=FALSE)
unzip("data/mada/mada38s.zip", exdir="data/mada")
}
In our example, fordefor2010.tif
is a forest raster at 30m for the year 2010 considering the deforestation on the period 2000-2010 in Madagascar. We can plot this raster and zoom on a region with function .plot.forest()
in the forestatrisk
package to see the deforestation data. The remaining forest appears in green and the deforested areas appear in red.
# Make output directory
if (!dir.exists("output")) {
dir.create("output")
}
# Plot forest cover change 2000-2010
fig <- far$plot$fcc(input_fcc_raster="data/model/fordefor2010.tif",
output_file="output/fcc.png",
col=c(255,0,0,255), # rgba color for deforestation
figsize=c(5,5),
dpi=150,
zoom=c(340000,412000,7420000,7500000))
knitr::include_graphics("output/fcc.png")
We use the function .sample()
from the forestatrisk
module to sample 10,000 points (pixel centers) in the deforested areas and in the remaining forest (20,000 points in total). The input_forest_raster
argument defines the path to the forest raster including the deforested pixels (with value=0) and the remaining forest pixels (value=1) after a given period of time. The random seed can be set with argument Seed
to reproduce the data of the random sampling.
The .sample()
function also extracts information from environmental variables for each sampled point. Sampling is done by block to allow the computation on large study areas (e.g. country or continental scale) with a fine spatial resolution (e.g. 30m). The var_dir
argument indicates the directory including all the environmental raster files (they must be GeoTIFF files with extension .tif
) that we want to test.
The .sample()
function identifies the spatial cell for each sample point (sample point are grouped within a spatial cell). Spatial cells and grouped observations are used to estimate the spatial autocorrelation of the deforestation process. The csize
argument define the width (in km) of the square spatial cells. Each spatial cell will be attributed a parameter. To avoid estimating too many parameters, width of the square spatial cells must be sufficiently large. Both points sampling and extraction of environmental data are done by block to avoid memory problems for big datasets.
import forestatrisk
help(forestatrisk.sample)
## Help on function sample in module forestatrisk.sample:
##
## sample(nsamp=10000, Seed=1234, csize=10, var_dir='data', input_forest_raster='forest.tif', output_file='output/sample.txt', blk_rows=0)
## Sample points and extract raster values.
##
## This function (i) randomly draw spatial points in deforested and
## forested areas and (ii) extract environmental variable values for
## each spatial point.
##
## :param nsamp: number of random spatial points.
## :param seed: seed for random number generator.
## :param csize: spatial cell size in km.
## :param var_dir: directory with raster data.
## :param input_forest_raster: name of the forest raster file (1=forest, 0=deforested) in the var_dir directory
## :param output_file: path to file to save sample points.
## :param blk_rows: if > 0, number of lines per block.
## :return: a pandas DataFrame, each row being one observation.
# Training data-set
if (!file.exists("output/sample.txt")) {
samp <- far$sample(nsamp=10000L, Seed=1234L, csize=10L,
var_dir="data/model",
input_forest_raster="fordefor2010.tif",
output_file="output/sample.txt",
blk_rows=1L)
}
samp <- read.table("output/sample.txt", header=TRUE, sep=",")
set.seed(1234)
train <- sample(1:20000, size=10000, replace=FALSE)
data_train <- samp[train,] %>% dplyr::filter(complete.cases(.))
data_valid <- samp[-train,] %>% dplyr::filter(complete.cases(.))
head(data_train)
## altitude dist_defor dist_edge dist_river dist_road dist_town
## 1 67 1146 108 150 37237 474
## 2 760 4082 811 3690 11767 7697
## 3 1080 6081 1776 9828 43487 16278
## 4 34 680 85 300 24553 1485
## 5 993 5544 190 3903 27649 14575
## 6 699 450 450 3750 23028 8255
## fordefor2010 sapm slope X Y cell
## 1 0 0 8 1042455 8235255 3638
## 2 1 1 10 978285 8272545 3307
## 3 1 1 5 949095 8290275 3224
## 4 1 0 14 1027425 8271045 3393
## 5 1 1 5 727185 7544595 9195
## 6 1 0 8 980085 8236275 3632
Sampled observations can be plotted using function .plot.obs()
from the forestatrisk module. Dark red dots indicate deforestation observations and dark green dots indicate forest observations.
# Plot sample points
fig <- far$plot$obs(sample=data_train,
name_forest_var="fordefor2010",
input_fcc_raster="data/model/fordefor2010.tif",
output_file="output/obs.png",
zoom=c(340000,412000,7420000,7500000),
figsize=c(5,5),#c(11.69,8.27),
s=5,dpi=300)
knitr::include_graphics("output/obs.png")
Before modelling the deforestation, it is important to look at the relationship between environmental variables and deforestation. Using formulas from the patsy
Python module, we can specify the relationships that we want to look at. In the example below, we plot the relationships between some continuous environmental variables and the probability of deforestation using function .plot.correlation()
from the forestatrisk
package. Note that -1 must be set at the end of the formula. The function .correlation()
returns a serie of graphics that can be analyzed to choose the right relationship for each continuous variable (linear or polynomial for example).
# Descriptive statistics
# Model formulas
formula_1 <- paste0("fordefor2010 ~ dist_road + dist_town + dist_defor +",
"dist_river + dist_edge + altitude + slope - 1")
# Standardized variables (mean=0, std=1)
formula_2 <- paste0("fordefor2010 ~ scale(dist_road) + scale(dist_town) +",
"scale(dist_defor) + scale(dist_river) + scale(dist_edge) +",
"scale(altitude) + scale(slope) - 1")
formulas <- c(formula_1, formula_2)
# List to store figures
corr_fig <- list()
# Loop on formulas
for (f in 1:length(formulas)) {
# Output file
of <- glue("output/correlation_{f}.pdf")
# Data
dmat <- patsy$dmatrices(formulas[f], data=data_train, eval_env=-1L,
return_type="dataframe")
# Plots
fig <- far$plot$correlation(y=dmat[[1]],data=dmat[[2]],
plots_per_page=3L,figsize=c(7,8),
dpi=100L,output_file=of)
}
In this example (see pdf files produced), we can see that a linear model should be sufficient to represent the relationship between the probability of deforestation and the standardized distance to the nearest road or town. On the contrary, it could be interesting to fit a polynomial model for the the standardized distance to previous deforestation (dist_defor
variable) for which the relationship seems non-linear. Several models can be fitted and compared to see if a second-order or third-order polynomial relationship is justified.
We propose to use the Binomial icar model (Vieilledent et al. 2014) to estimate the deforestation probability of a pixel given a set of environmental variables. The Binomial icar model is a linear Binomial logistic regression model including an intrinsic Conditional Autoregressive (icar) process to account for the spatial autocorrelation of the observations. Parameter inference is done in a hierarchical Bayesian framework. The .model_binomial_icar()
function from the forestatrisk
module includes a Metropolis-within-Gibbs algorithm written in pure C code to reduce computation time.
Figure caption: Parameter inference is done in a hierarchical Bayesian framework. Bayesian statistics rely on the Bayes’ theorem named after Reverend Thomas Bayes. Each parameter has a prior and an approximated posterior probability distribution from which we can compute the mean, standard deviation, credible intervals at 95%, etc.
For the deforestation process it is very important to take into account the spatial autocorrelation of the process with spatial random effects. Indeed, the selected fixed environmental variables are not able to fully explain the spatial variability of the deforestation process, especially when working at large geographical scales, such as the national or continental scale. Spatial random effects allow estimating a higher/lower probability of deforestation in a particular region (associated to unmeasurable or unknow factors) that is different from the mean probability of deforestation derived from the environmental factors included in the model. The Binomial icar model can be described as follow:
Ecological process
\[\begin{equation} y_i \sim \mathcal{B}inomial(\theta_i,t_i) \\ \text{logit}(\theta_i) = X_i \beta + \rho_{j(i)} \end{equation}\]\(y_i\): random variable for the deforestation process (0 if no deforestation, 1 if deforestation)
\(\theta_i\): probability of deforestation
\(t_i\): number of trials (always 1 in our example)
\(X_i\): vector of values for environmental explicative variables
\(\beta\): vector of fixed effect parameters
\(\rho_j\): spatial random effect
\(j(i)\): index of the spatial entity for observation \(i\).
Spatial autocorrelation
The spatial autocorrelation is modelled with an intrinsic conditional autoregressive (icar) process:
\[\begin{equation} \rho_j \sim \mathcal{N}ormal(\mu_j,V_{\rho} / n_j) \end{equation}\]\(\mu_j\): mean of \(\rho_{j'}\) in the neighborhood of \(j\).
\(V_{\rho}\): variance of the spatial random effects.
\(n_j\): number of neighbors for spatial entity \(j\).
Figure caption: Representation of the neighborhood for the intrinsic conditional autoregressive (icar) process. Target spatial cell \(j\) has 8 neighbors in this case. Several observations (black points, equivalent to pixel centers in our case) can be located in each spatial cell. Deforestation probability in one spatial cell \(j\) depends on deforestation probability in neighboring cells.
Before running the model, we add a column indicating the number of trials for each observation (1 in our case as we are considering a Bernoulli process). We then remove any observation with non-available data (NA) from the data-set. We also compute the number of neighbors (nneigh
) and the neighbor identifiers (adj
) for each spatial cell using function .cellneigh
from the forestatrisk
module.
# Spatial cells for spatial-autocorrelation
neighborhood <- far$cellneigh_ctry(raster="data/model/fordefor2010.tif",
vector="data/mada/mada38s.shp",
csize=10L, rank=1L)
nneigh <- neighborhood[[1]]
adj <- neighborhood[[2]]
cell_in <- neighborhood[[3]]
ncell <- neighborhood[[4]]
# Udpate cell number in training dataset
cell_rank <- vector()
for (i in 1:nrow(data_train)) {
cell_rank[i] <- which(cell_in==data_train$cell[i])-1 # ! cells start at zero
}
data_train$cell <- cell_rank
# Udpate cell number in validation dataset
cell_rank <- vector()
for (i in 1:nrow(data_valid)) {
cell_rank[i] <- which(cell_in==data_valid$cell[i])-1 # ! cells start at zero
}
data_valid$cell <- cell_rank
A model formula must also be defined to specify the explicative variables we want to include in the model. The formula allows specifying some variable transformations (such as standardization in our case). See the patsy
module for more information. In our model, we included the following variables: location inside a protected area, altitude, distance to past deforestation (with a degree two polynomial), distance to forest edge, distance to nearest road and distance to nearest town. The formula must end with the name of the variable indicating the spatial cell for each observation (cell
in our case).
# Formula
data_train$trials <- 1 # Set number of trials to one
formula <- paste0("I(1-fordefor2010) + trials ~ C(sapm) + scale(altitude) +
scale(slope) +",
"scale(dist_defor) + np.power(scale(dist_defor),2) + ",
"scale(dist_edge) + ",
"scale(dist_road) + scale(dist_town) + cell")
# Model
mod_icar <- far$model_binomial_iCAR(
# Observations
suitability_formula=formula, data=data_train,
# Spatial structure
n_neighbors=np_array(nneigh,dtype="int32"), neighbors=np_array(adj,dtype="int32"),
# Environment
eval_env=-1L,
# Chains
burnin=1000L, mcmc=1000L, thin=1L,
# Starting values
beta_start=-99)
Once the model has been fitted, we can print a summary of the model showing the parameter estimates. The 95% credible intervals obtained from the posterior distribution of each parameter, except distance to nearest town (dist_town
), do not include zero, indicating that parameters are significantly different from zero. The variance of the spatial random effects (Vrho
) is given together with the deviance value, which can be used to compare different statistical models (lower deviance is better). Looking at the parameter estimates, we can see that the deforestation probability is much lower inside protected areas and that deforestation probability decreases with altitude, slope, distance to past deforestation, forest edge, roads and towns. Parameter values are then coherent regarding the deforestation process and easy to interpret.
sink(file="output/summary_mod_icar.txt")
print(mod_icar)
## Binomial logistic regression with iCAR process
## Model: I(1 - fordefor2010) + trials ~ 1 + C(sapm) + scale(altitude) + scale(slope) + scale(dist_defor) + np.power(scale(dist_defor), 2) + scale(dist_edge) + scale(dist_road) + scale(dist_town) + cell
## Posteriors:
## Mean Std CI_low CI_high
## Intercept -0.263 0.0964 -0.485 -0.105
## C(sapm)[T.1.0] -0.587 0.086 -0.787 -0.426
## scale(altitude) -0.662 0.0602 -0.778 -0.538
## scale(slope) -0.158 0.0348 -0.226 -0.0919
## scale(dist_defor) -0.905 0.0731 -1.06 -0.771
## np.power(scale(dist_defor), 2) 0.071 0.00877 0.0497 0.0893
## scale(dist_edge) -0.545 0.0518 -0.654 -0.454
## scale(dist_road) -0.14 0.0548 -0.24 -0.0382
## scale(dist_town) 0.0438 0.056 -0.0915 0.137
## Vrho 8.45 0.899 7.09 10.4
## Deviance 9.54e+03 75.5 9.38e+03 9.68e+03
sink()
print(mod_icar)
## Binomial logistic regression with iCAR process
## Model: I(1 - fordefor2010) + trials ~ 1 + C(sapm) + scale(altitude) + scale(slope) + scale(dist_defor) + np.power(scale(dist_defor), 2) + scale(dist_edge) + scale(dist_road) + scale(dist_town) + cell
## Posteriors:
## Mean Std CI_low CI_high
## Intercept -0.263 0.0964 -0.485 -0.105
## C(sapm)[T.1.0] -0.587 0.086 -0.787 -0.426
## scale(altitude) -0.662 0.0602 -0.778 -0.538
## scale(slope) -0.158 0.0348 -0.226 -0.0919
## scale(dist_defor) -0.905 0.0731 -1.06 -0.771
## np.power(scale(dist_defor), 2) 0.071 0.00877 0.0497 0.0893
## scale(dist_edge) -0.545 0.0518 -0.654 -0.454
## scale(dist_road) -0.14 0.0548 -0.24 -0.0382
## scale(dist_town) 0.0438 0.056 -0.0915 0.137
## Vrho 8.45 0.899 7.09 10.4
## Deviance 9.54e+03 75.5 9.38e+03 9.68e+03
To check for the convergence of the Markov chain Monte Carlo (MCMC), we can plot the traces and the posterior distributions of the estimated parameters using method .plot()
associated to the hSDM_binomial_icar
class defined in the deforestprob
module. This method returns the figures showing the traces and posterior distributions.
require(coda)
mcmc <- as.mcmc(mod_icar$mcmc)
plot(mcmc[,c(1:3,10,11)])
We use the model to predict the spatial probability of deforestation at the national scale for Madagascar. Before, doing so, we smooth the spatial random effects which have been estimated at a coarse resolution (10km in our example). To do so, we use the function .resample_rho()
from the deforestprob
module to resample the results at a finer resolution using a bilinear interpolation. The function writes a raster file to the disk with a resolution of the raster specified in the argument input_raster
of the function (1km in our case). The function .resample_rho()
returns a figure of the spatial random effects that can be plotted.
# Get the spatial random effects
rho <- rep(-9999,ncell) # -9999 will be considered as nodata
rho[cell_in+1] <- mod_icar$rho
# Resample them
fig <- far$resample_rho(rho=r_to_py(rho), input_raster="data/model/fordefor2010.tif",
output_file="output/rho.tif",
csize_orig=10L, csize_new=1L)
# Plot random effects
fig <- far$plot$rho("output/rho_orig.tif",output_file="output/rho_orig.png")
fig <- far$plot$rho("output/rho.tif",output_file="output/rho.png")
# Plot with R
mada <- rgdal::readOGR(dsn="data/mada",layer="mada38s", verbose=FALSE)
r.rho_orig <- raster("output/rho_orig.tif")
r.rho <- raster("output/rho.tif")
rho_plot(r.rho_orig, mada, output_file="output/rho_orig_ggplot.png",
quantiles_legend=c(0.025,0.975),width=4.5, height=8)
## Results plotted to "output/rho_orig_ggplot.png"
rho_plot(r.rho, mada, output_file="output/rho_ggplot.png",
quantiles_legend=c(0.025,0.975),width=4.5, height=8)
## Results plotted to "output/rho_ggplot.png"
The .predict_raster_binomial_icar()
function of the forestatrisk
module can be used to predict the spatial probability of deforestation from an hSDM_binomial_icar model (i.e. an object of class hSDM_binomial_icar
). The function writes a raster of predictions to the disk. The prediction is done by block to avoid memory problems for big datasets. Functions will return NA for pixels with no forest or for pixels with missing environmental variables.
if (!file.exists("output/prob_icar.tif")) {
far$predict_raster_binomial_iCAR(
mod_binomial_icar, var_dir="data/model",
input_cell_raster="output/rho.tif",
input_forest_raster="data/model/fordefor2010.tif",
output_file="output/prob_icar.tif",
blk_rows=128L)
}
The raster of predictions can then be plotted.
if (!file.exists("output/prob_icar.png")) {
fig <- far$plot$prob("output/prob_icar.tif",
output_file="output/prob_icar.png",
figsize=c(4,4))
}
knitr::include_graphics("output/prob_icar.png")
Given the spatial probability of deforestation and the number of hectares that should be deforested, we can predict the future forest cover using function .deforest()
from the forestatrisk
package. The number of hectares are converted into number of pixels to be deforested. Pixels with the highest probability of deforestation are deforested first. The function computes a probability threshold above which pixels are deforested.
In our example, we consider an annual deforestation of roughly 100,000 ha for Madagascar. Considering the period 2010-2050, this would correspond to 4 Mha of deforestation. This number can of course be more precise and refined considering various deforestation scenarios (demographic growth, economic development, etc.).
if (!file.exists("output/proj2050_icar.tif")) {
deforest <- far$deforest(input_raster="output/prob_icar.tif",
hectares=4000000,
output_file="output/proj2050_icar.tif",
blk_rows=128L)
}
We can plot the predicted future forest cover in 2050 with leaflet. We first reproject the raster to WGS 84 / World Mercator projection (EPSG:3857) and we resample the map at 250 m using function gdalwarp
from GDAL
. The 250 m resolution raster that can be transformed into a lower size image that can be plotted with leaflet.
if [ ! -f output/proj2050_icar_epsg3857_ov32.tif ]
then
gdalwarp -overwrite -s_srs EPSG:32738 -t_srs EPSG:3857 -tr 250 250 \
-ot Byte -r near -co "COMPRESS=LZW" -co "PREDICTOR=2" -co "BIGTIFF=YES" \
output/proj2050_icar.tif output/proj2050_icar_epsg3857_ov32.tif
fi
We also set the color of the map and the extent of the view for leaflet.
# Colors and extent view for leaflet
r <- raster("output/proj2050_icar_epsg3857_ov32.tif")
pal <- colorFactor(c("red", "darkgreen"), c(0,1), na.color = "transparent")
r1 <- raster(nrows=1,ncols=1,ext=extent(340000,412000,7420000,7500000),
crs=CRS("+init=epsg:32738"))
r2 <- projectExtent(r1, crs=CRS("+init=epsg:4326"))
ext2 <- extent(r2)
The red areas represent the deforestation on the period 2010-2050. The green areas represent the remaining forest in 2050. Most of the remaining forest in 2050 are inside the protected areas or located in remote areas, at high altitudes and far from roads and big cities (for example in the Tsaratanana mountain region and around the Masoala peninsula, north-east Madagascar).
# Leaflet map
m <- leaflet() %>% addTiles() %>%
addRasterImage(r, colors=pal, opacity=0.8, project=FALSE) %>%
fitBounds(ext2@xmin,ext2@ymin,ext2@xmax,ext2@ymax)
m
With our model, we found a deviance of 9580.
Using the validation data-set, we can compute several model performance indices (Liu, White, and Newell 2011):
It is known that the value of some of these indices might vary with the intensity of deforestation (Pontius et al. 2008).
In our case, we compute these indices for various values of the deforestation rate (in percentage of deforestation observations) in our validation data-set (1, 5, 10, 25, 50). For each percentage, the threshold used to convert the predicted probabilities of deforestation were choosen in order to respect the deforestation rate (in %) in the observations.
source("R/perf.R")
# Performance icar
set.seed(1234)
performance_icar <- performance_index(data_valid,theta_pred=mod_icar$predict(data_valid))
write.table(performance_icar,file="output/performance_icar.txt")
performance_icar
## perc FOM OA EA Spe Sen TSS K AUC
## 1 1 0.03 0.98 0.98 0.99 0.06 0.05 0.05 0.83
## 2 5 0.15 0.93 0.90 0.96 0.26 0.22 0.22 0.81
## 3 10 0.23 0.87 0.82 0.93 0.37 0.30 0.30 0.83
## 4 25 0.38 0.77 0.62 0.85 0.55 0.39 0.39 0.80
## 5 50 0.58 0.74 0.50 0.73 0.74 0.47 0.47 0.81
Considering the period 2010-2017, we can compare model’s predictions in the future with the observed deforestation on the same period.
First, we compute the observed forest cover change raster on the period 2010-2017 from the forest cover in 2010 and 2017.
if [ ! -f output/proj2017_obs.tif ]
then
# Set nodata different from 255
gdal_translate -a_nodata 99 -co 'COMPRESS=LZW' -co 'PREDICTOR=2' \
data/model/fordefor2010.tif output/fordefor2010_.tif
gdal_translate -a_nodata 99 -co 'COMPRESS=LZW' -co 'PREDICTOR=2' \
data/validation/for2017.tif output/for2017_.tif
# Compute forest-cover change between 2010 and 2017
gdal_calc.py --overwrite -A output/fordefor2010_.tif -B output/for2017_.tif --outfile=output/proj2017_obs.tif --type=Byte --calc='255-254*(A==1)*(B==1)-255*(A==1)*(B==255)' --co 'COMPRESS=LZW' --co 'PREDICTOR=2' --NoDataValue=255
fi
We compute the deforested area on the period 2010-2017.
# Number of deforested pixels
count_d <- far$countpix("output/proj2017_obs.tif", value=0L, blk_rows=128L)
Results are in hectares and number of pixels.
count_d
## $area
## [1] 871336.2
##
## $npix
## [1] 9681513
Second, we derive the predicted forest-cover change map for the period 2010-2017 following a total deforestation of 871,336 hectares between the two dates.
Note: Because forest covers about 9.3 Mha in 2010 (Vieilledent et al. 2018), this corresponds to a total deforestation rate of 9% on the period 2010-2017, and to an annual deforestation rate of about 1.3%.yr\(^{-1}\).
# Model predictions 2010-2017
if (!file.exists("output/proj2017_icar.tif")) {
fcc_icar_2017 <- far$deforest(input_raster="output/prob_binomial_icar.tif",
hectares=871336, #count_d$area,
output_file="output/proj2017_icar.tif",
blk_rows=128L)
}
Third, we compute the indices based on the two maps.
indices_2010_2017_icar <- far$validation(pred="output/proj2017_icar.tif",
obs="output/proj2017_obs.tif",
blk_rows=128L)
as.data.frame(indices_2010_2017_icar)
## FOM K Spe OA TSS Sen
## 1 0.12 0.13 0.92 0.85 0.13 0.21
# Null model
formula_null <- "I(1-fordefor2010) ~ 1"
dmat_null <- patsy$dmatrices(formula_null, data=r_to_py(data_train), NA_action="drop",
return_type="dataframe", eval_env=-1L)
Y <- dmat_null[[1]]
X_null <- dmat_null[[2]]
mod_null <- sm$GLM(Y, X_null, family=sm$families$Binomial())$fit()
print(mod_null$summary())
## Generalized Linear Model Regression Results
## ===============================================================================
## Dep. Variable: I(1 - fordefor2010) No. Observations: 9983
## Model: GLM Df Residuals: 9982
## Model Family: Binomial Df Model: 0
## Link Function: logit Scale: 1.0000
## Method: IRLS Log-Likelihood: -6919.1
## Date: ven., 21 déc. 2018 Deviance: 13838.
## Time: 17:02:11 Pearson chi2: 9.98e+03
## No. Iterations: 3 Covariance Type: nonrobust
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 0.0218 0.020 1.091 0.275 -0.017 0.061
## ==============================================================================
# Simple glm with no spatial random effects
formula_glm <- paste0("I(1-fordefor2010) ~ C(sapm) + scale(altitude) + ",
"scale(slope) + scale(dist_defor) + I(scale(dist_defor)*scale(dist_defor)) + ",
"scale(dist_edge) + scale(dist_road) + scale(dist_town)")
mod_glm <- smf$glm(formula_glm, r_to_py(data_train),
family=sm$families$Binomial(), eval_env=-1L)$fit()
# Summary glm
sink(file="output/summary_mod_binomial_glm.txt")
print(mod_glm$summary())
## Generalized Linear Model Regression Results
## ===============================================================================
## Dep. Variable: I(1 - fordefor2010) No. Observations: 9983
## Model: GLM Df Residuals: 9974
## Model Family: Binomial Df Model: 8
## Link Function: logit Scale: 1.0000
## Method: IRLS Log-Likelihood: -6370.0
## Date: ven., 21 déc. 2018 Deviance: 12740.
## Time: 17:02:12 Pearson chi2: 1.29e+04
## No. Iterations: 5 Covariance Type: nonrobust
## ============================================================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------------------------------------
## Intercept 0.0371 0.028 1.316 0.188 -0.018 0.092
## C(sapm)[T.1.0] -0.3654 0.047 -7.697 0.000 -0.458 -0.272
## scale(altitude) -0.1456 0.026 -5.616 0.000 -0.196 -0.095
## scale(slope) -0.1347 0.025 -5.363 0.000 -0.184 -0.085
## scale(dist_defor) -0.7181 0.044 -16.272 0.000 -0.805 -0.632
## I(scale(dist_defor) * scale(dist_defor)) 0.0707 0.007 9.446 0.000 0.056 0.085
## scale(dist_edge) -0.3965 0.034 -11.536 0.000 -0.464 -0.329
## scale(dist_road) -0.0301 0.023 -1.323 0.186 -0.075 0.015
## scale(dist_town) -0.0631 0.022 -2.807 0.005 -0.107 -0.019
## ============================================================================================================
sink()
print(mod_glm$summary())
## Generalized Linear Model Regression Results
## ===============================================================================
## Dep. Variable: I(1 - fordefor2010) No. Observations: 9983
## Model: GLM Df Residuals: 9974
## Model Family: Binomial Df Model: 8
## Link Function: logit Scale: 1.0000
## Method: IRLS Log-Likelihood: -6370.0
## Date: ven., 21 déc. 2018 Deviance: 12740.
## Time: 17:02:12 Pearson chi2: 1.29e+04
## No. Iterations: 5 Covariance Type: nonrobust
## ============================================================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------------------------------------
## Intercept 0.0371 0.028 1.316 0.188 -0.018 0.092
## C(sapm)[T.1.0] -0.3654 0.047 -7.697 0.000 -0.458 -0.272
## scale(altitude) -0.1456 0.026 -5.616 0.000 -0.196 -0.095
## scale(slope) -0.1347 0.025 -5.363 0.000 -0.184 -0.085
## scale(dist_defor) -0.7181 0.044 -16.272 0.000 -0.805 -0.632
## I(scale(dist_defor) * scale(dist_defor)) 0.0707 0.007 9.446 0.000 0.056 0.085
## scale(dist_edge) -0.3965 0.034 -11.536 0.000 -0.464 -0.329
## scale(dist_road) -0.0301 0.023 -1.323 0.186 -0.075 0.015
## scale(dist_town) -0.0631 0.022 -2.807 0.005 -0.107 -0.019
## ============================================================================================================
formula_rf <- paste0("I(1-fordefor2010) ~ 0 + sapm + altitude + ",
"slope + dist_defor + dist_edge + dist_road + dist_town")
mod_rf <- far$model_random_forest(formula=formula_rf, data=data_train,
eval_env=-1L, n_estimators=500L, n_jobs=2L)
formula_rf_xy <- paste0("I(1-fordefor2010) ~ 0 + sapm + altitude + ",
"slope + dist_defor + dist_edge + dist_road + dist_town + X + Y")
mod_rf_xy <- far$model_random_forest(formula=formula_rf_xy, data=data_train,
eval_env=-1L, n_estimators=500L, n_jobs=2L)
# Deviances
deviance_null <- mod_null$deviance
deviance_glm <- mod_glm$deviance
deviance_icar <- mod_icar$deviance
deviance_full <- 0
# Table
dev <- c(deviance_null, deviance_glm, deviance_icar, deviance_full)
mod_dev <- data.frame(model=c("null", "glm", "icar", "full"),
deviance=round(dev))
perc <- 100*(1-mod_dev$deviance/deviance_null)
mod_dev$perc <- round(perc)
write.table(mod_dev,file="output/deviance_model_comparison.txt",sep=",",row.names=FALSE)
mod_dev
## model deviance perc
## 1 null 13838 0
## 2 glm 12740 8
## 3 icar 9539 31
## 4 full 0 100
performance_null <- performance_index(data_valid,model="null")
performance_glm <- performance_index(data_valid,mod_glm$predict(data_valid))
performance_rf <- performance_index(data_valid,mod_rf$predict(data_valid))
performance_rf_xy <- performance_index(data_valid,mod_rf_xy$predict(data_valid))
# Save
write.table(performance_null, file="output/performance_null.txt",
sep="\t", row.names=FALSE)
write.table(performance_glm, file="output/performance_glm.txt",
sep="\t", row.names=FALSE)
write.table(performance_rf, file="output/performance_rf.txt",
sep="\t", row.names=FALSE)
write.table(performance_rf_xy, file="output/performance_rf_xy.txt",
sep="\t", row.names=FALSE)
write.table(performance_icar, file="output/performance_icar.txt",
sep="\t", row.names=FALSE)
# Save for perc=10%
df_mod_valid <- rbind(performance_null[3,],performance_glm[3,],
performance_rf[3,],performance_rf_xy[3,],
performance_icar[3,])
df_mod_valid <- df_mod_valid %>%
dplyr::mutate(mod=c("null","glm","rf","rf_xy","icar")) %>%
dplyr::select(10,2:9)
write.table(df_mod_valid, file="output/df_mod_valid.txt", sep="\t", row.names=FALSE)
# Print
df_mod_valid
## mod FOM OA EA Spe Sen TSS K AUC
## 1 null 0.06 0.82 0.82 0.90 0.11 0.01 0.01 0.51
## 2 glm 0.10 0.84 0.82 0.91 0.18 0.09 0.09 0.69
## 3 rf 0.15 0.85 0.82 0.92 0.27 0.19 0.18 0.77
## 4 rf_xy 0.23 0.87 0.82 0.93 0.37 0.30 0.30 0.83
## 5 icar 0.23 0.87 0.82 0.93 0.37 0.30 0.30 0.83
Models + environmental variables (raster) => probabilities
if (!file.exists("output/prob_glm.tif")) {
# Compute predictions with logistic regression
fig_prob_glm <- far$predict_raster(mod_glm, var_dir="data/model",
input_forest_raster="data/model/fordefor2010.tif",
output_file="output/prob_glm.tif",
blk_rows=128L, transform=TRUE)
}
if (!file.exists("output/prob_rf.tif")) {
# Compute predictions with logistic regression
fig_prob_rf <- far$predict_raster(mod_rf, var_dir="data/model",
input_forest_raster="data/model/fordefor2010.tif",
output_file="output/prob_rf.tif",
blk_rows=128L)
}
if (!file.exists("output/prob_rf_xy.tif")) {
# Compute predictions with logistic regression
fig_prob_rf <- far$predict_raster(mod_rf_xy, var_dir="data/model",
input_forest_raster="data/model/fordefor2010.tif",
output_file="output/prob_rf_xy.tif",
blk_rows=128L)
}
if (!file.exists("output/proj2017_glm.tif")) {
deforest <- far$deforest(input_raster="output/prob_glm.tif",
hectares=871336,
output_file="output/proj2017_glm.tif",
blk_rows=128L)
}
if (!file.exists("output/proj2017_rf.tif")) {
deforest <- far$deforest(input_raster="output/prob_rf.tif",
hectares=871336,
output_file="output/proj2017_rf.tif",
blk_rows=128L)
}
if (!file.exists("output/proj2017_rf_xy.tif")) {
deforest <- far$deforest(input_raster="output/prob_rf_xy.tif",
hectares=871336,
output_file="output/proj2017_rf_xy.tif",
blk_rows=128L)
}
if (!file.exists("output/val_npix_glm.txt")) {
tval_glm_df <- far$validation_npix(r_pred="output/proj2017_glm.tif",
r_obs="output/proj2017_obs.tif",
output_file="output/val_npix_glm.txt",
value_f=1,
value_d=0,
square_size=33L)
}
tval_glm_df <- read.table("output/val_npix_glm.txt",header=TRUE,sep=",")
e <- extent(raster("output/proj2017_glm.tif"))
cor_glm <- cor_scale(df=tval_glm_df, e=e, square_size=33)
if (!file.exists("output/val_npix_rf.txt")) {
tval_rf_df <- far$validation_npix(r_pred="output/proj2017_rf.tif",
r_obs="output/proj2017_obs.tif",
output_file="output/val_npix_rf.txt",
value_f=1,
value_d=0,
square_size=33L)
}
tval_rf_df <- read.table("output/val_npix_rf.txt",header=TRUE,sep=",")
e <- extent(raster("output/proj2017_rf.tif"))
cor_rf <- cor_scale(df=tval_rf_df, e=e, square_size=33)
if (!file.exists("output/val_npix_rf_xy.txt")) {
tval_rf_xy_df <- far$validation_npix(r_pred="output/proj2017_rf_xy.tif",
r_obs="output/proj2017_obs.tif",
output_file="output/val_npix_rf_xy.txt",
value_f=1,
value_d=0,
square_size=33L)
}
tval_rf_xy_df <- read.table("output/val_npix_rf_xy.txt",header=TRUE,sep=",")
e <- extent(raster("output/proj2017_rf_xy.tif"))
cor_rf_xy <- cor_scale(df=tval_rf_xy_df, e=e, square_size=33)
if (!file.exists("output/val_npix_icar.txt")) {
tval_icar_df <- far$validation_npix(r_pred="output/proj2017_icar.tif",
r_obs="output/proj2017_obs.tif",
output_file="output/val_npix_icar.txt",
value_f=1,
value_d=0,
square_size=33L)
}
tval_icar_df <- read.table("output/val_npix_icar.txt",header=TRUE,sep=",")
e <- extent(raster("output/proj2017_icar.tif"))
cor_icar <- cor_scale(df=tval_icar_df, e=e, square_size=33)
cor_comp <- cor_glm %>%
dplyr::rename(cor_glm=cor) %>%
dplyr::mutate(cor_rf=cor_rf$cor,cor_rf_xy=cor_rf_xy$cor,cor_icar=cor_icar$cor) %>%
dplyr::select(-coeff)
readr::write_csv(cor_comp,path="output/cor_comp.txt")
cor_comp
## box_size nbox cor_glm cor_rf cor_rf_xy cor_icar
## 1 990 1250562 0.07882742 0.2867619 0.3494042 0.2675677
## 2 1980 313026 0.11088693 0.3517335 0.4096922 0.3144153
## 3 4950 50367 0.16215767 0.4520354 0.5031885 0.3865071
## 4 9900 12710 0.18872271 0.5289786 0.5805331 0.4521283
## 5 14850 5665 0.21097195 0.5669719 0.6078905 0.4733076
## 6 24750 2046 0.23378389 0.6061704 0.6428317 0.4948558
## 7 49500 527 0.27305075 0.6699511 0.7024582 0.5513582
## 8 74250 231 0.27429559 0.6296216 0.6950429 0.5236612
## 9 99000 144 0.28805695 0.6909386 0.7781211 0.6076498
## 10 198000 40 0.34713081 0.7274813 0.7619694 0.5518821
Liu, Canran, Matt White, and Graeme Newell. 2011. “Measuring and Comparing the Accuracy of Species Distribution Models with Presence-Absence Data.” Ecography 34 (2):232–43. https://doi.org/10.1111/j.1600-0587.2010.06354.x.
Pontius, Robert, Wideke Boersma, Jean-Christophe Castella, Keith Clarke, Ton de Nijs, Charles Dietzel, Zengqiang Duan, et al. 2008. “Comparing the Input, Output, and Validation Maps for Several Models of Land Change.” The Annals of Regional Science 42 (1). Springer Berlin / Heidelberg:11–37. https://doi.org/10.1007/s00168-007-0138-2.
Vieilledent, Ghislain, Clovis Grinand, Fety A. Rakotomalala, Rija Ranaivosoa, Jean-Roger Rakotoarijaona, Thomas F. Allnutt, and Frédéric Achard. 2018. “Combining Global Tree Cover Loss Data with Historical National Forest Cover Maps to Look at Six Decades of Deforestation and Forest Fragmentation in Madagascar.” Biological Conservation 222:189–97. https://doi.org/10.1016/j.biocon.2018.04.008.
Vieilledent, Ghislain, Cory Merow, Jérôme Guélat, Andrew M. Latimer, Marc Kéry, Alan E. Gelfand, Adam M. Wilson, Frédéric Mortier, and John A. Silander Jr. 2014. hSDM: hierarchical Bayesian species distribution models. https://doi.org/10.5281/zenodo.594920.