Introduction

Figure caption: Main causes of deforestation in central Menabe. a-a’: Slash-and-burn agriculture (“hatsake”) for peanut crop. Peanut (a’) is cultivated as a cash crop. Part of the production is at the destination of the national market but most of it is exported outside Madagascar, mainly for the Chinese market. b-b’: Slash-and-burn agriculture for maize crop. maize (b’) is cultivated for auto-consumption and as a cash crop. The production of maize is at the destination of the national market and is used in particular to brew the national beers. c-c’: Cyclone followed by uncontrolled fires. Cyclone “Fanele” (2009) caused tree mortality and accumulation of wood fuel on the ground. As a consequence, uncontrolled fires set on nearby pastures (c’) spread over large areas of forest after 2009. d-d’: Illegal logging. Timbers are used for house and boat construction.

This tutorial is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.

Computing environment

Installing Docker

Docker is used to run software packages called “containers”. Containers are isolated from each other and bundle their own tools, libraries and configuration files. Containers are created from “images” that specify their precise contents.

To install Docker for your operating system, go to https://docs.docker.com/install/. For Windows user, go directly to https://docs.docker.com/docker-for-windows/install/.

Docker image

For our tutorial, we have built a specific Docker image including R and Python:

  • R computing environment: R 3.5.1, RStudio, and various R packages (ex. knitr, ggplot2, reticulate).
  • Python computing environment: Python 3, and various Python packages (ex. numpy, pandas, matplotlib, and forestatrisk).

Some computations, such as computation by blocks on large raster files, were more easily coded and faster with Python. That’s why we decided to use Python as a complementary language to R. The reticulate R package allow us to use Python functions from R and thus to work in a single computing environment within RStudio.

Load the forestatrisk Docker image

Download the image from internet with the following command in a terminal: docker pull ghislainv/docker-forestatrisk . (do not forget the final “.”)

Note: If the internet connection is slow, use a USB key and copy the image somewhere on your computer and load it with the following command: docker load -i <path to image tar file>

Run the container with docker run -d -e PASSWORD=PWD -e ROOT=true -p 8787:8787 -v ORIGIN_FOLDER:/home/rstudio ghislainv/docker-forestatrisk.

Note:

  • Set your own password PWD, it cannot be rstudio.
  • Replace ORIGIN_FOLDER with a folder on your local machine where you want to save the R scripts. For example /c/Users/YOUR_NAME or /home/YOUR_NAME.

Point your browser to localhost:8787 to access RStudio interface. Log in with rstudio/PWD.

Note on docker’s containers and images:

  • List running containers: docker ps
  • List images: docker images or docker image list
  • Stop container: docker stop <id container>
  • Remove container: docker rm <idcontainer>

Get the tutorial

In RStudio, open a terminal and clone the GitHub repository locally:
git clone https://github.com/ghislainv/forecasting-deforestation-Mada

Navigate with RStudio in the folder named forecasting-deforestation-Mada and open the R Notebook file named training.Rmd. R Markdown files (*.Rmd) can be used in R to embbed text, code, table and figues inside a unique document. Code are organized into ‘chunks’ that can be executed independently and interactively. An R Notebook is an R Markdown document with output visible immediately beneath the input.

I used this notebook to write the tutorial available at https://ecology.ghislainv.fr/forecasting-deforestation-Mada.

In the preambule, change the author and date with your name and today’s date.

Python in R

Specify the Python version to use with reticulate and check that it is the right one.

use_python("/usr/bin/python3.5")
py_config()
## python:         /usr/bin/python3.5
## libpython:      /usr/lib/python3.5/config-3.5m-x86_64-linux-gnu/libpython3.5.so
## pythonhome:     /usr:/usr
## version:        3.5.3 (default, Sep 27 2018, 17:25:39)  [GCC 6.3.0 20170516]
## numpy:          /usr/local/lib/python3.5/dist-packages/numpy
## numpy_version:  1.15.3
## 
## python versions found: 
##  /usr/bin/python3.5
##  /usr/bin/python
##  /usr/bin/python3

Import the Python modules into R.

far <- import("forestatrisk")
patsy <- import("patsy")
sm <- import("statsmodels.api")
smf <- import("statsmodels.formula.api")

R/Python intro

R

Some exercises
a <- c(20,5,6,2,9,12)
b <- 1
c <- a + b
print(c)
## [1] 21  6  7  3 10 13
Simple regression
# Generating data
nobs <- 100
x.seq <- runif(n=nobs,min=0,max=100)
a.target <- 2
b.target <- 2
sigma2.target <- 300
y.seq <- a.target + b.target*x.seq +
         rnorm(n=nobs,mean=0,sd=sqrt(sigma2.target))

# Data-set
Data <- as.data.frame(cbind(y.seq,x.seq))

# Plot
plot(x.seq,y.seq)

# Estimation
M <- lm(y.seq~x.seq,data=Data)
summary(M)
## 
## Call:
## lm(formula = y.seq ~ x.seq, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.340 -11.649   1.587  11.141  41.977 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.61161    3.03037   1.852   0.0671 .  
## x.seq        2.00341    0.05435  36.864   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.6 on 98 degrees of freedom
## Multiple R-squared:  0.9327, Adjusted R-squared:  0.932 
## F-statistic:  1359 on 1 and 98 DF,  p-value: < 2.2e-16
M$coefficients
## (Intercept)       x.seq 
##    5.611605    2.003410
var(M$residuals)
## [1] 240.9543
M$df
## [1] 98
# Graphics 
x.pred <- seq(from=0,to=100,length.out=100) # We will predict y for 100 values of x
X.pred <- as.data.frame(x.pred)
names(X.pred) <- "x.seq"
Y.pred <- predict.lm(M,X.pred,interval="confidence",
                     type="response")

plot(x.seq,y.seq,col=grey(0.7),
     xlab="X",
     ylab="Y",
     main="Simple linear regression") # Row data

# Confidence envelops for predictive posterior
lines(x.pred,Y.pred[,1],col="red",lwd=3)
lines(x.pred,Y.pred[,2],col="blue",lwd=2,lty=2)
lines(x.pred,Y.pred[,3],col="blue",lwd=2,lty=2)

Python

Some exercises

You can execute Python code within a R notebooks. We need to indicate which version of Python to use in the notebook for Python chunk.

import numpy as np
a = np.array([20,5,6,2,9,12])
b = 1
c = a + b
print(type(c))
## <class 'numpy.ndarray'>
print(c)
## [21  6  7  3 10 13]

Preparing the data-set

Downloading the geospatial data

To model the spatial probability of deforestation, we need a map of the past deforestation and maps of potential explicative environmental variables. Environmental variables can be derived from topography (altitude and slope), accessibility (distance to roads, towns, and forest edge), deforestation history (distance to previous deforestation) or landscape policy (protected area network) for example. In our example, we use the following variables :

tab1 <- read.table("tables/variables.txt",sep=",",header=TRUE)
names(tab1) <- c("Product","Source","Variable derived","Unit","Resolution (m)")
# Kable
knitr::kable(tab1, linesep="") %>%
  kable_styling(bootstrap_options = c("striped"))
Product Source Variable derived Unit Resolution (m)
Deforestation maps (1990-2000-2010) BioSceneMada (1) forest/non-forest 30
distance to forest edge m 30
distance to previous deforestation m 30
Digital Elevation Model SRTM v4.1 CSI-CGIAR (2) altitude m 90
slope ° 90
Highways OSM - Geofabrik (3) distance to roads m 150
Places distance to towns m 150
Waterways distance to river m 150
Protected areas Rebioma (4) presence of protected area 30
  1. http://bioscenemada.cirad.fr, deforestation maps from Vieilledent et al. (2018)
  2. http://srtm.csi.cgiar.org, SRTM 90m Digital Elevation Database v4.1
  3. http://www.geofabrik.de, data extracts from the OpenStreetMap project for Madagascar,
  4. http://rebioma.net, SAPM (“Système des Aires Protégées à Madagascar”), 20/12/2010 version
# Make data directory
if (!dir.exists("data")) {
  dir.create("data")
  dir.create("data/model")
  dir.create("data/mada")
  dir.create("data/validation")
}

# Files
files <- c("fordefor2010.tif", "dist_edge.tif", "dist_defor.tif",
           "altitude.tif", "slope.tif",
           "dist_road.tif", "dist_town.tif", "dist_river.tif",
           "sapm.tif", "roads.kml", "towns.kml", "rivers.kml", "sapm.kml")

# Download model data
for (i in files) {
  if (!file.exists(glue("data/model/{i}"))) {
    f <- glue("https://zenodo.org/record/259582/files/{i}")
    curl_download(f, glue("data/model/{i}"), quiet=FALSE)
  }
}

# Download validation data
if (!file.exists("data/validation/for2017.tif")) {
  f <- "http://bioscenemada.cirad.fr/FileTransfer/for2017.tif"
  curl_download(f, "data/validation/for2017.tif", quiet=FALSE)
}

# Download mada outline
if (!file.exists("data/mada/mada38s.shp")) {
  f <- "http://bioscenemada.cirad.fr/FileTransfer/mada38s.zip"
  curl_download(f, "data/mada/mada38s.zip", quiet=FALSE)
  unzip("data/mada/mada38s.zip", exdir="data/mada")
}

In our example, fordefor2010.tif is a forest raster at 30m for the year 2010 considering the deforestation on the period 2000-2010 in Madagascar. We can plot this raster and zoom on a region with function .plot.forest() in the forestatrisk package to see the deforestation data. The remaining forest appears in green and the deforested areas appear in red.

# Make output directory
if (!dir.exists("output")) {
  dir.create("output")
}
# Plot forest cover change 2000-2010
fig <- far$plot$fcc(input_fcc_raster="data/model/fordefor2010.tif",
             output_file="output/fcc.png",
             col=c(255,0,0,255),  # rgba color for deforestation
             figsize=c(5,5),
             dpi=150,
             zoom=c(340000,412000,7420000,7500000))
knitr::include_graphics("output/fcc.png")

Sampling points

We use the function .sample() from the forestatrisk module to sample 10,000 points (pixel centers) in the deforested areas and in the remaining forest (20,000 points in total). The input_forest_raster argument defines the path to the forest raster including the deforested pixels (with value=0) and the remaining forest pixels (value=1) after a given period of time. The random seed can be set with argument Seed to reproduce the data of the random sampling.

The .sample() function also extracts information from environmental variables for each sampled point. Sampling is done by block to allow the computation on large study areas (e.g. country or continental scale) with a fine spatial resolution (e.g. 30m). The var_dir argument indicates the directory including all the environmental raster files (they must be GeoTIFF files with extension .tif) that we want to test.

The .sample() function identifies the spatial cell for each sample point (sample point are grouped within a spatial cell). Spatial cells and grouped observations are used to estimate the spatial autocorrelation of the deforestation process. The csize argument define the width (in km) of the square spatial cells. Each spatial cell will be attributed a parameter. To avoid estimating too many parameters, width of the square spatial cells must be sufficiently large. Both points sampling and extraction of environmental data are done by block to avoid memory problems for big datasets.

import forestatrisk
help(forestatrisk.sample)
## Help on function sample in module forestatrisk.sample:
## 
## sample(nsamp=10000, Seed=1234, csize=10, var_dir='data', input_forest_raster='forest.tif', output_file='output/sample.txt', blk_rows=0)
##     Sample points and extract raster values.
##     
##     This function (i) randomly draw spatial points in deforested and
##     forested areas and (ii) extract environmental variable values for
##     each spatial point.
##     
##     :param nsamp: number of random spatial points.
##     :param seed: seed for random number generator.
##     :param csize: spatial cell size in km.
##     :param var_dir: directory with raster data.
##     :param input_forest_raster: name of the forest raster file (1=forest,     0=deforested) in the var_dir directory
##     :param output_file: path to file to save sample points.
##     :param blk_rows: if > 0, number of lines per block.
##     :return: a pandas DataFrame, each row being one observation.
# Training data-set
if (!file.exists("output/sample.txt")) {
  samp <- far$sample(nsamp=10000L, Seed=1234L, csize=10L,
                     var_dir="data/model",
                     input_forest_raster="fordefor2010.tif",
                     output_file="output/sample.txt",
                     blk_rows=1L)
}
samp <- read.table("output/sample.txt", header=TRUE, sep=",")
set.seed(1234)
train <- sample(1:20000, size=10000, replace=FALSE)
data_train <- samp[train,] %>% dplyr::filter(complete.cases(.))
data_valid <- samp[-train,] %>% dplyr::filter(complete.cases(.))
head(data_train)
##   altitude dist_defor dist_edge dist_river dist_road dist_town
## 1       67       1146       108        150     37237       474
## 2      760       4082       811       3690     11767      7697
## 3     1080       6081      1776       9828     43487     16278
## 4       34        680        85        300     24553      1485
## 5      993       5544       190       3903     27649     14575
## 6      699        450       450       3750     23028      8255
##   fordefor2010 sapm slope       X       Y cell
## 1            0    0     8 1042455 8235255 3638
## 2            1    1    10  978285 8272545 3307
## 3            1    1     5  949095 8290275 3224
## 4            1    0    14 1027425 8271045 3393
## 5            1    1     5  727185 7544595 9195
## 6            1    0     8  980085 8236275 3632

Sampled observations can be plotted using function .plot.obs() from the forestatrisk module. Dark red dots indicate deforestation observations and dark green dots indicate forest observations.

# Plot sample points
fig <- far$plot$obs(sample=data_train,
             name_forest_var="fordefor2010",
             input_fcc_raster="data/model/fordefor2010.tif",
             output_file="output/obs.png",
             zoom=c(340000,412000,7420000,7500000),
             figsize=c(5,5),#c(11.69,8.27),
             s=5,dpi=300)
knitr::include_graphics("output/obs.png")

Descriptive statistics

Before modelling the deforestation, it is important to look at the relationship between environmental variables and deforestation. Using formulas from the patsy Python module, we can specify the relationships that we want to look at. In the example below, we plot the relationships between some continuous environmental variables and the probability of deforestation using function .plot.correlation() from the forestatrisk package. Note that -1 must be set at the end of the formula. The function .correlation() returns a serie of graphics that can be analyzed to choose the right relationship for each continuous variable (linear or polynomial for example).

# Descriptive statistics

# Model formulas
formula_1 <- paste0("fordefor2010 ~ dist_road + dist_town + dist_defor +",
                    "dist_river + dist_edge + altitude + slope - 1")
# Standardized variables (mean=0, std=1)
formula_2 <- paste0("fordefor2010 ~ scale(dist_road) + scale(dist_town) +",
                    "scale(dist_defor) + scale(dist_river) + scale(dist_edge) +",
                    "scale(altitude) + scale(slope) - 1")
formulas <- c(formula_1, formula_2)

# List to store figures
corr_fig <- list()

# Loop on formulas
for (f in 1:length(formulas)) {
    # Output file
    of <- glue("output/correlation_{f}.pdf")
    # Data
    dmat <- patsy$dmatrices(formulas[f], data=data_train, eval_env=-1L,
                                  return_type="dataframe")
    # Plots
    fig <- far$plot$correlation(y=dmat[[1]],data=dmat[[2]],
                         plots_per_page=3L,figsize=c(7,8),
                         dpi=100L,output_file=of)
}

In this example (see pdf files produced), we can see that a linear model should be sufficient to represent the relationship between the probability of deforestation and the standardized distance to the nearest road or town. On the contrary, it could be interesting to fit a polynomial model for the the standardized distance to previous deforestation (dist_defor variable) for which the relationship seems non-linear. Several models can be fitted and compared to see if a second-order or third-order polynomial relationship is justified.

Deforestation model

We propose to use the Binomial icar model (Vieilledent et al. 2014) to estimate the deforestation probability of a pixel given a set of environmental variables. The Binomial icar model is a linear Binomial logistic regression model including an intrinsic Conditional Autoregressive (icar) process to account for the spatial autocorrelation of the observations. Parameter inference is done in a hierarchical Bayesian framework. The .model_binomial_icar() function from the forestatrisk module includes a Metropolis-within-Gibbs algorithm written in pure C code to reduce computation time.

Figure caption: Parameter inference is done in a hierarchical Bayesian framework. Bayesian statistics rely on the Bayes’ theorem named after Reverend Thomas Bayes. Each parameter has a prior and an approximated posterior probability distribution from which we can compute the mean, standard deviation, credible intervals at 95%, etc.

For the deforestation process it is very important to take into account the spatial autocorrelation of the process with spatial random effects. Indeed, the selected fixed environmental variables are not able to fully explain the spatial variability of the deforestation process, especially when working at large geographical scales, such as the national or continental scale. Spatial random effects allow estimating a higher/lower probability of deforestation in a particular region (associated to unmeasurable or unknow factors) that is different from the mean probability of deforestation derived from the environmental factors included in the model. The Binomial icar model can be described as follow:

Ecological process

\[\begin{equation} y_i \sim \mathcal{B}inomial(\theta_i,t_i) \\ \text{logit}(\theta_i) = X_i \beta + \rho_{j(i)} \end{equation}\]

\(y_i\): random variable for the deforestation process (0 if no deforestation, 1 if deforestation)

\(\theta_i\): probability of deforestation

\(t_i\): number of trials (always 1 in our example)

\(X_i\): vector of values for environmental explicative variables

\(\beta\): vector of fixed effect parameters

\(\rho_j\): spatial random effect

\(j(i)\): index of the spatial entity for observation \(i\).

Spatial autocorrelation

The spatial autocorrelation is modelled with an intrinsic conditional autoregressive (icar) process:

\[\begin{equation} \rho_j \sim \mathcal{N}ormal(\mu_j,V_{\rho} / n_j) \end{equation}\]

\(\mu_j\): mean of \(\rho_{j'}\) in the neighborhood of \(j\).

\(V_{\rho}\): variance of the spatial random effects.

\(n_j\): number of neighbors for spatial entity \(j\).

Figure caption: Representation of the neighborhood for the intrinsic conditional autoregressive (icar) process. Target spatial cell \(j\) has 8 neighbors in this case. Several observations (black points, equivalent to pixel centers in our case) can be located in each spatial cell. Deforestation probability in one spatial cell \(j\) depends on deforestation probability in neighboring cells.

Model formula

A model formula must also be defined to specify the explicative variables we want to include in the model. The formula allows specifying some variable transformations (such as standardization in our case). See the patsy module for more information. In our model, we included the following variables: location inside a protected area, altitude, distance to past deforestation (with a degree two polynomial), distance to forest edge, distance to nearest road and distance to nearest town. The formula must end with the name of the variable indicating the spatial cell for each observation (cell in our case).

# Formula
data_train$trials <- 1  # Set number of trials to one
formula <- paste0("I(1-fordefor2010) + trials ~ C(sapm) + scale(altitude) +
                  scale(slope) +",
                  "scale(dist_defor) + np.power(scale(dist_defor),2) + ",
                  "scale(dist_edge) + ",
                  "scale(dist_road) + scale(dist_town) + cell")

Model summary

Once the model has been fitted, we can print a summary of the model showing the parameter estimates. The 95% credible intervals obtained from the posterior distribution of each parameter, except distance to nearest town (dist_town), do not include zero, indicating that parameters are significantly different from zero. The variance of the spatial random effects (Vrho) is given together with the deviance value, which can be used to compare different statistical models (lower deviance is better). Looking at the parameter estimates, we can see that the deforestation probability is much lower inside protected areas and that deforestation probability decreases with altitude, slope, distance to past deforestation, forest edge, roads and towns. Parameter values are then coherent regarding the deforestation process and easy to interpret.

sink(file="output/summary_mod_icar.txt")
print(mod_icar)
## Binomial logistic regression with iCAR process
##   Model: I(1 - fordefor2010) + trials ~ 1 + C(sapm) + scale(altitude) + scale(slope) + scale(dist_defor) + np.power(scale(dist_defor), 2) + scale(dist_edge) + scale(dist_road) + scale(dist_town) + cell
##   Posteriors:
##                                      Mean        Std     CI_low    CI_high
##                      Intercept     -0.263     0.0964     -0.485     -0.105
##                 C(sapm)[T.1.0]     -0.587      0.086     -0.787     -0.426
##                scale(altitude)     -0.662     0.0602     -0.778     -0.538
##                   scale(slope)     -0.158     0.0348     -0.226    -0.0919
##              scale(dist_defor)     -0.905     0.0731      -1.06     -0.771
## np.power(scale(dist_defor), 2)      0.071    0.00877     0.0497     0.0893
##               scale(dist_edge)     -0.545     0.0518     -0.654     -0.454
##               scale(dist_road)      -0.14     0.0548      -0.24    -0.0382
##               scale(dist_town)     0.0438      0.056    -0.0915      0.137
##                           Vrho       8.45      0.899       7.09       10.4
##                       Deviance   9.54e+03       75.5   9.38e+03   9.68e+03
sink()
print(mod_icar)
## Binomial logistic regression with iCAR process
##   Model: I(1 - fordefor2010) + trials ~ 1 + C(sapm) + scale(altitude) + scale(slope) + scale(dist_defor) + np.power(scale(dist_defor), 2) + scale(dist_edge) + scale(dist_road) + scale(dist_town) + cell
##   Posteriors:
##                                      Mean        Std     CI_low    CI_high
##                      Intercept     -0.263     0.0964     -0.485     -0.105
##                 C(sapm)[T.1.0]     -0.587      0.086     -0.787     -0.426
##                scale(altitude)     -0.662     0.0602     -0.778     -0.538
##                   scale(slope)     -0.158     0.0348     -0.226    -0.0919
##              scale(dist_defor)     -0.905     0.0731      -1.06     -0.771
## np.power(scale(dist_defor), 2)      0.071    0.00877     0.0497     0.0893
##               scale(dist_edge)     -0.545     0.0518     -0.654     -0.454
##               scale(dist_road)      -0.14     0.0548      -0.24    -0.0382
##               scale(dist_town)     0.0438      0.056    -0.0915      0.137
##                           Vrho       8.45      0.899       7.09       10.4
##                       Deviance   9.54e+03       75.5   9.38e+03   9.68e+03

Plot traces and posteriors

To check for the convergence of the Markov chain Monte Carlo (MCMC), we can plot the traces and the posterior distributions of the estimated parameters using method .plot() associated to the hSDM_binomial_icar class defined in the deforestprob module. This method returns the figures showing the traces and posterior distributions.

require(coda)
mcmc <- as.mcmc(mod_icar$mcmc)
plot(mcmc[,c(1:3,10,11)])

Forecasting deforestation

Resampling the spatial random effects

We use the model to predict the spatial probability of deforestation at the national scale for Madagascar. Before, doing so, we smooth the spatial random effects which have been estimated at a coarse resolution (10km in our example). To do so, we use the function .resample_rho() from the deforestprob module to resample the results at a finer resolution using a bilinear interpolation. The function writes a raster file to the disk with a resolution of the raster specified in the argument input_raster of the function (1km in our case). The function .resample_rho() returns a figure of the spatial random effects that can be plotted.

# Get the spatial random effects
rho <- rep(-9999,ncell)  # -9999 will be considered as nodata
rho[cell_in+1] <- mod_icar$rho

# Resample them
fig <- far$resample_rho(rho=r_to_py(rho), input_raster="data/model/fordefor2010.tif",
                 output_file="output/rho.tif",
                 csize_orig=10L, csize_new=1L)

# Plot random effects
fig <- far$plot$rho("output/rho_orig.tif",output_file="output/rho_orig.png")
fig <- far$plot$rho("output/rho.tif",output_file="output/rho.png")

# Plot with R
mada <- rgdal::readOGR(dsn="data/mada",layer="mada38s", verbose=FALSE)
r.rho_orig <- raster("output/rho_orig.tif")
r.rho <- raster("output/rho.tif")
rho_plot(r.rho_orig, mada, output_file="output/rho_orig_ggplot.png",
         quantiles_legend=c(0.025,0.975),width=4.5, height=8)
## Results plotted to "output/rho_orig_ggplot.png"

rho_plot(r.rho, mada, output_file="output/rho_ggplot.png",
         quantiles_legend=c(0.025,0.975),width=4.5, height=8)
## Results plotted to "output/rho_ggplot.png"

Computing spatial probability of deforestation

The .predict_raster_binomial_icar() function of the forestatrisk module can be used to predict the spatial probability of deforestation from an hSDM_binomial_icar model (i.e. an object of class hSDM_binomial_icar). The function writes a raster of predictions to the disk. The prediction is done by block to avoid memory problems for big datasets. Functions will return NA for pixels with no forest or for pixels with missing environmental variables.

if (!file.exists("output/prob_icar.tif")) {
  far$predict_raster_binomial_iCAR(
    mod_binomial_icar, var_dir="data/model",
    input_cell_raster="output/rho.tif",
    input_forest_raster="data/model/fordefor2010.tif",
    output_file="output/prob_icar.tif",
    blk_rows=128L)
}

The raster of predictions can then be plotted.

if (!file.exists("output/prob_icar.png")) {
  fig <- far$plot$prob("output/prob_icar.tif",
                       output_file="output/prob_icar.png",
                       figsize=c(4,4))
}
knitr::include_graphics("output/prob_icar.png")

Predicting future forest cover

Given the spatial probability of deforestation and the number of hectares that should be deforested, we can predict the future forest cover using function .deforest() from the forestatrisk package. The number of hectares are converted into number of pixels to be deforested. Pixels with the highest probability of deforestation are deforested first. The function computes a probability threshold above which pixels are deforested.

In our example, we consider an annual deforestation of roughly 100,000 ha for Madagascar. Considering the period 2010-2050, this would correspond to 4 Mha of deforestation. This number can of course be more precise and refined considering various deforestation scenarios (demographic growth, economic development, etc.).

if (!file.exists("output/proj2050_icar.tif")) {
  deforest <- far$deforest(input_raster="output/prob_icar.tif",
                           hectares=4000000,
                           output_file="output/proj2050_icar.tif",
                           blk_rows=128L)
}

We can plot the predicted future forest cover in 2050 with leaflet. We first reproject the raster to WGS 84 / World Mercator projection (EPSG:3857) and we resample the map at 250 m using function gdalwarp from GDAL. The 250 m resolution raster that can be transformed into a lower size image that can be plotted with leaflet.

if [ ! -f output/proj2050_icar_epsg3857_ov32.tif ]
then
  gdalwarp -overwrite -s_srs EPSG:32738 -t_srs EPSG:3857 -tr 250 250 \
         -ot Byte -r near -co "COMPRESS=LZW" -co "PREDICTOR=2" -co "BIGTIFF=YES" \
         output/proj2050_icar.tif output/proj2050_icar_epsg3857_ov32.tif
fi

We also set the color of the map and the extent of the view for leaflet.

# Colors and extent view for leaflet
r <- raster("output/proj2050_icar_epsg3857_ov32.tif")
pal <- colorFactor(c("red", "darkgreen"), c(0,1), na.color = "transparent")
r1 <- raster(nrows=1,ncols=1,ext=extent(340000,412000,7420000,7500000),
             crs=CRS("+init=epsg:32738"))
r2 <- projectExtent(r1, crs=CRS("+init=epsg:4326"))
ext2 <- extent(r2)

The red areas represent the deforestation on the period 2010-2050. The green areas represent the remaining forest in 2050. Most of the remaining forest in 2050 are inside the protected areas or located in remote areas, at high altitudes and far from roads and big cities (for example in the Tsaratanana mountain region and around the Masoala peninsula, north-east Madagascar).

# Leaflet map
m <- leaflet() %>% addTiles() %>%
  addRasterImage(r, colors=pal, opacity=0.8, project=FALSE) %>%
  fitBounds(ext2@xmin,ext2@ymin,ext2@xmax,ext2@ymax)
m

Model performance

Deviance explained

With our model, we found a deviance of 9580.

Spatial validation

Using the validation data-set, we can compute several model performance indices (Liu, White, and Newell 2011):

  • FOM: figure of merit
  • OA: overall accuracy
  • EA: expected accuracy
  • Spe: specificity
  • Sen: sensitivity
  • TSS: true skill statistics
  • K: kappa of Cohen
  • AUC: area under the (ROC) curve

It is known that the value of some of these indices might vary with the intensity of deforestation (Pontius et al. 2008).

In our case, we compute these indices for various values of the deforestation rate (in percentage of deforestation observations) in our validation data-set (1, 5, 10, 25, 50). For each percentage, the threshold used to convert the predicted probabilities of deforestation were choosen in order to respect the deforestation rate (in %) in the observations.

source("R/perf.R")

# Performance icar
set.seed(1234)
performance_icar <- performance_index(data_valid,theta_pred=mod_icar$predict(data_valid))
write.table(performance_icar,file="output/performance_icar.txt")
performance_icar
##   perc  FOM   OA   EA  Spe  Sen  TSS    K  AUC
## 1    1 0.03 0.98 0.98 0.99 0.06 0.05 0.05 0.83
## 2    5 0.15 0.93 0.90 0.96 0.26 0.22 0.22 0.81
## 3   10 0.23 0.87 0.82 0.93 0.37 0.30 0.30 0.83
## 4   25 0.38 0.77 0.62 0.85 0.55 0.39 0.39 0.80
## 5   50 0.58 0.74 0.50 0.73 0.74 0.47 0.47 0.81

Temporal validation

Considering the period 2010-2017, we can compare model’s predictions in the future with the observed deforestation on the same period.

First, we compute the observed forest cover change raster on the period 2010-2017 from the forest cover in 2010 and 2017.

if [ ! -f output/proj2017_obs.tif ]
then
# Set nodata different from 255
gdal_translate -a_nodata 99 -co 'COMPRESS=LZW' -co 'PREDICTOR=2' \
               data/model/fordefor2010.tif output/fordefor2010_.tif

gdal_translate -a_nodata 99 -co 'COMPRESS=LZW' -co 'PREDICTOR=2' \
               data/validation/for2017.tif output/for2017_.tif
               
# Compute forest-cover change between 2010 and 2017
gdal_calc.py --overwrite -A output/fordefor2010_.tif -B output/for2017_.tif --outfile=output/proj2017_obs.tif --type=Byte --calc='255-254*(A==1)*(B==1)-255*(A==1)*(B==255)' --co 'COMPRESS=LZW' --co 'PREDICTOR=2' --NoDataValue=255
fi

We compute the deforested area on the period 2010-2017.

# Number of deforested pixels
count_d <- far$countpix("output/proj2017_obs.tif", value=0L, blk_rows=128L)

Results are in hectares and number of pixels.

count_d
## $area
## [1] 871336.2
## 
## $npix
## [1] 9681513

Second, we derive the predicted forest-cover change map for the period 2010-2017 following a total deforestation of 871,336 hectares between the two dates.

Note: Because forest covers about 9.3 Mha in 2010 (Vieilledent et al. 2018), this corresponds to a total deforestation rate of 9% on the period 2010-2017, and to an annual deforestation rate of about 1.3%.yr\(^{-1}\).

# Model predictions 2010-2017
if (!file.exists("output/proj2017_icar.tif")) {
  fcc_icar_2017 <- far$deforest(input_raster="output/prob_binomial_icar.tif",
                                hectares=871336, #count_d$area,
                                output_file="output/proj2017_icar.tif",
                                blk_rows=128L)
}

Third, we compute the indices based on the two maps.

indices_2010_2017_icar <- far$validation(pred="output/proj2017_icar.tif",
                                         obs="output/proj2017_obs.tif",
                                         blk_rows=128L)
as.data.frame(indices_2010_2017_icar)
##    FOM    K  Spe   OA  TSS  Sen
## 1 0.12 0.13 0.92 0.85 0.13 0.21

Model comparison

Model types

  • null model
  • glm model: simple glm with no spatial random effect model
  • rf model: random forest model with no x,y
  • rfxy: random forest model with x,y
# Null model
formula_null <- "I(1-fordefor2010) ~ 1"
dmat_null <- patsy$dmatrices(formula_null, data=r_to_py(data_train), NA_action="drop",
                             return_type="dataframe", eval_env=-1L)
Y <- dmat_null[[1]]
X_null <- dmat_null[[2]]
mod_null <- sm$GLM(Y, X_null, family=sm$families$Binomial())$fit()
print(mod_null$summary())
##                   Generalized Linear Model Regression Results                  
## ===============================================================================
## Dep. Variable:     I(1 - fordefor2010)   No. Observations:                 9983
## Model:                             GLM   Df Residuals:                     9982
## Model Family:                 Binomial   Df Model:                            0
## Link Function:                   logit   Scale:                          1.0000
## Method:                           IRLS   Log-Likelihood:                -6919.1
## Date:               ven., 21 déc. 2018   Deviance:                       13838.
## Time:                         17:02:11   Pearson chi2:                 9.98e+03
## No. Iterations:                      3   Covariance Type:             nonrobust
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept      0.0218      0.020      1.091      0.275      -0.017       0.061
## ==============================================================================
# Simple glm with no spatial random effects
formula_glm <- paste0("I(1-fordefor2010) ~ C(sapm) + scale(altitude) + ",
                       "scale(slope) + scale(dist_defor) + I(scale(dist_defor)*scale(dist_defor)) + ",
                       "scale(dist_edge) + scale(dist_road) + scale(dist_town)")
mod_glm <- smf$glm(formula_glm, r_to_py(data_train),
                                     family=sm$families$Binomial(), eval_env=-1L)$fit()
# Summary glm
sink(file="output/summary_mod_binomial_glm.txt")
print(mod_glm$summary())
##                   Generalized Linear Model Regression Results                  
## ===============================================================================
## Dep. Variable:     I(1 - fordefor2010)   No. Observations:                 9983
## Model:                             GLM   Df Residuals:                     9974
## Model Family:                 Binomial   Df Model:                            8
## Link Function:                   logit   Scale:                          1.0000
## Method:                           IRLS   Log-Likelihood:                -6370.0
## Date:               ven., 21 déc. 2018   Deviance:                       12740.
## Time:                         17:02:12   Pearson chi2:                 1.29e+04
## No. Iterations:                      5   Covariance Type:             nonrobust
## ============================================================================================================
##                                                coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------------------------------------
## Intercept                                    0.0371      0.028      1.316      0.188      -0.018       0.092
## C(sapm)[T.1.0]                              -0.3654      0.047     -7.697      0.000      -0.458      -0.272
## scale(altitude)                             -0.1456      0.026     -5.616      0.000      -0.196      -0.095
## scale(slope)                                -0.1347      0.025     -5.363      0.000      -0.184      -0.085
## scale(dist_defor)                           -0.7181      0.044    -16.272      0.000      -0.805      -0.632
## I(scale(dist_defor) * scale(dist_defor))     0.0707      0.007      9.446      0.000       0.056       0.085
## scale(dist_edge)                            -0.3965      0.034    -11.536      0.000      -0.464      -0.329
## scale(dist_road)                            -0.0301      0.023     -1.323      0.186      -0.075       0.015
## scale(dist_town)                            -0.0631      0.022     -2.807      0.005      -0.107      -0.019
## ============================================================================================================
sink()
print(mod_glm$summary())
##                   Generalized Linear Model Regression Results                  
## ===============================================================================
## Dep. Variable:     I(1 - fordefor2010)   No. Observations:                 9983
## Model:                             GLM   Df Residuals:                     9974
## Model Family:                 Binomial   Df Model:                            8
## Link Function:                   logit   Scale:                          1.0000
## Method:                           IRLS   Log-Likelihood:                -6370.0
## Date:               ven., 21 déc. 2018   Deviance:                       12740.
## Time:                         17:02:12   Pearson chi2:                 1.29e+04
## No. Iterations:                      5   Covariance Type:             nonrobust
## ============================================================================================================
##                                                coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------------------------------------
## Intercept                                    0.0371      0.028      1.316      0.188      -0.018       0.092
## C(sapm)[T.1.0]                              -0.3654      0.047     -7.697      0.000      -0.458      -0.272
## scale(altitude)                             -0.1456      0.026     -5.616      0.000      -0.196      -0.095
## scale(slope)                                -0.1347      0.025     -5.363      0.000      -0.184      -0.085
## scale(dist_defor)                           -0.7181      0.044    -16.272      0.000      -0.805      -0.632
## I(scale(dist_defor) * scale(dist_defor))     0.0707      0.007      9.446      0.000       0.056       0.085
## scale(dist_edge)                            -0.3965      0.034    -11.536      0.000      -0.464      -0.329
## scale(dist_road)                            -0.0301      0.023     -1.323      0.186      -0.075       0.015
## scale(dist_town)                            -0.0631      0.022     -2.807      0.005      -0.107      -0.019
## ============================================================================================================
formula_rf <- paste0("I(1-fordefor2010) ~ 0 + sapm + altitude + ",
                     "slope + dist_defor + dist_edge + dist_road + dist_town")

mod_rf <- far$model_random_forest(formula=formula_rf, data=data_train, 
                              eval_env=-1L, n_estimators=500L, n_jobs=2L)
formula_rf_xy <- paste0("I(1-fordefor2010) ~ 0 + sapm + altitude + ",
                        "slope + dist_defor + dist_edge + dist_road + dist_town + X + Y")
mod_rf_xy <- far$model_random_forest(formula=formula_rf_xy, data=data_train, 
                                 eval_env=-1L, n_estimators=500L, n_jobs=2L)

Temporal validation

Compute performance indexes

if (!file.exists("output/val_npix_glm.txt")) {
    tval_glm_df <- far$validation_npix(r_pred="output/proj2017_glm.tif",
                                                                         r_obs="output/proj2017_obs.tif",
                                                                         output_file="output/val_npix_glm.txt", 
                                                                         value_f=1,
                                                                         value_d=0,
                                                                         square_size=33L)
}
tval_glm_df <- read.table("output/val_npix_glm.txt",header=TRUE,sep=",")
e <- extent(raster("output/proj2017_glm.tif"))
cor_glm <- cor_scale(df=tval_glm_df, e=e, square_size=33)
if (!file.exists("output/val_npix_rf.txt")) {
    tval_rf_df <- far$validation_npix(r_pred="output/proj2017_rf.tif",
                                                                         r_obs="output/proj2017_obs.tif",
                                                                         output_file="output/val_npix_rf.txt", 
                                                                         value_f=1,
                                                                         value_d=0,
                                                                         square_size=33L)
}
tval_rf_df <- read.table("output/val_npix_rf.txt",header=TRUE,sep=",")
e <- extent(raster("output/proj2017_rf.tif"))
cor_rf <- cor_scale(df=tval_rf_df, e=e, square_size=33)
if (!file.exists("output/val_npix_rf_xy.txt")) {
    tval_rf_xy_df <- far$validation_npix(r_pred="output/proj2017_rf_xy.tif",
                                                                         r_obs="output/proj2017_obs.tif",
                                                                         output_file="output/val_npix_rf_xy.txt", 
                                                                         value_f=1,
                                                                         value_d=0,
                                                                         square_size=33L)
}
tval_rf_xy_df <- read.table("output/val_npix_rf_xy.txt",header=TRUE,sep=",")
e <- extent(raster("output/proj2017_rf_xy.tif"))
cor_rf_xy <- cor_scale(df=tval_rf_xy_df, e=e, square_size=33)
if (!file.exists("output/val_npix_icar.txt")) {
    tval_icar_df <- far$validation_npix(r_pred="output/proj2017_icar.tif",
                                                                         r_obs="output/proj2017_obs.tif",
                                                                         output_file="output/val_npix_icar.txt", 
                                                                         value_f=1,
                                                                         value_d=0,
                                                                         square_size=33L)
}
tval_icar_df <- read.table("output/val_npix_icar.txt",header=TRUE,sep=",")
e <- extent(raster("output/proj2017_icar.tif"))
cor_icar <- cor_scale(df=tval_icar_df, e=e, square_size=33)
cor_comp <- cor_glm %>%
    dplyr::rename(cor_glm=cor) %>%
    dplyr::mutate(cor_rf=cor_rf$cor,cor_rf_xy=cor_rf_xy$cor,cor_icar=cor_icar$cor) %>%
    dplyr::select(-coeff)
readr::write_csv(cor_comp,path="output/cor_comp.txt")
cor_comp
##    box_size    nbox    cor_glm    cor_rf cor_rf_xy  cor_icar
## 1       990 1250562 0.07882742 0.2867619 0.3494042 0.2675677
## 2      1980  313026 0.11088693 0.3517335 0.4096922 0.3144153
## 3      4950   50367 0.16215767 0.4520354 0.5031885 0.3865071
## 4      9900   12710 0.18872271 0.5289786 0.5805331 0.4521283
## 5     14850    5665 0.21097195 0.5669719 0.6078905 0.4733076
## 6     24750    2046 0.23378389 0.6061704 0.6428317 0.4948558
## 7     49500     527 0.27305075 0.6699511 0.7024582 0.5513582
## 8     74250     231 0.27429559 0.6296216 0.6950429 0.5236612
## 9     99000     144 0.28805695 0.6909386 0.7781211 0.6076498
## 10   198000      40 0.34713081 0.7274813 0.7619694 0.5518821

References

Liu, Canran, Matt White, and Graeme Newell. 2011. “Measuring and Comparing the Accuracy of Species Distribution Models with Presence-Absence Data.” Ecography 34 (2):232–43. https://doi.org/10.1111/j.1600-0587.2010.06354.x.

Pontius, Robert, Wideke Boersma, Jean-Christophe Castella, Keith Clarke, Ton de Nijs, Charles Dietzel, Zengqiang Duan, et al. 2008. “Comparing the Input, Output, and Validation Maps for Several Models of Land Change.” The Annals of Regional Science 42 (1). Springer Berlin / Heidelberg:11–37. https://doi.org/10.1007/s00168-007-0138-2.

Vieilledent, Ghislain, Clovis Grinand, Fety A. Rakotomalala, Rija Ranaivosoa, Jean-Roger Rakotoarijaona, Thomas F. Allnutt, and Frédéric Achard. 2018. “Combining Global Tree Cover Loss Data with Historical National Forest Cover Maps to Look at Six Decades of Deforestation and Forest Fragmentation in Madagascar.” Biological Conservation 222:189–97. https://doi.org/10.1016/j.biocon.2018.04.008.

Vieilledent, Ghislain, Cory Merow, Jérôme Guélat, Andrew M. Latimer, Marc Kéry, Alan E. Gelfand, Adam M. Wilson, Frédéric Mortier, and John A. Silander Jr. 2014. hSDM: hierarchical Bayesian species distribution models. https://doi.org/10.5281/zenodo.594920.