The GPBoost
algorithm extends linear mixed effects and Gaussian process models
by replacing the linear fixed effects function with a non-parametric
non-linear function modeled using tree-boosting. This article shows how
the GPBoost algorithm implemented in the GPBoost
library can be used for modeling data with a spatial and grouped
structure. We demonstrate the functionality of the GPBoost
library using European GDP data which is an example of areal spatial
econometric data. All results are obtained using GPBoost
version 1.2.1. This demo uses the R package, but the corresponding
Python package provides the same functionality.
Applying a GPBoost model (= combined tree-boosting and random effects / GP models) involves the following main steps:
Define a GPModel in which you specify the
following:
Create a gpb.Dataset with the response variable and
fixed effects predictor variables
Choose tuning parameters, e.g., using the function
gpb.grid.search.tune.parameters
Train the model with the gpboost /
gpb.train function
Interpret the trained model an/or make predictions
The data used in this demo is European gross domestic product (GDP) data. It can downloaded from https://raw.githubusercontent.com/fabsig/GPBoost/master/data/gdp_european_regions.csv. The data was collected by Massimo Giannini, University of Rome Tor Vergata, from Eurostat and kindly provided to Fabio Sigrist for a talk at the University of Rome Tor Vergata on June 16, 2023.
Data was collected for 242 European regions for the two years 2000 and 2021. I.e., the total number of data points is 484. The response variable is
There are four predictor variables:
Further, there are centroid spatial coordinates or the regions:
A spatial region ID:
A spatial cluster ID:
We first load the data and create a map illustrating the (log) GDP / capita over space. We create two maps: one with all data and another one when excluding some remote islands. In the commented code below, we also show how to create a spatial plot when no shape file for spatial polygons is available.
library(gpboost)
library(ggplot2)
library(gridExtra)
library(viridis)
library(sf)
# Load data
data <- read.csv("https://raw.githubusercontent.com/fabsig/GPBoost/master/data/gdp_european_regions.csv")
FID <- data$FID
data <- as.matrix(data[,names(data)!="FID"]) # convert to matrix since the boosting part currently does not support data.frames
covars <- c("L", "K", "pop", "edu")
# Load shape file for spatial plots
cur_tempfile <- tempfile()
download.file(url = "https://raw.githubusercontent.com/fabsig/GPBoost/master/data/shape_European_regions.zip", destfile = cur_tempfile)
out_directory <- tempfile()
unzip(cur_tempfile, exdir = out_directory)
shape <- st_read(dsn = out_directory)
# Create spatial plot of GDP
data_plot <- merge(shape, data.frame(FID = FID, y = data[,"y"]), by="FID")
p1 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="GDP / capita (log)", option = "B") +
ggtitle("GDP / capita (log)")
# Sample plot excluding islands
p2 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="GDP / capita (log)", option = "B") +
xlim(2700000,6400000) + ylim(1500000,5200000) +
ggtitle("GDP / capita (log) -- excluding islands")
grid.arrange(p1, p2, ncol=2)
# # Spatial plot without a shape file
# p1 <- ggplot(data = data.frame(Lat=data[,"Lat"], Long=data[,"Long"],
# GDP=data[,"y"]), aes(x=Long,y=Lat,color=GDP)) +
# geom_point(size=2, alpha=0.5) + scale_color_viridis(option = "B") +
# ggtitle("GDP / capita (log)")
# p2 <- ggplot(data = data.frame(Lat=data[,"Lat"], Long=data[,"Long"],
# GDP=data[,"y"]), aes(x=Long,y=Lat,color=GDP)) +
# geom_point(size=3, alpha=0.5) + scale_color_viridis(option = "B") +
# ggtitle("GDP / capita (log) -- Europe excluding islands") + xlim(-10,28) + ylim(35,67)
# grid.arrange(p1, p2, ncol=2)
In the following, we use a Gaussian process model with an exponential
covariance function to model spatial random effects. Additionally, we
include grouped random effects for the cluster variable cl. In the
GPBoost library, Gaussian process random effects are
defined by the gp_coords argument and grouped random
effects via the group_data argument of the
GPModel constructor. The above-mentioned predictor
variables are used in the fixed effects tree-ensemble function. We fit
the model using the gpboost, or equivalently the
gpb.train, function. Note that we use tuning parameters
that are selected below using cross-validation.
gp_model <- GPModel(group_data = data[, c("cl")],
gp_coords = data[, c("Long", "Lat")],
likelihood = "gaussian", cov_function = "exponential")
params <- list(learning_rate = 0.01, max_depth = 2, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 0)
nrounds <- 37
# gp_model$set_optim_params(params = list(trace=TRUE)) # To monitor hyperparameter estimation
boost_data <- gpb.Dataset(data = data[, covars], label = data[, "y"])
gpboost_model <- gpboost(data = boost_data, gp_model = gp_model,
nrounds = nrounds, params = params,
verbose = 1) # same as gpb.train
It is important that tuning parameters are appropriately chosen for
boosting. There are no universal default values and every data set will
likely need different tuning parameters. Below we show how tuning
parameters can be chosen using the
gpb.grid.search.tune.parameters function. We use the mean
square error (mse) as prediction accuracy measure on the
validation data. Alternatively, one can also use, e.g., the test
negative log-likelihood (test_neg_log_likelihood = default
value if nothing is specified) which also takes prediction uncertainty
into account. Depending on the data set and the grid size, this can
take some time. Instead of a deterministic grid search as below, one can
also do a random grid search to speed things up (see
num_try_random).
gp_model <- GPModel(group_data = data[, "cl"],
gp_coords = data[, c("Long", "Lat")],
likelihood = "gaussian", cov_function = "exponential")
boost_data <- gpb.Dataset(data = data[, covars], label = data[, "y"])
param_grid = list("learning_rate" = c(1,0.1,0.01),
"min_data_in_leaf" = c(10,100,1000),
"max_depth" = c(1,2,3,5,10),
"lambda_l2" = c(0,1,10))
other_params <- list(num_leaves = 2^10)
set.seed(1)
opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, params = other_params,
num_try_random = NULL, nfold = 4,
data = boost_data, gp_model = gp_model,
nrounds = 1000, early_stopping_rounds = 10,
verbose_eval = 1, metric = "mse") # metric = "test_neg_log_likelihood"
opt_params
# ***** New best test score (l2 = 0.0255393919591794) found for the following parameter combination: learning_rate: 0.01, min_data_in_leaf: 10, max_depth: 2, lambda_l2: 0, nrounds: 37
Information on the estimated random effects model can be obtained by
calling the summary function on the gp_model.
For Gaussian processes, GP_var is the marginal variance,
i.e., the amount of spatial correlation or structure spatial variation,
and GP_range is the range, or scale, parameter that
measures how fast correlation decays over space. For an exponential
covariance function, three times this value (approx. 17 here) is the
distance where the (residual) spatial correlation is essentially zero
(below 5%). As the results below show, the amount of spatial correlation
is relatively small since the marginal variance of 0.0089 is small
compared to the total variance of the response variable which is approx.
0.29. Additionally the error term and the cl grouped random effects also
have small variances (0.013 and 0.022). We thus conclude that most of
the variance in the response variable is explained by the fixed effects
predictor variables.
summary(gp_model)
## =====================================================
## Covariance parameters (random effects):
## Param.
## Error_term 0.0131
## Group_1 0.0221
## GP_var 0.0089
## GP_range 5.7502
## =====================================================
We can plot the estimated (“predicted”) spatial random effects at the
training data locations by calling the predict function on
the training data; see the code below. Such a plot shows the spatial
effect when factoring out the effect of the fixed effects predictor
variables. Note that these spatial effects take into account both the
spatial Gaussian process and the grouped region cluster random effects.
If one wants to obtain only spatial random effects from the Gaussian
process part, one can use the
predict_training_data_random_effects function (see the
commented code below). Alternatively, one can also do spatial
extrapolation (=“Krigging”), but this makes not a lot of sense for areal
data.
pred <- predict(gpboost_model, group_data_pred = data[1:242, c("cl")],
gp_coords_pred = data[1:242, c("Long", "Lat")],
data = data[1:242, covars], predict_var = TRUE, pred_latent = TRUE)
data_plot <- merge(shape, data.frame(FID = FID, y = pred$random_effect_mean), by="FID")
plot_mu <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="Spatial effect", option = "B") +
ggtitle("Spatial effect (mean)") + xlim(2700000,6400000) + ylim(1500000,5200000)
data_plot <- merge(shape, data.frame(FID = FID, y = pred$random_effect_cov), by="FID")
plot_sd <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="Std. dev.", option = "B") +
ggtitle("Uncertainty (std. dev.)") + xlim(2700000,6400000) + ylim(1500000,5200000)
grid.arrange(plot_mu, plot_sd, ncol=2)
# Only spatial effecst from the Gaussian process
# rand_effs <- predict_training_data_random_effects(gp_model, predict_var = TRUE)[1:242, c("GP", "GP_var")]
# data_plot <- merge(shape, data.frame(FID = FID, y = rand_effs[,1]), by="FID")
# plot_mu <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
# scale_fill_viridis(name="Spatial effect (mean)", option = "B") +
# ggtitle("Spatial effect from Gausian process (mean)") + xlim(2700000,6400000) + ylim(1500000,5200000)
# data_plot <- merge(shape, data.frame(FID = FID, y = rand_effs[,2]), by="FID")
# plot_sd <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
# scale_fill_viridis(name="Uncertainty (std. dev.)", option = "B") +
# ggtitle("Uncertainty (std. dev.)") + xlim(2700000,6400000) + ylim(1500000,5200000)
# grid.arrange(plot_mu, plot_sd, ncol=2)
There are several tools for understanding the form of the fixed effects function. Below we consider variable importance measures, interaction measures, and dependence plots. Specifically, we look at
As the results below show, the information obtained from SHAP values and SHAP dependence plots is the same as when looking at traditional variable importance measures and partial dependence plots. The most important variables are ‘K’ and ‘edu’. From the dependence plots, we conclude that there are non-linearities. For instance, the effect of K is almost flat for large and small values of K and increasing in-between. Further, the effect of edu is steeper for small values of edu and flattens for larger values of edu. Friedman’s H-statistic indicates that there are some interactions. For the two variables with the largest amount of interaction, L and pop, we create a two-dimensional partial dependence plot below. Further, the SHAP force plot shows that the effect of the predictor variables is different for the year 2000 compared to the year 2021.
# SHAP values
library(SHAPforxgboost)
shap.plot.summary.wrap1(gpboost_model, X = data[,covars]) + ggtitle("SHAP values")
# SHAP dependence plots
shap_long <- shap.prep(gpboost_model, X_train = data[,covars])
shap.plot.dependence(data_long = shap_long, x = covars[2],
color_feature = covars[4], smooth = FALSE, size = 2) +
ggtitle("SHAP dependence plot for K")
shap.plot.dependence(data_long = shap_long, x = covars[4],
color_feature = covars[2], smooth = FALSE, size = 2) +
ggtitle("SHAP dependence plot for edu")
# SHAP force plot
shap_contrib <- shap.values(gpboost_model, X_train = data[,covars])
plot_data <- cbind(shap_contrib$shap_score, ID = 1:dim(data)[1])
shap.plot.force_plot(plot_data, zoom_in=FALSE, id = "ID") +
scale_fill_discrete(name="Variable") + xlab("Regions") +
geom_vline(xintercept=242.5, linewidth=1) + ggtitle("SHAP force plot") +
geom_text(x=100,y=-1.2,label="2000") + geom_text(x=400,y=-1.2,label="2021")
# Split-based feature importances
feature_importances <- gpb.importance(gpboost_model, percentage = TRUE)
gpb.plot.importance(feature_importances, top_n = 25, measure = "Gain",
main = "Split-based variable importances")
# H-statistic for interactions
library(flashlight)
fl <- flashlight(model = gpboost_model, data = data.frame(y = data[,"y"], data[,covars]),
y = "y", label = "gpb",
predict_fun = function(m, X) predict(m, data.matrix(X[,covars]),
gp_coords_pred = matrix(-100, ncol = 2, nrow = dim(X)[1]),
group_data_pred = matrix(-1, ncol = 1, nrow = dim(X)[1]),
pred_latent = TRUE)$fixed_effect)
plot(imp <- light_interaction(fl, v = covars, pairwise = TRUE)) +
ggtitle("H interaction statistic") # takes a few seconds
# Partial dependence plots
gpb.plot.partial.dependence(gpboost_model, data[,covars], variable = 2, xlab = covars[2],
ylab = "gdp", main = "Partial dependence plot for K" )
gpb.plot.partial.dependence(gpboost_model, data[,covars], variable = 4, xlab = covars[4],
ylab = "gdp", main = "Partial dependence plot edu" )
# Two-dimensional partial dependence plot (to visualize interactions)
i = 1; j = 3;# i vs j
gpb.plot.part.dep.interact(gpboost_model, data[,covars], variables = c(i,j), xlab = covars[i],
ylab = covars[j], main = "Pairwise partial dependence plot for pop and L")
In the above model, we have used the same random effects for both
years 2000 and 2021. Alternatively, one can also use independent spatial
and grouped random effects for different time periods (independent
under the prior, conditional on the data, there is dependence …).
In the GPBoost library, this can be done via the
cluster_ids argument. cluster_ids needs to be
a vector of length the sample size, and every entry indicates the
“cluster” to which an observation belongs to. Different clusters then
have independent spatial and grouped random effects, but the
hyperparameters (e.g., marginal variance, variance components, etc.) and
the fixed effects function are the same across clusters.
Below we show how we can fit such a model and create two separate spatial maps. As the results below show, the spatial effects are quite different for the two years 2000 and 2021. In particular, there is less (residual) spatial variation (or heterogeneity) for the year 2021 compared to 2000. This is confirmed by looking at standard deviations of the predicted spatial random effects which is almost twice as large for 2000 compared to the year 2021 (see below). A further conclusion is that in 2000, there were more regions with “low” GDP numbers (spatial effects below 0), and this is no longer the case for 2021.
gp_model <- GPModel(group_data = data[, c("cl")], gp_coords = data[, c("Long", "Lat")],
likelihood = "gaussian", cov_function = "exponential",
cluster_ids = c(rep(1,242), rep(2,242)))
boost_data <- gpb.Dataset(data = data[, covars], label = data[, "y"])
params <- list(learning_rate = 0.01, max_depth = 1, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 1)
# Note: we use the same tuning parameters as above. Ideally, the would have to be chosen again
gpboost_model <- gpboost(data = boost_data, gp_model = gp_model, nrounds = nrounds,
params = params, verbose = 0)
# Separate spatial maps for the years 2000 and 2021
pred <- predict(gpboost_model, group_data_pred = data[, c("cl")],
gp_coords_pred = data[, c("Long", "Lat")],
data = data[, covars], predict_var = TRUE, pred_latent = TRUE,
cluster_ids_pred = c(rep(1,242), rep(2,242)))
data_plot <- merge(shape, data.frame(FID = FID, y = pred$random_effect_mean[1:242]), by="FID")
plot_mu_2000 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="Spatial effect", option = "B") +
ggtitle("Spatial effect for 2000 (mean)") + xlim(2700000,6400000) + ylim(1500000,5200000)
data_plot <- merge(shape, data.frame(FID = FID, y = pred$random_effect_mean[243:484]), by="FID")
plot_mu_2021 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(name="Spatial effect", option = "B") +
ggtitle("Spatial effect for 2021 (mean)") + xlim(2700000,6400000) + ylim(1500000,5200000)
grid.arrange(plot_mu_2000, plot_mu_2021, ncol=2)
sd(pred$random_effect_mean[1:242]) # 0.2321267
## [1] 0.2321267
sd(pred$random_effect_mean[243:484]) # 0.1286398
## [1] 0.1286398
In the above model, there is no interaction between the random effects and the fixed effects predictor variables. I.e., there is no interaction between the spatial coordinates and the fixed effects predictor variables. Such interaction can be modeled by additionally including the random effects input variables (= the spatial coordinates or the categorical grouping variable) in the fixed effects function. The code below shows how this can be done. As the variable importance plot below shows, the coordinates are not used in the tree-ensemble, and there is thus no such interaction detectable for this data set.
gp_model <- GPModel(group_data = data[, c("cl")], gp_coords = data[, c("Long", "Lat")],
likelihood = "gaussian", cov_function = "exponential")
covars_interact <- c(c("Long", "Lat"), covars) ## add spatial coordinates to fixed effects predictor variables
boost_data <- gpb.Dataset(data = data[, covars_interact], label = data[, "y"])
params <- list(learning_rate = 0.01, max_depth = 1, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 1)
# Note: we use the same tuning parameters as above. Ideally, the would have to be chosen again
gpboost_model <- gpboost(data = boost_data, gp_model = gp_model, nrounds = nrounds,
params = params, verbose = 0)
feature_importances <- gpb.importance(gpboost_model, percentage = TRUE)
gpb.plot.importance(feature_importances, top_n = 25, measure = "Gain",
main = "Var. imp. when including coordinates in the fixed effects")
For large data sets, calculations with Gaussian processes are slow,
and one has to use an approximation. The GPBoost library
implements several ones. For instance, setting
gp_approx="vecchia" in the GPModel constructor
will use a Vecchia approximation. The data set used in this article is
relatively small, and we can do all calculations exactly.
Above, we have used a Gaussian process to model spatial random
effects. Since the data is areal data, another option is to use models
that rely on neighborhood information such as CAR and SAR models. These
models are currently not yet implemented in the GPBoost
library (might be added in the future -> contact the
author).
Another option is to use a grouped random effects model defined by
the categorical region ID variable for modeling spatial effects. The
code below shows how this can be done in GPBoost. However,
this model essentially ignores the spatial structure.
gp_model <- GPModel(group_data = data[, c("group", "cl")], likelihood = "gaussian")
(Generalized) linear mixed effects and Gaussian process models can
also be run in the GPBoost library. The code below shows
how this can be done with the fitGPModel function. Note
that one needs to manually add a column of 1’s to include an intercept.
With the option params = list(std_dev=TRUE), one obtains
standard deviations and p-values in the summary
function.
X_incl_1 = cbind(Intercept = rep(1,dim(data)[1]), data[, covars])
gp_model <- fitGPModel(group_data = data[, c("cl")], gp_coords = data[, c("Long", "Lat")],
likelihood = "gaussian", cov_function = "exponential",
y = data[, "y"], X = X_incl_1, params = list(std_dev=TRUE))
summary(gp_model)
## =====================================================
## Model summary:
## Log-lik AIC BIC
## 195.97 -373.94 -336.30
## Nb. observations: 484
## Nb. groups: 2 (Group_1)
## -----------------------------------------------------
## Covariance parameters (random effects):
## Param. Std. dev.
## Error_term 0.0215 0.0017
## Group_1 0.0003 0.0008
## GP_var 0.0126 0.0047
## GP_range 7.2823 3.6585
## -----------------------------------------------------
## Linear regression coefficients (fixed effects):
## Param. Std. dev. z value P(>|z|)
## Intercept 16.2816 0.4559 35.7128 0.0000
## L 0.4243 0.0565 7.5042 0.0000
## K 0.6493 0.0212 30.6755 0.0000
## pop 0.0134 0.0097 1.3812 0.1672
## edu 0.0100 0.0009 10.9645 0.0000
## =====================================================