Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
64 views25 pages

LSavati Regression Analysis

This document discusses regression analysis (RA) as a statistical technique used in various fields. RA quantifies the relationship between a dependent variable and one or more independent variables. It explores relationships that can be described by straight lines or more complex transformations. RA is used to understand how the dependent variable changes with the independent variables and to characterize variation around the regression function.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views25 pages

LSavati Regression Analysis

This document discusses regression analysis (RA) as a statistical technique used in various fields. RA quantifies the relationship between a dependent variable and one or more independent variables. It explores relationships that can be described by straight lines or more complex transformations. RA is used to understand how the dependent variable changes with the independent variables and to characterize variation around the regression function.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Luca Salvati

Regression Analysis as a statistical ‘data mining’

Regression analysis (RA) is a frequently used statistical technique in economic,


social, behavioural, physical, and biological sciences. Its main objective is to quantify
the relationship between a dependent variable and one or more independent variables
(which are also called ‘predictors’ or ‘explanatory variables’).
Linear regression explores relationships (e.g. Table 1) that can be readily
described by straight lines (Figure 1) on a graph (or their generalization to many
dimensions). A very large number of problems can be solved by linear RA, and even
more by means of transformation of the original variables that result in linear
relationships among the transformed variables.

Table 1. An example of correlation analysis: weather condition (1993-2008).


Year Precipitation (mm) Aridity index
1993 639,2 0,93
1994 591,4 0,79
1995 597,2 0,76
1996 893,4 1,02
1997 670,0 0,81
1998 782,4 0,86
1999 731,8 0,80
2000 577,1 0,72
2001 497,0 0,66
2002 804,6 0,95
2003 477,2 0,58
2004 906,2 0,99
2005 801,4 0,91
2006 479,9 0,59
2007 443,6 0,56
Figure 1. Searching for graphical ideas: straight line indicates the equation [y = x]. Points are
ten-days comparison among drought duration and intensity in a Mediterranean place; the
filled line indicates the departures of the values measured over time from the simple y = x
model).
100

80
Drought intensity

60

40

20

0
0 20 40 60 80 100
Drought duration

1. Introduction

1.1. Overview of regression analysis

In statistical sciences, RA includes any techniques for modelling several


variables, when the focus is on the relationship between a dependent variable and one
or more independent variables. In other words, and more specifically, RA helps us
understanding how the typical (e.g. mean) value of the dependent variable changes
when any one of the independent variables is varied, while the other independent
variables are held fixed.
Most commonly, RA estimates the conditional expectation of the dependent
variable given the independent variables — that is, the average value of the dependent
variable when the independent variables are held fixed. Less commonly, the focus is
on a quantile, or other location parameters of the conditional distribution of the
dependent variable given the independent variables. As an example, these procedures
are meaningful when analysing strictly non linear relationships o even linear
relationships when variables are not normally distributed.
In all cases (e.g. Figure 2), the estimation target is a mathematical function of
the independent variables called the regression function. A further, likely interesting

2
application in RA, is to characterize the variation of the dependent variable around the
regression function, which can be described by a probability distribution.

Figure 2. Possible datasets for RA.


A non-linear relationship.
6

4
%

0
10000 20000 30000

Reddito disponibile pro-capite (euro)

A linear relationship? A squared polynomial relationships with


outlier.
120 6
Acqua erogata (litri pro-

100 4

80 2
capite)

0
%

60
-2
40
-4
20 -6
0 -8
10000 20000 30000 10000 20000 30000
Reddito disponibile pro-capite (euro) Reddito disponibile pro-capite (euro)

1.2. Scopes of RA

Regression analysis is also used to understand which among the independent


variables, are related to the dependent variable, and to explore the forms of these
relationships. In restricted circumstances, regression analysis can be used to infer
causal relationships between the independent and dependent variables but researcher
should be cautious when defining conclusions. Finally, regression analysis is also
widely used for prediction (including forecasting of time-series data) and for the
analysis of data with a spatially-complex structure. Notably, the use of RA for (spatial
and temporal) prediction has substantial overlap with the field of machine learning.

Figure 3. Examples of output of time series RA.

3
Messina (Sicilia)
100

90
AW (mm)

80

70

60
80

82

84

86

88

90

92

94

96

98

00
19

19

19

19

19

19

19

19

19

19

20
Montagna del Matese (Campania)
120

110
AW (mm)

100

90

80
80

82

84

86

88

90

92

94

96

98

00
19

19

19

19

19

19

19

19

19

19

20

1.3. Computational details

A large body of techniques for carrying out RA has been developed in recent
times. Traditional methods, such as linear regression (LR) and ordinary least squares
(OLS) regression, are parametric. This means that the regression function used in the
computation procedure is defined in terms of a finite number of unknown parameters
that are estimated directly from the data. On the contrary, nonparametric regression
refers to techniques that allow the regression function to lie in a specified set of
functions, which may be infinite-dimensional.
In practice, the performance of a defined RA method depends on the form of
the data-generating process. It could depend also on how it relates to the regression
approach being used in the study. Since the true form of the data-generating process is
not known, RA depends (at least to some extent) on making assumptions about this
process. These assumptions are sometimes (but not always) testable via statistical
inference if an enough large amount of data is available.

4
Notably, regression models for prediction are often useful even when the
assumptions are moderately violated, although they may not perform optimally.
However, as observed earlier, when carrying out inference using a regression model,
especially involving small effects or questions of causality based on observational
data, regression techniques should be used cautiously as they can easily give
misleading results.

Box 1
Back to the past

The earliest form of regression was the method of least squares (originally
known in French as méthode des moindres carrés), which was published by Legendre in
1805 and by Gauss in 1809. Legendre and Gauss both applied the method to the
problem of determining, from astronomical observations, the orbits of bodies about
the Sun. Gauss published a further development of the theory of least squares in 1821,
including a version of the Gauss–Markov theorem.
The term ‘regression’, however, was coined by F. Galton in the nineteenth
century to describe a biological phenomenon. The phenomenon was that the heights of
descendants of tall ancestors tend to regress down towards a normal average (a
phenomenon also known as regression toward the mean). This work was later
extended by U. Yule and K. Pearson to a more general context. In the work of Yule and
Pearson, the joint distribution of the response and explanatory variables is assumed to
be Gaussian.
This assumption was weakened by R.A. Fisher in his works of 1922 and 1925.
Fisher assumed that the conditional distribution of the response variable is Gaussian,
but the joint distribution need not be. In this respect, Fisher's assumption is closer to
Gauss's formulation of 1821.

1.4. Working in progress

Regression methods continue to be an area of active research. In recent


decades, original methodologies have been developed for robust regression,
regression involving correlated responses such as time series and growth curves,
regression in which the predictor or response variables are curves, and even images,
graphs, or other complex data objects, regression methods accommodating various

5
types of missing data, nonparametric regression, Bayesian methods for regression,
regression in which the predictor variables are measured with error, regression with
more predictor variables than observations, and causal inference with regression.

Figure 4. A traditional regression output.

1.5. Regression assumptions

The classical assumptions underlying regression analysis are:


 The sample must be representative of the population for the inference
prediction.
 The error is assumed to be a random variable with a mean of zero conditional
on the explanatory variables.
 The variables are error-free. If this is not so, modelling may be done using
errors-in-variables model techniques.
 The predictors must be linearly independent, i.e. it must not be possible to
express any predictor as a linear combination of the others. Otherwise, the
problem of multicollinearity occurs.
 The errors are uncorrelated, that is, the variance-covariance matrix of the errors
is diagonal and each non-zero element is the variance of the error.
 The variance of the error is constant across observations (namely,
homoscedasticity). If not, weighted least squares or other procedures might be
used.

These points represent sufficient (but not all necessary) conditions for the least-
squares estimator to possess desirable properties, in particular, these assumptions

6
imply that the parameter estimates will be unbiased, consistent, and efficient in the
class of linear unbiased estimators. Many of these assumptions may be relaxed in
more advanced treatments.

Box 2

In brief assumptions of the regression model are:


 The relationship between X and Y is linear.
 The expected value of the error term is zero
 The variance of the error term is constant for all the values of the independent
variable, X. This is the assumption of homoscedasticity.
 There is no autocorrelation.
 The independent variable is uncorrelated with the error term.
 The error term is normally distributed.

1.6. Details on regression technique

Regression models involve the following variables:


 The unknown parameters denoted as β; this may be a scalar or a vector of
length k.
 The independent variables, X.
 The dependent variable, Y.

A regression model relates Y to a function of X and β.

The approximation can be formalized as E(Y|X) = f(X, β). To carry out regression
analysis, the form of the function f must be specified. Sometimes the form of this
function is based on knowledge about the relationship between Y and X that does not
rely on the data. If no such knowledge is available, a flexible or convenient form for f is
chosen.

1.6.1. Assumptions

7
Assume now that the vector of unknown parameters β is of length k. In order to
perform a regression analysis, information about the dependent variable Y should be
provided:
 If N data points of the form (Y,X) are observed, where N < k, the most classical
approaches to regression analysis cannot be performed: since the system of
equations defining the regression model is undetermined, there is not enough
data to estimate β.
 If exactly N = k data points are observed, and the function f is linear, the
equations Y = f(X, β) can be solved exactly rather than approximately. This
reduces to solving a set of N equations with N unknowns (the elements of β),
which has a unique solution as long as the X are linearly independent. If f is
nonlinear, a solution may not exist, or many solutions may exist.
 The most common situation is where N > k data points are observed. In this
case, there is enough information in the data to estimate a unique value for β
that best fits the data in some sense, and the regression model when applied to
the data can be viewed as an over-determined system in β.

In the last case, the regression analysis provides the tools for:

1. Finding a solution for unknown parameters β that will, for example, minimize
the distance between the measured and predicted values of the dependent
variable Y (also known as method of least squares).
2. Under certain statistical assumptions, the regression analysis uses the surplus
of information to provide statistical information about the unknown
parameters β and predicted values of the dependent variable Y.

Regression need independent measurements to be correctly performed. In the case


of general linear regression, the above statement is equivalent to the requirement that
matrix XTX is regular.

1.7. Model specification

In linear regression, the model specification is that the dependent variable, yi is


a linear combination of the parameters (but need not be linear in the independent

8
variables). At this point, we can define simple and multiple linear regression. In simple
linear regression for modelling n data points there is one independent variable: xi, and
two parameters, β0 and β1, giving a straight line (Figure 5):

In multiple linear regression, there are several independent variables or


functions of independent variables. For example, adding a term in xi2 to the preceding
regression gives a parabola:

This is still a functional form compatible with linear regression; although the
expression on the right hand side is quadratic in the independent variable xi, it is
linear in the parameters β0, β1 and β2.

Figure 5. Illustration of linear regression on three different data sets (blue, green, violet:
regression lines refer to green and blue point datasets).

4,0
Nord
3,5
3,0 Centro
2,5 Sud
ISD

2,0
1,5
1,0
0,5
0,0
4,0 4,5 5,0 5,5 6,0
log(Ya/SAU)

In both cases, is an error term and the subscript i indexes a particular


observation. Given a random sample from the population, we estimate the population
parameters and obtain the sample linear regression model:

9
The term ei is the residual, . As discussed earlier, one method of
estimation is ordinary least squares. This method obtains parameter estimates that
minimize the sum of squared residuals (SSE):

Minimization of this function results in a set of normal equations, a set of


simultaneous linear equations in the parameters, which are solved to yield the

parameter estimators, .

1.8. Parameter estimates

In the case of simple regression, the formulas for the least squares estimates
are:

where is the mean (average) of the x values and is the mean of the y values. In
linear least squares (straight line fitting) the derivation of these formulas is easy and
several numerical examples are possible. Under the assumption that the population
error term has a constant variance, the estimate of that variance is given by:

This is called the root mean square error (RMSE) of the regression. The
standard errors of the parameter estimates are given by the following formulas:

10
Under the further assumption that the population error term is normally
distributed, the researcher can use these estimated standard errors to create confidence
intervals and conduct hypothesis tests about the population parameters.

1.9. Generalizing the linear model

In the more general multiple regression model, there are p independent


variables:

The least square parameter estimates are obtained by p normal equations. The
residual can be written as:

The normal equations are

Note that for the normal equations depicted above,

That is, there is no β0. Thus in what follows,

11
In matrix notation, the normal equations for k responses (usually k = 1) are
written as:

with generalized inverse ( − ) solution, subscripts showing matrix dimensions:

1.10. Regression diagnostics

Once a regression model has been constructed, it may be important to verify


the goodness of fit of the statistical model and the significance of the estimated
parameters expressed in term of probability level. Commonly used checks of goodness
of fit are (i) the R-squared (even corrected, if necessary), (ii) descriptive analyses of the
pattern of residuals, and (iii) simplified (or more complex) hypothesis testing.
Statistical significance can be checked by an F-test of the overall fit, followed by t-tests
of individual parameters.
The interpretation of these diagnostic statistics rests heavily on the model
assumptions. Although examination of the residuals can be used to invalidate a
model, the results of a Student t-test or Fisher-Snedecor F-test are sometimes more
difficult to interpret when the model's assumptions are violated.
For example, if the error term does not have a normal distribution, in small
samples the estimated parameters will not follow normal distributions and complicate
inference. With relatively large samples, however, a central limit theorem can be
invoked such that hypothesis testing may proceed using asymptotic approximations.

1.11. Extensions of RA

When the model function is not linear in the parameters, the sum of squares
must be minimized by an iterative procedure. This introduces many complications
which are summarized in differences between linear and non-linear least squares.

12
The response variable may be non-continuous (e.g. "limited" to lie on some
subsets of the real line). Nonlinear models for binary dependent variables include the
probit and logit model. The multivariate probit model makes it possible to estimate
jointly the relationship between several binary dependent variables and some
independent variables. For categorical variables with more than two values there is
the multinomial logit. For ordinal variables with more than two values, there are the
ordered logit and ordered probit models.
As far as interpolation and extrapolation are concerned, regression models
predict a value of the y variable given known values of the x variables. Prediction
within the range of values is known as interpolation. Prediction outside the range of
the data is known as extrapolation, which is more risky.

1.12. Other techniques

Although the parameters of RA are usually estimated using the method of least
squares, other methods can been used (Figure 6) including:
 Bayesian linear regression.
 Least absolute deviations, which is more robust in the presence of outliers,
leading to quantile regression.

Figure 6. An unexplored nexus: What is the better model for the data in this scatterplot? Linear
or squared polynomial regression?
0,6

0,5

0,4
D ESAI

0,3

0,2

0,1

0,0

-0,1
3,4 3,6 3,8 4,0 4,2 4,4 4,6 4,8

log(Y/P)

13
 Nonparametric regression, that requires a large number of observations and is
computationally intensive.
 Distance metric learning, which is learned by the search of a meaningful
distance metric in a given input space.

Box 3
In brief regression techniques

Aims and scope: regression models can be used for three main purposes:
- To describe or model a set of data with one dependent variable and one (or more)
independent variables.
- To predict or estimate the values of the dependent variable based on given value(s)
of the independent variable(s).
- To control or administer standards from a useable statistical relationship.

Rationale: Simple Linear Regression is the method for finding the "line of best fit"
between the dependent variable, y, and the independent variable, x.
‘Simple’: because only one independent variable is considered.
Linear: also means linear in the parameters, since no parameter appears to the first
power.
Linear in the Independent Variable: the independent variable only appears to the first
power.
The Least Squares Regression Line is the line which minimizes the sum of the square
or the error of the data points; it is an averaging line of the data.
Regression analysis tries to fit a model to one dependent variable based on one or
more independent variables.

Figure 7 shows several features:


- The actual data points (x,y) are the blue dots.
- The Least Squares Regression Line of the dependent (y) variable based on the
independent (x) variable is shown in black.
- The errors (residuals) are the vertical distances between the observed values of y and
the predictions of the "line of best fit" which are shown in red.
- The goal, in general, is to minimize the errors from the actual data to the regression
line. The least squares line minimizes the sum of the square of the errors.

14
Figure 7. A graphical description of regression residuals.

Box 4
The ‘convergence’ regression analysis

The convergence debate is a really important point in the economic literature


and is also a continuous source of examples and exercises for the statisticians,
especially those experts in regression analysis. From the economic perspective, the
neoclassical growth model predicts ‘absolute’ convergence (Quah 1996) which applies
to poor economies that tend to grow faster than rich ones. This notion was generalised
through the concept of ‘conditional convergence’ (Sala-i-Martin 1996) which applies
when the growth rate of an economy is positively related to the distance between the
economy’s level of income and its own steady-state (Barro and Sala-i-Martin 2004).
The two concepts coincide if a group of countries or regions tends to converge to the
same steady-state (Mainardi, 2007). Thus, one way to test the convergence hypothesis
is to check whether systems with similar characteristics (e.g. systems which are likely
to converge to the same steady state) converge in an ‘absolute’ sense. The convergence
issue is a traditional research theme for growth economics, but it represents a
meaningful concept in political matters, too. As an example, convergence in a certain
social or environmental process may indicate that policies carried out at different
scales with the aim to reduce territorial disparities are effective (Neumayer 2001). An
example of an analysis with detailed results of simple polynomial (squared) regression
between the A variable (or its logarithm) and one predictor only (namely the B
variable) estimated at three different spatial levels was reported in Table 2.

15
Table 2. An example of a table with detailed results of simple polynomial (squared)
regression between (A) (or its logarithm) and one predictor only (namely B variable)
estimated at three different spatial levels (standard errors of the estimate are reported
in brackets; stars indicate significance at p < 0.02 (*), and p < 0.001 (**)). The three
coefficient indicated in the table are those from the equation (A) = a + b (B)+ c (B2).

(A) Variable Log(A)


Spatial level 1 (n = 20)
a 8.30(3.60)* 1.99(0.86)*
b -3.22(1.43)* -9.12(4.35)*
c -5.32(2.26)* -0.10(0.04)*
Adj-R2 0.33* 0.30*
Spatial level 2 (n = 103)
a 5.11(0.89)** 1.19(0.19)**
b -2.00(0.36)** -5.76(1.03)**
c -3.22(0.56)** -0.05(0.01)**
Adj-R2 0.33** 0.31**
Spatial level 3 (n = 780)
a 2.58(0.45)** 0.64(0.10)**
b -0.98(0.18)** -2.86(0.52)**
c -1.65(0.28)** -0.02(0.00)**
Adj-R2 0.13** 0.12**

Figure 8. Non linear relationship between A and variation in A between 1990 and 2000
estimated at the spatial level 2 (panel A) and 3 (panel B). See also Table 2.

Panel A

16
Panel B

1.13. Some caveats on RA

The illustrations of the Regression Analysis as it was made in this chapter


afford some basis for optimism that such techniques can be helpful in research topics,
while also suggesting considerable basis for caution in their use.
A crucial issue alimenting caution in the use of MRA is the relationship
between the statistical significance test and the burden of proof. Unfortunately, in
most of case studied, there is no simple relationship between this burden of proof and
the statistical significance test. At one extreme, we can imagine that the parameter
estimate in the regression study is the only information we have about the studied
phenomenon.
Very rarely, however, is the regression estimate the only information available,
and when the standard errors are high the estimate may be among the least reliable
information available. Further, regression analysis is subject to considerable
manipulation. It is not obvious precisely which variables should be included in a
model, or what proxies to use for included variables that cannot be measured
precisely.
There is considerable room for experimentation, and this experimentation can
become the typical field of application for “data mining” (and sometimes
multidimensional analysis), whereby an investigator tries various specifications until
the desired result appears. It is clear that if the best result that an research can present
contains high standard errors and low statistical significance, it is often plausible to
suppose that numerous even less impressive results remain hidden.
For these reasons, those who use regression analysis tend to report results that
satisfy the conventional significance tests—often the five-percent significance level, or
even more stringent levels, e.g. the one-percent significance level or the 0,1 percent

17
significant level — and to suppose that less significant results are not terribly
interesting. A vigorous cross-examination based on a number of matters, often
suggested by the researcher himself, is necessary to validate the regression results.
Still more difficult issues arise when an exact parameter estimate is needed for
some purpose. The fact that the parameter is “statistically significant” simply means
that by conventional tests, one can reject the hypothesis that its true value is zero. But
there are surely many other hypotheses about the parameter value that cannot be
rejected, and indeed the likelihood that regression will produce a perfectly accurate
estimate of any parameter is negligible.
About the only guidance that can be given from a statistical standpoint is the
obvious—parameter estimates with proportionally low standard errors are less likely
to be wide of the mark than others.

1.14. Examples for analysis: from ‘exploratory data mining’ to advanced regression models

1.14.1. Regression and geographical matters


There may be spatial trends and spatial autocorrelation in the variables that
violates the statistical assumptions of regression. As an example, geographic weighted
regression is one technique to deal with such data (Fotheringham et al., 2002). Also,
variables may include values aggregated by areas.
As a matter of fact, with aggregated data (that are data measured on
administrative, political, economic spatial domains) the Modifiable Areal Unit
Problem occurs and can cause extreme variation in regression parameters
(Fotheringham and Wong, 1991). When analyzing data aggregated by political
boundaries, postal codes, or census areas results may be very different with a different
choice of units.

1.14.2. Preliminary correlation analysis


A preliminary analysis can be carried out on the selected covariates in order to
specify the regression models and avoid redundancy and collinearity among variables
which could bias model estimates. This analysis includes the following step: (i)
computation of a correlation matrix among predictors (by using both the Pearson
moment coefficient or the Spearman rank correlation coefficient) and preliminary
stepwise OLS regressions among the dependent variable and the three classes of
predictors, alone or pooled together. This step is necessary when most of the variables

18
considered in the study are never introduced as covariates in linear (or non-linear)
models studying similar processes. Based on this approach, some variables could be
excluded from the final analysis due to their correlation with the other covariates.

1.14.3. Modeling different functional forms


The relationship can be then tested by specifying different (reduced) forms
starting with the simplest one, relating change in (A) during the investigated period as
the dependent variable and change in B (or its logarithm) as the predictor. The vector
Xi, which include the additional covariates, was then added to the core model as
control. At the first stage, the following equation can be estimated:
 = b0 + b1(B) + b2(B)2 + b3(B)3 + e
where b0 is the intercept and b(●) are the coefficient terms. The vector Xi that includes
additional regression variables was then incorporated in the selected form as follows:
 = b0 + b1(B) + b2(B)2 + b3(B)3 + bm(Xi) + e
Equations can be preliminary estimated through simple OLS regression. Collinearity
among variables is checked throughout by way of variance inflation factor and
condition index. Outputs reporting the variables which enter each model with
significant coefficients and standard errors are to be analyzed and commented.

1.14.4. Modeling extensions of RA

Notably, OLS regression assumes spatial randomness which indicates that any
grouping of high or low values of the study variable in space would be independent. If
this assumption is not true, i.e. a spatial structure exists in the variable as detected by
the presence of spatial correlation, standard OLS estimates are inefficient (Rupasingha
et al. 2004). The spatial variation of both the dependent variable and the main
predictor can be explored through exploratory spatial data analysis techniques.
Central to the spatial framework is the choice of the matrix that describes the
interaction structure of the cross-section units, i.e. the definition of proximity. For each
spatial unit a relevant neighboring set must be defined consisting of those units that
potentially interact with it. Although in regional data analysis proximity is usually
defined in terms of contiguity, if the basic units are defined by administrative
boundaries this definition may not be appropriate, because partitions of the territory
based on administrative criteria may not coincide with the one based on the study’s
target criteria.

19
Figure 9. A curve describing (A)-(B) relationship and the possible functional form.
a

b
A

An alternative approach can be used, i.e. a spatial weight matrix based on


Euclidean distances between the gravitational centers of the spatial units. Potential
interactions between locations were summarized by the matrix W = { wij } where wij = 1
if districts i and j are within a fixed distance, d, of each other and 0 otherwise. We
considered eight values of d ranging from 25 to 200 kilometres with a span of 25
kilometres. By increasing d incrementally, it is possible to assess how far the links
between spatial units extend, i.e. spatial correlation (Anselin, 2001).
The assessment of global spatial autocorrelation could be carried out by
Moran’s I and Geary’s c statistics (Cliff and Ord, 1981). Along with the test statistics,
the standardized z-value for each statistic, the associated significance level, p1,
assuming the (asymptotic) distributions of I and c are normal, and an alternative
indicator of statistical significance (p2), are to be calculated.
Unfortunately, Moran’s I and Geary’s c tests provide only a general measure of
spatial correlation. To model spatial correlation in association with the explanatory
variables, two levels of variation should be considered: large-scale changes in the
mean due to spatial location, and small scale variation due to interactions with
neighbors. Two approaches could be thus developed in order to address these issues.
First, a spatial regression model is built-up in the following form:
Zi = i +         
where Zi is the random process at location i (i.e. ), i is the mean at the same site,
which is a linear, square or cubic model with (i) B alone (i.e. the restricted model), and
(ii) all the additional covariates (i.e. the full model),  ~ N(0, ) and  is the covariance
matrix of random variables at all locations. The small scale variation bias modeled by

20
fitting two different covariance models to , including conditional spatial
autoregression (CAR) and moving average (MA) structures. The spatial weight matrix
introduced in these models was chosen according to the results of Moran’s and
Geary’s statistics.
The second approach arise from consideration that socio-economic processes
are usually not constant over space, bearing a certain amount of spatial non-
stationarity. If the data generating process is non-stationary over space, global
statistics (and model fitting) which summarize the major characteristics of a given
spatial data configuration (and the relationship among variables) might be locally
misleading due to a bias in the estimates. Different types of (local) techniques were
developed in order to deal with spatial non-stationarity, including the Geographically
Weighted Regression (GWR) proposed by Fotheringham et al. (2002).
The methodological framework underlying GWR is quite similar to that of
local linear regression models, as it uses a kernel function to calculate weights for the
estimation of local weighted regression models. Contrary to the standard regression
model, where the regression coefficients are location-invariant, the specification of a
basic GWR model for each location s = 1, …, n, is:
y(s) = X(s)b(s) + e(s)
where y(s) is the dependent variable at location s, X(s) is the row vector of explanatory
variables at location s, b(s) is the column vector of regression coefficients at location s,
and e (s) is the random error at location s.
Hence, regression parameters, estimated at each location by weighted least
squares, vary in space implying that each coefficient in the model is a function of s, a
point within the geographical space of the study area. As a result, GWR gives rise to a
distribution of local estimated parameters. The weighting scheme is expressed as a
kernel function that places more weight on the observations closer to the location s. In
this study we adopted one of the most commonly used specifications of the kernel
function, which is the bi-square nearest neighbor function (Fotheringam et al. 2002).

1.15. A Special issue: Binary Logistic Regression

Binary Logistic Regression (BLR) is useful when you want to model the event
probability for a categorical response variable with two outcomes. For example:
- a loan officer wants to know whether the next customer is likely to default;
- a doctor wants to accurately diagnose a possibly cancerous tumor;

21
- a catalog company wants to increase the proportion of mailings that result in sales.
Using the Binary Logistic Regression, the catalog company can send mailings
to the people who are most likely to respond, the doctor can determine whether the
tumor is more likely to be benign or malignant, and the loan officer can assess the risk
of extending credit to a particular customer.
Since the probability of an event must lies between 0 and 1, it is impractical to
model probabilities with linear regression techniques, because the linear regression
model allows the dependent variable to take values greater than 1 or less than 0. The
logistic regression model is a type of generalized linear model that extends the linear
regression model by linking the range of real numbers to the 0-1 range.
Start by considering the existence of an unobserved continuous variable, Z,
which can be thought of as the "propensity towards" the event of interest. In the case
of loan officer, Z is the customer's propensity to default on a loan, with larger values of
Z corresponding to greater probabilities of defaulting.
In the BLR, and more in general in the logistic regression models (LRM), the
relationship between Z and the probability of the event of interest is described by a
link function (see the box below).

Box 5
A classical link function in BRM.

Notably, the model assumes that Z is linearly related to the predictors, so:

22
zi = b0 + b1xi1 + b2xi2 + ... + bpxip

where:
xij is the jth predictor for the ith case
bj is the jth coefficient
p is the number of predictors

Of course, if Z values were observable, you would simply fit a linear regression
to Z and be done. However, since Z is unobserved, the predictors should must be
related to the probability of interest by substituting for Z. The regression coefficients
will be estimated through iterative procedures, such as the maximum likelihood
method.
We can turn to one of the previous examples. If you are a loan officer at a bank,
then you want to be able to identify characteristics that are indicative of people who
are likely to default on loans, and use those characteristics to identify good and bad
credit risks. Suppose information on 800 past and prospective customers. This
information is contained on a spreadsheet. Note that the first 600 cases are customers
who were previously given loans.
One possible exercise would be to use a random sample of these customers to
create a logistic regression model, and setting the remaining customers aside in order
to validate the analysis. Then it is possible to use the model in order to classify the
remaining 200 prospective customers as good or bad credit risks.
After building a model, the next step will be to determine whether it
reasonably approximates the behaviour of your data. One typical test of model fit,
which is usually implemented by several commercial softwares, is the Hosmer-
Lemeshow goodness-of-fit statistic. Moreover, it is possible to develop some ‘residual
plots’, that are true diagnostic plots. Two helpful plots could be the change in deviance
vs predicted probabilities and Cook's distances vs predicted probabilities.
The change in deviance plot helps you to identify cases that are poorly fit by
the model. Larger changes in deviance indicate poorer fits. By identifying the cases
that are poorly fit by the model, you can focus on how those customers are different,
and hopefully discover another predictor that will improve the model.

23
There are usually several models that pass the diagnostic checks, so you need
tools to choose between them. They include: (i) automated variable selection, (ii)
Pseudo R-Squared Statistics, and (iii) classification and validation.
The automated selection of variables lies with the problem that, when
constructing a model, the researcher generally want to only include predictors that
contribute significantly to the model. Several methods for stepwise selection of the
"best" predictors to include in the model can be applied to the Binary Logistic
Regression procedure. From step to step, the improvement in classification indicates
how well your model performs. A better model should correctly identify a higher
percentage of the cases.
Moreover, as the r-squared statistic, which measures the variability in the
dependent variable that is explained by a linear regression model, cannot be
computed for logistic regression models, the pseudo r-squared statistics are
introduced. These statistics are designed to have similar properties to the true r-
squared statistic. Finally, crosstabulating observed response categories with predicted
categories helps researcher to determine how well the model identifies defaulters.
Once having selected the appropriate model, it is time to interpret the
regression coefficients and the other statistics. Of course, the meaning of a logistic
regression coefficient is not as straightforward as that of a linear regression coefficient.
While the B coefficient is convenient for testing the usefulness of predictors, it is
possible to obtain a transformation for it like Exp(B), which is easier for interpretation.
Exp(B) represents the ratio-change in the odds of the event of interest for a one-unit
change in the predictor.
In synthesis, the BLR procedure is a very useful tool for predicting the value of
a categorical response variable with two possible outcomes. However, several
apparently similar techniques are available in most analysis cases. If there are more
than two possible outcomes and they do not have an inherent ordering, the
Multinomial Logistic Regression (MLR) should be used as the correct procedure. MLR
can also be used as an alternative when there are only two outcomes.

References

Anselin, L. 2001. “Spatial Effects in Econometric Practice in Environmental and


Resource Economics.“ Am. J. Agric. Econ. 83(3): 705-710.

24
Arbia, G. and Paelinck, J.H.P., 2003. Economic convergence or divergence? Modelling
the interregional dynamics of EU regions 1985-1999. Geographical Systems, 5, 1-24.
Barro, R.J. and Sala-i-Martin, X., 2004, Economic growth. Cambridge, Massachusetts:
MIT Press.
Cliff, A. and J.K. Ord. 1981. “Spatial processes, models and applications.” Pion:
London.
Ezcurra, R., 2007. Is there cross-country convergence in carbon dioxide emissions?
Energy Policy, 35, 1363-1372.
Ezcurra, R., Pascual, P. and Rapun, M., 2007. The spatial distribution of income
inequality in the European Union. Environment and Planning A, 39, 869-890.
Fotheringham, A.S., C. Brunsdon and M. Charlton. 2002. “Geographically weighted
regression. The analysis of spatially varying relationships.” Wiley, Chichester.
Hosmer, D. W. and S. Lemeshow. 2000. Applied Logistic Regression. New York: John
Wiley & Sons.
Kleinbaum, D. G. 1994. Logistic Regression: A Self-Learning Text. New York: Springer-
Verlag.
Maddison, D. 2006. “Environmental Kuznets curves: a spatial econometric approach.”
J. Environ. Econ. Manag. 51: 218-230.
Mainardi, S., 2007. Resource exploitation and cross-region growth trajectories:
nonparametric estimates for Chile. Journal of Environmental Management, 85, 27-43.
Neumayer, E., 2001. Improvement without convergence: pressure on the environment
in European union countries. Journal of Common Market Structure, 39, 927-937.
Norusis, M. 2004. SPSS 13.0 Statistical Procedures Companion. Upper Saddle-River, N.J.:
Prentice Hall, Inc.
Patacchini, E. 2008. “Local analysis of economic disparities in Italy: a spatial statistics
approach.” Stat. Meth. Appl. 17: 85-112.
Quah, D.T., 1996. Empirics for economic growth and convergence. European Economic
Review, 40, 1353-1375.
Sala-i-Martin, X., 1996. Regional cohesion: evidence and theories of regional growth
and convergence. European Economic Review, 40, 1325-1352.
Sala-i-Martin, X., 2002. 15 Years of new growth economics: what have we learnt?,
Columbia University, Department of Economics, Discussion Paper Series, No. 0102-47.

25

You might also like