LSavati Regression Analysis
LSavati Regression Analysis
80
Drought intensity
60
40
20
0
0 20 40 60 80 100
Drought duration
1. Introduction
2
application in RA, is to characterize the variation of the dependent variable around the
regression function, which can be described by a probability distribution.
4
%
0
10000 20000 30000
100 4
80 2
capite)
0
%
60
-2
40
-4
20 -6
0 -8
10000 20000 30000 10000 20000 30000
Reddito disponibile pro-capite (euro) Reddito disponibile pro-capite (euro)
1.2. Scopes of RA
3
Messina (Sicilia)
100
90
AW (mm)
80
70
60
80
82
84
86
88
90
92
94
96
98
00
19
19
19
19
19
19
19
19
19
19
20
Montagna del Matese (Campania)
120
110
AW (mm)
100
90
80
80
82
84
86
88
90
92
94
96
98
00
19
19
19
19
19
19
19
19
19
19
20
A large body of techniques for carrying out RA has been developed in recent
times. Traditional methods, such as linear regression (LR) and ordinary least squares
(OLS) regression, are parametric. This means that the regression function used in the
computation procedure is defined in terms of a finite number of unknown parameters
that are estimated directly from the data. On the contrary, nonparametric regression
refers to techniques that allow the regression function to lie in a specified set of
functions, which may be infinite-dimensional.
In practice, the performance of a defined RA method depends on the form of
the data-generating process. It could depend also on how it relates to the regression
approach being used in the study. Since the true form of the data-generating process is
not known, RA depends (at least to some extent) on making assumptions about this
process. These assumptions are sometimes (but not always) testable via statistical
inference if an enough large amount of data is available.
4
Notably, regression models for prediction are often useful even when the
assumptions are moderately violated, although they may not perform optimally.
However, as observed earlier, when carrying out inference using a regression model,
especially involving small effects or questions of causality based on observational
data, regression techniques should be used cautiously as they can easily give
misleading results.
Box 1
Back to the past
The earliest form of regression was the method of least squares (originally
known in French as méthode des moindres carrés), which was published by Legendre in
1805 and by Gauss in 1809. Legendre and Gauss both applied the method to the
problem of determining, from astronomical observations, the orbits of bodies about
the Sun. Gauss published a further development of the theory of least squares in 1821,
including a version of the Gauss–Markov theorem.
The term ‘regression’, however, was coined by F. Galton in the nineteenth
century to describe a biological phenomenon. The phenomenon was that the heights of
descendants of tall ancestors tend to regress down towards a normal average (a
phenomenon also known as regression toward the mean). This work was later
extended by U. Yule and K. Pearson to a more general context. In the work of Yule and
Pearson, the joint distribution of the response and explanatory variables is assumed to
be Gaussian.
This assumption was weakened by R.A. Fisher in his works of 1922 and 1925.
Fisher assumed that the conditional distribution of the response variable is Gaussian,
but the joint distribution need not be. In this respect, Fisher's assumption is closer to
Gauss's formulation of 1821.
5
types of missing data, nonparametric regression, Bayesian methods for regression,
regression in which the predictor variables are measured with error, regression with
more predictor variables than observations, and causal inference with regression.
These points represent sufficient (but not all necessary) conditions for the least-
squares estimator to possess desirable properties, in particular, these assumptions
6
imply that the parameter estimates will be unbiased, consistent, and efficient in the
class of linear unbiased estimators. Many of these assumptions may be relaxed in
more advanced treatments.
Box 2
The approximation can be formalized as E(Y|X) = f(X, β). To carry out regression
analysis, the form of the function f must be specified. Sometimes the form of this
function is based on knowledge about the relationship between Y and X that does not
rely on the data. If no such knowledge is available, a flexible or convenient form for f is
chosen.
1.6.1. Assumptions
7
Assume now that the vector of unknown parameters β is of length k. In order to
perform a regression analysis, information about the dependent variable Y should be
provided:
If N data points of the form (Y,X) are observed, where N < k, the most classical
approaches to regression analysis cannot be performed: since the system of
equations defining the regression model is undetermined, there is not enough
data to estimate β.
If exactly N = k data points are observed, and the function f is linear, the
equations Y = f(X, β) can be solved exactly rather than approximately. This
reduces to solving a set of N equations with N unknowns (the elements of β),
which has a unique solution as long as the X are linearly independent. If f is
nonlinear, a solution may not exist, or many solutions may exist.
The most common situation is where N > k data points are observed. In this
case, there is enough information in the data to estimate a unique value for β
that best fits the data in some sense, and the regression model when applied to
the data can be viewed as an over-determined system in β.
In the last case, the regression analysis provides the tools for:
1. Finding a solution for unknown parameters β that will, for example, minimize
the distance between the measured and predicted values of the dependent
variable Y (also known as method of least squares).
2. Under certain statistical assumptions, the regression analysis uses the surplus
of information to provide statistical information about the unknown
parameters β and predicted values of the dependent variable Y.
8
variables). At this point, we can define simple and multiple linear regression. In simple
linear regression for modelling n data points there is one independent variable: xi, and
two parameters, β0 and β1, giving a straight line (Figure 5):
This is still a functional form compatible with linear regression; although the
expression on the right hand side is quadratic in the independent variable xi, it is
linear in the parameters β0, β1 and β2.
Figure 5. Illustration of linear regression on three different data sets (blue, green, violet:
regression lines refer to green and blue point datasets).
4,0
Nord
3,5
3,0 Centro
2,5 Sud
ISD
2,0
1,5
1,0
0,5
0,0
4,0 4,5 5,0 5,5 6,0
log(Ya/SAU)
9
The term ei is the residual, . As discussed earlier, one method of
estimation is ordinary least squares. This method obtains parameter estimates that
minimize the sum of squared residuals (SSE):
parameter estimators, .
In the case of simple regression, the formulas for the least squares estimates
are:
where is the mean (average) of the x values and is the mean of the y values. In
linear least squares (straight line fitting) the derivation of these formulas is easy and
several numerical examples are possible. Under the assumption that the population
error term has a constant variance, the estimate of that variance is given by:
This is called the root mean square error (RMSE) of the regression. The
standard errors of the parameter estimates are given by the following formulas:
10
Under the further assumption that the population error term is normally
distributed, the researcher can use these estimated standard errors to create confidence
intervals and conduct hypothesis tests about the population parameters.
The least square parameter estimates are obtained by p normal equations. The
residual can be written as:
11
In matrix notation, the normal equations for k responses (usually k = 1) are
written as:
1.11. Extensions of RA
When the model function is not linear in the parameters, the sum of squares
must be minimized by an iterative procedure. This introduces many complications
which are summarized in differences between linear and non-linear least squares.
12
The response variable may be non-continuous (e.g. "limited" to lie on some
subsets of the real line). Nonlinear models for binary dependent variables include the
probit and logit model. The multivariate probit model makes it possible to estimate
jointly the relationship between several binary dependent variables and some
independent variables. For categorical variables with more than two values there is
the multinomial logit. For ordinal variables with more than two values, there are the
ordered logit and ordered probit models.
As far as interpolation and extrapolation are concerned, regression models
predict a value of the y variable given known values of the x variables. Prediction
within the range of values is known as interpolation. Prediction outside the range of
the data is known as extrapolation, which is more risky.
Although the parameters of RA are usually estimated using the method of least
squares, other methods can been used (Figure 6) including:
Bayesian linear regression.
Least absolute deviations, which is more robust in the presence of outliers,
leading to quantile regression.
Figure 6. An unexplored nexus: What is the better model for the data in this scatterplot? Linear
or squared polynomial regression?
0,6
0,5
0,4
D ESAI
0,3
0,2
0,1
0,0
-0,1
3,4 3,6 3,8 4,0 4,2 4,4 4,6 4,8
log(Y/P)
13
Nonparametric regression, that requires a large number of observations and is
computationally intensive.
Distance metric learning, which is learned by the search of a meaningful
distance metric in a given input space.
Box 3
In brief regression techniques
Aims and scope: regression models can be used for three main purposes:
- To describe or model a set of data with one dependent variable and one (or more)
independent variables.
- To predict or estimate the values of the dependent variable based on given value(s)
of the independent variable(s).
- To control or administer standards from a useable statistical relationship.
Rationale: Simple Linear Regression is the method for finding the "line of best fit"
between the dependent variable, y, and the independent variable, x.
‘Simple’: because only one independent variable is considered.
Linear: also means linear in the parameters, since no parameter appears to the first
power.
Linear in the Independent Variable: the independent variable only appears to the first
power.
The Least Squares Regression Line is the line which minimizes the sum of the square
or the error of the data points; it is an averaging line of the data.
Regression analysis tries to fit a model to one dependent variable based on one or
more independent variables.
14
Figure 7. A graphical description of regression residuals.
Box 4
The ‘convergence’ regression analysis
15
Table 2. An example of a table with detailed results of simple polynomial (squared)
regression between (A) (or its logarithm) and one predictor only (namely B variable)
estimated at three different spatial levels (standard errors of the estimate are reported
in brackets; stars indicate significance at p < 0.02 (*), and p < 0.001 (**)). The three
coefficient indicated in the table are those from the equation (A) = a + b (B)+ c (B2).
Figure 8. Non linear relationship between A and variation in A between 1990 and 2000
estimated at the spatial level 2 (panel A) and 3 (panel B). See also Table 2.
Panel A
16
Panel B
17
significant level — and to suppose that less significant results are not terribly
interesting. A vigorous cross-examination based on a number of matters, often
suggested by the researcher himself, is necessary to validate the regression results.
Still more difficult issues arise when an exact parameter estimate is needed for
some purpose. The fact that the parameter is “statistically significant” simply means
that by conventional tests, one can reject the hypothesis that its true value is zero. But
there are surely many other hypotheses about the parameter value that cannot be
rejected, and indeed the likelihood that regression will produce a perfectly accurate
estimate of any parameter is negligible.
About the only guidance that can be given from a statistical standpoint is the
obvious—parameter estimates with proportionally low standard errors are less likely
to be wide of the mark than others.
1.14. Examples for analysis: from ‘exploratory data mining’ to advanced regression models
18
considered in the study are never introduced as covariates in linear (or non-linear)
models studying similar processes. Based on this approach, some variables could be
excluded from the final analysis due to their correlation with the other covariates.
Notably, OLS regression assumes spatial randomness which indicates that any
grouping of high or low values of the study variable in space would be independent. If
this assumption is not true, i.e. a spatial structure exists in the variable as detected by
the presence of spatial correlation, standard OLS estimates are inefficient (Rupasingha
et al. 2004). The spatial variation of both the dependent variable and the main
predictor can be explored through exploratory spatial data analysis techniques.
Central to the spatial framework is the choice of the matrix that describes the
interaction structure of the cross-section units, i.e. the definition of proximity. For each
spatial unit a relevant neighboring set must be defined consisting of those units that
potentially interact with it. Although in regional data analysis proximity is usually
defined in terms of contiguity, if the basic units are defined by administrative
boundaries this definition may not be appropriate, because partitions of the territory
based on administrative criteria may not coincide with the one based on the study’s
target criteria.
19
Figure 9. A curve describing (A)-(B) relationship and the possible functional form.
a
b
A
20
fitting two different covariance models to , including conditional spatial
autoregression (CAR) and moving average (MA) structures. The spatial weight matrix
introduced in these models was chosen according to the results of Moran’s and
Geary’s statistics.
The second approach arise from consideration that socio-economic processes
are usually not constant over space, bearing a certain amount of spatial non-
stationarity. If the data generating process is non-stationary over space, global
statistics (and model fitting) which summarize the major characteristics of a given
spatial data configuration (and the relationship among variables) might be locally
misleading due to a bias in the estimates. Different types of (local) techniques were
developed in order to deal with spatial non-stationarity, including the Geographically
Weighted Regression (GWR) proposed by Fotheringham et al. (2002).
The methodological framework underlying GWR is quite similar to that of
local linear regression models, as it uses a kernel function to calculate weights for the
estimation of local weighted regression models. Contrary to the standard regression
model, where the regression coefficients are location-invariant, the specification of a
basic GWR model for each location s = 1, …, n, is:
y(s) = X(s)b(s) + e(s)
where y(s) is the dependent variable at location s, X(s) is the row vector of explanatory
variables at location s, b(s) is the column vector of regression coefficients at location s,
and e (s) is the random error at location s.
Hence, regression parameters, estimated at each location by weighted least
squares, vary in space implying that each coefficient in the model is a function of s, a
point within the geographical space of the study area. As a result, GWR gives rise to a
distribution of local estimated parameters. The weighting scheme is expressed as a
kernel function that places more weight on the observations closer to the location s. In
this study we adopted one of the most commonly used specifications of the kernel
function, which is the bi-square nearest neighbor function (Fotheringam et al. 2002).
Binary Logistic Regression (BLR) is useful when you want to model the event
probability for a categorical response variable with two outcomes. For example:
- a loan officer wants to know whether the next customer is likely to default;
- a doctor wants to accurately diagnose a possibly cancerous tumor;
21
- a catalog company wants to increase the proportion of mailings that result in sales.
Using the Binary Logistic Regression, the catalog company can send mailings
to the people who are most likely to respond, the doctor can determine whether the
tumor is more likely to be benign or malignant, and the loan officer can assess the risk
of extending credit to a particular customer.
Since the probability of an event must lies between 0 and 1, it is impractical to
model probabilities with linear regression techniques, because the linear regression
model allows the dependent variable to take values greater than 1 or less than 0. The
logistic regression model is a type of generalized linear model that extends the linear
regression model by linking the range of real numbers to the 0-1 range.
Start by considering the existence of an unobserved continuous variable, Z,
which can be thought of as the "propensity towards" the event of interest. In the case
of loan officer, Z is the customer's propensity to default on a loan, with larger values of
Z corresponding to greater probabilities of defaulting.
In the BLR, and more in general in the logistic regression models (LRM), the
relationship between Z and the probability of the event of interest is described by a
link function (see the box below).
Box 5
A classical link function in BRM.
Notably, the model assumes that Z is linearly related to the predictors, so:
22
zi = b0 + b1xi1 + b2xi2 + ... + bpxip
where:
xij is the jth predictor for the ith case
bj is the jth coefficient
p is the number of predictors
Of course, if Z values were observable, you would simply fit a linear regression
to Z and be done. However, since Z is unobserved, the predictors should must be
related to the probability of interest by substituting for Z. The regression coefficients
will be estimated through iterative procedures, such as the maximum likelihood
method.
We can turn to one of the previous examples. If you are a loan officer at a bank,
then you want to be able to identify characteristics that are indicative of people who
are likely to default on loans, and use those characteristics to identify good and bad
credit risks. Suppose information on 800 past and prospective customers. This
information is contained on a spreadsheet. Note that the first 600 cases are customers
who were previously given loans.
One possible exercise would be to use a random sample of these customers to
create a logistic regression model, and setting the remaining customers aside in order
to validate the analysis. Then it is possible to use the model in order to classify the
remaining 200 prospective customers as good or bad credit risks.
After building a model, the next step will be to determine whether it
reasonably approximates the behaviour of your data. One typical test of model fit,
which is usually implemented by several commercial softwares, is the Hosmer-
Lemeshow goodness-of-fit statistic. Moreover, it is possible to develop some ‘residual
plots’, that are true diagnostic plots. Two helpful plots could be the change in deviance
vs predicted probabilities and Cook's distances vs predicted probabilities.
The change in deviance plot helps you to identify cases that are poorly fit by
the model. Larger changes in deviance indicate poorer fits. By identifying the cases
that are poorly fit by the model, you can focus on how those customers are different,
and hopefully discover another predictor that will improve the model.
23
There are usually several models that pass the diagnostic checks, so you need
tools to choose between them. They include: (i) automated variable selection, (ii)
Pseudo R-Squared Statistics, and (iii) classification and validation.
The automated selection of variables lies with the problem that, when
constructing a model, the researcher generally want to only include predictors that
contribute significantly to the model. Several methods for stepwise selection of the
"best" predictors to include in the model can be applied to the Binary Logistic
Regression procedure. From step to step, the improvement in classification indicates
how well your model performs. A better model should correctly identify a higher
percentage of the cases.
Moreover, as the r-squared statistic, which measures the variability in the
dependent variable that is explained by a linear regression model, cannot be
computed for logistic regression models, the pseudo r-squared statistics are
introduced. These statistics are designed to have similar properties to the true r-
squared statistic. Finally, crosstabulating observed response categories with predicted
categories helps researcher to determine how well the model identifies defaulters.
Once having selected the appropriate model, it is time to interpret the
regression coefficients and the other statistics. Of course, the meaning of a logistic
regression coefficient is not as straightforward as that of a linear regression coefficient.
While the B coefficient is convenient for testing the usefulness of predictors, it is
possible to obtain a transformation for it like Exp(B), which is easier for interpretation.
Exp(B) represents the ratio-change in the odds of the event of interest for a one-unit
change in the predictor.
In synthesis, the BLR procedure is a very useful tool for predicting the value of
a categorical response variable with two possible outcomes. However, several
apparently similar techniques are available in most analysis cases. If there are more
than two possible outcomes and they do not have an inherent ordering, the
Multinomial Logistic Regression (MLR) should be used as the correct procedure. MLR
can also be used as an alternative when there are only two outcomes.
References
24
Arbia, G. and Paelinck, J.H.P., 2003. Economic convergence or divergence? Modelling
the interregional dynamics of EU regions 1985-1999. Geographical Systems, 5, 1-24.
Barro, R.J. and Sala-i-Martin, X., 2004, Economic growth. Cambridge, Massachusetts:
MIT Press.
Cliff, A. and J.K. Ord. 1981. “Spatial processes, models and applications.” Pion:
London.
Ezcurra, R., 2007. Is there cross-country convergence in carbon dioxide emissions?
Energy Policy, 35, 1363-1372.
Ezcurra, R., Pascual, P. and Rapun, M., 2007. The spatial distribution of income
inequality in the European Union. Environment and Planning A, 39, 869-890.
Fotheringham, A.S., C. Brunsdon and M. Charlton. 2002. “Geographically weighted
regression. The analysis of spatially varying relationships.” Wiley, Chichester.
Hosmer, D. W. and S. Lemeshow. 2000. Applied Logistic Regression. New York: John
Wiley & Sons.
Kleinbaum, D. G. 1994. Logistic Regression: A Self-Learning Text. New York: Springer-
Verlag.
Maddison, D. 2006. “Environmental Kuznets curves: a spatial econometric approach.”
J. Environ. Econ. Manag. 51: 218-230.
Mainardi, S., 2007. Resource exploitation and cross-region growth trajectories:
nonparametric estimates for Chile. Journal of Environmental Management, 85, 27-43.
Neumayer, E., 2001. Improvement without convergence: pressure on the environment
in European union countries. Journal of Common Market Structure, 39, 927-937.
Norusis, M. 2004. SPSS 13.0 Statistical Procedures Companion. Upper Saddle-River, N.J.:
Prentice Hall, Inc.
Patacchini, E. 2008. “Local analysis of economic disparities in Italy: a spatial statistics
approach.” Stat. Meth. Appl. 17: 85-112.
Quah, D.T., 1996. Empirics for economic growth and convergence. European Economic
Review, 40, 1353-1375.
Sala-i-Martin, X., 1996. Regional cohesion: evidence and theories of regional growth
and convergence. European Economic Review, 40, 1325-1352.
Sala-i-Martin, X., 2002. 15 Years of new growth economics: what have we learnt?,
Columbia University, Department of Economics, Discussion Paper Series, No. 0102-47.
25