Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views51 pages

Lecture 3and4

The document discusses Multiple Linear Regression (MLR) as an extension of Simple Linear Regression, emphasizing the importance of using multiple explanatory variables to explain variations in the response variable. It covers hypothesis testing, variance-covariance matrices, regression diagnostics, and issues like multicollinearity and leverage in MLR models. Additionally, it outlines methods for selecting the best MLR model, including Mallow's Cp, PRESS statistic, and adjusted R2.

Uploaded by

siripk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views51 pages

Lecture 3and4

The document discusses Multiple Linear Regression (MLR) as an extension of Simple Linear Regression, emphasizing the importance of using multiple explanatory variables to explain variations in the response variable. It covers hypothesis testing, variance-covariance matrices, regression diagnostics, and issues like multicollinearity and leverage in MLR models. Additionally, it outlines methods for selecting the best MLR model, including Mallow's Cp, PRESS statistic, and adjusted R2.

Uploaded by

siripk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

ENE 890

Environmental Data Analysis


Fall 2007
Lecture 3 – October 15
Multiple Linear Regression
z Extension of simple linear regression (SLR) to the case of
multiple explanatory variables.
z The goal of this relationship is to explain as much as possible of
the variation observed in the response (y) variable, leaving as
little variation as possible to unexplained "noise".
Why Use MLR?
z When are multiple explanatory variables required?
z When scientific knowledge and experience tells us they are likely
to be useful.
z Residuals of SLR may indicate there is a temporal trend
(suggesting time as an additional explanatory variable), a spatial
trend (suggesting spatial coordinates as explanatory variables),
or seasonality (suggesting variables which indicate which season
the data point was collected in).

2
MLR Model

z There are k explanatory variables, some of which may be related or correlated to


each other.
z It is best to avoid calling these "independent" variables. They may or may not be
independent of each other.
z Calling them explanatory variables describes their purpose: to explain the variation in
the response variable.

3
Hypothesis Tests for Multiple Regression
Nested F Tests
Let model "s" be the "simpler" MLR model

It has k+1 parameters including the intercept, with degrees of freedom (dfs) of n−(k+1). Sum of squared errors is SSEs.

Let model "c" be the more complex regression model

It has m+1 parameters and degrees of freedom (dfc) of n−(m+1). Sum of squared errors is SSEc.

Test: whether c provides a sufficiently better explanation of the variation in y than does s.

The models are "nested" because all of the k explanatory variables in s are also present in c, and thus s is nested within c.

If F exceeds the tabulated value of the F distribution (Fα(dfs-dfc,dfc)), then H0 is rejected.

If F is small, the additional variables are adding little to the model, and s would be chosen over c. 4
Variance-Covariance Matrix
Values of the k explanatory variables for each of the n observations, along with a vector of 1s for the intercept term, can
be combined into a matrix X:

X is used in MLR to compute the variance-covariance matrix σ2 • (X'X)−1, where (X'X)−1 is the "X prime X inverse“ matrix.
Elements of (X'X)−1 for three explanatory variables are as follows:

When multiplied by the error variance σ2 (estimated by the variance of the residuals, s2), the diagonal elements of the
matrix C00 through C33 become the variances of the regression coefficients, while the off-diagonal elements become the
covariances between the coefficients.
Both (X'X)−1 and s2 can be output from MLR software. 5
Confidence Intervals for Slope Coefficients

Interval estimates for the regression coefficients β0 through βk are often printed by MLR software.

A 100•(1−α)% confidence interval on βj is

where Cjj is the diagonal element of (X'X)−1 corresponding to the jth explanatory variable.

Often printed is the standard error of the regression coefficient:

6
Regression Diagnostics
z Important to use graphical tools to diagnose
deficiencies in MLR.
z The following residuals plots are very important:
z normal probability plots of residuals,
z residuals versus predicted (to identify curvature or
heteroscedasticity),
z residuals versus time sequence or location (to identify
trends),
z residuals versus any candidate explanatory variables
not in the model (to identify variables, or appropriate
transformations of them, which may be used to improve
the model fit). 7
Partial Residual Plots
As with SLR, curvature in a plot of residuals versus an explanatory variable included in the model indicates
that a transformation of that explanatory variable is required. Their relationship should be linear.
To see this relation, residuals should not be plotted directly against explanatory variables; the other
explanatory variables will influence these plots.
eg, curvature in the relationship between e and x1 may show up in the plot of e versus x2, erroneously
indicating that a transformation of x2 is required.
To avoid such effects, partial residuals plots (also called adjusted variable plots) should be constructed.

The partial residual is


yj* = y − y^(j)
where y^(j) is the predicted value of y from a regression equation where xj is left out of the model. All other
candidate explanatory variables are present.

This partial residual is then plotted versus an adjusted explanatory variable

xj* = x − x^(j)

where x^(j) is the xj predicted from a regression against all other explanatory variables. So xj is treated as a
response variable in order to compute its adjusted value.

The partial plot (ej* versus xj*) describes the relationship between y and the jth explanatory variable after all
effects of the other explanatory variables have been removed.
Only the partial plot accurately indicates whether a transformation of xj is necessary. 8
Leverage and Influence
z Regression diagnostics are much more important in MLR than in SLR.
z It is very difficult when performing multiple regression to recognize
points of high leverage or high influence from any set of plots. Because
the explanatory variables are multidimensional.
z One observation may not be exceptional in terms of each of its
explanatory variables taken one at a time, but viewed in combination it
can be very exceptional. Numerical diagnostics can accurately detect
such anomalies.
z The leverage statistic hi = x0'(X'X)−1x0 expresses the distance of a
given point x0 from the center of the sample observations. It has two
important uses in MLR:
z Direct extension of its use in SLR -- to identify points unusual in value of the
explanatory variables. Such points warrant further checking as possible
errors, or may indicate a poor model.
z When making predictions. The leverage value for a prediction should not
exceed the largest hi in the original data set.

9
Example 1
Variations in chemical concentrations within a steeply
dipping aquifer are to be described by location and
depth.
The data are concentrations (C) plus three
coordinates: distance east (DE), distance north
(DN), and well depth (D).
Data were generated using C = 30 + 0.5 D + ε.
Any acceptable regression model should closely
reproduce this true model, and should find C to be
independent of DE and DN.
10
11
Compared to the critical leverage statistic hi=3p/n=0.6,
and critical influence statistic DFFITS=2 (p/n)0.5 =0.9, the
Three pairwise plots of explanatory 16th observation is found to be a point of high leverage
variables do not reveal any "outliers" in and high influence.

the data set.

The axes have been rotated,


showing observation 16 to be
lying outside the plane of
occurrence of the rest of the data,
even though its individual values
for the three explanatory
variables are not unusual.

12
z The depth value for observation 16 was set as a "typographical error", and should be 23.111
instead of 13.111.
z What does this error and resulting high leverage point do to a regression of concentration versus
the three explanatory variables?
z From the t-ratios it is seen that DN and perhaps DE appear to be significantly related to Conc, but
depth (D) is not.
z This is exactly opposite of what is known to be true.

13
z One outlier has had a severe detrimental effect on the regression coefficients and model structure.
z Points of high leverage and influence should always be examined before accepting a regression
model, to determine if they represent errors.
z Suppose that the "typographical error“ was detected and corrected.
z Table shows that the resulting regression relationship is drastically changed:

Based on the t-statistics, DE and DN are not significantly related to C, while depth is related.

The intercept of 29 is close to the true value of 30, and the slope for depth (0.7) is not far from
the true value of 0.5.

For observation 16, hi = 0.19 and DFFITS = 0.48, both well below their critical values.

Thus no observations have undue influence on the regression equation. 14


z Since DE and DN do not appear to belong in the regression
model, dropping them produces the equation below, with values
very close to the true values from which the data were
generated.
z Thus by using regression diagnostics to inspect observations
deemed unusual, a poor regression model was turned into an
acceptable one.

15
Multi-Collinearity
z At least one explanatory variable is closely related to one or more other explanatory variables.
z It results in several undesirable consequences for the regression equation:
z Equations acceptable in terms of overall F-tests may have slope coefficients with magnitudes which are unrealistically
large, and whose partial F or t-tests are found to be insignificant.
z Coefficients may be unrealistic in sign (a negative slope for a regression of streamflow vs. precipitation, (etc). Usually
this occurs when two variables describing approximately the same thing are counter-balancing each other in the
equation, having opposite signs.
z Slope coefficients are unstable. A small change in one or a few data values could cause a large change in the
coefficients.
z Automatic procedures such as stepwise, forwards and backwards methods produce different models judged to be
"best".
z An excellent diagnostic for measuring multi-collinearity is the variance inflation factor (VIF). For variable j
the VIF is

where Rj2 is the R2 from a regression of the jth explanatory variable on all of the other explanatory variables -- the
equation used for adjustment of xj in partial plots.

The ideal is VIFj≅ 1, corresponding to Rj2 ≅ 0.

Serious problems are indicated when VIFj > 10 (Rj2 > 0.9).

A useful interpretation of VIF is that multi-collinearity "inflates" the width of the confidence interval for the jth
regression coefficient by the amount (VIFj)0.5 compared to what it would be with a perfectly independent set of
explanatory variables.
16
Multi-Collinearity
z Multicollinearity refers to linear inter-correlation among variables.
z If variables correlate highly they are redundant in the same model.
z A principal danger of such data redundancy is that of over fitting in
regression models.
z The best regression models are those in which the predictor variables
each correlate highly with the dependent (outcome) variable but
correlate at most only minimally with each other.
z Such a model is often called "low noise" and will be statistically robust
(that is, it will predict reliably across numerous samples drawn from the
same statistical population).
z SPSS: Collinearity diagnostics. Eigenvalues of the scaled and
uncentered cross-products matrix, condition indices, and variance-
decomposition proportions are displayed along with variance inflation
factors (VIF) and tolerances for individual variables.

17
Solutions for multi-collinearity
z Center the data. Centering redefines the explanatory
variables by subtracting a constant from the original
variable, and then recomputing the derived variables.
This constant should be one which produces about as
many positive values as negative values, such as the
mean or median. When all of the derived explanatory
variables are recomputed as functions (squares,
products, etc.) of these centered variables, their multi-
collinearity will be reduced.
z Eliminate variables.
z Collect additional data.

18
Example 2 -- centering
z lnC is to be related to distance east and distance north of a city.
z Since the square of distance east (DESQ) must be strongly related to
DE, and similarly DNSQ and DN, and DE•DN with both DE and DN,
multi-collinearity between these variables will be detected by their VIFs.

19
Using the rule that any VIF above 10 indicates a strong dependence
between variables, table shows that all variables have high VIFs.
Therefore all of the slope coefficients are unstable, and no conclusions
can be drawn from the value of 10.5 for DE, or 15.1 for DN, etc.
This cannot be considered a good regression model, even though the R2
is large.

20
z DE and DN are centered by subtracting their medians.
z Three derived variables DESQ, DNSQ and DEDN are recomputed, and the regression
rerun.
z Table give the results, showing that all multi-collinearity is completely removed.
z The coefficients for DE and DN are now more reasonable in size, while the coefficients for
the derived variables are exactly the same.
z The t-statistics for DE and DN have changed because their uncentered values were
unstable and t-tests unreliable.
z Note that the s and R2 are also unchanged. In fact, this is exactly the same model as the
uncentered equation, but only in a different and centered coordinate system.

21
Choosing the Best MLR Model
z Appropriate approach to variable selection.
z Benefit of adding additional variables to a multiple regression model is to explain
more of the variance of the response variable.
z Cost of adding additional variables is that the degrees of freedom decreases,
making it more difficult to find significance in hypothesis tests and increasing the
width of confidence intervals.
z A good model will explain as much of the variance of y as possible with a small
number of explanatory variables.
z Consider only explanatory variables which must have some effect on the
dependent variable. Simply minimizing the SSE or maximizing R2 are not
sufficient criteria.
z Variables enter the model only if they make a significant improvement in the
model.
z There are at least two types of approaches for evaluating whether a new
variable sufficiently improves the model.
z The first approach uses some overall measure of model quality. The latter has many
advantages.
z The second approach uses partial F or t-tests, and when automated is often called a
"stepwise" procedure.
22
Overall Measures of Quality
Mallow's Cp:
to achieve a good compromise between the desire to explain as much variance in y as
possible (minimize bias) by including all relevant variables, and to minimize the
variance of the resulting estimates (minimize the standard error) by keeping the
number of coefficients small.

n is the number of observations,


p is the number of coefficients (number of explanatory variables plus 1),
sp2 is the mean square error (MSE) of this p coefficient model,
s is the best estimate of the true error, which is usually taken to be the minimum MSE
among the 2k possible models.

The best model is the one with the lowest Cp value.

23
PRESS statistic

One of the best measures of the quality of a regression equation is the "PRESS" statistic,
the "PRediction Error Sum of Squares."

PRESS is a validation-type estimator of error.

By minimizing PRESS, the model with the least error in the prediction of future
observations is selected.

PRESS and Cp generally agree as to which model is "best", even though their criteria
for selection are not identical.

24
Adjusted R2 (R2a):
An R2 value adjusted for the number of explanatory variables (or equivalently, the
degrees of freedom) in the model.

The model with the highest R2a is identical to the one with the smallest standard error
(s) or mean squared error (MSE).

Weakness of R2: it must increase when any additional variable is added to the
regression. This happens no matter how little explanatory power that variable has. R2a
is adjusted to offset the loss in degrees of freedom by including as a weight the ratio
of total to error degrees of freedom:

25
Stepwise Procedures
z It’s possible to automate the variable selection effort. Automated model
selection methods in which the computer algorithm determines which model is
preferred.
z Forward selection starts with only an intercept and adds variables to the
equation one at a time. Once in, each variable stays in the model. All variables
not in the model are evaluated with partial F or t statistics in comparison to the
existing model. The variable with the highest significant partial F or t statistic is
included, and the process repeats until either all available variables are included
or no new variables are significant.
z Backward elimination starts with all explanatory variables in the model and
eliminates the one with the lowest partial-F statistic (or lowest |t|). It stops when
all remaining variables are significant.
z Stepwise regression combines the ideas of forward and backward. It alternates
between adding and removing variables, checking significance of individual
variables within and outside the model. Variables significant when entering the
model will be eliminated if later they test as insignificant.

26
Example
z An automotive industry group keeps track of the
sales for a variety of personal motor vehicles.
z In an effort to be able to identify over- and
underperforming models, you want to establish a
relationship between vehicle sales and vehicle
characteristics.
z Information concerning different makes and models
of cars.
z Use linear regression to identify models that are not
selling well.
27
ANOVAb

Sum of
Model Squares df Mean Square F Sig. The ANOVA table reports a
1 Regression 130.300 10 13.030 13.305 .000a significant F statistic,
Residual 138.082 141 .979 indicating that model is
Total 268.383 151 statistically significant.

Model Summary As a whole, the regression


Adjusted Std. Error of does a good job of modeling
Model R R Square R Square the Estimate sales. Nearly half the variation
1 .697a .486 .449 .98960 in sales is explained by the
model.
Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
Even though the model fit looks
1 (Constant) -3.017 2.741 -1.101 .273 positive, the coefficients table shows
Vehicle type .883 .331 .293 2.670 .008 that there are too many predictors in
Price in thousands -.046 .013 -.502 -3.596 .000 the model.
Engine size .356 .190 .281 1.871 .063
Horsepower -.002 .004 -.092 -.509 .611 There are several non-significant
Wheelbase .042 .023 .241 1.785 .076 coefficients, indicating that these
Width -.028 .042 -.073 -.676 .500 variables do not contribute much to
Length .015 .014 .148 1.032 .304 the model.
Curb weight .156 .350 .075 .447 .655
Fuel capacity -.057 .047 -.167 -1.203 .231
Fuel efficiency .081 .040 .262 2.023 .045
a. Dependent Variable: Log-transformed sales

To determine the relative importance of the significant predictors, look at the


standardized coefficients. Even though Price in thousands has a small coefficient
compared to Vehicle type, Price in thousands actually contributes more to the model
because it has a larger absolute standardized coefficient. 28
The second section of the coefficients table shows that there might be a problem with multicollinearity.

Coefficientsa

Unstandardized Standardized
Coefficients Coefficients Correlations Collinearity Statistics
Model B Std. Error Beta t Sig. Zero-order Partial Part Tolerance VIF
1 (Constant) -3.017 2.741 -1.101 .273
Vehicle type .883 .331 .293 2.670 .008 .274 .219 .161 .304 3.293
Price in thousands -.046 .013 -.502 -3.596 .000 -.552 -.290 -.217 .187 5.337
Engine size .356 .190 .281 1.871 .063 -.135 .156 .113 .162 6.159
Horsepower -.002 .004 -.092 -.509 .611 -.389 -.043 -.031 .112 8.896
Wheelbase .042 .023 .241 1.785 .076 .292 .149 .108 .200 4.997
Width -.028 .042 -.073 -.676 .500 .037 -.057 -.041 .313 3.193
Length .015 .014 .148 1.032 .304 .215 .087 .062 .178 5.605
Curb weight .156 .350 .075 .447 .655 -.041 .038 .027 .131 7.644
Fuel capacity -.057 .047 -.167 -1.203 .231 -.016 -.101 -.073 .189 5.303
Fuel efficiency .081 .040 .262 2.023 .045 .121 .168 .122 .217 4.604
a. Dependent Variable: Log-transformed sales

For most predictors, the values of the partial


and part correlations drop sharply from the The tolerance is the percentage of the
zero-order correlation. This means, that variance in a given predictor that
much of the variance in sales that is When the tolerances are
cannot be explained by the other close to 0, there is high
explained by price is also explained by other predictors. Thus, the small tolerances
variables. multicollinearity.
show that 70%-90% of the variance in
Partial Correlation. The correlation that remains between two a given predictor can be explained by A variance inflation factor
variables after removing the correlation that is due to their the other predictors. greater than 2 is usually
mutual association with the other variables. The correlation
between the dependent variable and an independent variable
considered problematic, and
when the linear effects of the other independent variables in the the smallest VIF in the table
model have been removed from both. is 3.193
Part Correlation. The correlation between the dependent
variable and an independent variable when the linear effects of
the other independent variables in the model have been
removed from the independent variable. It is related to the 29
change in R squared when a variable is added to an equation.
Sometimes called the semipartial correlation.
The collinearity diagnostics confirm that there are serious problems with multicollinearity.

Condition
Model Dimension Eigenvalue Index Several eigenvalues are close to 0, indicating that the
1 1 9.920 1.000 predictors are highly intercorrelated and that small
2 .733 3.678 changes in the data values may lead to large changes
3 .259 6.193 in the estimates of the coefficients.
4 .050 14.051
5 .019 22.589 The condition indices are computed as the square roots of
6 .008 35.942 the ratios of the largest eigenvalue to each successive
7 .005 44.275 eigenvalue. Values greater than 15 indicate a possible
8 .003 58.480 problem with collinearity; greater than 30, a serious problem.
9 .002 76.175
Six of these indices are larger than 30, suggesting a very
10 .001 130.747
11
serious problem with collinearity.
.000 148.267

Now try to fix the collinearity problems by rerunning the regression using the
stepwise method of model selection. This is in order to include only the most useful
variables in the model.

30
Collinearity Diag

Condition
There are no eigenvalues close to 0, and all of the
Model Dimension Eigenvalue Index ( condition indices are much less than 15. The strategy
1 1 1.004 1.000 has worked, and the model built using stepwise
2 .996 1.004 methods does not have problems with collinearity.
2 1 1.109 1.000
2 .999 1.054
3 .891 1.116
a. Dependent Variable: Log-transformed sales

Model Summaryc
The new model's ability to explain sales compares favorably
Adjusted Std. Error of with that of the previous model.
Model R R Square R Square the Estimate
1 .552a .304 .300 1.11553
2 .655b .430 .422 1.01357 Look in particular at the adjusted R-square statistics, which
a. Predictors: (Constant), Zscore: Price in thousands are nearly identical. A model with extra predictors will
b. Predictors: (Constant), Zscore: Price in thousands,
always have a larger R-square, but the adjusted R-square
Zscore: Wheelbase compensates for model complexity to provide a more fair
c. Dependent Variable: Log-transformed sales comparison of model performance.

Coefficientsa

Unstandardized Standardized
Coefficients Coefficients Collinearity Statistics
Model B Std. Error Beta t Sig. Tolerance VIF
1 (Constant) 3.286 .090 36.316 .000
Zscore: Price in
-.732 .090 -.552 -8.104 .000 1.000 1.000
thousands
2 (Constant) 3.290 .082 40.020 .000
Zscore: Price in
-.783 .083 -.590 -9.487 .000 .988 1.012
thousands
Zscore: Wheelbase .470 .082 .356 5.718 .000 .988 1.012
a. Dependent Variable: Log-transformed sales

31
The stepwise algorithm chooses price and size (in terms of the vehicle wheelbase) as predictors. Sales are negatively affected
by price and positively affected by size; your conclusion is that cheaper, bigger cars sell well.
Excluded Variablesc

Collinearity Statistics
Partial Minimum
Model Beta In t Sig. Correlation Tolerance VIF Tolerance
1 Zscore: Type .251a 3.854 .000 .301 .998 1.002 .998
Zscore: Engine size .342a 4.128 .000 .320 .611 1.636 .611
Zscore: Horsepower .257a 2.062 .041 .167 .293 3.417 .293
Zscore: Wheelbase .356a 5.718 .000 .424 .988 1.012 .988
Zscore: Width .244a 3.517 .001 .277 .892 1.121 .892
Zscore: Length .308a 4.790 .000 .365 .976 1.025 .976
Zscore: Curb weight .346a 4.600 .000 .353 .722 1.385 .722 Price was chosen first
Zscore: Fuel capacity .266a 3.687 .000 .289 .820 1.219 .820 because it is the
Zscore: Fuel efficiency -.198a -2.584 .011 -.207 .758 1.319 .758
predictor that is most
2 Zscore: Type .129b 1.928 .056 .157 .835 1.197 .827
highly correlated with
sales.
Zscore: Engine size .145b 1.576 .117 .128 .445 2.246 .445
The remaining predictors
Zscore: Horsepower .028b .229 .819 .019 .256 3.910 .256
are then analyzed to
Zscore: Width -.025b -.275 .784 -.023 .470 2.126 .470
determine which, if any,
Zscore: Length .027b .237 .813 .020 .290 3.448 .290
is the most suitable for
Zscore: Curb weight .105b 1.028 .306 .084 .365 2.741 .365
inclusion at the next step.
Zscore: Fuel capacity .002b .024 .981 .002 .443 2.259 .443
Zscore: Fuel efficiency .014b .164 .870 .014 .559 1.790 .559
a. Predictors in the Model: (Constant), Zscore: Price in thousands
b. Predictors in the Model: (Constant), Zscore: Price in thousands, Zscore: Wheelbase
c. Dependent Variable: Log-transformed sales

Beta In is the value of the standardized coefficient for the predictor if it is included next.
All of the significance values are less than 0.05, so any of the remaining predictors would be adequate if included in the model.
To choose the best variable to add to the model, look at the partial correlation, which is the linear correlation between the
proposed predictor and the dependent variable after removing the effect of the current model. Thus, wheelbase is chosen
next because it has the highest partial correlation.
After adding wheelbase to the model, none of the remaining predictors are significant. However, vehicle type just barely misses
the 0.05 cutoff, so you may want to add it manually in a future analysis to see how it changes the results.

Engine size would have a larger beta coefficient if added to the model, but it's not as desirable as vehicle type. This is because32
engine
size has a relatively low tolerance compared to vehicle type, indicating that it is more highly correlated with price and wheelbase.
Casewise Diagnosticsa

Log-transfo Predicted
Case Number Model Std. Residual rmed sales Value Residual
53 Explorer 2.297 5.62 3.2953 2.32778
84 3000GT -4.905 -2.21 2.7638 -4.97111
109 Cutlass -3.610 .11 3.7651 -3.65892
116 Breeze -2.252 1.66 3.9393 -2.28296
118 Prowler -2.139 .63 2.7955 -2.16849
132 SW -2.012 1.65 3.6927 -2.03967
a. Dependent Variable: Log-transformed sales

Cases with large negative residuals as the 3000GT and the


Cutlass. Relative to other cars of their size and price, these two
The shape of the histogram follows the shape of the models underperformed in the market.
normal curve fairly well, but there are one or two large The Breeze, Prowler, and SW also appear to have
negative residuals. For more information on these cases, underperformed to a lesser extent.
see the casewise diagnostics. The Explorer seems to be the only overperformer.

Two most underperforming vehicles.


Additionally, Breeze, Prowler, SW, and Explorer
are quite close to the majority of cases.
The apparent underperformance of the Breeze,
Prowler, and SW and overperformance of the
Explorer could be due to random chance.
The clusters of cases far to the left and to the right
of the general cluster of cases. While the vehicles
in these clusters do not have large residuals, their
distance from the general cluster may have given
these cases unnecessary influence in determining
the regression coefficients.
33
To check the residuals by price:

The unusual points noted in the


residuals by predicted values scatterplot
are high-priced vehicles.

34
To check the residuals by wheelbase

The points to the right of the general


cluster in the plot correspond to points
to the right of the general cluster in the
residuals by predicted values
scatterplot.

35
To check Cook's distance by the centered leverage value

The resulting scatterplot shows


several unusual points.

The point with the largest Cook's


distance is the 3000GT. It does
not have a high leverage value,
and D<2.4 so while it adds a lot
of variability to your regression
estimates, the 3000GT did not
affect the slope of the regression
equation.

Similarly, many of the cases with


high leverage values do not have
large Cook's distances, so they
are not likely to have exerted too
much influence on the model.

36
Summary of Model Selection Criteria
z Should y be transformed? To decide whether to transform the y variable, plot residuals versus
predicted values for the untransformed data. Compare this to a residuals plot for the best transformed
model, looking for three things:
z constant variance across the range of y^,
z normality of residuals, and
z a linear pattern, not curvature.

z Should x (or several x's) be transformed? Transformation of an x variable should be made using
partial plots. Check for the same three patterns of constant variance, normality and linearity.
Considerable help can be obtained from statistics such as R2 (maximize it), or SSE or PRESS (minimize
it). Many transformations can be rapidly checked with such statistics, but a residuals plot should always
be inspected prior to making any final decision.

z Which of several models, each with the same y and with the same number of explanatory
variables, is preferable? Use of R2, SSE, or PRESS is appropriate here, but back it up with a residuals
plot.

z Which of several nested models, each with the same y, is preferable? Use the partial F test between
any pair of nested models to find which is best. One may also select the model based on minimum Cp or
minimum PRESS.

z Which of several models is preferable when each uses the same y variable but are not
necessarily nested? Cp or PRESS must be used in this situation.

37
Analysis of Covariance
z Often there are factors which influence the dependent variable
which are not appropriately expressed as a continuous variable
(they are categorical variables).
z Examples of such grouped or qualitative variables include
location (stations, aquifers, positions in a cross section), or time
(day & night; winter & summer), region (east-west).
z These factors are perfectly valid explanatory variables in a
multiple regression context.
z They can be incorporated by the use of binary or "dummy"
variables, essentially blending regression and analysis of
variance into an analysis of covariance.

38
Use of One Binary Variable
To the simple one-variable regression model

an additional factor is believed to have an important influence on Y for any given value of X.
Perhaps this factor is a seasonal one: cold season versus warm season -- where some precise definition
exists to classify all observations as either cold or warm.

A second variable, a binary variable Z, is added to the equation where

to produce the model

X: quantitative covariate, Z: qualitative predictor to be tested.


When the slope coefficient β2 is significant, model 2 would be preferred to the SLR model.
This also says that the relationship between Y and X is affected by season.

39
Consider H0: β2 = 0 versus H1: β2 ≠ 0. The null hypothesis is tested using a student's t-test with (n−3)
degrees of freedom. There are 3 betas being estimated.

If the partial |t|≥ tα/2, H0 is rejected, inferring that there are two models:

Regression lines differ for the two seasons.


Both seasons have the same slope, but
different intercepts, and will plot as two
parallel lines.

40
z Suppose that the relationship between X and Y for the two seasons is suspected not
only to differ in intercept, but in slope as well.

The intercept equals β0 for the cold season and β0 +β2 for the warm season;

The slope equals β1 for the cold season and β1 + β3 for the warm season.

This model is referred to as an "interaction model" because of the use of the explanatory variable Z X, the
interaction (product) of the original predictor X and the binary variable Z.

To determine whether the simple regression model with no Z terms can be improved upon by this model, the
following hypotheses are tested:

where s refers to the simpler (no Z terms) model, and c refers to the more complex model.

41
z If H0 is rejected, model 11.12 should also be compared to model 11.11 (the shift in intercept only
model) to determine whether there is a change in slope in addition to the change in intercept, or
whether the rejection of model 11.10 in favor of 11.12 was due only to a shift in intercept.
z The null hypothesis H0': β3 = 0 is compared to H1': β3 ≠ 0 using the test statistic.

Assuming H0 and H0' are both rejected, the model


can be expressed as the two separate equations:

42
Multiple Binary Variables
z In some cases, the factor of interest must be expressed as more than two categories:
z 4 seasons, 12 months, 5 stations, 3 flow conditions (rising limb, falling limb, base flow), etc.
z Assume there are precise definitions of 3 flow conditions so that all discharge (Xi) and
concentration (Yi) pairs are classified as either rising, falling, or base flow.
z Two binary variables are required to express these three categories -- there is always one
less binary variable required than the number of categories.

43
44
To test for differences between each pair of categories:
1. Is rising different from base flow? This is tested using the t-statistic on the coefficient β2.

If |t|>tα/2 on n−4 degrees of freedom, reject H0 where H0: β2 = 0.

2. Is falling different from base flow? This is tested using the t-statistic on the coefficient β3.

If |t|>tα/2 with n−4 degrees of freedom, reject H0 where H0: β3 = 0.

3. Is rising different from falling?

the standard error of the difference (b2−b3) must be known.

The null hypothesis is H0: (β2 − β3) = 0.

The estimated variance of b2−b3, Var(b2−b3) = Var(b2) + Var(b3) − 2Cov(b2, b3)

where Cov is the covariance between b2 and b3.

To determine these terms, the matrix (X'X)−1 and s2 (s2 is the mean square error) are required. Then

45
z Even greater complexity can be added to these kinds of models, using
multiple binary variables and interaction terms such as
Y = β0 + β1 X + β2 R + β3 D + β4 R X + β5 D X +ε.
z The procedures for selecting models follow the pattern described above.
z The significance of an individual β coefficient, given all the other βs, can be
determined from the t statistic.
z The comparison of two models, where the set of explanatory variables for
one model is a subset of those used in the other model, is computed by a
nested F test.
z The determination of whether two coefficients in a given model differ
significantly from each other is computed by using a t test after estimating
the variance of the difference between the coefficients based on the
elements of the (X'X)−1 matrix and s2.

46
Example
z Study on faba bean plant (Vicia faba) for use as Endonuclease Treatment Time (min)
an ecotoxicological indicator species. treatment
z One specific endpoint investigated was DNA 0 10 20 30
damage in nuclei from V. faba root tip cells.
FokI 20.5 65.4 68.4 78.1
z Greater damage is indicated by higher
percentages of separated DNA. 13.4 56 54.7 72.8

z Of interest was whether changing the treatment 17.2 57.3 62.8 81.9
of the cells with different endonucleases 19.6 59.2 61.8 79.4
produced differences in DNA damage.
EcoRI 6.8 10.5 49 62.1
z The study protocol also called for a variety of
treatment times, since increasing treatment 13 12.6 40.6 50.2
time was thought to increase genotoxic DNaseI+MgCl2 18.2 24.5 28 48.2
response and enhance the assay’s potential as
21.9 28 31.6 47.8
an ecotoxicological screen.
z Below are the mean percentage responses of 2.6 56.2
separated DNA as a function of treatment time 2.4 56
(x=0, 10, 20, or 30 min) and endonuclease
treatment (FokI, EcoRI, DNaseI+MgCl2, or 9.9 52.7
DNaseI+MnCl2). 14.4 42.6
z Exposure time: quantitative covariate DNaseI+MnCl2 58.9 91.5 96.6
z Endonuclease treatment: qualitative predictor to
65.2 89.9 96.6
be tested.

47
Tests of Between-Subjects Effects

Dependent Variable: LogPi


Type III Sum Partial Eta
Source of Squares df Mean Square F Sig. Squared
Corrected Model 18.212a 4 4.553 70.527 .000 .873
Intercept 4.717 1 4.717 73.068 .000 .641
time 8.561 1 8.561 132.617 .000 .764
Treatment 11.008 3 3.669 56.839 .000 .806
Error 2.647 41 .065
Total 21.410 46
Corrected Total 20.859 45
a. R Squared = .873 (Adjusted R Squared = .861)

z F-test for H0:β2=β3=β4=β5=0, whether endonuclease treatment makes a difference in the generation of DNA
damage, produces a test statistic of Fcalc=56.84.
z There is a clear difference among treatments after adjusting for possible differences in percent DNA damage due to
treatment time.

48
Polynomial regression
In many environmental applications, the mean response is affected in a much more complex
fashion than can be described by simple linear relationships.
Addition of higher order polynomial terms.

Yi= β0+ β1xi+ β2xi2 +…+ βpxip + εi

Special form of multiple regression.


Centering with mean of x can be used to avoid multicollinearity.

To test the effect of x-variable, assess H0: β1=β2=β3=β4=0 via an F statistic.


Or t-tests on individual regression coefficients.
e.g. H0: β2=0 using tcalc = b2/se(b2) assess whether quadratic curvature is required in a model that
already has a linear term present by comparing the full model Yi= β0+ β1xi+ β2xi2 to a
reduced model Yi= β0+ β1xi

49
Example
Yield of irrigated rice as a fct of min T, with a goal of understanding the
environmental conditions that affect yield.

Quadratic polynomial may provide a good


approximation. So, consider the polynomial
regression model with p=2.
ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 53.932 2 26.966 78.094 .000a
Residual 12.776 37 .345
Total 66.708 39
a. Predictors: (Constant), mintsquare, mint
b. Dependent Variable: yield

There is a significant effect of temperature on yield.

Coefficientsa

Unstandardized Standardized
Coefficients Coefficients 95% Confidence Interval for B Correlations Collinearity Statistics
Model B Std. Error Beta t Sig. Lower Bound Upper Bound Zero-order Partial Part Tolerance VIF
1 (Constant) 3.509 .119 29.449 .000 3.267 3.750
mint -.356 .034 -.776 -10.546 .000 -.424 -.287 -.843 -.866 -.759 .956 1.046
mintsquare .041 .009 .319 4.341 .000 .022 .060 .482 .581 .312 .956 1.046
a. Dependent Variable: yield

Confidence intervals fail to contain zero, so each term contributes significantly to the model.
50
Reference

Statistical Methods in Water Resources


by D.R. Helsel and R.M. Hirsch

51

You might also like