Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views46 pages

05 Diagnostics

The document discusses various statistical concepts related to multiple linear regression, including F-testing, multicollinearity, and the importance of residual analysis. It emphasizes the need for transformations to address issues such as nonconstant variance and non-normality, and highlights the significance of studentized residuals in detecting outliers. Additionally, it provides guidance on interpreting transformed models and the implications of elasticity in economic contexts.

Uploaded by

Fan Sini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views46 pages

05 Diagnostics

The document discusses various statistical concepts related to multiple linear regression, including F-testing, multicollinearity, and the importance of residual analysis. It emphasizes the need for transformations to address issues such as nonconstant variance and non-normality, and highlights the significance of studentized residuals in detecting outliers. Additionally, it provides guidance on interpreting transformed models and the implications of elasticity in economic contexts.

Uploaded by

Fan Sini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Diagnostics and Transformations

DSO-530

F-testing, multicollinearity,
standardized and studentized residuals,
outliers and leverage, nonconstant variance,
non-normality, nonlinearity, and transformations

Mladen Kolar ([email protected])


The F -test
The F -test tries to formalize the idea of a big R2 .

The test statistic is

SSR/(p − 1) R2 /(p − 1)
f= =
SSE/(n − p) (1 − R2 )/(n − p)

If f is big, then the regression is “worthwhile”:


▶ Big SSR relative to SSE?
▶ R2 close to one?

1
What we are really testing:

H0 : β1 = β2 = · · · = βd = 0
H1 : at least one βj ̸= 0.

Hypothesis testing only gives a yes/no answer.


▶ Which βj ̸= 0?
▶ How many?

The test is contained in the .summary() for any MLR fit.

2
smf.ols("price ~ make", data=pickup).fit().summary()

R-squared: 0.021
F-statistic: 0.4628
Prob (F-statistic): 0.633

smf.ols("price ~ make + miles",


data=pickup).fit().summary()

R-squared: 0.446
F-statistic: 11.25
Prob (F-statistic): 1.51e-05

3
Multicollinearity
Multicollinearity refers to strong linear dependence between
some of the covariates in a multiple regression model.

The usual marginal effect interpretation is lost:


▶ change in one X variable leads to change in others.

Coefficient standard errors will be large, such that


multicollinearity leads to large uncertainty about the bj ’s.

4
Example: how employee ratings of their supervisor relate to
performance metrics.

The Data:
Y: Overall rating of supervisor
X1: Opportunity to learn new things
X2: Does not allow special privileges
X3: Raises based on performance

5
Suppose that you regress Y onto X1 and X2 = 10 × X1 .

Then

E[Y |X1 , X2 ] = β0 + β1 X1 + β2 X2 = β0 + β1 X1 + β2 (10X1 )

and the marginal effect of X1 on Y is

∂E[Y |X1 , X2 ]
= β1 + 10β2
∂X1

▶ X1 and X2 do not act independently!

6
Multicollinearity is not a big problem in and of itself, you just
need to know that it is there.

If you recognize multicollinearity:


▶ Understand that the βj are not true marginal effects.
▶ Consider dropping variables to get a more simple model.
▶ Expect to see big standard errors on your coefficients
(i.e., your coefficient estimates are unstable).

7
Recall model assumptions

Y |X ∼ N (β0 + β1 X1 + · · · + βd Xd , σ 2 )

Key assumptions of our linear regression model:


(i) The conditional mean of Y is linear in X.
(ii) The additive errors (deviations from line)
▶ are Normally distributed
▶ independent from each other
▶ identically distributed (i.e., they have constant variance)

8
Inference and prediction relies on this model being true!

If the model assumptions do not hold, then all bets are off:
▶ prediction can be systematically biased
▶ standard errors and confidence intervals are wrong
(but how wrong?)

We will focus on using graphical methods (plots!) to detect


violations of the model assumptions.

You’ll see that


▶ It is more of an art than a science,
▶ but it is grounded in mathematics.

9
Plotting e vs Ŷ is your #1 tool for finding fit problems.

Why?
▶ Because it gives a quick visual indicator of whether or not
the model assumptions are true.

What should we expect to see if they are true?


1. Each εi has the same variance (σ 2 ).
2. Each εi has the same mean (0).
3. The εi collectively have the same Normal distribution.

Remember: Ŷ is made from X in SLR and MLR, so one plot


summarizes across the X. [more on MLR later]

10
How do we check these?

Well, the true εi residuals are unknown, so must look instead


at the least squares estimated residuals.

▶ We estimate Yi = b0 + b1 Xi + ei , such that the sample


least squares regression residuals are ei = Yi − Ŷi

What should the ei look like if the SLR/MLR model is true?

11
12
Nonconstant variance
One of the most common violations (problems?) in real data
▶ E.g. A trumpet shape in the scatterplot

We can try to stabilize the variance . . . or do robust inference

13
Variance stabilizing transformations
This is one of the most common model violations; luckily, it is
usually fixable by transforming the response (Y ) variable.

log(Y ) is the most common variance stabilizing transform.


▶ If Y has only positive values (e.g. sales) or is a count
(e.g. # of customers), take log(Y ) (always natural log).

In general, think about in what scale you expect linearity.

14
Reconsider the regression of truck price onto year, after
removing trucks older than 1993 (truck[year>1992,]).

15
Reconsider the regression of truck price onto year, after
removing trucks older than 1993 (truck[year>1992,]).

16
Warning: be careful when interpreting the transformed model.

If E[log(Y )|X] = b0 + b1 X, then E[Y |X] ≈ eb0 eb1 X .


We have a multiplicative model now!

Also, you cannot compare R2 values for regressions


corresponding to different transformations of the response.
▶ Y and f (Y ) may not be on the same scale,
▶ therefore var(Y ) and var(f (Y )) may not be either.

Look at residuals to see which model is better.

17
18
The log-log model
The other common covariate transform is log(X).
▶ When X-values are bunched up, log(X) helps spread
them out and reduces the leverage of extreme values.
▶ Recall that both reduce sb1 .

In practice, this is often used in conjunction with a log(Y )


response transformation.
▶ The log-log model is

log(Y ) = β0 + β1 log(X) + ε.

▶ It is super useful, and has some special properties ...

19
Recall that
▶ log is always natural log, with base e = 2.718 . . ., and
▶ log(ab) = log(a) + log(b)
▶ log(ab ) = b log(a).

Consider the multiplicative model E[Y |X] = AX B .


Take logs of both sides to get

log(E[Y |X]) = log(A) + log(X B ) = log(A) + B log(X)


≡ β0 + β1 log(X).

The log-log model is appropriate whenever things are linearly


related on a multiplicative, or percentage, scale.
(See handout on Brightspace.)
20
Consider a country’s GDP as a function of IMPORTS:
▶ Since trade multiplies, we might expect to see %GDP
increase with %IMPORTS.

21
Elasticity and the log-log model
In a log-log model, the slope β1 is sometimes called elasticity.

An elasticity is (roughly) % change in Y per 1% change in X.

d%Y
β1 ≈
d%X

For example, economists often assume that GDP has import


elasticity of 1. Indeed:
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 1.8915 0.343 5.520 0.000 1.183 2.600
np.log(IMPORTS) 0.9693 0.088 11.007 0.000 0.787 1.152

(Can we test for 1%?)


22
Price elasticity
In marketing, the slope coefficient β1 in the regression

log(sales) = β0 + β1 log(price) + ε

is called price elasticity:


▶ the % change in sales per 1% change in price.

The model implies that E[sales|price] = A ∗ priceβ1 such that


β1 is the constant rate of change.

————
Economists have “demand elasticity” curves, which are just more
general and harder to measure.

23
Example: we have Nielson SCANTRACK data on supermarket
sales of a canned food brand produced by Consolidated Foods.

24
Run the regression to determine price elasticity:
smf.ols("np.log(Sales) ~ np.log(Price)", data=confood).fit().summary()

coef std err t P>|t| [0.025 0.975]


---------------------------------------------------------------------------------
Intercept 4.8029 0.174 27.534 0.000 4.453 5.153
np.log(Price) -5.1477 0.510 -10.097 0.000 -6.172 -4.124

Sales decrease by about 5% for every 1% price increase. 25


26
Summary of transformations
Use plots of residuals v.s. X or Ŷ to determine the next step.

Log transform is your best friend (log(X), log(Y ), or both).

Add polynomial terms (e.g. X 2 ) to get nonlinear functions.


▶ Use statistical tests to back up your choices.
▶ Careful with extrapolation.

Be careful to get the interpretation correct after transforming.


▶ You can’t use R2 to compare models under different
transformations of Y .

27
28
What should the ei look like?

If the SLR model is true, it turns out that:

1 (Xi − X̄)2
ei ∼ N (0, σ 2 [1 − hi ]), hi = + Pn 2
.
n j=1 (X j − X̄)

The hi term is referred to as the ith observation’s leverage:


▶ It is that point’s share of the data (1/n) plus its
proportional contribution to variability in X.

Notice that as n → ∞, hi → 0 and residuals ei “obtain” the


same distribution as the unknown errors εi , i.e., ei ∼ N (0, σ 2 ).

—————————————
See handout on course page for derivations.
29
30
Studentized residuals
Since ei ∼ N (0, σ 2 [1 − hi ]), we know that
ei
√ ∼ N (0, 1).
σ 1 − hi

We thus define a Studentized residual as


e
ri = √i
s−i 1 − hi
1
where s2−i = 2
P
n−p−1 j̸=i ej is σ̂ calculated without ei .

Studentized residuals are used to detect outliers and influential


points.
31
Outliers and Studentized residuals

32
Outliers and Studentized residuals
Since the studentized residuals should be ≈ N (0, 1), we
should be concerned about any ri outside of about [−3, 3].

These aren’t hard and fast cutoffs. As n gets bigger, we will expect to
see some very rare events (big εi ) and not get worried unless |ri | > 4.
33
How to deal with outliers
When should you delete outliers?
▶ Only when you have a really good reason!

There is nothing wrong with running a regression with and


without potential outliers to see whether results are
significantly impacted.

Any time outliers are dropped, the reasons for doing so should
be clearly noted.
▶ I maintain that both a statistical and a non-statistical
reason are required. (What?)

34
Normality and studentized residuals
A more subtle issue is the normality of the distribution on ε.

We can look at the residuals to judge normality if n is big


enough (say > 20; less than that makes it too hard to call).

In particular, if we have decent size n, we want the shape of


the studentized residual distribution to “look” like N (0, 1).

The most obvious tactic is to look at a histogram of ri .

35
For example, consider the residuals from a regression of Rent
on SqFt which ignores houses with ≥ 2000 sqft.

36
Assessing normality via Q-Q plots
Higher fidelity diagnostics are provided by normal Q-Q plots.

Q-Q stands for quantile-quantile:


▶ plot the sample quantiles (e.g. 10th percentile, etc.)
▶ against true percentiles from a N (0, 1) distribution (e.g.
−1.96 is the true 2.5% quantile).

If ri ∼ N (0, 1) these quantiles should be equal


▶ lie on a line through 0 with slope 1

37
statsmodels has a function for normal Q-Q plots:
residuals = model.get_influence().resid_studentized_external
sm.qqplot(residuals, line=’45’)
plt.title("Q-Q Plot of Studentized Residuals")
plt.show()

38
Example: recall our pickup data regression of price on years

Our go-to suite of three diagnostic plots tell us that:


▶ Data are more curved than straight (i.e. line doesn’t fit).
▶ Residuals are skewed to the right.
▶ There is a huge positive ei for an old “classic” truck.

39
Residual diagnostics for MLR
Consider the residuals from the sales data:
● ● ● ● ● ●

● ● ● ● ● ● ● ● ●
0.02

0.02

0.02
● ● ●
●● ● ● ● ● ●
● ● ●● ● ● ● ●
●● ● ● ● ● ●
● ●
● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ● ● ●● ●
residuals

residuals

residuals
● ● ●●● ●● ● ● ● ●● ● ●
● ● ● ● ● ●● ● ●● ● ●
● ● ●
● ● ● ●● ●● ● ● ●● ● ● ●● ●
● ● ● ●●
0.00

0.00

0.00
● ●
● ●● ●● ● ● ●● ●
● ●
● ●● ●
● ●
● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ●● ●
● ● ● ● ● ●
●● ● ● ● ● ●● ● ● ● ●
● ●● ● ●
● ● ● ● ●
●● ● ● ● ● ●● ●
● ● ● ●

● ● ● ● ●● ● ● ●
●● ● ● ● ●

● ●● ● ●
● ●● ●●●● ● ●●
●● ● ●●
● ● ● ● ● ●
●● ● ● ● ● ●● ●
● ● ● ●● ● ● ●●
● ● ●
●● ● ● ● ●
● ● ● ● ● ●
−0.03

−0.03

−0.03
● ● ●
● ●● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●

0.5 1.0 1.5 2.0 0.2 0.4 0.6 0.8 0.2 0.6 1.0

fitted P1 P2

We use the same residual diagnostics (scatterplots, QQ, etc).


▶ Plot raw residuals against Ŷ to see overall fit.
▶ Compare e against each X to identify problems.

Diagnosing the problem and finding a solution involves


looking at lots of residual plots (against different Xj ’s).
40
For example, the sales, P1, and P2 variables were
pre-transformed from raw values to a log scale.

On the original scale, things don’t look so good:

● ● ●
0.6

0.6

0.6
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
residuals

residuals

residuals
● ● ● ● ● ●
● ● ●
● ● ●
0.2

0.2

0.2
●● ● ● ● ●
● ●
● ●
● ● ● ● ● ● ● ● ● ● ●

●● ● ● ● ● ● ●● ● ●● ● ● ●
● ● ● ● ● ● ●
● ●
● ●● ●
● ●
●● ● ● ● ●
●●
● ●● ● ●● ● ● ●● ●● ●● ●● ● ●●● ●●● ● ●
● ● ● ● ● ●● ● ● ● ● ●
●● ●

●●●
● ● ● ●●● ● ●● ● ●● ●● ● ●

●●● ● ●● ● ●● ●●●●●● ● ●● ●●●
● ●●
● ●● ●●●●● ● ● ● ● ● ● ● ●●● ●●●● ●●● ● ● ● ● ●

● ●
● ● ●● ●
● ● ●
●● ● ●● ●
●● ●
−0.2

−0.2

−0.2
● ●● ●●● ● ●● ● ● ●● ● ●
●● ● ●●

● ●●●
● ●●
● ● ● ●
● ●
● ●
● ●● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●●
● ● ● ● ● ●

1 2 3 4 5 6 7 1.2 1.6 2.0 2.4 1.5 2.0 2.5 3.0 3.5

fitted exp(P1) exp(P2)

41
In particular, the studentized residuals are heavily right skewed.
(“studentizing” is the same, but leverage is now distance in d-dim.)
50
40
Frequency

▶ Our log-log
30

transform fixes
20

this problem.
10
0

−2 −1 0 1 2 3 4

Studentized Residuals

42
Wrapping up
Use the three go-to diagnostic plots to check assumptions
▶ Plot residuals v.s. X or Ŷ to determine next step

Think about the correct scale for linearity


▶ Use polynomials for nonlinearities
▶ log() transform is your best friend, gives elasticities
▶ Always pay attention to interpretation!
▶ You can’t use R2 to compare models under different
transformations of Y !

43
Glossary and Equations
1 (Xi − X̄)2
Leverage is hi = + n
P 2
n j=1 (Xj − X̄)

e approx
Studentized residuals are ri = √i ∼ N (0, 1)
s−i 1 − hi
d%Y
Elasticity is the slope in a log-log model: β1 ≈ .
d%X
(See handout on course website.)

44
Glossary and Equations

F-test
▶ H0 : βdbase +1 = βdbase +2 = . . . = βdfull = 0.
▶ H1 : at least one βj ̸= 0 for j > 0.
▶ Null hypothesis distributions
▶ f= (R2 )/(p−1)
(1−R2 )/(n−p)
∼ Fp−1,n−p

45

You might also like