05 Diagnostics
05 Diagnostics
DSO-530
F-testing, multicollinearity,
standardized and studentized residuals,
outliers and leverage, nonconstant variance,
non-normality, nonlinearity, and transformations
SSR/(p − 1) R2 /(p − 1)
f= =
SSE/(n − p) (1 − R2 )/(n − p)
1
What we are really testing:
H0 : β1 = β2 = · · · = βd = 0
H1 : at least one βj ̸= 0.
2
smf.ols("price ~ make", data=pickup).fit().summary()
R-squared: 0.021
F-statistic: 0.4628
Prob (F-statistic): 0.633
R-squared: 0.446
F-statistic: 11.25
Prob (F-statistic): 1.51e-05
3
Multicollinearity
Multicollinearity refers to strong linear dependence between
some of the covariates in a multiple regression model.
4
Example: how employee ratings of their supervisor relate to
performance metrics.
The Data:
Y: Overall rating of supervisor
X1: Opportunity to learn new things
X2: Does not allow special privileges
X3: Raises based on performance
5
Suppose that you regress Y onto X1 and X2 = 10 × X1 .
Then
∂E[Y |X1 , X2 ]
= β1 + 10β2
∂X1
6
Multicollinearity is not a big problem in and of itself, you just
need to know that it is there.
7
Recall model assumptions
Y |X ∼ N (β0 + β1 X1 + · · · + βd Xd , σ 2 )
8
Inference and prediction relies on this model being true!
If the model assumptions do not hold, then all bets are off:
▶ prediction can be systematically biased
▶ standard errors and confidence intervals are wrong
(but how wrong?)
9
Plotting e vs Ŷ is your #1 tool for finding fit problems.
Why?
▶ Because it gives a quick visual indicator of whether or not
the model assumptions are true.
10
How do we check these?
11
12
Nonconstant variance
One of the most common violations (problems?) in real data
▶ E.g. A trumpet shape in the scatterplot
13
Variance stabilizing transformations
This is one of the most common model violations; luckily, it is
usually fixable by transforming the response (Y ) variable.
14
Reconsider the regression of truck price onto year, after
removing trucks older than 1993 (truck[year>1992,]).
15
Reconsider the regression of truck price onto year, after
removing trucks older than 1993 (truck[year>1992,]).
16
Warning: be careful when interpreting the transformed model.
17
18
The log-log model
The other common covariate transform is log(X).
▶ When X-values are bunched up, log(X) helps spread
them out and reduces the leverage of extreme values.
▶ Recall that both reduce sb1 .
log(Y ) = β0 + β1 log(X) + ε.
19
Recall that
▶ log is always natural log, with base e = 2.718 . . ., and
▶ log(ab) = log(a) + log(b)
▶ log(ab ) = b log(a).
21
Elasticity and the log-log model
In a log-log model, the slope β1 is sometimes called elasticity.
d%Y
β1 ≈
d%X
log(sales) = β0 + β1 log(price) + ε
————
Economists have “demand elasticity” curves, which are just more
general and harder to measure.
23
Example: we have Nielson SCANTRACK data on supermarket
sales of a canned food brand produced by Consolidated Foods.
24
Run the regression to determine price elasticity:
smf.ols("np.log(Sales) ~ np.log(Price)", data=confood).fit().summary()
27
28
What should the ei look like?
1 (Xi − X̄)2
ei ∼ N (0, σ 2 [1 − hi ]), hi = + Pn 2
.
n j=1 (X j − X̄)
—————————————
See handout on course page for derivations.
29
30
Studentized residuals
Since ei ∼ N (0, σ 2 [1 − hi ]), we know that
ei
√ ∼ N (0, 1).
σ 1 − hi
32
Outliers and Studentized residuals
Since the studentized residuals should be ≈ N (0, 1), we
should be concerned about any ri outside of about [−3, 3].
These aren’t hard and fast cutoffs. As n gets bigger, we will expect to
see some very rare events (big εi ) and not get worried unless |ri | > 4.
33
How to deal with outliers
When should you delete outliers?
▶ Only when you have a really good reason!
Any time outliers are dropped, the reasons for doing so should
be clearly noted.
▶ I maintain that both a statistical and a non-statistical
reason are required. (What?)
34
Normality and studentized residuals
A more subtle issue is the normality of the distribution on ε.
35
For example, consider the residuals from a regression of Rent
on SqFt which ignores houses with ≥ 2000 sqft.
36
Assessing normality via Q-Q plots
Higher fidelity diagnostics are provided by normal Q-Q plots.
37
statsmodels has a function for normal Q-Q plots:
residuals = model.get_influence().resid_studentized_external
sm.qqplot(residuals, line=’45’)
plt.title("Q-Q Plot of Studentized Residuals")
plt.show()
38
Example: recall our pickup data regression of price on years
39
Residual diagnostics for MLR
Consider the residuals from the sales data:
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
0.02
0.02
0.02
● ● ●
●● ● ● ● ● ●
● ● ●● ● ● ● ●
●● ● ● ● ● ●
● ●
● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ● ● ●● ●
residuals
residuals
residuals
● ● ●●● ●● ● ● ● ●● ● ●
● ● ● ● ● ●● ● ●● ● ●
● ● ●
● ● ● ●● ●● ● ● ●● ● ● ●● ●
● ● ● ●●
0.00
0.00
0.00
● ●
● ●● ●● ● ● ●● ●
● ●
● ●● ●
● ●
● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ●● ●
● ● ● ● ● ●
●● ● ● ● ● ●● ● ● ● ●
● ●● ● ●
● ● ● ● ●
●● ● ● ● ● ●● ●
● ● ● ●
●
● ● ● ● ●● ● ● ●
●● ● ● ● ●
●
● ●● ● ●
● ●● ●●●● ● ●●
●● ● ●●
● ● ● ● ● ●
●● ● ● ● ● ●● ●
● ● ● ●● ● ● ●●
● ● ●
●● ● ● ● ●
● ● ● ● ● ●
−0.03
−0.03
−0.03
● ● ●
● ●● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
0.5 1.0 1.5 2.0 0.2 0.4 0.6 0.8 0.2 0.6 1.0
fitted P1 P2
● ● ●
0.6
0.6
0.6
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
residuals
residuals
residuals
● ● ● ● ● ●
● ● ●
● ● ●
0.2
0.2
0.2
●● ● ● ● ●
● ●
● ●
● ● ● ● ● ● ● ● ● ● ●
●● ● ● ● ● ● ●● ● ●● ● ● ●
● ● ● ● ● ● ●
● ●
● ●● ●
● ●
●● ● ● ● ●
●●
● ●● ● ●● ● ● ●● ●● ●● ●● ● ●●● ●●● ● ●
● ● ● ● ● ●● ● ● ● ● ●
●● ●
●
●●●
● ● ● ●●● ● ●● ● ●● ●● ● ●
●
●●● ● ●● ● ●● ●●●●●● ● ●● ●●●
● ●●
● ●● ●●●●● ● ● ● ● ● ● ● ●●● ●●●● ●●● ● ● ● ● ●
●
● ●
● ● ●● ●
● ● ●
●● ● ●● ●
●● ●
−0.2
−0.2
−0.2
● ●● ●●● ● ●● ● ● ●● ● ●
●● ● ●●
●
● ●●●
● ●●
● ● ● ●
● ●
● ●
● ●● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●●
● ● ● ● ● ●
41
In particular, the studentized residuals are heavily right skewed.
(“studentizing” is the same, but leverage is now distance in d-dim.)
50
40
Frequency
▶ Our log-log
30
transform fixes
20
this problem.
10
0
−2 −1 0 1 2 3 4
Studentized Residuals
42
Wrapping up
Use the three go-to diagnostic plots to check assumptions
▶ Plot residuals v.s. X or Ŷ to determine next step
43
Glossary and Equations
1 (Xi − X̄)2
Leverage is hi = + n
P 2
n j=1 (Xj − X̄)
e approx
Studentized residuals are ri = √i ∼ N (0, 1)
s−i 1 − hi
d%Y
Elasticity is the slope in a log-log model: β1 ≈ .
d%X
(See handout on course website.)
44
Glossary and Equations
F-test
▶ H0 : βdbase +1 = βdbase +2 = . . . = βdfull = 0.
▶ H1 : at least one βj ̸= 0 for j > 0.
▶ Null hypothesis distributions
▶ f= (R2 )/(p−1)
(1−R2 )/(n−p)
∼ Fp−1,n−p
45