0% found this document useful (0 votes)

7 views46 pages

05 Diagnostics

The document discusses various statistical concepts related to multiple linear regression, including F-testing, multicollinearity, and the importance of residual analysis. It emphasizes the need for transformations to address issues such as nonconstant variance and non-normality, and highlights the significance of studentized residuals in detecting outliers. Additionally, it provides guidance on interpreting transformed models and the implications of elasticity in economic contexts.

Uploaded by

Fan Sini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views46 pages

05 Diagnostics

Uploaded by

Fan Sini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Diagnostics and Transformations

DSO-530

F-testing, multicollinearity,
standardized and studentized residuals,
outliers and leverage, nonconstant variance,
non-normality, nonlinearity, and transformations

Mladen Kolar ([email protected])

The F -test
The F -test tries to formalize the idea of a big R2 .

The test statistic is

SSR/(p − 1) R2 /(p − 1)
f= =
SSE/(n − p) (1 − R2 )/(n − p)

If f is big, then the regression is “worthwhile”:

▶ Big SSR relative to SSE?
▶ R2 close to one?

1
What we are really testing:

H0 : β1 = β2 = · · · = βd = 0
H1 : at least one βj ̸= 0.

Hypothesis testing only gives a yes/no answer.

▶ Which βj ̸= 0?
▶ How many?

The test is contained in the .summary() for any MLR fit.

2
smf.ols("price ~ make", data=pickup).fit().summary()

R-squared: 0.021
F-statistic: 0.4628
Prob (F-statistic): 0.633

smf.ols("price ~ make + miles",

data=pickup).fit().summary()

R-squared: 0.446
F-statistic: 11.25
Prob (F-statistic): 1.51e-05

3
Multicollinearity
Multicollinearity refers to strong linear dependence between
some of the covariates in a multiple regression model.

The usual marginal effect interpretation is lost:

▶ change in one X variable leads to change in others.

Coefficient standard errors will be large, such that

multicollinearity leads to large uncertainty about the bj ’s.

4
Example: how employee ratings of their supervisor relate to
performance metrics.

The Data:
Y: Overall rating of supervisor
X1: Opportunity to learn new things
X2: Does not allow special privileges
X3: Raises based on performance

5
Suppose that you regress Y onto X1 and X2 = 10 × X1 .

Then

E[Y |X1 , X2 ] = β0 + β1 X1 + β2 X2 = β0 + β1 X1 + β2 (10X1 )

and the marginal effect of X1 on Y is

∂E[Y |X1 , X2 ]
= β1 + 10β2
∂X1

▶ X1 and X2 do not act independently!

6
Multicollinearity is not a big problem in and of itself, you just
need to know that it is there.

If you recognize multicollinearity:

▶ Understand that the βj are not true marginal effects.
▶ Consider dropping variables to get a more simple model.
▶ Expect to see big standard errors on your coefficients
(i.e., your coefficient estimates are unstable).

7
Recall model assumptions

Y |X ∼ N (β0 + β1 X1 + · · · + βd Xd , σ 2 )

Key assumptions of our linear regression model:

(i) The conditional mean of Y is linear in X.
(ii) The additive errors (deviations from line)
▶ are Normally distributed
▶ independent from each other
▶ identically distributed (i.e., they have constant variance)

8
Inference and prediction relies on this model being true!

If the model assumptions do not hold, then all bets are off:
▶ prediction can be systematically biased
▶ standard errors and confidence intervals are wrong
(but how wrong?)

We will focus on using graphical methods (plots!) to detect

violations of the model assumptions.

You’ll see that

▶ It is more of an art than a science,
▶ but it is grounded in mathematics.

9
Plotting e vs Ŷ is your #1 tool for finding fit problems.

Why?
▶ Because it gives a quick visual indicator of whether or not
the model assumptions are true.

What should we expect to see if they are true?

1. Each εi has the same variance (σ 2 ).
2. Each εi has the same mean (0).
3. The εi collectively have the same Normal distribution.

Remember: Ŷ is made from X in SLR and MLR, so one plot

summarizes across the X. [more on MLR later]

10
How do we check these?

Well, the true εi residuals are unknown, so must look instead

at the least squares estimated residuals.

▶ We estimate Yi = b0 + b1 Xi + ei , such that the sample

least squares regression residuals are ei = Yi − Ŷi

What should the ei look like if the SLR/MLR model is true?

11
12
Nonconstant variance
One of the most common violations (problems?) in real data
▶ E.g. A trumpet shape in the scatterplot

We can try to stabilize the variance . . . or do robust inference

13
Variance stabilizing transformations
This is one of the most common model violations; luckily, it is
usually fixable by transforming the response (Y ) variable.

log(Y ) is the most common variance stabilizing transform.

▶ If Y has only positive values (e.g. sales) or is a count
(e.g. # of customers), take log(Y ) (always natural log).

In general, think about in what scale you expect linearity.

14
Reconsider the regression of truck price onto year, after
removing trucks older than 1993 (truck[year>1992,]).

15
Reconsider the regression of truck price onto year, after
removing trucks older than 1993 (truck[year>1992,]).

16
Warning: be careful when interpreting the transformed model.

If E[log(Y )|X] = b0 + b1 X, then E[Y |X] ≈ eb0 eb1 X .

We have a multiplicative model now!

Also, you cannot compare R2 values for regressions

corresponding to different transformations of the response.
▶ Y and f (Y ) may not be on the same scale,
▶ therefore var(Y ) and var(f (Y )) may not be either.

Look at residuals to see which model is better.

17
18
The log-log model
The other common covariate transform is log(X).
▶ When X-values are bunched up, log(X) helps spread
them out and reduces the leverage of extreme values.
▶ Recall that both reduce sb1 .

In practice, this is often used in conjunction with a log(Y )

response transformation.
▶ The log-log model is

log(Y ) = β0 + β1 log(X) + ε.

▶ It is super useful, and has some special properties ...

19
Recall that
▶ log is always natural log, with base e = 2.718 . . ., and
▶ log(ab) = log(a) + log(b)
▶ log(ab ) = b log(a).

Consider the multiplicative model E[Y |X] = AX B .

Take logs of both sides to get

log(E[Y |X]) = log(A) + log(X B ) = log(A) + B log(X)

≡ β0 + β1 log(X).

The log-log model is appropriate whenever things are linearly

related on a multiplicative, or percentage, scale.
(See handout on Brightspace.)
20
Consider a country’s GDP as a function of IMPORTS:
▶ Since trade multiplies, we might expect to see %GDP
increase with %IMPORTS.

21
Elasticity and the log-log model
In a log-log model, the slope β1 is sometimes called elasticity.

An elasticity is (roughly) % change in Y per 1% change in X.

d%Y
β1 ≈
d%X

For example, economists often assume that GDP has import

elasticity of 1. Indeed:
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 1.8915 0.343 5.520 0.000 1.183 2.600
np.log(IMPORTS) 0.9693 0.088 11.007 0.000 0.787 1.152

(Can we test for 1%?)

22
Price elasticity
In marketing, the slope coefficient β1 in the regression

log(sales) = β0 + β1 log(price) + ε

is called price elasticity:

▶ the % change in sales per 1% change in price.

The model implies that E[sales|price] = A ∗ priceβ1 such that

β1 is the constant rate of change.

————
Economists have “demand elasticity” curves, which are just more
general and harder to measure.

23
Example: we have Nielson SCANTRACK data on supermarket
sales of a canned food brand produced by Consolidated Foods.

24
Run the regression to determine price elasticity:
smf.ols("np.log(Sales) ~ np.log(Price)", data=confood).fit().summary()

coef std err t P>|t| [0.025 0.975]

---------------------------------------------------------------------------------
Intercept 4.8029 0.174 27.534 0.000 4.453 5.153
np.log(Price) -5.1477 0.510 -10.097 0.000 -6.172 -4.124

Sales decrease by about 5% for every 1% price increase. 25

26
Summary of transformations
Use plots of residuals v.s. X or Ŷ to determine the next step.

Log transform is your best friend (log(X), log(Y ), or both).

Add polynomial terms (e.g. X 2 ) to get nonlinear functions.

▶ Use statistical tests to back up your choices.
▶ Careful with extrapolation.

Be careful to get the interpretation correct after transforming.

▶ You can’t use R2 to compare models under different
transformations of Y .

27
28
What should the ei look like?

If the SLR model is true, it turns out that:

1 (Xi − X̄)2
ei ∼ N (0, σ 2 [1 − hi ]), hi = + Pn 2
.
n j=1 (X j − X̄)

The hi term is referred to as the ith observation’s leverage:

▶ It is that point’s share of the data (1/n) plus its
proportional contribution to variability in X.

Notice that as n → ∞, hi → 0 and residuals ei “obtain” the

same distribution as the unknown errors εi , i.e., ei ∼ N (0, σ 2 ).

—————————————
See handout on course page for derivations.
29
30
Studentized residuals
Since ei ∼ N (0, σ 2 [1 − hi ]), we know that
ei
√ ∼ N (0, 1).
σ 1 − hi

We thus define a Studentized residual as

e
ri = √i
s−i 1 − hi
1
where s2−i = 2
P
n−p−1 j̸=i ej is σ̂ calculated without ei .

Studentized residuals are used to detect outliers and influential

points.
31
Outliers and Studentized residuals

32
Outliers and Studentized residuals
Since the studentized residuals should be ≈ N (0, 1), we
should be concerned about any ri outside of about [−3, 3].

These aren’t hard and fast cutoffs. As n gets bigger, we will expect to
see some very rare events (big εi ) and not get worried unless |ri | > 4.
33
How to deal with outliers
When should you delete outliers?
▶ Only when you have a really good reason!

There is nothing wrong with running a regression with and

without potential outliers to see whether results are
significantly impacted.

Any time outliers are dropped, the reasons for doing so should
be clearly noted.
▶ I maintain that both a statistical and a non-statistical
reason are required. (What?)

34
Normality and studentized residuals
A more subtle issue is the normality of the distribution on ε.

We can look at the residuals to judge normality if n is big

enough (say > 20; less than that makes it too hard to call).

In particular, if we have decent size n, we want the shape of

the studentized residual distribution to “look” like N (0, 1).

The most obvious tactic is to look at a histogram of ri .

35
For example, consider the residuals from a regression of Rent
on SqFt which ignores houses with ≥ 2000 sqft.

36
Assessing normality via Q-Q plots
Higher fidelity diagnostics are provided by normal Q-Q plots.

Q-Q stands for quantile-quantile:

▶ plot the sample quantiles (e.g. 10th percentile, etc.)
▶ against true percentiles from a N (0, 1) distribution (e.g.
−1.96 is the true 2.5% quantile).

If ri ∼ N (0, 1) these quantiles should be equal

▶ lie on a line through 0 with slope 1

37
statsmodels has a function for normal Q-Q plots:
residuals = model.get_influence().resid_studentized_external
sm.qqplot(residuals, line=’45’)
plt.title("Q-Q Plot of Studentized Residuals")
plt.show()

38
Example: recall our pickup data regression of price on years

Our go-to suite of three diagnostic plots tell us that:

▶ Data are more curved than straight (i.e. line doesn’t fit).
▶ Residuals are skewed to the right.
▶ There is a huge positive ei for an old “classic” truck.

39
Residual diagnostics for MLR
Consider the residuals from the sales data:
● ● ● ● ● ●

● ● ● ● ● ● ● ● ●
0.02

0.02

0.02
● ● ●
●● ● ● ● ● ●
● ● ●● ● ● ● ●
●● ● ● ● ● ●
● ●
● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ● ● ●● ●
residuals

residuals

residuals
● ● ●●● ●● ● ● ● ●● ● ●
● ● ● ● ● ●● ● ●● ● ●
● ● ●
● ● ● ●● ●● ● ● ●● ● ● ●● ●
● ● ● ●●
0.00

0.00

0.00
● ●
● ●● ●● ● ● ●● ●
● ●
● ●● ●
● ●
● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ●● ●
● ● ● ● ● ●
●● ● ● ● ● ●● ● ● ● ●
● ●● ● ●
● ● ● ● ●
●● ● ● ● ● ●● ●
● ● ● ●
●
● ● ● ● ●● ● ● ●
●● ● ● ● ●
●
● ●● ● ●
● ●● ●●●● ● ●●
●● ● ●●
● ● ● ● ● ●
●● ● ● ● ● ●● ●
● ● ● ●● ● ● ●●
● ● ●
●● ● ● ● ●
● ● ● ● ● ●
−0.03

−0.03

−0.03
● ● ●
● ●● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●

0.5 1.0 1.5 2.0 0.2 0.4 0.6 0.8 0.2 0.6 1.0

fitted P1 P2

We use the same residual diagnostics (scatterplots, QQ, etc).

▶ Plot raw residuals against Ŷ to see overall fit.
▶ Compare e against each X to identify problems.

Diagnosing the problem and finding a solution involves

looking at lots of residual plots (against different Xj ’s).
40
For example, the sales, P1, and P2 variables were
pre-transformed from raw values to a log scale.

On the original scale, things don’t look so good:

● ● ●
0.6

0.6

0.6
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
residuals

residuals

residuals
● ● ● ● ● ●
● ● ●
● ● ●
0.2

0.2

0.2
●● ● ● ● ●
● ●
● ●
● ● ● ● ● ● ● ● ● ● ●

●● ● ● ● ● ● ●● ● ●● ● ● ●
● ● ● ● ● ● ●
● ●
● ●● ●
● ●
●● ● ● ● ●
●●
● ●● ● ●● ● ● ●● ●● ●● ●● ● ●●● ●●● ● ●
● ● ● ● ● ●● ● ● ● ● ●
●● ●
●
●●●
● ● ● ●●● ● ●● ● ●● ●● ● ●
●
●●● ● ●● ● ●● ●●●●●● ● ●● ●●●
● ●●
● ●● ●●●●● ● ● ● ● ● ● ● ●●● ●●●● ●●● ● ● ● ● ●
●
● ●
● ● ●● ●
● ● ●
●● ● ●● ●
●● ●
−0.2

−0.2

−0.2
● ●● ●●● ● ●● ● ● ●● ● ●
●● ● ●●
●
● ●●●
● ●●
● ● ● ●
● ●
● ●
● ●● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●●
● ● ● ● ● ●

1 2 3 4 5 6 7 1.2 1.6 2.0 2.4 1.5 2.0 2.5 3.0 3.5

fitted exp(P1) exp(P2)

41
In particular, the studentized residuals are heavily right skewed.
(“studentizing” is the same, but leverage is now distance in d-dim.)
50
40
Frequency

▶ Our log-log
30

transform fixes
20

this problem.
10
0

−2 −1 0 1 2 3 4

Studentized Residuals

42
Wrapping up
Use the three go-to diagnostic plots to check assumptions
▶ Plot residuals v.s. X or Ŷ to determine next step

Think about the correct scale for linearity

▶ Use polynomials for nonlinearities
▶ log() transform is your best friend, gives elasticities
▶ Always pay attention to interpretation!
▶ You can’t use R2 to compare models under different
transformations of Y !

43
Glossary and Equations
1 (Xi − X̄)2
Leverage is hi = + n
P 2
n j=1 (Xj − X̄)

e approx
Studentized residuals are ri = √i ∼ N (0, 1)
s−i 1 − hi
d%Y
Elasticity is the slope in a log-log model: β1 ≈ .
d%X
(See handout on course website.)

44
Glossary and Equations

F-test
▶ H0 : βdbase +1 = βdbase +2 = . . . = βdfull = 0.
▶ H1 : at least one βj ̸= 0 for j > 0.
▶ Null hypothesis distributions
▶ f= (R2 )/(p−1)
(1−R2 )/(n−p)
∼ Fp−1,n−p

Ms 236 N 0
No ratings yet
Ms 236 N 0
63 pages
Lecture36 2012 Full
No ratings yet
Lecture36 2012 Full
30 pages
Spatial Panel Models: J. Paul Elhorst
No ratings yet
Spatial Panel Models: J. Paul Elhorst
21 pages
Permintaan Harga Pendapatan: Regression Statistics
No ratings yet
Permintaan Harga Pendapatan: Regression Statistics
10 pages
Statistical Analysis of Kekerasan, Susut Bobot, and TPT
No ratings yet
Statistical Analysis of Kekerasan, Susut Bobot, and TPT
4 pages
Interaction Effects in MLR Guide
No ratings yet
Interaction Effects in MLR Guide
3 pages
Notes Part 2
No ratings yet
Notes Part 2
101 pages
Lecture 16
No ratings yet
Lecture 16
10 pages
Advanced Regression with GLMs
No ratings yet
Advanced Regression with GLMs
13 pages
Regreesion Analysis
No ratings yet
Regreesion Analysis
24 pages
Anderson and Gerbing 1988
No ratings yet
Anderson and Gerbing 1988
13 pages
Mantel-Haenszel Common Odds Ratio Estimate
No ratings yet
Mantel-Haenszel Common Odds Ratio Estimate
1 page
Pattern Recognition
No ratings yet
Pattern Recognition
3 pages
Transport Survey Methods Guide
No ratings yet
Transport Survey Methods Guide
14 pages
Sawtooth Software: Analysis of Traditional Conjoint Using Microsoft Excel: An Introductory Example
No ratings yet
Sawtooth Software: Analysis of Traditional Conjoint Using Microsoft Excel: An Introductory Example
7 pages
Exercises Point Estimation
No ratings yet
Exercises Point Estimation
2 pages
Logistic Regression Guide
No ratings yet
Logistic Regression Guide
16 pages
Linear Regression Lecture Notes
100% (2)
Linear Regression Lecture Notes
228 pages
Chapter 4: Economic Analysis
No ratings yet
Chapter 4: Economic Analysis
18 pages
Non-Linear Data Models: Anol Bhattacherjee, Ph.D. University of South Florida
No ratings yet
Non-Linear Data Models: Anol Bhattacherjee, Ph.D. University of South Florida
28 pages
OLS Regression Analysis Insights
No ratings yet
OLS Regression Analysis Insights
15 pages
Chatterjee & Hadi
100% (1)
Chatterjee & Hadi
30 pages
Model Specification and Data Problems: 8.1 Functional Form Misspecification
No ratings yet
Model Specification and Data Problems: 8.1 Functional Form Misspecification
9 pages
049 Stat 326 Regression Final Paper
No ratings yet
049 Stat 326 Regression Final Paper
17 pages
Annotated 4 Ch4 Linear Regression F2014
No ratings yet
Annotated 4 Ch4 Linear Regression F2014
11 pages
15 Types of Regression You Should Know
No ratings yet
15 Types of Regression You Should Know
30 pages
Demand Estimation & Regression Analysis
No ratings yet
Demand Estimation & Regression Analysis
18 pages
Econometrics for Students
No ratings yet
Econometrics for Students
32 pages
Qs 2
No ratings yet
Qs 2
11 pages
Summary of Topics For Midterm Exam #2: STA 371G, Fall 2017
No ratings yet
Summary of Topics For Midterm Exam #2: STA 371G, Fall 2017
6 pages
05 Diagnostic Test of CLRM 2
No ratings yet
05 Diagnostic Test of CLRM 2
39 pages
Chapter 5 Violations of CLRM Assumptions
100% (2)
Chapter 5 Violations of CLRM Assumptions
25 pages
Multicollinearity and Endogeneity PDF
No ratings yet
Multicollinearity and Endogeneity PDF
37 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Multiple Regression Insights
100% (1)
Multiple Regression Insights
21 pages
Sem3.P2.Unit1.Simple Random Sampling
No ratings yet
Sem3.P2.Unit1.Simple Random Sampling
47 pages
Robust Regression with STATA Guide
No ratings yet
Robust Regression with STATA Guide
93 pages
An Introduction To Statistical Learning
No ratings yet
An Introduction To Statistical Learning
19 pages
3 Regression Diagnostics
100% (1)
3 Regression Diagnostics
53 pages
SimpleLinearRegression PDF
No ratings yet
SimpleLinearRegression PDF
86 pages
Econometrics 2021
No ratings yet
Econometrics 2021
9 pages
STAT 2601 Test 3 Formula Sheet
No ratings yet
STAT 2601 Test 3 Formula Sheet
1 page
Simple Regression Model Fitting
No ratings yet
Simple Regression Model Fitting
5 pages
Intro To Regresion: Codergirl Data Analysis
No ratings yet
Intro To Regresion: Codergirl Data Analysis
32 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Data Science for Beginners
No ratings yet
Data Science for Beginners
98 pages
Econometrics for Advanced Learners
No ratings yet
Econometrics for Advanced Learners
129 pages
Chapter Two-Four
No ratings yet
Chapter Two-Four
118 pages
Linear Regression
No ratings yet
Linear Regression
10 pages
Estimating Demand: Regression Analysis
No ratings yet
Estimating Demand: Regression Analysis
29 pages
博士论文奖学金
100% (1)
博士论文奖学金
12 pages
3-Linear Regreesion-Assumptions
No ratings yet
3-Linear Regreesion-Assumptions
28 pages
Mult Hetero Notes Agd
No ratings yet
Mult Hetero Notes Agd
29 pages
CH 06
No ratings yet
CH 06
22 pages
3 - Chapter Three-Unequal Probability Sampling
No ratings yet
3 - Chapter Three-Unequal Probability Sampling
32 pages
7 Regression Analysis
No ratings yet
7 Regression Analysis
23 pages
Chap 5-2 Machine Learning Basics - Hyun-Lim Yang
No ratings yet
Chap 5-2 Machine Learning Basics - Hyun-Lim Yang
39 pages
Correlation and Regression
No ratings yet
Correlation and Regression
12 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
Lecture 1: Introduction and Key Concepts
No ratings yet
Lecture 1: Introduction and Key Concepts
62 pages
P&S Unit 4 Total
No ratings yet
P&S Unit 4 Total
39 pages
BA501 Week5 Linear Regression
No ratings yet
BA501 Week5 Linear Regression
45 pages
AA3 - Linear Regression - 2024
No ratings yet
AA3 - Linear Regression - 2024
26 pages
Econometrics
No ratings yet
Econometrics
13 pages
Mungadze Linear
No ratings yet
Mungadze Linear
21 pages
MLR Assumptions & Diagnostics
No ratings yet
MLR Assumptions & Diagnostics
11 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
Econometric Estimation BETA
No ratings yet
Econometric Estimation BETA
36 pages
Problems On Statistics 2
No ratings yet
Problems On Statistics 2
2 pages
Vapitulo 3 Big Data
No ratings yet
Vapitulo 3 Big Data
65 pages
Statistic and Data Science Ii PDF
No ratings yet
Statistic and Data Science Ii PDF
37 pages
Applied Econometrics 2nd Edition Dimitrios Asteriou PDF Download
No ratings yet
Applied Econometrics 2nd Edition Dimitrios Asteriou PDF Download
40 pages
N SR TSRPT
No ratings yet
N SR TSRPT
1 page
Ecotrix Theory Viva Guide Gujarati Ch2to10
No ratings yet
Ecotrix Theory Viva Guide Gujarati Ch2to10
4 pages
Lecture 3and4
No ratings yet
Lecture 3and4
51 pages
INT354 - Unit 4
No ratings yet
INT354 - Unit 4
50 pages
DSO 510 Video
No ratings yet
DSO 510 Video
9 pages
DSO 510 16204 - Captions - English (United States)
No ratings yet
DSO 510 16204 - Captions - English (United States)
11 pages
04 MLR
No ratings yet
04 MLR
32 pages
DSO 510 16202 - Captions - English (United States)
No ratings yet
DSO 510 16202 - Captions - English (United States)
36 pages
Chapter 4
No ratings yet
Chapter 4
38 pages
Session 6 Brightspace
No ratings yet
Session 6 Brightspace
14 pages
Merge
No ratings yet
Merge
240 pages
03slr Inference
No ratings yet
03slr Inference
45 pages
Week2 - 3 Common Mistakes That Can Derail Your Team's Predictive Analytics Efforts
No ratings yet
Week2 - 3 Common Mistakes That Can Derail Your Team's Predictive Analytics Efforts
7 pages
11 - Econometrics - Linear Regression
No ratings yet
11 - Econometrics - Linear Regression
20 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
5 pages
Day.11 What Is Multiple Linear Regression
No ratings yet
Day.11 What Is Multiple Linear Regression
10 pages
CH 4 Multiple Regression Models
No ratings yet
CH 4 Multiple Regression Models
28 pages
Multiple LR Basics v2025
No ratings yet
Multiple LR Basics v2025
38 pages
Regression Classification
No ratings yet
Regression Classification
106 pages
Chap 5
No ratings yet
Chap 5
13 pages
MLDL Lecture 1
No ratings yet
MLDL Lecture 1
28 pages

05 Diagnostics

Uploaded by

05 Diagnostics

Uploaded by

Diagnostics and Transformations

Mladen Kolar ([email protected])

The test statistic is

If f is big, then the regression is “worthwhile”:

Hypothesis testing only gives a yes/no answer.

The test is contained in the .summary() for any MLR fit.

smf.ols("price ~ make + miles",

The usual marginal effect interpretation is lost:

Coefficient standard errors will be large, such that

E[Y |X1 , X2 ] = β0 + β1 X1 + β2 X2 = β0 + β1 X1 + β2 (10X1 )

and the marginal effect of X1 on Y is

▶ X1 and X2 do not act independently!

If you recognize multicollinearity:

Key assumptions of our linear regression model:

We will focus on using graphical methods (plots!) to detect

You’ll see that

What should we expect to see if they are true?

Remember: Ŷ is made from X in SLR and MLR, so one plot

Well, the true εi residuals are unknown, so must look instead

▶ We estimate Yi = b0 + b1 Xi + ei , such that the sample

What should the ei look like if the SLR/MLR model is true?

We can try to stabilize the variance . . . or do robust inference

log(Y ) is the most common variance stabilizing transform.

In general, think about in what scale you expect linearity.

If E[log(Y )|X] = b0 + b1 X, then E[Y |X] ≈ eb0 eb1 X .

Also, you cannot compare R2 values for regressions

Look at residuals to see which model is better.

In practice, this is often used in conjunction with a log(Y )

▶ It is super useful, and has some special properties ...

Consider the multiplicative model E[Y |X] = AX B .

log(E[Y |X]) = log(A) + log(X B ) = log(A) + B log(X)

The log-log model is appropriate whenever things are linearly

An elasticity is (roughly) % change in Y per 1% change in X.

For example, economists often assume that GDP has import

(Can we test for 1%?)

is called price elasticity:

The model implies that E[sales|price] = A ∗ priceβ1 such that

coef std err t P>|t| [0.025 0.975]

Sales decrease by about 5% for every 1% price increase. 25

Log transform is your best friend (log(X), log(Y ), or both).

Add polynomial terms (e.g. X 2 ) to get nonlinear functions.

Be careful to get the interpretation correct after transforming.

If the SLR model is true, it turns out that:

The hi term is referred to as the ith observation’s leverage:

Notice that as n → ∞, hi → 0 and residuals ei “obtain” the

We thus define a Studentized residual as

Studentized residuals are used to detect outliers and influential

There is nothing wrong with running a regression with and

We can look at the residuals to judge normality if n is big

In particular, if we have decent size n, we want the shape of

The most obvious tactic is to look at a histogram of ri .

Q-Q stands for quantile-quantile:

If ri ∼ N (0, 1) these quantiles should be equal

Our go-to suite of three diagnostic plots tell us that:

We use the same residual diagnostics (scatterplots, QQ, etc).

Diagnosing the problem and finding a solution involves

On the original scale, things don’t look so good:

1 2 3 4 5 6 7 1.2 1.6 2.0 2.4 1.5 2.0 2.5 3.0 3.5

fitted exp(P1) exp(P2)

Think about the correct scale for linearity

You might also like