Regression
Analyses
SUN Chenshuo
Department of Information Systems & Analytics
Application
Advertising
Sales of a particular product in 200 different
markets
Advertising budgets in each of the markets, for
3 different media
TV,radioandnewspaper
Client: can control advertising budget, not sales
directly
A marketing plan to improve sales
Advertising
Sales (thousands of units)
Budget for TV, radio, newspaper in thousands of dollars
Questions
Is there a relationship bet ween advertising budget and sales?
If so, how strong?
Given a budget, can we predict sales with high accuracy? Is
the relationship linear?
Which media contribute to sales?
All three? Just one or t wo?
How accurately can we estimate the effect of each
medium on sales?
Simple Linear Regression
Simple Linear Regression
Quantitative response
Y
Predictor variable X
Linear relationship: Y ≈ β0 + β1X
“regressing Y onto X”
sales ≈ β0 + β1TV
Simple Linear Regression
Y ≈ β0 + β1X
Model parameters or coefficients β0, β1
Training data used to estimate β0̂ , β1
Predict ion ŷ = β0̂ + β1̂ x
Simple Linear Regression:
Coefficient Estimates
& theirAccuracy
Finding the Estimates
Training data: (x1, y1), (x2, y2), …, (xn, yn)
Find β0̂ , β1̂ that fits the training data
well
Linear model: Line as close as
possible to the 200 points
Training: through Least Squares
Finding the Estimates
ŷ = β0̂ + β1̂ x
i-th Residual: ei = yi − yi
Residual Sum of Squares (RSS)
RSS = e21+ e2 2+ … + e2 n
Least Squares: Find β0̂ , β1̂ that minimises
RSS
25
20
Sales
15
10
β0̂ = 7.03
β1̂ = 0.0475
5
0 50 100 150 200 250 300
TV
Least squares fit for regression of ‘sales’ onto ‘TV’
Linear Regression
sales = 7.03 + 0.0475 TV
Additional $1000 spent on TV
advertising sells approximately
47.5 additional units of the product
Unbiasedness
Using one set of observations, an estimate
may over/under estimate a parameter
Average of a large number of estimates
from a large number of datasets:
accurate estimate of parameter
No systematic over/under-estimation Least
square estimates are unbiased
Accuracy of Coefficient Estimates
Least Square Estimates are unbiased
Average over many datasets will be
accurate
A single estimate may over/under-
estimate
By howmuch?
Accuracy of Coefficient Estimates
Standard Error
Confidence Interval
P-Value
Standard Error
ConfidenceIntervals
95% confidence interval: range of values
With 95% probability the range will
contain the true unknown parameter
value
β̂0 − 2SE(β0̂ ), β0̂ + 2SE(β0̂ )
[ ]
β̂1 − 2SE(β1̂ ), β1̂ + 2SE(β1̂ )
[ ]
ConfidenceIntervals
sales = 7.03 + 0.0475 TV
β0 : [6.130, 7.935]
β1 : [0.042, 0.053]
In the absence of any advertising, sales will,
on average, be bet ween 6130 and 7935 units.
For each 1000 unit increase in TV advertising, there will be
an average increase in sales of 42 to 53 units.
Statistical Hypothesis Testing
State Null and Alternative Hypotheses mathematically
Select relevant test statistic
Obtain distribution of test statistic under the null hypothesis
Select significance level: probability threshold below which null
hypothesis is rejected (usually 0.01 or 0.05)
Compute observed value of test statistic fromdata
P-value: probability, under the null hypothesis, of observed value
(and more extremal values)
Reject null hypothesis if p-value is less than significance level
Hypothesis Test
Null Hypothesis
There is no relationship bet ween X & Y
H0 : β1 = 0
Alternative Hypothesis
There is some relationship bet ween X & Y
H1 : β1 ≠ 0
Hypothesis Test
β1
t-statistic: t=
SE(β 1̂ )
Is our estimate sufficiently far from zero, so we can be confident of
rejecting the null hypothesis?
If SE is low, then smaller non-zero estimates may
besufficient IfSEishigh,thenlargerestimatesrequired
Hypothesis Test
Small p-value (< 0.05)
Assuming null hypothesis (i.e. in the absence
of any real association bet ween X and
Y)
Unlikely to observe such a substantial
association (bet ween X and Y) due to chance
Reject the null hypothesis
Accuracy of Coefficient Estimates
Standard Error
Confidence Interval
P-Value
Simple Linear Regression:
Model Fit
Accuracy of the model
Y ≈ β0 + β1X
Y = g(X) + ϵ
Y = β0 + β1X + ϵ
Evenifweknewthetruecoefficients,we maynot
have accurate predictions due to errors (that
have not been modelled)
Model Fit
Residual Standard Error (RSE)
Estimate of the standard deviation of the error
RSS
RSE =
n−2
n
RSS = e21+ e2 2+ … + e2 n = ∑ (y i − ŷ )
i
2
i=1
Model Fit
2 RSS ∑i (yi − yî)2
R =1− =1−
TSS ∑i (yi − y¯)2
TSS:totalvarianceinY(beforeregression)
RSS:varianceleftunexplained(afterregression)
R-squared: proportion of variability in Y that can be
explained using X
y¯:mean of yi
Model Fit
R-squared: proportion of variability in Y that can be
explained using X
~1:Goodfit,Largeproportionexplained
~0: Regression did not explain much variability Linear
model wrong and/or inherent error high
Model Fit
sales = 7.03 + 0.0475 TV
R-squared: 0.61
61% of variability in sales is
explained by regression on
TV
Model Fit
For simple linear regression
R-squared = squared linearcorrelation
Simple Linear Regression
sales ≈ β0 + β1TV
25
Least Squares Estimates
20
Sales
15
10
Accuracy of Estimates
5
0 50 100 150 200 250 300
TV
SE,ConfidenceIntervals,P-value
Model Fit
RSE,R-squared β1
β0
Multiple Linear
Regression
Multiple Linear
Regression
Multiple Predictors
25
25
25
20
20
20
Sales
Sales
Sales
15
15
15
10
10
10
5
5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
Multiple Linear Regression
Y = β0 + β1X1 + β2X2 + … + βpXp + ϵ
βj : average effect on Y of one unit increase
in Xj holding all other predictors fixed
sales = β0 + β1TV + β2radio + β3newspaper + ϵ
Regression Coefficients
ŷ = β0̂ + β1̂ x1 + β2̂ x2 + … + βp̂ xp
Least Squares Estimate
Find Coefficients β0̂ , …, βp̂ to minimise
RSS:
n
RSS = ∑ (yi − yî )2
i=1
Average effect of Xj keeping other predictors fixed
Average effect of Xj ignoring other predictors
Predictors->Response
Is at least one of the predictors useful in
predicting the response?
Hypothesis Test
Null hypothesis
β1 = β2 = … = βp =0
Alternative hypothesis
At least one βj is non-zero
Hypothesis Test
(TSS − RSS)/p
F=
RSS/(n − p − 1)
TSS = ∑ (yi − y¯)2, RSS = ∑ (yi − yî )2
i i
F-statistic
For large n, F > 1 is evidence against null
F-statistic: 570.3
P-value: significant
At least one of the 3 media budget has effect on sales
Model Fit
RSE
R-squared
Predictors Response R-squared RSE
TV Sales 0.61 3.26
TV
Sales 0.89719 1.681
Radio
TV Radio
Sales 0.8972 1.686
Newspap
er
Linear Models:
Assumptions & Limitations
Linear Model: Assumptions
Y = β0 + β1X1 + β2X2 + … + βpXp + ϵ
Additive
Linear
Linear Model: Assumptions
Y = β0 + β1X1 + β2X2 + … + βpXp + ϵ
Additive
Effect of change in a predictor on
response is independent of other predictor
values
Change in response due to a unit change in a
predictor is constant, irrespective of the
value of the predictor
Non-linear Relations
Y = β0 + β1X1 + β2X2 + β3X2 + β4X3 + …
1 2
Polynomial Regression
Model is linear in terms of coefficients
Potential Problems
Non-linearity of the data
Correlation of error terms
Outliers
Collinearity
Non-linearity
Conclusions from Linear Model unreliable
Predictions may be inaccurate
Binary
Classification
Classification
Will a credit card customer
default on his/her credit card
payment?
Data: annual income, monthly credit card
balance, previous defaulters
Classification
Outcome: Default
Binary
(Yes/No) Predictor:
Balance
Pr( default=Yes | balance )
1.0
Regression?
1.0
||| | || | ||||||| || |||||||||||||||||| |||||||||||| |||||||||||||||||||||||| |||||||||||||||||||||||||||| |||||||||||||||| |||| | | | | | | ||| | || | ||||||| || |||||||||||||||||| |||||||||||| |||||||||||||||||||||||| |||||||||||||||||||||||||||| |||||||||||||||| |||| | | | | | |
Probability of Default
Probability of Default
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
||||||||||||||||||||||||||||||||||| ||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||| |||||||||||||||||| |||| ||||||||||||||||||||||||||||| ||||| ||||||||||||||||||||| ||||||| | | | | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||| ||||||| | || |
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
Balance Balance
Left: linear regression -> negative
probabilities!
Right: All probabilities lie bet ween 0 and 1
Logistic Regression
Logistic Regression
p(Y )
odds: = eβ0+β1X = eβ0eβ1X
1 − p(Y )
p(Y )
logit/log-odds: log ( = β0 + β1X
1 − p(Y ) )
Logistic function (sigmoid) lies bet ween 0 and 1
Unit increase in X changes log-odds by β1 or multiplies the odds by
β
e1
Regression Coefficients
Likelihood l(β0, β1) = ∏ P(xi) ∏ (1 − P(xj))
i:yi=1 j:yj=0
Find coefficients that maximise likelihood
Multiple Logistic Regression
p(Y )
log = β0 + β1X1 + … + βpXp
( 1 − p(Y ) )
Predict ion:
β ̂ +β1̂ x1+…+βp̂ xp
e 0
p̂(y) = ̂ +β1̂ x1+…+βp̂ xp
1+ eβ0
Decision Boundary
Logistic Regression
Target Variable: probabilities
Statistical Models
Linear Models
Linear Regression
Logistic Regression -
Classification
Statistical Models
Linear Models
Linear Regression
Logistic Regression - Classification
Key Idea: Approximate Data with a Hyperplane
Evaluate fit of the model
Accuracy of model parameter estimates
Classification with 3-NN
Important