Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
22 views67 pages

IS4242 W3 Regression Analyses

This document discusses using linear regression to analyze the relationship between advertising budgets and product sales across 200 markets. Specifically, it examines how TV, radio, and newspaper advertising budgets relate to sales. It introduces simple and multiple linear regression models. Key points covered include estimating regression coefficients, assessing the accuracy of estimates using standard error, confidence intervals, and p-values, testing hypotheses about coefficients, and evaluating model fit using R-squared and residual standard error. The document also discusses the assumptions of linear models and their limitations.

Uploaded by

wongdeshun4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views67 pages

IS4242 W3 Regression Analyses

This document discusses using linear regression to analyze the relationship between advertising budgets and product sales across 200 markets. Specifically, it examines how TV, radio, and newspaper advertising budgets relate to sales. It introduces simple and multiple linear regression models. Key points covered include estimating regression coefficients, assessing the accuracy of estimates using standard error, confidence intervals, and p-values, testing hypotheses about coefficients, and evaluating model fit using R-squared and residual standard error. The document also discusses the assumptions of linear models and their limitations.

Uploaded by

wongdeshun4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Regression

Analyses
SUN Chenshuo
Department of Information Systems & Analytics
Application
Advertising
Sales of a particular product in 200 different
markets
Advertising budgets in each of the markets, for
3 different media
TV,radioandnewspaper

Client: can control advertising budget, not sales


directly
A marketing plan to improve sales
Advertising

Sales (thousands of units)

Budget for TV, radio, newspaper in thousands of dollars


Questions
Is there a relationship bet ween advertising budget and sales?

If so, how strong?


Given a budget, can we predict sales with high accuracy? Is

the relationship linear?

Which media contribute to sales?

All three? Just one or t wo?

How accurately can we estimate the effect of each


medium on sales?
Simple Linear Regression
Simple Linear Regression

Quantitative response
Y

Predictor variable X
Linear relationship: Y ≈ β0 + β1X
“regressing Y onto X”
sales ≈ β0 + β1TV
Simple Linear Regression
Y ≈ β0 + β1X

Model parameters or coefficients β0, β1

Training data used to estimate β0̂ , β1


Predict ion ŷ = β0̂ + β1̂ x
Simple Linear Regression:
Coefficient Estimates
& theirAccuracy
Finding the Estimates
Training data: (x1, y1), (x2, y2), …, (xn, yn)

Find β0̂ , β1̂ that fits the training data


well

Linear model: Line as close as


possible to the 200 points
Training: through Least Squares
Finding the Estimates
ŷ = β0̂ + β1̂ x

i-th Residual: ei = yi − yi

Residual Sum of Squares (RSS)


RSS = e21+ e2 2+ … + e2 n

Least Squares: Find β0̂ , β1̂ that minimises


RSS
25
20
Sales

15
10

β0̂ = 7.03
β1̂ = 0.0475
5

0 50 100 150 200 250 300

TV

Least squares fit for regression of ‘sales’ onto ‘TV’


Linear Regression

sales = 7.03 + 0.0475 TV


Additional $1000 spent on TV
advertising sells approximately
47.5 additional units of the product
Unbiasedness
Using one set of observations, an estimate
may over/under estimate a parameter
Average of a large number of estimates
from a large number of datasets:
accurate estimate of parameter
No systematic over/under-estimation Least

square estimates are unbiased


Accuracy of Coefficient Estimates

Least Square Estimates are unbiased

Average over many datasets will be


accurate
A single estimate may over/under-
estimate
By howmuch?
Accuracy of Coefficient Estimates

Standard Error
Confidence Interval
P-Value
Standard Error
ConfidenceIntervals

95% confidence interval: range of values


With 95% probability the range will
contain the true unknown parameter
value

β̂0 − 2SE(β0̂ ), β0̂ + 2SE(β0̂ )


[ ]

β̂1 − 2SE(β1̂ ), β1̂ + 2SE(β1̂ )


[ ]
ConfidenceIntervals
sales = 7.03 + 0.0475 TV
β0 : [6.130, 7.935]
β1 : [0.042, 0.053]
In the absence of any advertising, sales will,
on average, be bet ween 6130 and 7935 units.

For each 1000 unit increase in TV advertising, there will be


an average increase in sales of 42 to 53 units.
Statistical Hypothesis Testing
State Null and Alternative Hypotheses mathematically
Select relevant test statistic
Obtain distribution of test statistic under the null hypothesis
Select significance level: probability threshold below which null
hypothesis is rejected (usually 0.01 or 0.05)
Compute observed value of test statistic fromdata
P-value: probability, under the null hypothesis, of observed value
(and more extremal values)
Reject null hypothesis if p-value is less than significance level
Hypothesis Test
Null Hypothesis

There is no relationship bet ween X & Y


H0 : β1 = 0

Alternative Hypothesis

There is some relationship bet ween X & Y


H1 : β1 ≠ 0
Hypothesis Test
β1
t-statistic: t=
SE(β 1̂ )

Is our estimate sufficiently far from zero, so we can be confident of


rejecting the null hypothesis?
If SE is low, then smaller non-zero estimates may
besufficient IfSEishigh,thenlargerestimatesrequired
Hypothesis Test
Small p-value (< 0.05)

Assuming null hypothesis (i.e. in the absence


of any real association bet ween X and
Y)
Unlikely to observe such a substantial
association (bet ween X and Y) due to chance

Reject the null hypothesis


Accuracy of Coefficient Estimates

Standard Error
Confidence Interval
P-Value
Simple Linear Regression:
Model Fit
Accuracy of the model
Y ≈ β0 + β1X
Y = g(X) + ϵ
Y = β0 + β1X + ϵ

Evenifweknewthetruecoefficients,we maynot
have accurate predictions due to errors (that
have not been modelled)
Model Fit

Residual Standard Error (RSE)

Estimate of the standard deviation of the error

RSS
RSE =
n−2
n
RSS = e21+ e2 2+ … + e2 n = ∑ (y i − ŷ )
i
2

i=1
Model Fit
2 RSS ∑i (yi − yî)2
R =1− =1−
TSS ∑i (yi − y¯)2

TSS:totalvarianceinY(beforeregression)

RSS:varianceleftunexplained(afterregression)

R-squared: proportion of variability in Y that can be


explained using X

y¯:mean of yi
Model Fit

R-squared: proportion of variability in Y that can be


explained using X

~1:Goodfit,Largeproportionexplained
~0: Regression did not explain much variability Linear

model wrong and/or inherent error high


Model Fit

sales = 7.03 + 0.0475 TV

R-squared: 0.61

61% of variability in sales is


explained by regression on
TV
Model Fit

For simple linear regression

R-squared = squared linearcorrelation


Simple Linear Regression
sales ≈ β0 + β1TV

25
Least Squares Estimates

20
Sales

15
10
Accuracy of Estimates

5
0 50 100 150 200 250 300

TV

SE,ConfidenceIntervals,P-value

Model Fit

RSE,R-squared β1

β0
Multiple Linear
Regression
Multiple Linear
Regression

Multiple Predictors
25

25

25
20

20

20
Sales

Sales

Sales
15

15

15
10

10

10
5

5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100

TV Radio Newspaper
Multiple Linear Regression

Y = β0 + β1X1 + β2X2 + … + βpXp + ϵ

βj : average effect on Y of one unit increase


in Xj holding all other predictors fixed

sales = β0 + β1TV + β2radio + β3newspaper + ϵ


Regression Coefficients
ŷ = β0̂ + β1̂ x1 + β2̂ x2 + … + βp̂ xp

Least Squares Estimate

Find Coefficients β0̂ , …, βp̂ to minimise


RSS:
n
RSS = ∑ (yi − yî )2
i=1
Average effect of Xj keeping other predictors fixed

Average effect of Xj ignoring other predictors


Predictors->Response

Is at least one of the predictors useful in


predicting the response?
Hypothesis Test

Null hypothesis
β1 = β2 = … = βp =0

Alternative hypothesis

At least one βj is non-zero


Hypothesis Test
(TSS − RSS)/p
F=
RSS/(n − p − 1)
TSS = ∑ (yi − y¯)2, RSS = ∑ (yi − yî )2
i i

F-statistic

For large n, F > 1 is evidence against null


F-statistic: 570.3
P-value: significant
At least one of the 3 media budget has effect on sales
Model Fit

RSE

R-squared
Predictors Response R-squared RSE

TV Sales 0.61 3.26

TV
Sales 0.89719 1.681
Radio

TV Radio
Sales 0.8972 1.686
Newspap
er
Linear Models:
Assumptions & Limitations
Linear Model: Assumptions

Y = β0 + β1X1 + β2X2 + … + βpXp + ϵ

Additive

Linear
Linear Model: Assumptions
Y = β0 + β1X1 + β2X2 + … + βpXp + ϵ

Additive

Effect of change in a predictor on


response is independent of other predictor
values
Change in response due to a unit change in a
predictor is constant, irrespective of the
value of the predictor
Non-linear Relations

Y = β0 + β1X1 + β2X2 + β3X2 + β4X3 + …


1 2

Polynomial Regression

Model is linear in terms of coefficients


Potential Problems

Non-linearity of the data


Correlation of error terms
Outliers

Collinearity
Non-linearity
Conclusions from Linear Model unreliable

Predictions may be inaccurate


Binary
Classification
Classification

Will a credit card customer


default on his/her credit card
payment?
Data: annual income, monthly credit card
balance, previous defaulters
Classification

Outcome: Default
Binary
(Yes/No) Predictor:
Balance

Pr( default=Yes | balance )


1.0
Regression?

1.0
||| | || | ||||||| || |||||||||||||||||| |||||||||||| |||||||||||||||||||||||| |||||||||||||||||||||||||||| |||||||||||||||| |||| | | | | | | ||| | || | ||||||| || |||||||||||||||||| |||||||||||| |||||||||||||||||||||||| |||||||||||||||||||||||||||| |||||||||||||||| |||| | | | | | |
Probability of Default

Probability of Default
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
||||||||||||||||||||||||||||||||||| ||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||| |||||||||||||||||| |||| ||||||||||||||||||||||||||||| ||||| ||||||||||||||||||||| ||||||| | | | | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||| ||||||| | || |

0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500

Balance Balance

Left: linear regression -> negative


probabilities!

Right: All probabilities lie bet ween 0 and 1


Logistic Regression
Logistic Regression
p(Y )
odds: = eβ0+β1X = eβ0eβ1X
1 − p(Y )

p(Y )
logit/log-odds: log ( = β0 + β1X
1 − p(Y ) )

Logistic function (sigmoid) lies bet ween 0 and 1


Unit increase in X changes log-odds by β1 or multiplies the odds by
β
e1
Regression Coefficients

Likelihood l(β0, β1) = ∏ P(xi) ∏ (1 − P(xj))


i:yi=1 j:yj=0
Find coefficients that maximise likelihood
Multiple Logistic Regression
p(Y )
log = β0 + β1X1 + … + βpXp
( 1 − p(Y ) )

Predict ion:
β ̂ +β1̂ x1+…+βp̂ xp
e 0
p̂(y) = ̂ +β1̂ x1+…+βp̂ xp
1+ eβ0
Decision Boundary

Logistic Regression

Target Variable: probabilities


Statistical Models

Linear Models

Linear Regression

Logistic Regression -
Classification
Statistical Models
Linear Models

Linear Regression

Logistic Regression - Classification


Key Idea: Approximate Data with a Hyperplane

Evaluate fit of the model

Accuracy of model parameter estimates


Classification with 3-NN
Important

You might also like