OLS Regression Analysis Basics
OLS Regression Analysis Basics
This lecture aims to equip students with a comprehensive understanding of the OLS technique and its
applications in regression analysis and forecasting.
are specific about how Xi and µi are created or generated, there is no way we can make any
statistical inference about the Yi and also, as we shall see, about β0 and β1. Thus, the
assumptions made about the Xi variable(s) and the error term are extremely critical to the valid
interpretation of the regression estimates.
Y=β0 + β1 X + µ
The regression model is linear in the parameters, though it may or may not be linear in the
variables.
Values taken by the regressor X may be considered fixed in repeated samples (the case of fixed
regressor) or they may be sampled along with the dependent variable Y (the case of stochastic
regressor).
Implication: In the latter case, it is assumed that the X variable(s) and the error term are
independent, that is cov(Xi, ui) =0
Implication: The variance of y about the regression line equals σ2 and is the same for all values
of x.
The values of µ are independent. That means, we assume that σ2 is the same for each x
Implication: The value of µ for a particular value of x is not related to the value of µ for any
other value of x; thus, the value of y for a particular value of x is not related to the value of y
for any other value of x.
The error term µ is a normally distributed. Even though the error term µ is not observable, we
assume that the error is a normally distributed random variable with mean 0 and standard
deviation σ; that is, E(µ)=0.
2
Module: Statistical Modeling
Lecture 2.3: Testing the Validation of the Model
Implication: β0 and β1 are constants, therefore E(β0)=β0 and E(β1)=β1; thus, for a given
value of x, the expected value of y is: E(y)=β0 + β1 X. And because y is a linear function of µ,
y is also a normally distributed random variable.
Figure 2.3.1 illustrates the model assumptions and their implications; note that in this
graphical interpretation, the value of E(y) changes according to the specific value of x
considered. However, regardless of the x value, the probability distribution of and hence the
probability distributions of y are normally distributed, each with the same variance. The
specific value of the error at any particular point depends on whether the actual value of y is
greater than or less than E(y).
Figure 2.3.1 ASSUMPTIONS FOR THE REGRESSION MODEL
1. Linearity: The relationship between the dependent and independent variables must be
linear.
2. Independence of Errors: The errors in the model should be independent of each other.
3
Module: Statistical Modeling
Lecture 2.3: Testing the Validation of the Model
3. Homoscedasticity: The variance of the errors should be constant across all values of the
independent variable.
4. Normality of Errors: The errors should be normally distributed.
Remember that
• ‘Estimator’ - β1ˆ and β2ˆ are estimators of the true value of α and β
• ‘Linear’ - β1ˆ and β2ˆ are linear estimators - that means that the formulae for β1ˆ and
β2ˆ are linear combinations of the random variables (in this case, y)
• ‘Unbiased’ - on average, the actual values of β1ˆ and β2ˆ will be equal to their true values
• ‘Best’ - means that the OLS estimator βˆ has minimum variance among the class of
linear unbiased estimators. Alternative linear unbiased estimator and showing in all
cases that it must have a variance no smaller than the OLS estimator.
Under assumptions 1-5 listed above, the OLS estimator can be shown to have the desirable
properties that it is consistent, unbiased and efficient. Unbiasedness and efficiency have already
been discussed above, and consistency is an additional desirable property.
Figure 2.3.2 OLS estimators have the lowest variance among Table 2-3-1: The least-squares method yields
4
Module: Statistical Modeling
Lecture 2.3: Testing the Validation of the Model
Where: x2i=(xi-ѝ), var=variance and se=standard error and where σ2 is the constant or
homoscedastic variance of ui. All the quantities entering into the preceding equations except
σ2 can be estimated from the data. σ2 itself is estimated by the following formula:
where σˆ2 is the OLS estimator of the true but unknown σ2 and where the expression n – 2 (n-
k, where K is the number of the parameters included in the model) is known as the number
of degrees of freedom (df), ∑uˆ2i being the sum of the residuals squared or the residual sum
of squares (RSS).
Once ∑uˆ2i is known, σˆ2 can be easily computed using the following formula:
Example 2-3-1:
The given below table represents the calculation of a regression equations of X on Y, where:
given below: (b1=44.25) and (b2=-o.25). Use these values in the table below to:
1. Visualize the relationship between data points and their fitted line. Plot the actual data
points and overlay the regression line. Clearly indicate the sum of squared errors (SSE),
regression sum of squares (RSS), and total sum of squares (TSS) on the graph.
2. Calculate the values of (Ỹi).
3. Calculate the square sum of residuals (RSS)
4. Calculate (SE) of the overall model, (SEb1) and (SEb2).
5
Module: Statistical Modeling
Lecture 2.3: Testing the Validation of the Model
Observations Price ($) Amount Demanded (Q) (X-ϰ) (Y-Ý) (X-ϰ)(Y-Ý) (X-ϰ)2
1 10 40 -3 -1 3 9
2 12 38 -1 -3 3 1
3 13 43 0 2 0 0
4 12 45 -1 4 -4 1
5 16 37 3 -4 -12 9
6 15 43 2 2 4 4
∑ 78 246 / / -6 24
Is the true slope different from zero? This is an important question because if β1 = 0, then X
does not influence Y and the regression model collapses to a constant β0 plus a random error
term:
In other words. in a simple linear regression equation, the mean or expected value of y is a
linear function of x: E(Y)= β0 + β1 X. If the value of β1 is zero, E(Y)= β0 + β1 X= β0. In this
case, the mean value of Y does not depend on the value of X and hence we would conclude that
X and Y are not linearly related. Alternatively, if the value of β1 is not equal to zero, we would
conclude that the two variables are related. Thus, to test for a significant regression
relationship, we must conduct a hypothesis test to determine whether the value of β1 is zero.
Two tests are commonly used. Both require an estimate of σ2, the variance of βK in the
regression model.
• t Test
The simple linear regression model is Y= β0 + β1X+µ. If X and Y are linearly related, we must
have β1≠0. The purpose of the t test is to see whether we can conclude that β1≠0. We will use
the sample data to test the following hypotheses about the parameter β1.
Suppose we want to test the hypothesis that the regression coefficient βk=0. To test this
hypothesis, we use the t test of statistics, which is:
6
Module: Statistical Modeling
Lecture 2.3: Testing the Validation of the Model
Example 2-3-2: Use the data from Example 2-3-1 to test β0 and β1.
7
Module: Statistical Modeling
Lecture 2.3: Testing the Validation of the Model
• When the interval for β0 includes zero, it suggests the intercept is not statistically
significant, meaning there may be no systematic effect on the dependent variable when
the independent variable is absent or minimal.
The form of a confidence interval for β0 and β1 is as follows:
β0 +tα/2SE βo ≤ β0 ≤ β0 -tα/2SE βo or in brief β0 ± tα/2SE βo
β1+tα/2SE βo ≤ β1≤ β1-tα/2SE βo or in brief β1 ± tα/2SE β1
•
The point estimator is β0 and β1 and the margin of error is tα/2. The confidence
coefficient associated with this interval is 1-α, and tα/2 is the t value providing an
area of α/2 in the upper tail of a t distribution with n-2 degrees of freedom.
Example 2-3-3
Suppose that we want to develop a 95% confidence interval estimate for a regression model
between gross leasable area (X) and retail sales (Y) in shopping malls. Based on the data in
the following table, calculate the confidence interval knowing that n= 24.
where: Y(Retail Sales (billions))= 0.3852 + 0.2590X (Gross Leasable Area (million sq ft))
REGRESSION OUTPUT
VARIABLES Coefficients Std.Error T(df=26) p-value 95% Lower 95% Upper
INTERCEPT 0.3852 1.9853 0.194 .8479 ………………… …………………
AREA 0.2590 0.0084 30.972 1.22E-19 ………………… …………………
In multiple regression, we have several predictors (X’s), and the F Test helps determine
whether they collectively have an explanatory power over Y. The test is crucial in assessing the
overall validity of the model before interpreting individual coefficients, as a significant F Test
indicates that at least one predictor contributes meaningfully to the prediction of Y.
1
With only one independent variable, the F test will provide the same conclusion as the t test; that is, if the t
test indicates b1±0 and hence a significant relationship, the F test will also indicate a significant relationship.
But with more than one independent variable, only the F test can be used to test for an overall significant
relationship.
8
Module: Statistical Modeling
Lecture 2.3: Testing the Validation of the Model
• Null Hypothesis (H0): β1=β2=…=βk=0 (All coefficients are equal to zero. This implies
that the independent variables do not jointly explain any of the variability in Y).
• Alternative Hypothesis (H1): At least one βi≠0 (At least one predictor has a non-zero
coefficient. Indicating that the model has explanatory power).
• Decision Rule: If the F statistic is large enough (or the p-value is small enough), we
reject the null hypothesis, concluding that the regression model is statistically
significant. Or: Reject H0 if FCALCULATED/ > F CRITICAL or p-value<ά
𝑴𝑺𝑹 𝑺𝑺𝑹/𝑷
𝑭= =
𝑴𝑺𝑬 𝑺𝑺𝑬/(𝑵−𝑷−𝟏)
where:
• MSR (Mean Square Regression): This measures the variance explained by the
regression model and is calculated as SSR/p, where SSR (Sum of Squares Regression)
is the explained variation by the model, and p is the number of predictors.
• MSE (Mean Square Error): This measures the unexplained variance, or the variance of
the residuals. It is calculated as SSE/(n−p−1), where SSE (Sum of Squares Error)
represents the residual or unexplained variation, n is the sample size, and p is the
number of predictors.
• F=Explained Variation per Degree of Freedom.
where:
• MSE (Mean Square Error) is the average unexplained variance, calculated by dividing
the residual sum of squares by the degrees of freedom n−k.
The F statistic follows an F distribution, and its critical value depends on the degrees of freedom
associated with the explained and unexplained variances.
1. If the F statistic is large and the p-value is small (typically less than 0.05), we reject
H0, suggesting that the model significantly explains the variance in Y (A higher F value
9
Module: Statistical Modeling
Lecture 2.3: Testing the Validation of the Model
suggests a more significant model, implying that the predictors explain a substantial
portion of the variance in the dependent variable).
2. If the F statistic is small and the p-value is high, we fail to reject H0, suggesting that
the independent variables do not collectively explain a significant portion of the variance
in Y.
For instance, in a regression model predicting income (dependent variable) based on education
level, years of experience, and industry (independent variables), a significant F Test result
would indicate that at least one of these predictors is meaningfully related to income.
In short:
• The F Test is a global test for the regression model, assessing the collective impact of
all predictors.
• The F Test does not tell us which specific variables are significant; we would look at
individual t-tests for each coefficient for that information.
Remember that:
• R2 values range from 0 to 1, where 0 indicates that the model explains none of the
variance in the dependent variable, while 1 means the model explains all the
variance.
• A higher R2 value suggests a better fit of the model to the data, meaning the
independent variables explain a large portion of the variability in the dependent
variable.
10
Module: Statistical Modeling
Lecture 2.3: Testing the Validation of the Model
• R2 is calculated by comparing the total sum of squares (TSS), which reflects the total
variation in the data, with the sum of squares of residuals (RSS), which indicates the
unexplained variation. The formula is:
• This ratio gives an insight into the extent to which the regression model reduces
prediction error compared to a simple mean model.
• This quantity varies from 0 to 1, and higher values indicate a better regression.
• Caution should be used in making general interpretations of R2 because a high value can
result from either a small SSE, a large SST, or both.
• While a high R2 indicates a strong model fit, it does not necessarily mean the model is
appropriate, as it doesn’t account for the potential overfitting or omitted variable bias.
• For multiple regressions, an adjusted R2 is often used to adjust for the number of
predictors in the model, providing a more reliable metric when comparing models with
different numbers of variables.
In summary, R2 serves as a useful measure for evaluating the goodness of fit of a regression
model, helping analysts understand the explanatory power of their models, although it should
be interpreted in context with other diagnostic measures.
In a regression, we seek to explain the variation in the dependent variable around its mean. We
express the total variation as a sum of squares (denoted SST):
The explained variation in Y (denoted SSR) is the sum of the squared differences between the
conditional mean yˆi (conditioned on a given value xi) and the unconditional mean Ý (same for
all xi):
The unexplained variation in Y (denoted SSE) is the sum of squared residuals, sometimes
referred to as the error sum of squares.*2
2 *But bear in mind that the residual e (observable) is not the same as the true error µ (unobservable)
i i
11
Module: Statistical Modeling
Lecture 2.3: Testing the Validation of the Model
If the fit is good, SSE will be relatively small compared to SST. If each observed data value yi
is exactly the same as its estimate yˆi (i.e., a perfect fit), then SSE will be zero. There is no
upper limit on SSE. Table 2.3.2 shows the calculation of SSE for the exam scores.
Figure 2-3-3: Decomposing the deviation of an observed y-value from the mean into the
deviations explained and not explained by the regression
And the estimated regression equation for these data is represented as:
12
Module: Statistical Modeling
Lecture 2.3: Testing the Validation of the Model
where the random variable tn-2 follows a Student’s t distribution with (n-2) degrees of freedom.
13
Module: Statistical Modeling
Lecture 2.3: Testing the Validation of the Model
In this example, we Reject H0 if tcalc > 2.048 Or if: tcalc < -2.048.
14
Module: Statistical Modeling
Lecture 2.3: Testing the Validation of the Model
In summary, testing for the significance of the correlation coefficient ensures that the
relationship you observe between variables is meaningful, not random, and that your findings
can be generalized to the population you are studying.
Example 2-3-4:
A research team was attempting to determine if political risk in countries is related to
inflation for these countries. In this research a survey of political risk analysts produced a
mean political risk score for each of 49 countries.
The political risk score is scaled such that the higher the score, the greater the political risk.
The sample correlation between political risk score and inflation for these countries was
0.43.
We wish to determine if the population correlation, r, between these measures is different
from 0. Specifically, we want to test: H0: r = 0
Against: H1: r > 0
• Use the previous information and the appendix (01) to test (H0: r = 0).
References:
• David P. Doane and Lori E. Seward (2016). Applied Statistics in Business and
Economics. 5TH EDITION. . McGraw-Hill Companies, Inc. Boston.
• Damodar N. Gujarati and Dawn C. Porter (2009). BASIC ECONOMETRICS. 5th edition.
McGraw-Hill Companies, Inc. Boston.
• Damodar Gujarati (2012). Econometrics By Examples. Palgrave Macmillan. London.
• Neil A. Weiss (2012). Introductory STATISTICS. 9TH edition. Pearson Education, Inc.
Boston. USA.
• Neil A. Weiss (2017). Introductory STATISTICS. 10TH edition. Pearson Education, Inc.
Boston. USA.
• David R. Anderson, Dennis J. Sweeney, and Thomas A. Williams (2008). Statistics for
Business and Economics. Tenth Edition. Thomson South-Western. Mason, OH. USA.
• Paul Newbold, William L. Carlson, and Betty M Thorne (2013). 8TH edition. Statistics
for Business and Economics. Pearson Education, Inc. Boston. USA.
Appendix 1
15
Module: Statistical Modeling
Lecture 2.3: Testing the Validation of the Model
16