Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views11 pages

Introduction To Regression Chapter 2

The document provides an overview of the Classical Linear Regression Model (CLRM) and the Ordinary Least Squares (OLS) method, emphasizing their importance in quantifying relationships between variables for decision-making. It discusses the assumptions necessary for OLS estimators to be unbiased and efficient, as well as the interpretation of the coefficient of determination (R²) and the limitations of using R² in regression analysis. Additionally, it outlines the process of hypothesis testing and confidence intervals in OLS, including the significance of coefficients and the use of p-values.

Uploaded by

laibaadeelnasir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views11 pages

Introduction To Regression Chapter 2

The document provides an overview of the Classical Linear Regression Model (CLRM) and the Ordinary Least Squares (OLS) method, emphasizing their importance in quantifying relationships between variables for decision-making. It discusses the assumptions necessary for OLS estimators to be unbiased and efficient, as well as the interpretation of the coefficient of determination (R²) and the limitations of using R² in regression analysis. Additionally, it outlines the process of hypothesis testing and confidence intervals in OLS, including the significance of coefficients and the use of p-values.

Uploaded by

laibaadeelnasir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Introduction to Regression: The Classical Linear Regression

Model (CLRM)
Why Do We Do Regressions?

Regression analysis is a fundamental econometric method used to:

 Reduce uncertainty by quantifying relationships between variables.


 Support planning and decision-making, especially in economics, finance, and business.

However, building a good model requires careful effort:

 Including too many variables may introduce irrelevant information (overfitting,


unnecessary complexity).
 Including too few variables can lead to misspecification by omitting important
influences or using the wrong functional form.

The Classical Linear Regression Model (CLRM)

CLRM is used to understand the relationship between two or more variables, typically
involving:

 A dependent variable (Y) – the one we want to explain or predict.


 An independent variable (X) – the one used to explain changes in Y.

In its simplest form (with just one X), the model assumes a linear relationship between X and
Y:

E(Yt)=a+βXt

 E(Yt): Expected value of Y at time t.


 a: Intercept (value of Y when X = 0).
 β: Slope (change in Y due to a one-unit change in X).
 Xt: Value of independent variable at time t.

However, real-world data seldom follows the expected relationship exactly. So we add a
disturbance term (uₜ) to capture the difference between actual and expected values:

Yt=a+βXt+ut

Why Does the Disturbance Term utu_t Exist?


Several reasons:

1. Omitted Variables: Not all relevant factors affecting Y may be included.


2. Aggregation: Simplifying many variables into one may leave residual variation.
3. Model Misspecification: The model structure may be incorrect (e.g., using XtX_t instead
of Xt−1X_{t-1}).
4. Functional Misspecification: The true relationship might be non-linear.
5. Measurement Error: Mistakes in data collection for Y or X.

Can We Estimate the Population Regression Function?

 The true (population) regression is unknown and unobservable.


 But we can estimate it using a sample of data.
 The first step is often to create a scatter plot of Y vs X.

Fitting a Line to the Data:

Several naïve methods to fit a line:

1. Drawing a line by eye.


2. Connecting the first and last data points.
3. Connecting the averages of early and late observations.

These are subjective and imprecise.

The Proper Method: Ordinary Least Squares (OLS)

OLS is the standard and statistically justified method for estimating the regression line. It:

 Minimizes the sum of squared residuals (the differences between actual and predicted
Y values).
 Provides estimators for a and β with desirable properties under the CLRM assumptions.

OLS is the focus of the next part of the discussion or chapter.

Ordinary Least Squares (OLS) Method of Estimation (No


Derivations)
1. Purpose of OLS

OLS is used to estimate the relationship between a dependent variable YY and an explanatory
(independent) variable XX using sample data. The goal is to find the line that best fits the data,
represented as:

Yt=a+βXt+ut

Since the population parameters aa and β\beta are unknown, we use sample data to estimate
them:

Y^t=a^+β^Xt

2. Why Use the OLS Method?

OLS works by minimizing the sum of the squared residuals — the differences between the
actual values and the predicted values. It’s the most popular estimation method because it has
desirable statistical properties:

 Unbiased: On average, it gives the correct parameter values.


 Efficient: It provides the most precise estimates under the classical assumptions.
 Consistent: Estimates improve as sample size increases.

Why Minimize Squared Residuals?

OLS has useful properties:

1. Eliminates sign issues: Squaring avoids positive and negative errors canceling each
other out.
2. Penalizes large errors more heavily: Squared terms give more weight to larger
deviations.
3. Produces efficient, unbiased estimates under the CLRM assumptions.

4. OLS Estimators

OLS provides estimates for:

 The slope (β^) Tells us how much Y changes when X increases by one unit.
 The intercept (a^): The predicted value of Y when X=0

Once estimated, the regression line can be used to:


 Interpret relationships
 Predict outcomes
 Assess model accuracy

5. Practical Use

To apply OLS:

1. Collect sample data for X and Y.


2. Use software (or formulas) to estimate a^ } and β^
3. Use the equation Y^=a^+β^X to make predictions or analyze the relationship.

The Assumptions of the Classical Linear Regression Model


(CLRM)
To ensure that OLS estimators are unbiased, consistent, and efficient, the following
assumptions must hold:

1. Linearity

The model is linear in parameters:

Yt=α+βXt+ut

The relationship between Y and X is assumed to be linear in form.

2. Variation in Xt

There must be variability in the independent variable XX. If all XX values are the same, we
cannot estimate a relationship.

3. Non-Stochastic Xt

The values of XX are fixed in repeated samples and not random. This means:

 Xt is not influenced by the random error ut


 Xt and ut are uncorrelated

4. Zero Mean of the Disturbance Term

The expected value of the error term is zero:


E(ut)=0

This ensures that the regression line reflects the average relationship between XX and YY.

5. Homoskedasticity

The error terms have constant variance:

Var(ut) = σ2

This means the spread of the errors is the same across all values of XX.

6. No Serial Correlation

The error terms are not correlated with each other:

Cov(ut,us)=0 for t≠s

This assumption is particularly important for time series data to avoid biased or inefficient
estimates.

7. Normality of the Error Terms

The disturbances are normally distributed:

ut∼N(0,σ2)

This assumption is especially important for conducting hypothesis tests and constructing
confidence intervals.

8. Sufficient Sample Size and No Perfect Multicollinearity

 The number of observations must exceed the number of estimated parameters.


 No exact linear relationships among explanatory variables (in multiple regression
contexts).
Mathematical Discussed
Assumption Possible Violation Implication
Expression in Chapter
Model
Wrong regressors, misspecification,
1. Linearity Yt=α+βXt+ut non-linearity, biased or Chapter 8
changing parameters inconsistent
estimates
Errors in variables,
Little or no variation
2. X is variable Var(X)≠0 inability to estimate Chapter 8
in XX
slope reliably
3. X is non- Endogeneity, Biased and
stochastic and fixed Cov(Xs,ut)=0 simultaneity, or inconsistent OLS Chapter 10
in repeated samples autoregression estimates
4. Mean of Systematic error in Biased intercept
E(ut)= 0 —
disturbance is zero model estimate
Unequal error Inefficient
5. Homoskedasticity Var(ut)=σ2 variance estimates, invalid Chapter 6
(heteroskedasticity) standard errors
Inefficient
Errors are correlated
6. No serial Cov(ut,us)=0 for estimates,
over time Chapter 7
correlation t≠s misleading
(autocorrelation)
inference
Invalid statistical
7. Normality of Outliers, skewness, or
ut∼N(0,σ2) tests and confidence Chapter 8
residuals kurtosis
intervals
No exact linear
relationships
8. No perfect Redundant or linearly Inability to estimate
among Chapter 5
multicollinearity dependent regressors model uniquely
independent
variables

Properties of the OLS Estimators

BLUE Property (Best Linear Unbiased Estimator)

Under the assumptions of the Classical Linear Regression Model (CLRM), the Ordinary Least
Squares (OLS) estimators are the Best Linear Unbiased Estimators (BLUE). This means that
among all linear and unbiased estimators, the OLS estimators have the smallest possible
variance.
To establish this, we break down the OLS estimators into two components: a non-random
component, which reflects the true parameter values, and a random component, which reflects
sampling variability. This randomness originates from the error term in the regression model.

Linearity

OLS estimators are linear functions of the observed dependent variable values. This means they
can be expressed as weighted averages of the dependent variable. Since the explanatory variables
(X values) are treated as fixed (non-stochastic), this confirms that the OLS estimators are linear.

Unbiasedness

An estimator is unbiased if its expected value equals the true parameter it estimates. Under the
CLRM assumptions, especially the assumption that the error terms have zero mean and are
uncorrelated with the regressors, both OLS estimators—β̂ (slope) and â (intercept)—are
unbiased. This implies that, on average, OLS will correctly estimate the true population
parameters.

Efficiency (Minimum Variance)

In addition to being linear and unbiased, OLS estimators are also efficient—they have the lowest
possible variance among all linear and unbiased estimators. This is proven by comparing the
OLS estimator to a general linear unbiased estimator and showing that the OLS estimator
satisfies the conditions for minimum variance.

Consistency

An estimator is consistent if, as the sample size increases indefinitely, the estimator converges to
the true parameter value. Even when the assumption that X is fixed is relaxed, the OLS
estimators remain consistent, provided that the regressors and the error term are uncorrelated.
This means that with a large enough sample, OLS will still produce values close to the true
population parameters.

Overall Goodness of Fit

To evaluate how well the regression model fits the data, we decompose each actual value of the
dependent variable into two parts: the predicted value from the regression equation and the
residual (or error). This decomposition allows us to assess how much of the total variation in the
dependent variable is explained by the model.

The total variation is called the Total Sum of Squares (TSS). It can be broken down into:

 Explained Sum of Squares (ESS): The part of the variation explained by the regression
model.
 Residual Sum of Squares (RSS): The part of the variation not explained by the model.

The key measure that arises from this decomposition is the coefficient of determination (R²),
which is calculated as:

R2 ² = {ESS} / {TSS}

R² indicates the proportion of the variation in the dependent variable that is explained by the
model:

 R² = 0: the model explains none of the variation.


 R² = 1: the model explains all the variation.
 R² between 0 and 1: the model explains some, but not all, of the variation.

An R² of 0.4, for example, means that 40% of the variation in the dependent variable is explained
by the regression model. It does not mean that the model is twice as good as one with R² = 0.2.

Here’s a comprehensive summary of the provided content, excluding derivations and formulas,
while retaining all key points:

Problems Associated with R² in Regression Analysis

There are several serious issues with using R² to evaluate single regression equations or to
compare different equations:

1. Spurious Regression: High R² values can appear even when variables are unrelated,
especially if they exhibit similar trends. This can mislead researchers into thinking a
relationship exists when it doesn’t.
2. Omitted Variable Bias: If an omitted variable (Zₜ) that actually determines the
dependent variable (Yₜ) is highly correlated with the included independent variable (Xₜ),
the R² may falsely indicate Xₜ is important.
3. Correlation ≠ Causation: A high R² only indicates correlation between observed and
predicted values, not causality. Determining causal relationships should rely on theory,
previous studies, and intuition.
4. Time Series vs. Cross-Sectional Data: Time series models often produce high R²
values, even if badly specified, due to trend components. Cross-sectional data usually
yield lower R² values because of more noise. Thus, R² comparisons across these data
types are invalid.
5. Low R² Doesn’t Imply a Poor Model: A low R² could result from the wrong functional
form, incorrect time period, or missing lagged variables — not necessarily from choosing
the wrong independent variable.
6. Incomparable R² from Different Models: R² values from models using different
transformations of Y (e.g., Yₜ vs. ln(Yₜ)) are not comparable, as R² reflects the proportion
of explained variance of the specific dependent variable used.

Hypothesis Testing and Confidence Intervals in OLS

Under the assumptions of the Classical Linear Regression Model (CLRM):

 OLS estimators (intercept and slope) follow a normal distribution.


 When standard errors are estimated, the relevant test statistics follow a Student’s t-
distribution with n − 2 degrees of freedom.
 The t-distribution is similar to the normal distribution but has fatter tails, especially with
small samples.

Testing the Significance of OLS Coefficients

Steps in Hypothesis Testing:

1. Set Hypotheses: Choose between two-tailed (e.g., β = 0 vs. β ≠ 0) or one-tailed tests


(e.g., β = 0 vs. β > 0), depending on prior knowledge.
2. Calculate the t-statistic: Often provided by software like EViews or Stata.
3. Find the Critical t-Value: Based on degrees of freedom (n − 2) and significance level.
4. Decision Rule: Reject the null if the absolute value of the t-statistic exceeds the critical
value.

If testing hypotheses other than β = 0 (e.g., β = 1), the null must be manually specified, and the t-
statistic calculated accordingly.

Rules of Thumb for Large Samples

 For 5% significance level:


o Two-tailed test: critical t ≈ ±2.
o One-tailed test: critical t ≈ ±1.65.
 These approximations are valid when degrees of freedom are >30.
 For smaller samples, exact t-table values should be used.

The p-Value Approach


 p-values give the exact probability of observing the test statistic under the null
hypothesis.
 A smaller p-value indicates stronger evidence against the null.
 If the p-value ≤ significance level (e.g., 0.05), the coefficient is statistically significant.
 More informative than just comparing t-values, especially when the choice of
significance level (1%, 5%, 10%) is arbitrary.

Confidence Intervals

 Confidence intervals indicate the range of values within which the true coefficient likely
falls, given a certain confidence level (e.g., 95%).
 They are constructed using the estimated coefficient, its standard error, and the
appropriate critical t-value.
 The same logic applies to both slope and intercept estimates.

🧪 How to Test If Your Model’s Numbers Matter

When you build a model, you want to know: Is this number actually important, or just
random? Here's how you check:

1. Set Up the Test


You make a guess — like “this number is zero” — and test it. If it’s not zero, that means
the variable matters.
2. Use a t-Statistic
It’s a number that helps you figure out how far your result is from your guess (usually
zero). Software like EViews will give it to you.
3. Find the “Critical” Value
It’s the cutoff. If your t-statistic is bigger than this number, you can say your variable is
important.
4. Make a Decision
If the t-stat is bigger than the critical value, you say, “Yep, this variable matters!”

🔍 What About p-Values?

A p-value tells you how likely it is that you got your result just by chance.

 Small p-value (like 0.01 or 0.04) = It’s probably not random → the variable is
important.
 Big p-value (like 0.3) = It’s probably just noise → the variable might not matter.
Rule: If the p-value is smaller than 0.05 (or whatever limit you choose), you say the variable is
statistically significant.

📏 Confidence Intervals (CIs)

A confidence interval tells you: “We’re pretty sure the real number is somewhere in this range.”

For example:
If you say the slope is 3, and your 95% CI is [1.5, 4.5], that means you’re 95% confident the real
slope is between 1.5 and 4.5.

You might also like