Advanced Research Methods and Software
Application
Baro Beyene
OSU
2022
1
Content
Chapter 8
Stata Software Application on Some
Econometric Models
2
1. Linear Regression Analysis
This section describes the use of Stata to do regression
analysis. Regression analysis involves estimating an
equation that best describes the data.
One variable is considered the dependent variable,
while the others are considered independent (or
explanatory) variables.
11/3/2022 3
Linear Regression Analysis
Stata is capable of many types of regression analysis
and associated statistical test.
In this section, we touch on only a few of the more
common commands and procedures. The commands
described in this section are:
11/3/2022 4
Linear Regression Analysis
Consider the database food_security to estimate a linear model of
determinants of daily calorie intake (lncalav). Suppose the factors
influencing daily calorie intake per capita of households are
farming system (farmsy), gender (femal), family size (famsz),
land allocate to staples (landst), access to irrigation (irrig),
fertilizer quantity used for crop productio (frtqt), oxen used for
draught power (oxen), (log of) annual gross income in ETB
(lninom), and access to off−farm activities (ofarm).
11/3/2022 5
Linear Regression Analysis
Based on this information workout the following problems:
a. Estimate the OLS model for determinants of (log) daily calorie
intake per adult equivalent of farm households in the study area.
b. Which variables are positively/adversely and significantly
affecting daily calorie intake?
c. How do you interpret the fitness of this OLS model to identify
determinants of calorie intake?
11/3/2022 6
Linear Regression Analysis
11/3/2022 7
Linear Regression Analysis
• According to the OLS model outputs, five variables
(famsz, irrig, frtqt, lninom, and ofarm) are statistically
significant variables affecting daily calorie intake of
households.
• Family size, access to irrigation, and access to off-farm
activities are factors adversely affecting daily calorie
intake while the remaining two variables enhance
calorie intake of households.
• About 19.3% of the variation in daily calorie intake is
explained by this OLS model
11/3/2022 8
Linear Regression Analysis
• However, interpretation of OLS model outputs is
possible if and only if the basic assumptions of
classical linear regression model are satisfied.
• There are many post-estimation tests used to check the
satisfaction of the basic assumptions of multiple linear
regression model.
• Tests for heteroscedasticity, omitted variables and
multicollinearity are the most important postestimation
tests that must be reported with the OLS model
outputs.
11/3/2022 9
Linear Regression Analysis
Post estimation test of OLS Regression/ Diagnostic Test
-Multicollinearity
-Hetroscedasticity
-Autocorrelation ( time series data)
-Normality
-Model Misspecification
11/3/2022 10
Diagnostic Test/Post Estimation Tests
1. Tests for Normality of Residuals
– kdensity -- produces kernel density plot with normal
distribution over laid.
11/3/2022 11
Diagnostic Test/Post Estimation Tests
1. Tests for Normality of Residuals
– pnorm -- graphs a standardized normal probability (P-P) plot.
– qnorm --- plots the quantiles of varname against the quantiles of
a normal distribution.
– mvtest normal residual (Doornik-Hansen test for multivariate
normality)
11/3/2022 12
Diagnostic Test/Post Estimation Tests
1. Tests for Normality of Residuals
– The Skewness-Kurtosis (Jarque-Bera) test
H0: the residual distribution is normal
11/3/2022 13
Diagnostic Cont…
2. Tests for Heteroscedasticity
– hettest – performs Breusch-Pagan/ Cook and Weisberg test
for heteroscedasticity.
– H0: No heteroscedasticity
11/3/2022 14
Diagnostic Cont…
2. Tests for Heteroscedasticity
– imtest-- computes the White general test for
Heteroscedasticity
– H0: No heteroscedasticity
11/3/2022 15
Diagnostic Cont…
3. Tests for Multicollinearity
Vif: Calculates the variance inflation factor for
the independent variables in the linear model.
This test involves the regression of one explanatory
variables on another explanatory variable and if the
auxiliary R2 is greater than 0.9, there is a problem of
Multicollinearity between explanatory variables.
11/3/2022 16
Diagnostic Cont…
3. Tests for Multicollinearity
11/3/2022 17
Diagnostic Cont…
4. Tests for Model Specification
– linktest -- performs a link test for model specification.
– H0: No specification problem
– The value of y hat square should not be significantly contributing to
the test model for correctly specified model.
11/3/2022 18
Diagnostic Cont…
4. Tests for Model Specification
– ovtest -- performs regression specification error test (RESET) for omitted
variables.
– H0: No omitted variable
11/3/2022 19
2 Linear Probability Model
Linear Probability Model(LPM)
Y=1
Y=0 X=income
X=X1 X=X2
Linear Probability Model(LPM)
Linear Probability Model(LPM)
These nonlinear regression models include:
A. The Logit Model Binary Choice Models
B. The Probit Model
C. Multinomial Logit and Probit Model (MNL & MNP)
D. Ordered Logit and Probit Model.
Numerical Example on LPM
Using LPM data from your Stata training folder, regress
poverty on family size and migration, test for
heteroskedasticity, normality and multicollinearity.
. reg poverty fs migration
Source SS df MS Number of obs = 20
F(2, 17) = 13.35
Model 2.93291409 2 1.46645705 Prob > F = 0.0003
Residual 1.86708591 17 .109828583 R-squared = 0.6110
Adj R-squared = 0.5653
Total 4.8 19 .252631579 Root MSE = .3314
poverty Coef. Std. Err. t P>|t| [95% Conf. Interval]
fs .0794074 .034043 2.33 0.032 .007583 .1512318
migration -.3628224 .203684 -1.78 0.093 -.792558 .0669132
_cons .2660414 .2456304 1.08 0.294 -.2521936 .7842763
There is negative predicted
probabilities for some
Numerical Example on LPM observations and probability
greater than one.
Test of normality: mvtest normal e
. mvtest normal e
Test for multivariate normality
Doornik-Hansen chi2(2) = 9.916 Prob>chi2 = 0.0070
Test of heteroskedasticity : hettest
. hettest
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of poverty
chi2(1) = 2.80
Prob > chi2 = 0.0942
3 The Logit and Probit Models
Logit and Probit
Logit and Probit
This requires a nonlinear functional form for the
probability. This can be possible if we assume that the
dependent or the error term (Ui) follows some sorts of
cumulative distribution functions.
The two important nonlinear functions which are
proposed for this are the logistic CDF and the normal
CDF.
Pr (Yi=1/Xi) = Pi = G (β0 + β1Xi) = G (Zi)
Logit and Probit
Logit and Probit
Logit and Probit
Logit and Probit
Logit and Probit
Pi
Cumulative Normal Distribution Function
Pi =1
Logistic Distribution Function
Pi=0
0
Logit and Probit
Numerical Example on Logit and Probit
Suppose that we want to examine factors affecting students
academic performance in particular course.
Assume that academic performance is measured by the
grades (A, B, C, D, F) scored by the students.
Further assume that data on three independent variables
namely; previous CGPA, PC ownerships and average score
in exercises (ASE) were collected from 32.
Numerical Example on Logit and Probit
Numerical Example on Logit and Probit
A. Logit Interpretation of Logit Model
logit grade gpa ase pc
Logistic regression Number of obs = 32
LR chi2(3) = 15.40
Prob > chi2 = 0.0015
Log likelihood = -12.889633 Pseudo R2 = 0.3740
grade Coef. Std. Err. z P>|z| [95% Conf. Interval]
gpa 2.826113 1.262941 2.24 0.025 .3507938 5.301432
ase .0951577 .1415542 0.67 0.501 -.1822835 .3725988
pc 2.378688 1.064564 2.23 0.025 .29218 4.465195
_cons -13.02135 4.931325 -2.64 0.008 -22.68657 -3.35613
Numerical Example on Logit and Probit
B. Odds Ratio Interpretation of Logit Model
logit grade gpa ase pc, or
Logistic regression Number of obs = 32
LR chi2(3) = 15.40
Prob > chi2 = 0.0015
Log likelihood = -12.889633 Pseudo R2 = 0.3740
grade Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
gpa 16.87972 21.31809 2.24 0.025 1.420194 200.6239
ase 1.099832 .1556859 0.67 0.501 .8333651 1.451502
pc 10.79073 11.48743 2.23 0.025 1.339344 86.93802
_cons 2.21e-06 .0000109 -2.64 0.008 1.40e-10 .03487
Numerical Example on Logit and Probit
C. Probability Interpretation of the logit model
. mfx
Marginal effects after logit
y = Pr(grade) (predict)
= .25282025
variable dy/dx Std. Err. z P>|z| [ 95% C.I. ] X
gpa .5338589 .23704 2.25 0.024 .069273 .998445 3.11719
ase .0179755 .02624 0.69 0.493 -.033448 .069399 21.9375
pc* .4564984 .18105 2.52 0.012 .10164 .811357 .4375
(*) dy/dx is for discrete change of dummy variable from 0 to 1
Numerical Example on Logit and Probit
D. Probit Estimation
probit grade gpa ase pc
Probit regression Number of obs = 32
LR chi2(3) = 15.55
Prob > chi2 = 0.0014
Log likelihood = -12.818803 Pseudo R2 = 0.3775
grade Coef. Std. Err. z P>|z| [95% Conf. Interval]
gpa 1.62581 .6938825 2.34 0.019 .2658255 2.985795
ase .0517289 .0838903 0.62 0.537 -.1126929 .2161508
pc 1.426332 .5950379 2.40 0.017 .2600795 2.592585
_cons -7.45232 2.542472 -2.93 0.003 -12.43547 -2.469166
Numerical Example on Logit and Probit
E. Probability Interpretation of Probit Model
. mfx
Marginal effects after probit
y = Pr(grade) (predict)
= .26580809
variable dy/dx Std. Err. z P>|z| [ 95% C.I. ] X
gpa .5333471 .23246 2.29 0.022 .077726 .988968 3.11719
ase .0169697 .02712 0.63 0.531 -.036184 .070123 21.9375
pc* .464426 .17028 2.73 0.006 .130682 .79817 .4375
(*) dy/dx is for discrete change of dummy variable from 0 to 1
Numerical Example on Logit and Probit
Logit: as GPA increases by one point, the log of the odds ratio
increases by 2.8 and statistically significant.
Odds ratio: as GPA increases by one point, the probability of
getting A is 16.87 times the probability of getting other grades
(B, C, D, F)
Marginal Effect: Both logit and probit give us similar results. As
GPA increases by one point, the probability of getting grade A by
the student increases by 53%.
Thank You!
11/3/2022 43