Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
449 views58 pages

Econometrics With Stata PDF

regress fits a linear regression model of a dependent variable on independent variables. It provides regression outputs including an ANOVA table, summary statistics, and a table of estimated coefficients and their standard errors, t-statistics, and confidence intervals. Weight was found to negatively predict mpg in a car dataset, with heavier cars getting worse gas mileage. A multiple linear regression was also estimated adding length, trunk size, and headroom as additional independent variables to potentially better predict mpg.

Uploaded by

barkon desie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
449 views58 pages

Econometrics With Stata PDF

regress fits a linear regression model of a dependent variable on independent variables. It provides regression outputs including an ANOVA table, summary statistics, and a table of estimated coefficients and their standard errors, t-statistics, and confidence intervals. Weight was found to negatively predict mpg in a car dataset, with heavier cars getting worse gas mileage. A multiple linear regression was also estimated adding length, trunk size, and headroom as additional independent variables to potentially better predict mpg.

Uploaded by

barkon desie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

STATA TRAINING MATERIAL

ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

6.0 ECONOMETRIC ANALYSIS (COMMANDS IN REGRESSION)

6.1 Introduction

In the workplace, you are often given questions or issues to address like what is the
effect of modern input use (like fertilizer) on productivity? What is the relationship
between CEOs performance and CEOs salary? What is the effect of one more year of
education on the wage of a labor? Does class size affect student performance? Does
population growth inhibit economic growth? What determines adoption of modern
inputs? And many more, all require quantitative answers. In such situation you opt
for statistical models that enables you quantify the relationship between a factor and
the response variable, generally known as Regression analysis or model.

6.2 Classical Ordinary Least Squares Regression / Linear Regression Model /

Linear Regression Model


regress fits a model of dependent variable (depvar) on list of independent
variables(varlist) using linear regression.

Syntax
regress depvar [indepvars] [if] [in] [weight] [, options]

Options Description
------------------------------------------------------------------------------------------------------
Model
noconstant suppress constant term
hascons has user-supplied constant
tsscons compute total sum of squares with constant; seldom used
SE/Robust
vce(vcetype) vcetype may be robust, bootstrap, or jackknife
robust synonym for vce(robust)
cluster(varname) adjust standard errors for intragroup correlation
mse1 force mean squared error to 1
hc2 use u^2_j/(1-h_jj) as observation's variance
hc3 use u^2_j/(1-h_jj)^2 as observation's variance
Reporting
level(#) set confidence level; default is level(95)
beta report standardized beta coefficients
eform(string) report exponentiated coefficients and label as string
noheader suppress the table header
plus make table extendable
-------------------------------------------------------------------------------------------------------
depvar and the varlist following depvar may contain time-series operators

Example 6.1
For instance, there are numerous occasions when it behoves a researcher to test the
relationship between a dependent variable and a several potential predictors of that
dependent variable. If a person was interested in buying a car, s/he might wish to
know the statistically significant predictors of good gas mileage. From the STATA
data set, auto.dta, s/he has data on weight, length, trunk size, and headroom. S/He

Department of Economics, CBE, MU


1
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

wants to ascertain which of these variables predict miles per gallon. To do so, s/he
would construct a regression model. To set up the STATA command for a classical
ordinary least squares statistical model, s/he would select miles per gallon, mpg, as
her/his dependent variable. This would become the first variable in her/his list of
variables. S/He follows this dependent variable with the other variables, which s/he
hypothesizes to predict mpg. S/He issues the regression command, which commences
with regress:
clear
sysuse auto.dta
regress mpg weight

The regress command above generates the following simple linear regression results:
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 1, 72) = 134.62
Model | 1591.9902 1 1591.9902 Prob > F = 0.0000
Residual | 851.469256 72 11.8259619 R-squared = 0.6515
-------------+------------------------------ Adj R-squared = 0.6467
Total | 2443.45946 73 33.4720474 Root MSE = 3.4389

------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0060087 .0005179 -11.60 0.000 -.0070411 -.0049763
_cons | 39.44028 1.614003 24.44 0.000 36.22283 42.65774
------------------------------------------------------------------------------

Interpretation:
The upper left corner is the analysis-of-variance (ANOVA) table. The column
headings SS, df, MS stand for ‘Sum of Squares’, ‘degrees of freedom’ and ‘Mean
square’ respectively. In this example, the total sum of squares is 2443.45, of which
1591.99 is accounted for by the model and 851.46 is left unexplained (in the residual).

Summary statistics are displayed on the top right corner. There are 74 observations
included in the regression analysis. The F-statistic (F(1,72)=134.62) tests the
hypothesis that all coefficients (excluding the constant) are zero. If the null hypothesis
is correct is, the probability of observing an F-statistic as large as 134.62 is 0.0000
(Stata’s way of indicating a number smaller than 0.00005). That is, the p-value is the
significance level of the test when we use the value of the test statistic, 134.62, as the
critical value for the test. The R-squared for the regression is 0.65 and Adjusted R-
squared is 0.64. Both R-squared and Adjusted R-squared tell that 65% and 64% of the
variation on the dependent variable, mpg is explained by the regressor, weight.

Finally, Stata produces a table of estimated coefficients. The first line indicates that
the dependent variable in this regression model was mpg. This table provides
information on: the estimated coefficient (Coef.), its standard error(Std. Err.),
the t-statistic (t) which tests the hypothesis that the coefficient is equal to zero, the
probability of observing this t-statistic if the 0-hypothesis were valid (P>|t|), and a
confidence interval for the estimated coefficient ([95% Conf. Interval]). Our
fitted model is:

mpg= 39.44028-.0060087 weight

Department of Economics, CBE, MU


2
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

This regression tells us that for every additional pound (lb) in the weigh of a car, the
gas mileage (mpg) rate on average decreases by 0.0066 units (miles per gallon). This
decrease is statistically significant as indicated by the 0.000 probability associated
with this coefficient.

We can redo the above regression changing the limit or width of confidence interval
in to, for instance, 90% as below:

reg mpg weight, level(90)

Source | SS df MS Number of obs = 74


-------------+------------------------------ F( 1, 72) = 134.62
Model | 1591.9902 1 1591.9902 Prob > F = 0.0000
Residual | 851.469256 72 11.8259619 R-squared = 0.6515
-------------+------------------------------ Adj R-squared = 0.6467
Total | 2443.45946 73 33.4720474 Root MSE = 3.4389

------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [90% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0060087 .0005179 -11.60 0.000 -.0068716 -.0051457
_cons | 39.44028 1.614003 24.44 0.000 36.75088 42.12969
------------------------------------------------------------------------------

Similarly, it is possible to suppress the constant term of a regression model using


noconstant command as follows:

reg mpg weight, noconstant

Source | SS df MS Number of obs = 74


-------------+------------------------------ F( 1, 73) = 259.18
Model | 28094.8545 1 28094.8545 Prob > F = 0.0000
Residual | 7913.14549 73 108.399253 R-squared = 0.7802
-------------+------------------------------ Adj R-squared = 0.7772
Total | 36008 74 486.594595 Root MSE = 10.411

------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | .006252 .0003883 16.10 0.000 .0054781 .007026
------------------------------------------------------------------------------

Estimating a regression model suppressing constant is directing the regression


(predicted) to pass through the origin.

Example 6.2
The person in example 6.1 felt that in addition to the weight of the car length, trunk
size and headroom has some thing to do with gas mileage, hence included the last
three variables as independent variable in her/his model and estimated the following
model, usually referred as Multiple Linear Regression Model.

mpgi = β 0 + β1weighti + β 2lengthi + β 3trunki + β 4headroomi + ε i

The command for multiple linear regression models is same to the simple linear
regression models except more than one independent variable is listed on varlist as
shown here.

Department of Economics, CBE, MU


3
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel
regress mpg weight length trunk headroom

This command generates the following ANOVA and regression parameter estimates.

Source | SS df MS Number of obs = 74


-------------+------------------------------ F( 4, 69) = 33.73
Model | 1616.74273 4 404.185683 Prob > F = 0.0000
Residual | 826.716728 69 11.9814019 R-squared = 0.6617
-------------+------------------------------ Adj R-squared = 0.6420
Total | 2443.45946 73 33.4720474 Root MSE = 3.4614

------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0038778 .0016121 -2.41 0.019 -.007094 -.0006617
length | -.0742607 .060633 -1.22 0.225 -.1952201 .0466988
trunk | -.0342372 .1582544 -0.22 0.829 -.3499461 .2814716
headroom | .0161296 .6405386 0.03 0.980 -1.26171 1.293969
_cons | 47.38497 6.54025 7.25 0.000 34.33753 60.43241
------------------------------------------------------------------------------

Interpretation:
Most of things are interpreted in the same manner to the simple linear regression
estimates but the coefficients. The coefficients are interpreted as follows: for every
additional pound (lb) in the weigh of a car, keeping the effect of other regressors
constant (cetris paribus), the gas mileage (mpg) rate on average decreases by 0.0038
units (miles per gallon). This decrease is statistically significant as indicated by the
0.019 probability associated with this coefficient.

From the full model regression output shown above, it appears that only weight is a
significant predictor of the gas mileage. The hypotheses that headroom, trunk size,
length of the vehicle is statistically significant positive predictors of the gas mileage
are disconfirmed by this regression analysis, based on this data set. Once the original
hypotheses are run, this model can be trimmed to a parsimonious one showing the
relationship between the dependent variable and only significant predictors.

The resulting formula from this full regression model is:

mpgi = 47.38 − 0.0038weighti − 0.0742lengthi − 0.0342trunki + 0.0161headroomi

You may be wondering what -0.0038 change in weight really means, and how you
might compare the strength of that coefficient to the coefficient for another variable,
say length. To address this problem, we can add an option to the regress command
called beta, which will give us the standardized regression coefficients. The beta
coefficients are used by some researchers to compare the relative strength of the
various predictors within the model. Because the beta coefficients are all measured in
standard deviations, instead of the units of the variables, they can be compared to one
another. In other words, the beta coefficients are the coefficients that you would
obtain if the outcome and predictor variables were all transformed standard scores,
also called z-scores, before running the regression.

Department of Economics, CBE, MU


4
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel
regress mpg weight length trunk headroom, beta
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 4, 69) = 33.73
Model | 1616.74273 4 404.185683 Prob > F = 0.0000
Residual | 826.716728 69 11.9814019 R-squared = 0.6617
-------------+------------------------------ Adj R-squared = 0.6420
Total | 2443.45946 73 33.4720474 Root MSE = 3.4614

------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
weight | -.0038778 .0016121 -2.41 0.019 -.5209278
length | -.0742607 .060633 -1.22 0.225 -.2858029
trunk | -.0342372 .1582544 -0.22 0.829 -.0253126
headroom | .0161296 .6405386 0.03 0.980 .0023586
_cons | 47.38497 6.54025 7.25 0.000 .
------------------------------------------------------------------------------

Interpretation:
Because the coefficients in the Beta column are all in the same standardized units you
can compare these coefficients to assess the relative strength of each of the
predictors. In this example, weight has the largest Beta coefficient, -0.52 (in absolute
value), and headroom has the smallest Beta, 0.002. Thus, a one standard deviation
increase in weight leads to a 0.52 standard deviation decrease in predicted mpg, with
the other variables held constant. And, a one standard deviation increase in
headroom, in turn, leads to a 0.002 standard deviation increase in predicted mpg with
the other variables in the model held constant.

Remember that the difference between the numbers listed in the Coef. column and
the Beta column is in the units of measurement.

6.3 Post Estimation Commands

6.3.1 Hypothesis Testing

test -- Test linear hypotheses after estimation


Usually after estimating a regression line (function) we want to test whether an
estimator is equivalent to some value or two or more estimators are linearly related,
the so called Joint Hypothesis Testing

Syntax for linear hypotheses test after estimation


test coeflist
test exp=exp[=...]

Description
test checks or tests linear hypotheses about the estimated parameters from the most
recently fitted model.

So far, we have concerned ourselves with testing a single variable at a time, for
example looking at the coefficient for weight and determining if that is significant.
We can also test sets of variables, using the test command, to see if the set of
variables are significant.

Department of Economics, CBE, MU


5
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel
test weight
( 1) weight = 0

F( 1, 69) = 5.79
Prob > F = 0.0188

If you compare this output with the output from the last regression you can see that
the result of the F-test, 5.79, is the same as the square of the result of the t-test in the
regression (-2.41^2 = 5.79).

test weight length

( 1) weight = 0
( 2) length = 0

F( 2, 69) = 32.83
Prob > F = 0.0000

The significant F-test, 32.83 (p-value=0.0000, which is very samll), means


that the collective contribution of these two variables is significant. One way to think
of this, is that there is a significant difference between a model with weight and
length as compared to a model without them, i.e., there is a significant difference
between the "full" model and the "reduced" models.

test length headroom

( 1) length = 0
( 2) headroom = 0

F( 2, 69) = 0.75
Prob > F = 0.4762

test length headroom trunk

( 1) length = 0
( 2) headroom = 0
( 3) trunk = 0

F( 3, 69) = 0.69
Prob > F = 0.5621

The insignificant F-test, 0.75 (p-value=0.4762, which is large), means that


the collective contribution of these two variables (length and headroom) is
insignificant. One way to think of this, is that there is no significant difference
between a model with length and headroom as compared to a model without them,
i.e., there is no significant difference between the "full" model and the "reduced"
models. The same interpretation and conclusion can be reached for the last joint
hypothesis test.

Syntax for non-linear hypotheses test after estimation


testnl exp=exp[=exp...] [, options]
testnl(exp=exp[=exp...])[(exp=exp[=exp...])...][,options]

Department of Economics, CBE, MU


6
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

Description
testnl tests (linear or nonlinear) hypotheses about the estimated parameters from
the most recently fitted model. testnl produces Wald-type tests of smooth nonlinear
(or linear) hypotheses about the estimated parameters from the most recently fitted
model. The p-values are based on the "delta method", an approximation appropriate
in large samples. testnl supports survey regression-type commands (svy: regress, svy:
logit, etc.). testnl may also be used to test linear hypotheses. test is faster if you want
to test only linear hypotheses. testnl is the only option for testing linear and nonlinear
hypotheses simultaneously.

Options
mtest[(opt)] specifies that tests be performed for each condition separately. Opt
specifies the method for adjusting p-values for multiple testing. Valid values for opt
are
bonferroni Bonferroni's method
holm Holm's method
sidak Sidak's method
noadjust no adjustment is to be made
Specifying mtest without an argument, is equivalent to mtest(noadjust).

Example 6.3
regress mpg weight length trunk headroom

testnl _b[ weight]/_b[ length] = _b[ trunk]


(1) _b[ weight]/_b[ length] = _b[ trunk]
F(1, 69) = 0.22
Prob > F = 0.6398

testnl (_b[ trunk] = _b[ headroom])


(1) _b[ trunk] = _b[ headroom]
F(1, 69) = 0.00
Prob > F = 0.9453

testnl(_b[ weight]/_b[ length]= _b[ trunk])(_b[ trunk]= _b[headroom])


(1) _b[ weight]/_b[ length] = _b[ trunk]
(2) _b[ trunk] = _b[ headroom]
F(2, 69) = 0.14
Prob > F = 0.8696

testnl(_b[ weight]/_b[length]=_b[trunk])(_b[trunk]=_b[headroom]),
mtest
(1) _b[ weight]/_b[ length] = _b[ trunk]
(2) _b[ trunk] = _b[ headroom]
---------------------------------------
| F(df,69) df p
-------+-------------------------------
(1) | 0.22 1 0.6398 #
(2) | 0.00 1 0.9453 #
-------+-------------------------------
all | 0.14 2 0.8696
---------------------------------------
# unadjusted p-values

Department of Economics, CBE, MU


7
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

Syntax for linear combination of coefficients after estimation


lincom exp [, options]

Description
lincom computes point estimates, standard errors, t or z statistics, p-values, and
confidence intervals for linear combinations of coefficients after any estimation
command. Results can optionally be displayed as odds ratios, hazard ratios, incidence-
rate ratios, or relative risk ratios.

Example 6.4
regress mpg weight length trunk headroom

lincom weight+ length


( 1) weight + length = 0

------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | -.0781385 .0591877 -1.32 0.191 -.1962148 .0399378
------------------------------------------------------------------------------

lincom 9*weight+ 2* length-1.2


( 1) 9 weight + 2 length = 1.2

------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | -1.383422 .1084071 -12.76 0.000 -1.599688 -1.167156
------------------------------------------------------------------------------
In addition to getting the regression table, it can be useful to see a scatterplot of the
predicted and outcome variables with the regression line plotted. After you run a
regression, you can create a variable that contains the predicted values using the
predict command.
predict mpghat

Draw graphs of scatter plot to see the sampled observations and linear fit line. To see
the fitted line predicted by OLS (mpghat) in this case.
twoway (scatter mpg weight) (lfit mpg weight)

As we saw earlier, the predict command can be used to generate predicted (fitted)
values after running regress. You can also obtain residuals by using the predict
command followed by a variable name, in this case e, with the residual option.
predict residualhat, residual

This command can be shortened to predict residualhat, resid or even predict


residualhat, r. The table below shows some of the other values can that be created
with the predict option.
Value to be created Option after Predict
--------------------------------------------------- --------------------
predicted values of y (y is the dependent variable) no option needed
residuals resid
standardized residuals rstandard
studentized or jackknifed residuals rstudent
leverage lev or hat
standard error of the residual stdr
Cook's D cooksd
standard error of predicted individual y stdf
standard error of predicted mean y stdp

Department of Economics, CBE, MU


8
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

Finally, as part of doing a multiple regression analysis you might be interested in


seeing the correlations among the variables in the regression model. You can do this
with the correlate command as shown below.
corr mpg weight length trunk headroom

(obs=74)
| mpg weight length trunk headroom
-------------+---------------------------------------------
mpg | 1.0000
weight | -0.8072 1.0000
length | -0.7958 0.9460 1.0000
trunk | -0.5816 0.6722 0.7266 1.0000
headroom | -0.4138 0.4835 0.5163 0.6620 1.0000

If we look at the correlations with mpg, we see weight and length have the two
strongest correlations with mpg. The correlation between mpg and weight is
negative, meaning that as the value of one variable goes down, the value of the other
variable tends to go up. Knowing that these variables are strongly associated with
mpg might give us a clue that they would be statistically significant regressors in the
regression model.

We can also use the pwcorr command to do pairwise correlations. The most
important difference between correlate and pwcorr is the way in which missing data
is handled. With correlate, an observation or case is dropped if any variable has a
missing value, in other words, correlate uses listwise , also called casewise, deletion.
pwcorr uses pairwise deletion, meaning that the observation is dropped only if there
is a missing value for the pair of variables being correlated. Two options that you can
use with pwcorr, but not with correlate, are the sig option, which will give the
significance levels for the correlations and the obs option, which will give the number
of observations used in the correlation. Such an option is not necessary with corr as
Stata lists the number of observations at the top of the output.

pwcorr mpg weight length trunk headroom


| mpg weight length trunk headroom
-------------+---------------------------------------------
mpg | 1.0000
weight | -0.8072 1.0000
length | -0.7958 0.9460 1.0000
trunk | -0.5816 0.6722 0.7266 1.0000
headroom | -0.4138 0.4835 0.5163 0.6620 1.0000

pwcorr mpg weight length trunk headroom, obs sig


| mpg weight length trunk headroom
-------------+---------------------------------------------
mpg | 1.0000
| 74
|
weight | -0.8072 1.0000
| 0.0000
| 74 74
|
length | -0.7958 0.9460 1.0000
| 0.0000 0.0000
| 74 74 74
|
trunk | -0.5816 0.6722 0.7266 1.0000
| 0.0000 0.0000 0.0000
| 74 74 74 74
|
headroom | -0.4138 0.4835 0.5163 0.6620 1.0000
| 0.0002 0.0000 0.0000 0.0000
| 74 74 74 74 74

Department of Economics, CBE, MU


9
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

6.3.2 Checking normality of Variables and Transforming variables

So far we covered basics of estimating both simple and multiple linear regression
models, interpreting the regression results performing joint hypothesis test and
checking the linear association among variables. Some researchers believe that linear
regression requires that the outcome (dependent) and predictor (explanatory) variables
be normally distributed. We need to clarify this issue. Actually, it is the residuals that
need to be normally distributed. In fact, the residuals need to be normal only for the
hypothesis tests (like t-tests) to be valid. The estimation of the regression coefficients
do not require normally distributed residuals. As we are interested in having valid t-
tests, we will investigate issues concerning normality.

A common cause of non-normally distributed residuals is non-normally distributed


dependent and/or explanatory variables. So, let us explore the distribution of our
variables and how we might transform them to a more normal shape. Let's start by
making a histogram of the variable weight, which is an explanatory variable in the
regression models above.

histogram weight
1.0e-04 2.0e-04 3.0e-04 4.0e-04 5.0e-04
Density
0

2,000 3,000 4,000 5,000


Weight (lbs.)

We can use the normal option to superimpose a normal curve on this graph and the
bin(15) option to use 15 bins. The distribution looks skewed to the right.

histogram weight, normal bin(15)


see the result/graph next page/below

Department of Economics, CBE, MU


10
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

4.0e-04 6.0e-04 8.0e-04


Density
2.0e-04
0

2,000 3,000 4,000 5,000


Weight (lbs.)

Histograms are sensitive to the number of bins or columns that are used in the display.
An alternative to histograms is the kernel density plot, which approximates the
probability density of the variable. Kernel density plots have the advantage of being
smooth and of being independent of the choice of origin, unlike histograms. Stata
implements kernel density plots with the kdensity command.

kdensity weight, normal


.0001 .0002 .0003 .0004 .0005
Density
0

1000 2000 3000 4000 5000


Weight (lbs.)

Kernel density estimate


Normal density

Not surprisingly, the kdensity plot also indicates that the variable weight does not
look normal.

There are three other types of graphs that are often used to examine the distribution of
variables; symmetry plots, normal quantile plots and normal probability plots.

Department of Economics, CBE, MU


11
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

Symmetry plots: A symmetry plot graphs the distance above the median for the i-th
value against the distance below the median for the i-th value. A variable that is
symmetric would have points that lie on the diagonal line. As we would expect, this
distribution is not symmetric.

symplot weight

Weight (lbs.)
2000
Distance above median
500 10000 1500

0 500 1000 1500


Distance below median

Normal quantile plots: A normal quantile plot graphs the quantiles of a variable
against the quantiles of a normal (Gaussian) distribution. qnorm is sensitive to non-
normality near the tails, and indeed we see considerable deviations from normal, the
diagonal line, in the tails. This plot is typical of variables that are xxxxxx.

qnorm weight
5,000
4,000
Weight (lbs.)
3,000 2,000
1,000

1,000 2,000 3,000 4,000 5,000


Inverse Normal

Department of Economics, CBE, MU


12
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

Normal probability plot: The normal probability plot is also useful for examining
the distribution of variables. pnorm is sensitive to deviations from normality nearer
to the center of the distribution. Again, we see indications of non-normality in weight.

pnorm weight
1.00 0.75
Normal F[(weight-m)/s]
0.25 0.50
0.00

0.00 0.25 0.50 0.75 1.00


Empirical P[i] = i/(N+1)

Having concluded that weight is not normally distributed, how should we address this
problem? First, we may try entering the variable as-is into the regression, but if we
see problems, which we likely would, then we may try to transform weight to make it
more normally distributed. Potential transformations include taking the log, the
square root or raising the variable to a power. Selecting the appropriate transformation
is somewhat of an art. Stata includes the ladder and gladder commands to help in the
process.

Ladder reports numeric results and gladder produces a graphic display.

Let's start with ladder and look for the transformation with the smallest chi-square.

ladder weight

Transformation formula chi2(2) P(chi2)


------------------------------------------------------------------
cubic weight^3 12.94 0.002
square weight^2 4.49 0.106
raw weight 5.66 0.059
square-root sqrt(weight) 8.81 0.012
log log(weight) 10.37 0.006
reciprocal root 1/sqrt(weight) 9.58 0.008
reciprocal 1/weight 8.04 0.018
reciprocal square 1/(weight^2) 8.03 0.018
reciprocal cubic 1/(weight^3) 12.37 0.002

Department of Economics, CBE, MU


13
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

The square transform has the smallest chi-square. Let's verify these results graphically
using gladder.

gladder weight

cubic square identity


3.0e-11

1.5e-07

8.0e-04
6.0e-04
2.0e-11

1.0e-07

4.0e-04
0 1.0e-11

0 5.0e-08

02.0e-04
0 5.00e+10 1.00e+11 5000000
1.00e+07
1.50e+07
2.00e+07
2.50e+07 2000 3000 4000 5000

sqrt log 1/sqrt


.02 .04 .06

0 50 100150200
0 .5 1 1.5 2
Density
0

40 50 60 70 7.5 8 8.5 -.024-.022 -.02 -.018 -.016-.014


1.0e+07

inverse 1/square 1/cubic

2.0e+10
0 200040006000

8.0e+06

1.5e+10
6.0e+06

1.0e+10
0 4.0e+06

0.0e+09
2.0e+06

5
-.0006 -.0005 -.0004 -.0003 -.0002 -3.00e-07
-2.50e-07
-2.00e-07
-1.50e-07
-1.00e-07
-5.00e-08 -2.00e-10
-1.50e-10
-1.00e-10
-5.00e-11 0

Weight (lbs.)
Histograms by transformation

This also indicates that the square transformation would help to make weight more
normally distributed. Let's use the generate command to create the variable
weightsqr which will be weight squar (weight2) and then see if we have normalized
it.
generate weightsqr= weight^2

histogram weightsqr, normal


1.0e-07
Density
5.0e-08
0

5000000 1.00e+07 1.50e+07 2.00e+07 2.50e+07


weightsqr

Department of Economics, CBE, MU


14
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

We can see now a little improvement on the normality of the variable weight after
transformation. We would then use the symplot, qnorm and pnorm commands to
help us assess whether weightsqr seems normal, as well as seeing how weightsqr
impacts the residuals, which is really the important consideration.

Exercise:
Check whether the dependent variable, mpg, is normally distributed?

6.3.3 Checking Data Problems and Validity of OLS Assumptions in a Data set

Without verifying that your data have met the assumptions underlying OLS
regression, your results may be misleading. In the following we explore how you can
use Stata to check on how well your data meet the assumptions of OLS regression. In
particular, we will consider the following assumptions.

• Linearity - the relationships between the predictors and the outcome variable
should be linear
• Normality - the errors should be normally distributed - technically normality is
necessary only for hypothesis tests to be valid,
estimation of the coefficients only requires that the errors be identically and
independently distributed
• Homogeneity of variance (homoscedasticity) - the error variance should be
constant
• Collinearity - predictors that are highly collinear, i.e., linearly related, can
cause problems in estimating the regression coefficients
• Independence - the errors associated with one observation are not correlated
with the errors of any other observation
• Errors-in-variables - predictor variables are measured without error (we will
cover this on further topics in regression)
• Model specification - the model should be properly specified (including all
relevant variables, and excluding irrelevant variables)

Additionally, there are issues that can arise during the analysis that, while strictly
speaking are not assumptions of regression, are nonetheless, of great concern to data
analysts.

Influence - individual observations that exert undue influence on the coefficients


Outliers – some observations are very extreme compared to other observations
that may cause problems in estimating the regression coefficients.

Let’s check out!

A) Unusual and Influential data

A single observation that is substantially different from all other observations can
make a large difference in the results of your regression analysis. If a single
observation (or small group of observations) substantially changes your results, you
would want to know about this and investigate further. There are three ways that an
observation can be unusual.

Department of Economics, CBE, MU


15
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

Outliers: In linear regression, an outlier is an observation with large residual. In


other words, it is an observation whose dependent-variable value is unusual given its
values on the predictor variables. An outlier may indicate a sample peculiarity or may
indicate a data entry error or other problem.

Leverage: An observation with an extreme value on a predictor variable is called a


point with high leverage. Leverage is a measure of how far an independent variable
deviates from its mean. These leverage points can have an effect on the estimate of
regression coefficients.

Influence: An observation is said to be influential if removing the observation


substantially changes the estimate of coefficients. Influence can be thought of as the
product of leverage and outliers.

To identify the aforementioned three types of observation, let’s look at the auto.dta
data set installed with in stata.

sysuse auto.dta

describe
Contains data from C:\Program Files\Stata9\ado\base/a/auto.dta
obs: 74 1978 Automobile Data
vars: 12 13 Apr 2005 17:45
size: 3,478 (99.7% of memory free) (_dta has notes)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
make str18 %-18s Make and Model
price int %8.0gc Price
mpg int %8.0g Mileage (mpg)
rep78 int %8.0g Repair Record 1978
headroom float %6.1f Headroom (in.)
trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn Circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear Ratio
foreign byte %8.0g origin Car type
-------------------------------------------------------------------------------
Sorted by: foreign

summarize mpg weight length headroom trunk


Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
mpg | 74 21.2973 5.785503 12 41
weight | 74 3019.459 777.1936 1760 4840
length | 74 187.9324 22.26634 142 233
headroom | 74 2.993243 .8459948 1.5 5
trunk | 74 13.75676 4.277404 5 23

If we want to estimate the determinants of auto fuel efficiency measured by mileage


per gallon (mpg) and if we suspect that weight, length, headroom and trunk to
affect mpg. We build a multiple regression model where mpg is the dependent
variable and weight, length, headroom, trunk are predictor variables. We will first
look at the scatter plots of mpg against each of the predictor variables before the
regression analysis so we will have some ideas about potential problems. We can
create a scatter plot matrix of all the variables and also plot scatter plots individually
between the dependent variable and a predictor.

Department of Economics, CBE, MU


16
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel
graph matrix mpg weight length headroom trunk
2,000 3,000 4,000 5,000 0.0 5.0
40

Mileage 30
(mpg) 20

10
5,000

4,000
Weight
3,000 (lbs.)
2,000
250

Length 200
(in.)
150
5.0

Headroom
(in.)

0.0
20
Trunk
space
(cu.
ft.)
0
10 20 30 40 150 200 250 0 20

scatter mpg weight length headroom trunk


scatter mpg length
scatter mpg headroom
scatter mpg trunk

Second, we run the regression as below:


regress mpg weight length headroom trunk
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 4, 69) = 33.73
Model | 1616.74273 4 404.185683 Prob > F = 0.0000
Residual | 826.716728 69 11.9814019 R-squared = 0.6617
-------------+------------------------------ Adj R-squared = 0.6420
Total | 2443.45946 73 33.4720474 Root MSE = 3.4614

------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0038778 .0016121 -2.41 0.019 -.007094 -.0006617
length | -.0742607 .060633 -1.22 0.225 -.1952201 .0466988
headroom | .0161296 .6405386 0.03 0.980 -1.26171 1.293969
trunk | -.0342372 .1582544 -0.22 0.829 -.3499461 .2814716
_cons | 47.38497 6.54025 7.25 0.000 34.33753 60.43241
------------------------------------------------------------------------------

Let's examine the studentized residuals as a first means for identifying outliers.The
predict command can generate residuals, standardized, or studentized residuals.
Residual generation is performed with the predict command. The following
command generates residuals in the data set that are called, resid.
predict resid, residuals

To generate standardized residuals (the residual divided by the standard deviation),


called stdres, to adjust the residual for its standard error, the user executes the
command:
predict stdres, rstandard

To generate studentized residuals, the standardized residual with the observation in


question deleted, the analyst executes the following command:
predict studres, rstudent

Department of Economics, CBE, MU


17
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

You may also generate the leverage (influence) to identify observations that will have
potential great influence on regression coefficient estimates.
predict lev, leverage

Generally, a point with leverage greater than (2k+2)/n should be carefully examined.
Here k is the number of predictors and n is the number of observations.

Outliers have to be detected, identified, and analyzed. Residuals can be tabulated


to see whether observations qualify as outliers. Standardized residuals whose absolute
values are in excess of 3.5 could qualify. Studentized residuals are distributed as a t
distribution with n – k – 1 degrees of freedom. They can be examined as t values.

Observations with high leverage values (leverage values range from 1/n to 1) would
qualify. One could sort the residuals by their standardized values and then tabulate
them with the commands:
sort stdres

tabulate stdres

Part of the Stata output for the above command is as below:


Standardize |
d residuals | Freq. Percent Cum.
------------+-----------------------------------
-1.967619 | 1 1.35 1.35
-1.731058 | 1 1.35 2.70
-1.673048 | 1 1.35 4.05
-1.601529 | 1 1.35 5.41
-1.437683 | 1 1.35 6.76
-1.337968 | 1 1.35 8.11
-1.062173 | 1 1.35 9.46
-.9713771 | 1 1.35 10.81
-.9347258 | 1 1.35 12.16
. . . . . .
. . . . . .
. . . . . .
-.0991644 | 1 1.35 51.35
-.0194167 | 1 1.35 52.70
-.0175309 | 1 1.35 54.05
.0119123 | 1 1.35 55.41
.0196775 | 1 1.35 56.76
.036817 | 1 1.35 58.11
.0534289 | 1 1.35 59.46
.0743084 | 1 1.35 60.81
.1008119 | 1 1.35 62.16
.1147917 | 1 1.35 63.51
. . . . . .
. . . . . .
. . . . . .

1.994285 | 1 1.35 95.95


2.359877 | 1 1.35 97.30
2.378415 | 1 1.35 98.65
4.116227 | 1 1.35 100.00
------------+-----------------------------------
Total | 74 100.00

Department of Economics, CBE, MU


18
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

Observations with absolute values greater than 3.5 merit closer


examination. For these outliers, some adjustment may be necessary.
We should pay attention to studentized residuals that exceed +2 or -2, and
get even more concerned about residuals that exceed +2.5 or -2.5 and even
yet more concerned about residuals that exceed +3 or -3.

There are robust methods used to detect outliers that require additional treatment.
Nonetheless, the observations need to be examined for typographical errors. Any such
errors need to be corrected. Given a large enough data set, they can be deleted. If the
data set is small, then some sort of smoothing or missing data replacement can be
invoked.

Another important diagnostic graph is that of the leverage versus the residual plot, a
plot which shows the influence of the outliers over the residuals. The command for
this plot is lvr2plot the influential outliers are revealed in the upper right sector. If
there are enough of them that have substantial leverage, then the analyst may wish to
resort to robust regression techniques.

lvr2plot
.2
.15
Leverage
.1
.05
0

0 .05 .1 .15 .2
Normalized residual squared

Leverage vs Squared Residual Plot (lvr2plot)

The two reference lines are the means for leverage, horizontal, and for the normalized
residual squared, vertical.

Now let's move on to overall measures of influence, specifically let's look at Cook's D
and DFITS. These measures both combine information on the residual and leverage.
Cook's D and DFITS are very similar except that they scale differently but they give
us similar answers.

Department of Economics, CBE, MU


19
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

The lowest value that Cook's D can assume is zero, and the higher the Cook's D is, the
more influential the point. The convention cut-off point is 4/n. We can list any
observation above the cut-off point by doing the following and see that the Cook's D
for 2 nd, 13th, 42nd, 57 th, 60 th, and 71 st larger than the cut-off point.

predict d, cooksd

list mpg weight length headroom trunk d if d>4/74


+-----------------------------------------------------+
| mpg weight length headroom trunk d |
|-----------------------------------------------------|
2. | 17 3,350 173 3.0 11 .0734097 |
13. | 21 4,290 204 3.0 13 .1144936 |
42. | 28 3,260 170 2.0 11 .182582 |
57. | 35 2,020 165 2.0 8 .06417 |
60. | 21 2,130 161 2.5 16 .0719379 |
|-----------------------------------------------------|
71. | 41 2,040 155 3.0 15 .3851261 |
+-----------------------------------------------------+

Now let's take a look at DFITS. The cut-off point for DFITS is 2*sqrt(k/n). DFITS
can be either positive or negative, with numbers close to zero corresponding to the
points with small or zero influence.

predict ginf2, dfits

list mpg weight length headroom trunk ginf2 if ginf2> 2*sqrt(4/74)


+-----------------------------------------------------+
| mpg weight length headroom trunk ginf2 |
|-----------------------------------------------------|
13. | 21 4,290 204 3.0 13 .7697008 |
42. | 28 3,260 170 2.0 11 .977092 |
57. | 35 2,020 165 2.0 8 .5864823 |
66. | 35 2,050 164 2.5 11 .4819245 |
71. | 41 2,040 155 3.0 15 1.585998 |
+-----------------------------------------------------+

The above measures are general measures of influence. You can also consider more
specific measures of influence that assess how each coefficient is changed by deleting
the observation. This measure is called DFBETA and is created for each of the
predictors. Apparently this is more computational intensive than summary statistics
such as Cook's D since the more predictors a model has the more computation it may
involve.

We have explored a number of the statistics that we can get after the regress
command. There are also several graphs that can be used to search for unusual and
influential observations, we left them for further reading.

B) Checking Validity of OLS Assumptions in a Data set

1. Test for Linearity of the Model:

When we do linear regression, we assume that the relationship between the response
variable and the predictors is linear. This is the assumption of linearity. If this
assumption is violated, the linear regression will try to fit a straight line to data that
does not follow a straight line. Checking the linear assumption in the case of simple
regression is straightforward, since we only have one predictor. All we have to do is a

Department of Economics, CBE, MU


20
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

scatter plot between the response variable and the predictor to see if nonlinearity is
present, such as a curved band or a big wave-shaped curve.

Example 6.5

use auto.dta, clear


regress mpg weight

Source | SS df MS Number of obs = 74


-------------+------------------------------ F( 1, 72) = 134.62
Model | 1591.9902 1 1591.9902 Prob > F = 0.0000
Residual | 851.469256 72 11.8259619 R-squared = 0.6515
-------------+------------------------------ Adj R-squared = 0.6467
Total | 2443.45946 73 33.4720474 Root MSE = 3.4389

------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0060087 .0005179 -11.60 0.000 -.0070411 -.0049763
_cons | 39.44028 1.614003 24.44 0.000 36.22283 42.65774
------------------------------------------------------------------------------

Below we use the scatter command to show a scatterplot predicting mpg from
weight and use lfit to show a linear fit, and then lowess to show a lowess smoother
predicting mpg from weight. The different type of plots confirm linear relation ship
between mpg and weight.

twoway(scatter mpg weight)(lfit mpg weight)(lowess mpg weight)


40
30
20
10

2,000 3,000 4,000 5,000


Weight (lbs.)

Mileage (mpg) Fitted values


lowess mpg weight

Example 6.6

Checking the linearity assumption is not so straightforward in the case of multiple


regressions. We will try to illustrate some of the techniques that you can use. The
most straightforward thing to do is to plot the standardized residuals against each of
the predictor variables in the regression model. If there is a clear nonlinear pattern,
there is a problem of nonlinearity. Otherwise, we should see for each of the plots just
a random scatter of points. Let's continue to use dataset auto.dta specifying a multiple
linear regression model as below.

Department of Economics, CBE, MU


21
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel
regress mpg weight length

Source | SS df MS Number of obs = 74


-------------+------------------------------ F( 2, 71) = 69.34
Model | 1616.08062 2 808.040312 Prob > F = 0.0000
Residual | 827.378835 71 11.653223 R-squared = 0.6614
-------------+------------------------------ Adj R-squared = 0.6519
Total | 2443.45946 73 33.4720474 Root MSE = 3.4137

------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0038515 .001586 -2.43 0.018 -.0070138 -.0006891
length | -.0795935 .0553577 -1.44 0.155 -.1899736 .0307867
_cons | 47.88487 6.08787 7.87 0.000 35.746 60.02374
------------------------------------------------------------------------------
predict r, resid

scatter r weight
15
10
Residuals
5
0
-5

2,000 3,000 4,000 5,000


Weight (lbs.)

scatter r length
15
10
Residuals
50
-5

140 160 180 200 220 240


Length (in.)

Department of Economics, CBE, MU


22
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

The two residual versus predictor variable plots above do not indicate strongly a clear
departure from linearity. Another command for detecting non-linearity is acprplot.
acprplot graphs an augmented component-plus-residual plot, a.k.a. augmented
partial residual plot. It can be used to identify nonlinearities in the data. Let's use the
acprplot command for weight and length and use the lowess lsopts(bwidth(1))
options to request lowess smoothing with a bandwidth of 1.

In the first plot below the smoothed line is close to the ordinary regression line, except
some deviations at end and middle points. The second plot does seem more
problematic at the left end. This may come from some potential influential points.
Overall, they don't look too bad and we shouldn't be too concerned about non-
linearities in the data.

acprplot weight, lowess lsopts(bwidth(1))


-5
Augmented component plus residual
-25 -20 -15 -30 -10

2,000 3,000 4,000 5,000


Weight (lbs.)

acprplot length, lowess lsopts(bwidth(1))


Augmented component plus residual
-55 -50 -45-60 -40

140 160 180 200 220 240


Length (in.)

Department of Economics, CBE, MU


23
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

We have seen how to use acprplot to detect nonlinearity. However, the example
didn't boldly show much nonlinearity. Look at a more interesting example on the
accompanying do file of this material.

2. Test for Normal Errors:

Residuals are assumed to be normally distributed. Like the other assumptions, this
assumption needs to be tested. The Normality of residuals is only required for valid
hypothesis testing, that is, the normality assumption assures that the p-values for the t-
tests and F-test will be valid. Normality is not required in order to obtain unbiased
estimates of the regression coefficients. OLS regression merely requires that the
residuals (errors) be identically and independently distributed.

After we run a regression analysis, we can use the predict command to create
residuals and then use commands such as kdensity, qnorm, pnorm, sktest to check
the normality of the residuals.

Graphical tests for normality:


A graphical approach to testing the normality of the residuals is to request a qq plot of
the residuals against the normal distribution with the command:

sysuse auto.dta, clear


regress mpg weight length trunk headroom
predict resid, residual
qnorm resid
15
10
Residuals
0 -55
-10

-10 -5 0 5 10
Inverse Normal

Residual-Normal Quantile Plot

The above command generates the plot in figure 4.xx. The straight diagonal line
represents the theoretical normal distribution, while the dotted line represent the
residuals of the model. The closer they cleave to straight diagonal line, the more
normal the distribution is said to be. As you see above, the results from qnorm show
very slight deviation of non-normality at the upper tail.

Department of Economics, CBE, MU


24
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

You can also use the kdensity command to produce a kernel density plot with the
normal option requesting that a normal density be overlaid on the plot. kdensity
stands for kernel density estimate. It can be thought of as a histogram with narrow
bins and moving average.

kdensity resid, normal .15


Density
.1
.05
0

-10 -5 0 5 10 15
Residuals

Kernel density estimate


Normal density

The pnorm command graphs a standardized normal probability (P-P) plot while
qnorm plots the quantiles of a variable against the quantiles of a normal distribution.
pnorm is sensitive to non-normality in the middle range of data and qnorm is
sensitive to non-normality near the tails. As you see below, the results from pnorm
show deviation from normal at the upper half.

pnorm resid
1.00 0.75
Normal F[(resid-m)/s]
0.25 0.500.00

0.00 0.25 0.50 0.75 1.00


Empirical P[i] = i/(N+1)

Department of Economics, CBE, MU


25
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

Numerical tests for normality:

sktest: Previously, on ANOVA, the Kolmogorov-Smirnov test was used to determine


whether the distribution of the residuals was statistically significantly different from
that of a theoretical normal distribution. This can be done by

sktest resid

Skewness/Kurtosis tests for Normality


------- joint ------
Variable | Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2
-------------+-------------------------------------------------------
resid | 0.000 0.002 18.29 0.0001

The sktest reveals that the p-value is small (P-value=0.0001), indicating that we can
reject the hypothesis that the residual is normally distributed.

swilk: Another test available is the swilk test which performs the Shapiro-Wilk W test
for normality. The p-value is based on the assumption that the distribution is normal.

swilk resid
Shapiro-Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+-------------------------------------------------
resid | 74 0.92004 5.149 3.575 0.00018

Analogous to sktest the swilk test also rejects the claim that the residual is normally
distributed, since p-value is small (P-value=0.00018).

3. Test for Constant Variance Errors (Homoscedasticity):

One of the main assumptions for the ordinary least squares regression is the
homogeneity of variance of the residuals. If the model is well-fitted, there should be
no pattern to the residuals plotted against the fitted values. If the variance of the
residuals is non-constant then the residual variance is said to be "heteroscedastic."
There are graphical and non-graphical methods for detecting heteroscedasticity.

Graphical methods for detecting heteroscedasticity


A commonly used graphical method is to plot the residuals versus fitted (predicted)
values. We do this by issuing the rvfplot command. One could request a plot of the
residuals against the fitted values to show the heteroskedasticity:

rvfplot
see output(the graph) next page

Department of Economics, CBE, MU


26
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

15
10
Residuals
05
-5

10 15 20 25 30
Fitted values

Residual versus Fitted values plot

Alternatively, we can also use the rvfplot command with the yline(0) option to put a
reference line at y=0. We see that the pattern of the data points is getting a little
narrower towards the right end, which is an indication of heteroscedasticity.

rvfplot, yline(0)
15
10
Residuals
05
-5

10 15 20 25 30
Fitted values

Department of Economics, CBE, MU


27
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

Non-graphical methods for detecting heteroscedasticity


Stata allows us to detect heteroscedasticity of the error term using the Breusch-Pagan
test. The Breusch-Pagan test for the violation of homoscdasticity is performed with
the hettest command.

hettest

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity


Ho: Constant variance
Variables: fitted values of mpg

chi2(1) = 14.54
Prob > chi2 = 0.0001

The significant result from the Cook-Weisburg test indicates that the regression of the
residuals on the predicted values reveals significant heteroskedasticity.

imtest
Cameron & Trivedi's decomposition of IM-test

---------------------------------------------------
Source | chi2 df p
---------------------+-----------------------------
Heteroskedasticity | 18.45 14 0.1872
Skewness | 10.40 4 0.0342
Kurtosis | 1.49 1 0.2220
---------------------+-----------------------------
Total | 30.34 19 0.0477
---------------------------------------------------

imtest is the White's test. It tests the null hypothesis that the variance of the residuals
is homogenous. Therefore, if the p-value is very small, we would have to reject the
null hypothesis and accept the alternative hypothesis that the variance is not
homogenous. So in this case, the evidence is against the null hypothesis that the
variance is homogeneous.

Both tests shown above are very sensitive to model assumptions, such as the
assumption of normality. Therefore it is a common practice to combine the tests with
diagnostic plots to make a judgment on the severity of the heteroscedasticity and to
decide if any correction is needed for heteroscedasticity.

4. Test for Multicollinearity:

When there is a perfect linear relationship among the predictors, the estimates for a
regression model cannot be uniquely computed. The term collinearity implies that two
variables are near perfect linear combinations of one another. When more than two
variables are involved it is often called multicollinearity, although the two terms are
often used interchangeably.

The primary concern is that as the degree of multicollinearity increases, the regression
model estimates of the coefficients become unstable and the standard errors for the
coefficients can get wildly inflated. In this section, we will explore some Stata
commands that help to detect multicollinearity.

Department of Economics, CBE, MU


28
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

We can use the vif command after the regression to check for multicollinearity. vif
stands for variance inflation factor. As a rule of thumb, a variable whose VIF values
are greater than 10 may merit further investigation. Tolerance, defined as 1/VIF, is
used by many researchers to check on the degree of collinearity. A tolerance value
lower than 0.1 is comparable to a VIF value of 10. It means that the variable could be
considered as a linear combination of other independent variables.

We can test for multicollinearity with the pwcorr command and the vif command,
although STATA checks for this problem and automatically drops perfectly collinear
predictor variables prior to estimation. The command

pwcorr weight length trunk headroom, obs sig

generates the correlation matrix of the predictor variables shown below.


| weight length trunk headroom
-------------+------------------------------------
weight | 1.0000
|
| 74
|
length | 0.9460 1.0000
| 0.0000
| 74 74
|
trunk | 0.6722 0.7266 1.0000
| 0.0000 0.0000
| 74 74 74
|
headroom | 0.4835 0.5163 0.6620 1.0000
| 0.0000 0.0000 0.0000
| 74 74 74 74

The higher the correlations between the predictor variables, the more the
multicollinearity. The variance inflation factor, VIF, is a measure of the reciprocal of
the complement of the inter-correlation among the predictor variables: VIF= 1/(1- r2)
where r2 is the multiple correlation between the predictor variable and the other
predictors. VIF values greater than 10 indicate possible problems

vif
Variable | VIF 1/VIF
-------------+----------------------
length | 11.11 0.090047
weight | 9.56 0.104549
trunk | 2.79 0.358189
headroom | 1.79 0.558932
-------------+----------------------
Mean VIF | 6.31

From the above Stata output, length is a multicollinear variable with VIF=11.11 >10.

5. Model Specification

Other questions that arise are that of proper functional form. Is this a linear model?
Does the dependent variable or independent variable have to be transformed? Or is
there any important variable omitted or an irrelevant variable included in the model?

Department of Economics, CBE, MU


29
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

A model specification error can occur when one or more relevant variables are
omitted from the model or one or more irrelevant variables are included in the model.
If relevant variables are omitted from the model, the common variance they share
with included variables may be wrongly attributed to those variables, and the error
term is inflated. On the other hand, if irrelevant variables are included in the model,
the common variance they share with included variables may be wrongly attributed to
them. Model specification errors can substantially affect the estimate of regression
coefficients.

There are a couple of methods to detect specification errors. The linktest command
performs a model specification link test for single-equation models. linktest is based
on the idea that if a regression is properly specified, one should not be able to find any
additional independent variables that are significant except by chance. linktest creates
two new variables, the variable of prediction, _hat, and the variable of squared
prediction, _hatsq. The model is then refit using these two variables as predictors.
_hat should be significant since it is the predicted value. On the other hand, _hatsq
shouldn't, because if our model is specified correctly, the squared predictions should
not have much explanatory power. That is we wouldn't expect _hatsq to be a
significant predictor if our model is specified correctly. So we will be looking at the
p-value for _hatsq.

linktest

Source | SS df MS Number of obs = 74


-------------+------------------------------ F( 2, 71) = 74.22
Model | 1652.88437 2 826.442185 Prob > F = 0.0000
Residual | 790.575089 71 11.1348604 R-squared = 0.6765
-------------+------------------------------ Adj R-squared = 0.6673
Total | 2443.45946 73 33.4720474 Root MSE = 3.3369

------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_hat | -.3188663 .7367366 -0.43 0.666 -1.787877 1.150145
_hatsq | .0312539 .0173478 1.80 0.076 -.0033365 .0658444
_cons | 13.22945 7.562782 1.75 0.085 -1.850308 28.30921
------------------------------------------------------------------------------

The Stata output reveals that our regression model is not correctly specified, since the
coefficient of the predicted dependent variable (_hat) is statistically insignificant and
the coefficient of the predicted dependent variable square (_hatsq) is statistically
significant. The linktest rejects the hypothesis that the model is correctly specified.

estat ovtest performs two flavors of the Ramsey regression specification error test
(RESET) for omitted variables. This test amounts to fitting y=xb+zt+u and then
testing t=0. If option rhs is not specified, powers of the fitted values are used for z. If
rhs is specified, powers of the individual elements of x are used.
Syntax for estat ovtest

estat ovtest
The output of the above command is as below:

Ramsey RESET test using powers of the fitted values of mpg


Ho: model has no omitted variables
F(3, 66) = 2.09
Prob > F = 0.1101

Department of Economics, CBE, MU


30
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

The ovtest result, on the contrary, fails to reject the null hypothesis of no omitted
variables indicating no model specification error. Given two contradictory results, we
better have to reconsider our model.

6. Test for Independence in Errors (Test for Autocorrelation):

Another OLS assumption is the errors associated with one observation are not
correlated with the errors of any other observation. The statement of this assumption
that the errors associated with one observation are not correlated with the errors of
any other observation cover several different situations.

Consider the case of collecting data from different villages in different regions. It is
likely that the households within each village will tend to be more alike than
households from different villages, that is, their errors are not independent. [We will
see the regress command with cluster option on further topics in regression].

Another way in which the assumption of independence can be broken is when data are
collected on the same variables over time (time series data). In this situation it is
likely that the errors for observation between adjacent time periods will be more
highly correlated than for observations more separated in time. This is known as
autocorrelation. When you have data that can be considered to be time-series you
should use the dwstat command that performs a Durbin-Watson test for correlated
residuals.

• estat dwatson computes the Durbin-Watson d statistic to test for first-order


serial correlation in the disturbance when all the regressors are strictly
exogenous.

Syntax for estat dwatson


estat dwatson

• estat durbinalt performs Durbin's alternative test for serial correlation in the
disturbance. This test does not require that all the regressors be strictly
exogenous.

Syntax for estat durbinalt


estat durbinalt [, durbinalt_options]

• estat bgodfrey performs the Breusch-Godfrey test for higher-order serial


correlation in the disturbance. This test does not require that all the regressors
be strictly exogenous.

Syntax for estat durbinalt


estat bgodfrey [, bgodfrey_options]

Example 6.7

Using time series data sp500.dta, we seek to investigate the relationship between
closing price (close) opening price (open) and volume (volume). To estimate this

Department of Economics, CBE, MU


31
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

relationship first we need to use the tsset command to let Stata know which variable
is the time variable. Then we can use regress command to estimate the model as
below:

sysuse sp500.dta
tsset date
regress close open volume
Source | SS df MS Number of obs = 248
-------------+------------------------------ F( 2, 245) = 3534.52
Model | 1798400.01 2 899200.006 Prob > F = 0.0000
Residual | 62329.1814 245 254.404822 R-squared = 0.9665
-------------+------------------------------ Adj R-squared = 0.9662
Total | 1860729.19 247 7533.31657 Root MSE = 15.95

------------------------------------------------------------------------------
close | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
open | .9804932 .0122914 79.77 0.000 .9562828 1.004704
volume | .0001216 .0004141 0.29 0.769 -.0006941 .0009373
_cons | 21.1053 17.04749 1.24 0.217 -12.47304 54.68364
------------------------------------------------------------------------------

estat dwatson
Number of gaps in sample: 55

Durbin-Watson d-statistic( 3, 248) = 1.551443

The Durbin-Watson statistic has a range from 0 to 4 with a midpoint of 2. The


observed value in our example is 1.55. We need to compare the d-statistics computed
by stata with d-statistics from Durbin-Watson table.

The hypothesis
Ho: No Serial Correlation (no autocorrelation)
H1: There is Serial Correlation (there is autocorrelation)

Comparing d-statistics computed and d-statistics tabulated


The d-statistics computed is 1.551443 and d-statistics tabulated is ( d L =1.75 and
d U =1.79 for k=3 and n=200). And (4 − d L ) =2.25 and (4 − dU ) =2.21.

Decision Rule

1. If d is less than d L or greater than (4 − d L ) we reject the null hypothesis of no


autocorrelation in favour of the alternative which implies existence of
autocorrelation.

2. If, d lies between d U and ( 4 − d U ) , accept the null hypothesis of no


autocorrelation

3. If however the value of d lies between d L and d U or between ( 4 − d U )


and (4 − d L ) , the D.W test is inconclusive.
Conclusion
Since d-statistics computed is less than d L , reject the 0-hypothesis of no
autocorrelation. [Caution: reliability of Durbin-Watson test is subject to fulfilment of
basic assumptions that underlie the test).

Department of Economics, CBE, MU


32
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel
estat durbinalt
Number of gaps in sample: 55

Durbin's alternative test for autocorrelation


---------------------------------------------------------------------------
lags(p) | chi2 df Prob > chi2
-------------+-------------------------------------------------------------
1 | 0.096 1 0.7567
---------------------------------------------------------------------------
H0: no serial correlation

estat bgodfrey
Number of gaps in sample: 55

Breusch-Godfrey LM test for autocorrelation


---------------------------------------------------------------------------
lags(p) | chi2 df Prob > chi2
-------------+-------------------------------------------------------------
1 | 0.098 1 0.7548
---------------------------------------------------------------------------
H0: no serial correlation

The last two commands check for high order autocorrelation instead of first order
serial correlation.

A simple visual check would be to plot the residuals versus the time variable.

predict resid, r
scatter resid date
100
50
Residuals
0
-50

01jan2001 01apr2001 01jul2001 01oct2001 01jan2002


Date

Example 6.8

Using time series data sp500.dta, we re-specify the model in example 1 above to
additionally take in to account the effect of previous closing price (lagclose) on
today’s closing price (close) To estimate this relationship first we need to use the tsset
command to let Stata know which variable is the time variable. Then we can use
regress command to estimate the model as below:

sysuse sp500.dta

tsset date

Department of Economics, CBE, MU


33
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel
gen lagclose= close[_n-1]

regress close open volume lagclose


Source | SS df MS Number of obs = 247
-------------+------------------------------ F( 3, 243) = 2375.93
Model | 1791678.01 3 597226.002 Prob > F = 0.0000
Residual | 61081.8429 243 251.365609 R-squared = 0.9670
-------------+------------------------------ Adj R-squared = 0.9666
Total | 1852759.85 246 7531.5441 Root MSE = 15.855

------------------------------------------------------------------------------
close | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
open | -1.898847 4.687278 -0.41 0.686 -11.13173 7.334033
volume | .0001084 .0004126 0.26 0.793 -.0007043 .0009211
lagclose | 2.882026 4.687858 0.61 0.539 -6.351996 12.11605
_cons | 18.22684 16.99459 1.07 0.285 -15.24867 51.70235
------------------------------------------------------------------------------

In cases when the explanatory variable is lag variable of the dependent variable,
durbin-watson statistics (estat dwatson command) is not appropriate to check for
autocorrelation. The appropriate commands are estat durbinalt and/or estat
bgodfrey.

estat durbinalt
Number of gaps in sample: 55

Durbin's alternative test for autocorrelation


---------------------------------------------------------------------------
lags(p) | chi2 df Prob > chi2
-------------+-------------------------------------------------------------
1 | 0.769 1 0.3804
---------------------------------------------------------------------------
H0: no serial correlation

estat bgodfrey
Number of gaps in sample: 55

Breusch-Godfrey LM test for autocorrelation


---------------------------------------------------------------------------
lags(p) | chi2 df Prob > chi2
-------------+-------------------------------------------------------------
1 | 0.783 1 0.3763
---------------------------------------------------------------------------
H0: no serial correlation

estat durbinalt, lags(2)


Number of gaps in sample: 55

Durbin's alternative test for autocorrelation


---------------------------------------------------------------------------
lags(p) | chi2 df Prob > chi2
-------------+-------------------------------------------------------------
2 | 1.918 2 0.3833
---------------------------------------------------------------------------
H0: no serial correlation

estat bgodfrey, l(2)


Number of gaps in sample: 55

Breusch-Godfrey LM test for autocorrelation


---------------------------------------------------------------------------
lags(p) | chi2 df Prob > chi2
-------------+-------------------------------------------------------------
2 | 1.950 2 0.3771
---------------------------------------------------------------------------
H0: no serial correlation
The last two commands use optional command to set the number of lags, which is set
in to 2.
Department of Economics, CBE, MU
34
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

6.4 Linear Regression with a large dummy-variable set

If we want to estimate a model with large number of dummy variables as regressors,


we can use the following command.

Syntax
• areg depvar [indepvars] [if] [in] [weight], absorb(varname) [options]

Example 6.9

Using the auto.dta an example dataset with in stata type the command below and see
the results below.

areg price weight length, absorb(rep78)

Linear regression, absorbing indicators Number of obs = 69


F( 2, 62) = 22.98
Prob > F = 0.0000
R-squared = 0.4341
Adj R-squared = 0.3793
Root MSE = 2294.5

------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | 5.478309 1.158582 4.73 0.000 3.162337 7.794281
length | -109.5065 39.26104 -2.79 0.007 -187.9882 -31.02482
_cons | 10154.62 4270.525 2.38 0.021 1617.96 18691.27
-------------+----------------------------------------------------------------
rep78 | F(4, 62) = 2.079 0.094 (5 categories)

The regressor, rep78, has 5 categories that may result in to 5 dummy variables. If we
were to use the regress command, we have to construct 5 dummy variables for
each category and estimate the model. An alternative is to use areg command for the
same estimation with out creating 5 dummy variables.

6.5 Non-linearity in an OLS Regression (Different Functional Forms)

In empirical analysis of the relationship between a dependent variable and some set of
independent variables, it is possible that we specify different functional forms
(logarithmic, semi-logarithmic, quadratic and others) to allow nonlinear relationship
between the dependent and the independent variables.

Logarithmic functions are used quite often in applied economics to estimate


elasticity and semi-elasticity of a dependent variable due to change in the independent
variable. Besides, for strictly positive variables logarithmic transformation would be
of help in mitigating heteroscedasticity and effect of outliers on coefficients’ estimate.

A logarithmic model can be specified as:


Log-log specification: log yi = β 0 + β1 log x1i + β 2 log x2i + ε i

The interpretation of the coefficients here is as elasticity of yi due to change in xi .


Specifically, β1 measures the percentage change on yi due one percent change in x1i ,
cetris paribus. An appropriate strategy in finding out the interpretation of coefficients

Department of Economics, CBE, MU


35
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

for different functional forms is to partially differentiate the dependent variable with
∂ log yi % ∆yi
respect to the independent variable! = β1 is equivalent to = β1 .
∂ log x1i % ∆x1i

Log-level specification: log yi = β 0 + β1 log x1i + β 2 x2i + ε i


( β 2 ) 100 here is interpreted as semi-elasticity, i.e., a unit change in x2i result in 100
times β 2 percentage change on yi , cetris paribus.

Level-log specification: yi = β 0 + β1 log x1i + β 2 log x2i + β 3 x3i + ε i


β1 measures the effect of one percent change in x1i on yi , cetris paribus.
100

Quadratic functions are also used quite often in applied economics to capture
decreasing or increasing marginal effects. A regression model with quadratic terms on
the independent variable can be given as follows:
yi = β 0 + β1 x1i + β 2 x12i + β 2 x2i + ν i

In this model the effect of x1i on yi is not simply β1 ! Rather it is given as


∂yi
= β1 + 2β 2 x1i .
∂x1i

Other functional forms also require care in interpreting the coefficients.

Estimation of the aforementioned models in stata is possible using regress command


once we transform the variables first. See the following examples:

Example 6.10

use bwages.dta, clear

***summarizing variables
sum wage exper educ

*** data generation


gen expersq= exper^2
gen expercb= exper^3

*** observing the data before estimation


graph matrix wage exper educ hhsex2

*** Linear Regression analysis


reg wage exper educ hhsex2

***Regression Analysis with Nonlinearity [Quadratic specifications]


reg wage exper expersq educ hhsex2
reg wage exper expersq expercb educ hhsex2
reg wage exper expersq expercb hhsex2
reg wage expersq hhsex2

Department of Economics, CBE, MU


36
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

Example 6.11

use housing.dta, clear

*** data generattion


gen lnprice= log(price)
gen lnrooms= log(bedrooms)
gen lnlotsize= log(lotsize)
gen lnbathrms= log(bathrms)
gen lnstories= log(stories)
tab driveway, gen(dway)
tab recroom, gen(recdy)
tab fullbase, gen(basedmy)
tab gashw, gen(hotwater)
tab airco, gen(aircond)
gen lngarage=log(garagepl)
replace lngarage= 0.001 if lngarage==.
tab prefarea, gen(location)

*** Linear Regression analysis


reg price lotsize bedrooms bathrms aircond2 dway2 recdy2 basedmy2
hotwater2 garagepl location2 stories

***Regression Analysis with Nonlinearity [Logarithmic specifications]


reg lnprice lnlotsize bedrooms bathrms aircond2 dway2 recdy2
basedmy2 hotwater2 garagepl location2 stories

reg price lnlotsize bedrooms bathrms aircond2 dway2 recdy2 basedmy2


hotwater2 garagepl location2 stories

Department of Economics, CBE, MU


37
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

7.0 Further Topics in Regression

In this part we will see various commands that go beyond OLS models. The topics
will include probability models (like probit, logit, multinomial logit and ordered
probit), robust regression methods, regression with censored and truncated data and
regression with measurement error.

7.1 Limited Dependent Variable Regression/Model

Not all dependent variables are continuous, so here we quickly go through some of
the limited dependent variables as they are useful in poverty analysis, and other
situations where we don't always have continuous dependent variable. For instance
some of the questions we want to address may be: What is the probability an
individual will get a higher degree, what determines labour force participation, what
factors drive the incidence of civil war?

Categories of limited dependent models


1. Discrete Response models
Binary response models (Whether a firm exports, or a farmer has adopted a
technology)
Multinomial response models (whether an individual is in wage employment,
unemployed or in self-employment)
Ordered response models (e.g. Credit ratings from A to D)

2. Corner solution models and censored regression models


e.g. modelling household health expenditures: the dependent variable is non-
negative, continuous above zero and has a lot of observations at zero

3. Sample Selection models


e.g. estimating the returns to education on a sample of wage employees misses
observations that are not participating in the labor market, hence the sample is
selected.

7.1.1 Probit Regression

Syntax for Probit regression


probit depvar [indepvars] [if] [in] [weight] [, probit_options]

Syntax for Probit regression, reporting marginal effects


dprobit [depvar indepvars [if] [in] [weight]] [, dprobit_options]

Both probit and dprobit fits a maximum-likelihood probit model. dprobit is an


alternative to probit.

Rather than reporting the coefficients, dprobit reports the marginal effect, that is the
change in the probability for an infinitesimal change in each independent, continuous
variable and, by default, reports the discrete change in the probability for dummy
variables. probit may be typed without arguments after dprobit estimation to see the
model in coefficient form. If estimating on grouped data, refer to bprobit.

Department of Economics, CBE, MU


38
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

Example 7.1

If you want to study the determinants of participation in credit market at household


level, first identify the dependent variable that captures participation in the credit
market and the possible explanatory variables that affect participation. Usually,
participation in the credit market is given using a discret variable that takes 1 if the
household participate in the credit market and 0 otherwise. And this is a case when
you have a limited dependent variable and estimating such model is possible using
probit or logit regression. The following commands describe how you estimate probit
and logit models of credit market participation.

Open the household level data


use hhdata.dta, clear

Create dummy variables


tab particip, gen(creditm)
tab offfarm,gen(nonfarm)
tab hhsex, gen(sexhh)
tab hhedu, gen(eduhh)
tab hhskill, gen(skillhh)
tab village, gen(tabia)
tab region, gen(zone)
tab density, gen(population)

Generate new variables labour, poverty and natural log of poverty


gen labour= adufem + adumale
gen poverty= deflexp /aduleqv
gen lnpoverty= log(poverty)

Check the correlation among the explanatory variables to avoid inclusion of highly
multicollinear variable
correlate farmsize hhage adufem adumale oxen tlu primary secondary
dismarkt localmat

Estimate credit market participation using probit regression


probit particip sexhh2 eduhh2 skillhh2 farmsize hhage labour tlu
primary secondary dismarkt, robust

predict p
gen predp= 0
replace predp= 1 if p > 0.5
tab predp particip
drop p

probit particip

probit particip sexhh2 eduhh2 skillhh2 farmsize hhage labour tlu


primary secondary dismarkt, if tabia==1

dprobit particip sexhh2 eduhh2 skillhh2 farmsize hhage labour tlu


primary secondary dismarkt, robust

by tabia: probit particip sexhh2 eduhh2 skillhh2 farmsize hhage


labour tlu primary secondary dismarkt, robust

Department of Economics, CBE, MU


39
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

7.1.2 Logit/Logistic Regression

Syntax
logit depvar [indepvars] [if] [in] [weight] [, options]

logit fits a maximum-likelihood logit model. depvar=0 indicates a negative outcome;


depvar!=0 & depvar!=. (typically depvar=1 indicates a positive outcome). Also see
logistic; logistic displays estimate as odds ratios. Many users prefer the logistic
command to logit. Results are the same regardless of which you use; both are the
maximum-likelihood estimator. A number of auxiliary commands that can be run
after logit, probit, or logistic estimation are described in logistic post estimation (you
can refer the STATA manual). A list of related estimation commands is given in
logistic estimation commands. If estimating on grouped data, refer to glogit.

Example 7.2

Using the same data above, estimate credit market participation using logit regression
and see the results
logit particip sexhh2 eduhh2 skillhh2 farmsize hhage adufem adumale
tlu primary secondary dismarkt , robust

logit particip sexhh2 eduhh2 skillhh2 farmsize hhage labour tlu


primary secondary dismarkt, if tabia==1

dlogit particip sexhh2 eduhh2 skillhh2 farmsize hhage labour tlu


primary secondary dismarkt, robust

by tabia: logit particip sexhh2 eduhh2 skillhh2 farmsize hhage


labour tlu primary secondary dismarkt, robust

7.1.3 Multinomial Logit Model

You may have limited dependent variable that takes more than two discrete values
that are not possible to order; in such situation you estimate the relationship using
multinomial logit model.

Syntax
mlogit depvar [indepvars] [if] [in] [weight] [, options]

mlogit fits maximum-likelihood multinomial logit models, also known as polytomous


logistic regression.

Example 7.3

Using the same data above, estimate land market participation using multinomial logit
model and see the results
mlogit landmark sexhh2 eduhh2 skillhh2 farmsize hhage adufem adumale
tlu primary secondary dismarkt zone2 zone3 zone4, robust

Department of Economics, CBE, MU


40
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

7.1.4 Ordered Probit/Logit Model

You may have limited dependent variable that takes more than two discrete values
that can be ordered; in such situation you estimate the relationship using ordered
probit/logit model.

Syntax
oprobit depvar [indepvars] [if] [in] [weight] [, options]
ologit depvar [indepvars] [if] [in] [weight] [, options]

oprobit fits ordered probit models of ordinal variable depvar on the independent
variables indepvars. The actual values taken on by the dependent variable are
irrelevant, except that larger values are assumed to correspond to "higher" outcomes.
Up to 50 outcomes are allowed in Stata/SE and Intercooled Stata, and up to 20
outcomes in Small Stata.

ologit fits ordered logit models of ordinal variable depvar on the independent
variables indepvars. The actual values taken on by the dependent variable are
irrelevant, except that larger values are assumed to correspond to "higher" outcomes.
Up to 50 outcomes are allowed in Stata/SE and Intercooled Stata, and up to 20
outcomes are allowed in Small Stata.

Example 7.5

Using the same data above, estimate land market participation using ordered probit
model and see the results
oprobit landmark sexhh2 eduhh2 skillhh2 farmsize hhage adufem adumale
tlu primary secondary dismarkt zone2 zone3 zone4, robust

7.1.5 Tobit Model

A limited dependent variable may take either zero or positive continuous values. In
such situation Tobit estimation is appropriate. For instance, labor supply, health
expenditure, alcoholic expenditure, so on take nonnegative continuous values

Example 7.6

Using the same data above, estimate loan equation using tobit model and see the
results

tobit credit20 sexhh2 eduhh2 skillhh2 farmsize hhage adufem adumale


tlu primary secondary dismarkt , ll(0)

mfx

Department of Economics, CBE, MU


41
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

7.1.6 Heckman sample selection model

Example 7.7

Using the same data above, estimate heckman two-step selection to see the effect of
participation on household welfare (poverty level) and see the results

First step, probit estimation of participation equation :


probit particip sexhh2 eduhh2 skillhh2 farmsize hhage labour tlu
primary secondary dismarkt, robust
predict phat
gen imr = normd(phat) / 1 - normprob(phat)

Second step, OLS estimation of household welfare (poverty) determinants or


correlates:
reg lnpoverty sexhh2 eduhh2 skillhh2 farmsize hhage adufem adumale
tlu primary secondary dismarkt zone2 zone3 zone4 imr, robust

Alternatively, one could estimate the two-step selection model in one step using full
maximum likeihood estimation as below:

heckman lnpoverty sexhh2 eduhh2 skillhh2 farmsize hhage adufem


adumale tlu primary secondary dismarkt zone2 zone3 zone4, twostep
select (particip= sexhh2 eduhh2 skillhh2 farmsize hhage labour tlu
primary secondary dismarkt)

7.2 Robust Regression Methods

It seems to be a rare dataset that meets all of the assumptions underlying multiple
regressions. We know that failure to meet assumptions can lead to biased estimates of
coefficients and especially biased estimates of the standard errors. This fact explains a
lot of the activity in the development of robust regression methods.

The idea behind robust regression methods is to make adjustments in the estimates
that take into account some of the flaws in the data itself. We are going to look at
three approaches to robust regression: 1) regression with robust standard errors
including the cluster option, 2) robust regression using iteratively reweighted least
squares, and 3) quantile regression, more specifically, median regression.

Before we look at these approaches, let's look at a standard OLS regression using the
nlsw88.dta dataset.

sysuse nlsw88.dta, clear

We will look at a model that predicts the hourly wage (wage) using the educational
level (grade), work experience (ttl_exp) and job tenure (tenure). First let's look at
the descriptive statistics for these variables. Note the missing values for grade and
tenure.

Department of Economics, CBE, MU


42
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel
sum wage grade ttl_exp tenure
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
wage | 2246 7.766949 5.755523 1.004952 40.74659
grade | 2244 13.09893 2.521246 0 18
ttl_exp | 2246 12.53498 4.610208 .1153846 28.88461
tenure | 2231 5.97785 5.510331 0 25.91667

Below we see the regression predicting hourly wage (wage) using the educational
level (grade), work experience (ttl_exp) and job tenure (tenure). We see that all of
the variables are significant except for tenure.

reg wage grade ttl_exp tenure


Source | SS df MS Number of obs = 2229
-------------+------------------------------ F( 3, 2225) = 128.82
Model | 10963.7014 3 3654.56714 Prob > F = 0.0000
Residual | 63124.2663 2225 28.3704568 R-squared = 0.1480
-------------+------------------------------ Adj R-squared = 0.1468
Total | 74087.9678 2228 33.2531274 Root MSE = 5.3264

------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
grade | .64945 .0456072 14.24 0.000 .560013 .738887
ttl_exp | .2329752 .0303483 7.68 0.000 .1734612 .2924891
tenure | .0378054 .0250756 1.51 0.132 -.0113686 .0869794
_cons | -3.86429 .6318619 -6.12 0.000 -5.103391 -2.625189
------------------------------------------------------------------------------

We can use the test command to test total work experience and job tenure, and we
find the overall test of these two variables is significant.

test ttl_exp tenure


( 1) ttl_exp = 0
( 2) tenure = 0
F( 2, 2225) = 54.83
Prob > F = 0.0000

Here is the residual versus fitted plot for this regression. Notice that the pattern of the
residuals is not exactly as we would hope. The spread of the residuals is somewhat
wider toward the left of the graph than at the right, where the variability of the
residuals is somewhat larger, suggesting some heteroscedasticity.

rvfplot
40
30
Residuals
20
10
0
-10

-5 0 5 10 15
Fitted values

Department of Economics, CBE, MU


43
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

Below we show the avplots. Although the plots are small, you can see some points
that are of concern. There is not a single extreme point but a handful of points that
stick out. For example, in the top right graph you can see a handful of points that
stick out from the rest. If this were just one or two points, we might look for mistakes
or for outliers, but we would be more reluctant to consider such a large number of
points as outliers.

avplots
10 20 30 40

-10 0 10 20 30 40
e( wage | X )

e( wage | X )
-10 0

-15 -10 -5 0 5 -10 -5 0 5 10 15


e( grade | X ) e( ttl_exp | X )
coef = .64945, se = .04560716, t = 14.24 coef = .23297518, se = .03034831, t = 7.68
10 20 30 40
e( wage | X )
-10 0

-10 -5 0 5 10
e( tenure | X )
coef = .03780541, se = .02507559, t = 1.51

Here is the lvr2plot for this regression. We see 4 points that are somewhat high in
both their leverage and their residuals.

lvr2plot
.015
.01
Leverage
.005
0

0 .005 .01 .015 .02


Normalized residual squared

Department of Economics, CBE, MU


44
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

None of these results are dramatic problems, but the rvfplot suggests that there might
be some outliers and some possible heteroscedasticity; the avplots have some
observations that look to have high leverage, however the lvr2plot do not show as
such influential observation. We might wish to use something other than OLS
regression to estimate this model. In the next several sections we will look at some
robust regression methods.

7.3 Regression with Robust Standard Errors

The Stata regress command includes a robust option for estimating the standard
errors using the Huber-White sandwich estimators. Such robust standard errors can
deal with a collection of minor concerns about failure to meet assumptions, such as
minor problems about normality, heteroscedasticity, or some observations that exhibit
large residuals, leverage or influence. For such minor problems, the robust option may
effectively deal with these concerns.

With the robust option, the point estimates of the coefficients are exactly the same as
in ordinary OLS, but the standard errors take into account issues concerning
heterogeneity and lack of normality. Here is the same regression as above using the
robust option. Note the changes in the standard errors and t-tests (but no change in
the coefficients). In this particular example, using robust standard errors did not
change any of the conclusions from the original OLS regression.

reg wage grade ttl_exp tenure, robust

Linear regression Number of obs = 2229


F( 3, 2225) = 142.34
Prob > F = 0.0000
R-squared = 0.1480
Root MSE = 5.3264

------------------------------------------------------------------------------
| Robust
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
grade | .64945 .0446901 14.53 0.000 .5618113 .7370887
ttl_exp | .2329752 .0290066 8.03 0.000 .1760924 .289858
tenure | .0378054 .0243765 1.55 0.121 -.0099976 .0856084
_cons | -3.86429 .5998228 -6.44 0.000 -5.040561 -2.688019
------------------------------------------------------------------------------

7.4 Using the Cluster Option

As described earlier, OLS regression assumes that the residuals are independent. The
nlsw88 dataset contains data on 2246 individuals that come from 12 industries. It is
very possible that the wage within each industry may not be independent, and this
could lead to residuals that are not independent within industry. We can use the
cluster option to indicate that the observations are clustered into industry (based on
industry) and that the observations may be correlated within industry, but would be
independent between industries.

By the way, if we did not know the number of industries, we could quickly find out
how many industries there are as shown below, by quietly tabulating industry and
then displaying the macro r(r) which gives the numbers of rows in the table, which is
the number of industries in our data.

Department of Economics, CBE, MU


45
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel
quietly tabulate industry
display r(r)
12

Now, we can run regress with the cluster option. We do not need to include the robust
option since robust is implied with cluster. Note that the standard errors have changed
substantially, much more so, than the change caused by the robust option by itself.

reg wage grade ttl_exp tenure, cluster( industry)


Linear regression Number of obs = 2215
F( 3, 11) = 83.87
Prob > F = 0.0000
R-squared = 0.1464
Number of clusters (industry) = 12 Root MSE = 5.3391

------------------------------------------------------------------------------
| Robust
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
grade | .6505642 .0482345 13.49 0.000 .5444008 .7567276
ttl_exp | .2322409 .0440692 5.27 0.000 .1352454 .3292365
tenure | .0362508 .0268599 1.35 0.204 -.0228675 .0953691
_cons | -3.85212 .6186058 -6.23 0.000 -5.213662 -2.490578
------------------------------------------------------------------------------

As with the robust option, the estimate of the coefficients are the same as the OLS
estimates, but the standard errors take into account that the observations within
industry are non-independent. Even though the standard errors are larger in this
analysis, the two variables that were significant in the OLS analysis are significant in
this analysis as well. These standard errors are computed based on aggregate wages
for the 12 industries, since these industry level wages should be independent. If you
have a very small number of clusters compared to your overall sample size it is
possible that the standard errors could be quite larger than the OLS results, which is
the case here.

7.5 Robust Regression

The Stata rreg command performs a robust regression using iteratively reweighted
least squares, i.e., rreg assigns a weight to each observation with higher weights
given to better behaved observations. In fact, extremely deviant cases, those with
Cook's D greater than 1, can have their weights set to missing so that they are not
included in the analysis at all.

We will use rreg with the generate option so that we can inspect the weights used to
weight the observations. Note that in this analysis both the coefficients and the
standard errors differ from the original OLS regression. Below we show the same
analysis using robust regression using the rreg command.

rreg wage grade ttl_exp tenure, gen (wt)

Huber iteration 1: maximum difference in weights = .90508229


Huber iteration 2: maximum difference in weights = .25138724
Huber iteration 3: maximum difference in weights = .08779832
Huber iteration 4: maximum difference in weights = .02881517
Biweight iteration 5: maximum difference in weights = .29168281
Biweight iteration 6: maximum difference in weights = .05586415
Biweight iteration 7: maximum difference in weights = .01749881
Biweight iteration 8: maximum difference in weights = .0049518

Department of Economics, CBE, MU


46
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel
Robust regression Number of obs = 2229
F( 3, 2225) = 376.09
Prob > F = 0.0000

------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
grade | .5179226 .0233127 22.22 0.000 .4722057 .5636395
ttl_exp | .1511732 .0155129 9.74 0.000 .1207519 .1815946
tenure | .1214573 .0128177 9.48 0.000 .0963213 .1465932
_cons | -2.6662 .3229847 -8.25 0.000 -3.299583 -2.032817
------------------------------------------------------------------------------

If you compare the robust regression results (directly above) with the OLS results
previously presented, you can see that the coefficients, standard errors, t values and p
values are different consistent to the problems we found in the data when we
performed the OLS analysis. Given this difference, we could further investigate the
reasons why the OLS and robust regression results are different [Among the two
results the robust regression results would probably be the more trustworthy].

Let's calculate and look at the predicted (fitted) values (p), the residuals (r), and the
leverage (hat) values (h). Note that we are including if e(sample) in the commands
because rreg can generate weights of missing and you wouldn't want to have
predicted values and residuals for those observations.

predict p if e(sample)
(option xb assumed; fitted values)
(17 missing values generated)

predict r if e(sample), resid


(17 missing values generated)

predict h if e(sample), hat


(17 missing values generated)

Now, let's check on the various predicted values and the weighting. First, we will sort
by wt then we will look at the first 100 observations. Notice that the smallest weights
are zero that slowly grows in to 0.2 ranges.

sort wt
list idcode wage p r h wt in 1/100
+-----------------------------------------------------------------+
| idcode wage p r h wt |
|-----------------------------------------------------------------|
1. | 236 24.66183 8.54511 16.11672 0 0 |
2. | 3250 40.19808 2.772077 37.42601 0 0 |
3. | 856 39.23074 5.928342 33.30239 0 0 |
4. | 1209 25.80515 7.041585 18.76356 0 0 |
5. | 2942 17.34411 5.361766 11.98234 0 0 |
|-----------------------------------------------------------------|
6. | 2792 30.19324 8.044289 22.14895 0 0 |
7. | 3937 21.38486 5.048339 16.33652 0 0 |
8. | 3648 40.19808 3.987089 36.21099 0 0 |
9. | 3507 40.19808 9.363749 30.83433 0 0 |
10. | 1863 20.64412 8.206309 12.43781 0 0 |
.. |-----------------------------------------------------------------|
.. |-----------------------------------------------------------------|
71. | 4801 38.70926 6.55222 32.15704 0 0 |
72. | 2049 30.92161 7.202032 23.71957 0 0 |
73. | 2613 40.74659 6.901216 33.84538 0 0 |
74. | 1664 20.64412 9.186622 11.4575 .0000133 .0072777 |
75. | 3999 19.35587 7.912395 11.44348 .0000101 .00763929 |
|-----------------------------------------------------------------|

Department of Economics, CBE, MU


47
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel
76. | 1480 20.64412 9.59372 11.0504 .0000447 .02228932 |
77. | 2188 17.77777 6.879003 10.89877 .0001206 .02967965 |
78. | 3734 18.58292 7.820083 10.76284 .000088 .03683624 |
79. | 1746 17.66505 6.942526 10.72252 .0000822 .03939117 |
80. | 1065 18.58292 7.971502 10.61142 .0000955 .04648615 |
.. |-----------------------------------------------------------------|
.. |-----------------------------------------------------------------|
..
96. | 4045 19.35587 10.49595 8.859919 .000759 .20543025 |
97. | 2106 17.02898 8.235084 8.793897 .0007966 .2131063 |
98. | 2548 17.41545 8.769694 8.645755 .0002875 .22976365 |
99. | 716 16.64251 8.007967 8.634542 .0004617 .23055002 |
100. | 641 18.38969 9.779596 8.610098 .0007486 .23394647 |
+-----------------------------------------------------------------+

Now, let's look at the last 100 observations. The weights for the last 100 observations
are all very close to one except the values for observations 2230 to the end are
missing due to the missing predictors. Note that the observations above that have the
lowest weights are also those with the largest residuals (residuals over 8) and the
observations below with the highest weights have very low residuals (all less than
0.14).

list idcode wage p r h wt in 2146/2246


+-----------------------------------------------------------------+
| idcode wage p r h wt |
|-----------------------------------------------------------------|
2146. | 3353 10.83736 10.70236 .1349934 .0037811 .99977254 |
2147. | 2373 5.837359 5.705997 .1313627 .0020209 .99977285 |
2148. | 1557 5.611914 5.736039 -.1241249 .0016357 .99977292 |
2149. | 2178 5.032206 4.905342 .1268642 .0011721 .99977884 |
2150. | 4116 5.636071 5.759209 -.1231381 .0010271 .99978204 |
.. |-----------------------------------------------------------------|
.. |-----------------------------------------------------------------|
..
2226. | 2183 4.17069 4.158518 .0121717 .002497 .99999789 |
2227. | 2876 3.526568 3.538591 -.0120237 .0026671 .99999881 |
2228. | 3728 9.114332 9.100998 .0133343 .0013403 .99999917 |
2229. | 4102 5.233495 5.237946 -.0044506 .0008686 .9999995 |
2230. | 1076 6.964568 . . . . |
|-----------------------------------------------------------------|
2231. | 1186 3.526568 . . . . |
2232. | 1620 2.491696 . . . . |
2233. | 5079 4.146536 . . . . |
2234. | 3027 2.938808 . . . . |
2235. | 4394 2.214171 . . . . |
|-----------------------------------------------------------------|
2236. | 1155 7.045088 . . . . |
2237. | 990 7.045088 . . . . |
2238. | 1577 3.285022 . . . . |
2239. | 3214 5.072463 . . . . |
2240. | 384 4.830918 . . . . |
|-----------------------------------------------------------------|
2241. | 2957 2.415459 . . . . |
2242. | 961 7.045088 . . . . |
2243. | 327 4.025765 . . . . |
2244. | 2606 3.961351 . . . . |
2245. | 3199 1.151368 . . . . |
|-----------------------------------------------------------------|
2246. | 5098 2.648529 . . . . |
+-----------------------------------------------------------------+

After using rreg, it is possible to generate predicted values, residuals and leverage
(hat), but most of the regression diagnostic commands are not available after rreg.
We will have to create some of them for ourselves. Here, of course, is the graph of
residuals versus fitted (predicted) with a line at zero. This plot looks much like the
OLS plot, except that in the OLS all of the observations would be weighted equally,

Department of Economics, CBE, MU


48
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

but as we saw above the observations with the greatest residuals are weighted less and
hence have less influence on the results.

scatter r p, yline(0)

40
30 20
residuals
10 0
-10

-5 0 5 10 15
Fitted values

To get lvr2plot we have to go through several steps in order to get the normalized
squared residuals and the means of both the residuals and the leverage (hat) values.

First, we generate the residual squared (r2) and then divide it by the sum of the
squared residuals. We then compute the mean of this value and save it as a local
macro called rm (which we will use for creating the leverage vs. residual plot).

generate r2=r^2
(17 missing values generated)

sum r2
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
r2 | 2229 29.70332 136.1406 .0000198 1400.706

replace r2 = r2/r(sum)
(2229 real changes made)

sum r2
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
r2 | 2229 .0004486 .0020562 2.99e-10 .0211559

local rm = r(mean)

Next we compute the mean of the leverage and save it as a local macro called hm.

sum h
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
h | 2229 .0017796 .0011727 0 .0106942
local hm = r(mean)

Department of Economics, CBE, MU


49
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

Now, we can plot the leverage against the residual squared as shown below.
Comparing the plot below with the plot from the OLS regression, this plot is much
better behaved. There are no longer points in the upper right quadrant of the graph.

scatter h r2, yline(`hm') xline(`rm')

.01
corrected (for wt) hat diagonals
.002 .004 .006 0 .008

0 .005 .01 .015 .02


r2

Let's close out this analysis by deleting our temporary variables.

drop wt p r h r2

7.6 Quantile Regression

Quantile regression, in general, and median regression, in particular, might be


considered as an alternative to rreg. The Stata command qreg does quantile
regression. qreg without any options will actually do a median regression in which
the coefficients will be estimated by minimizing the absolute deviations from the
median. Of course, as an estimate of central tendency, the median is a resistant
measure that is not as greatly affected by outliers as is the mean. It is not clear that
median regression is a resistant estimation procedure, in fact, there is some evidence
that it can be affected by high leverage values.

Here is what the quantile regression looks like using Stata's qreg command.
Comparing the quantile regression results with the OLS results previously presented,
you can see that the coefficients, standard errors, t values and p values are different.
Moreover, the coefficient for the variable job tenure (tenure) is statistically
significant in contrast to the result in OLS.

qreg wage grade ttl_exp tenure

Iteration 1: WLS sum of weighted deviations = 6819.5446

Iteration 1: sum of abs. weighted deviations = 6818.4424


Iteration 2: sum of abs. weighted deviations = 6601.0626
Iteration 3: sum of abs. weighted deviations = 6499.3684

Department of Economics, CBE, MU


50
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel
Iteration 4: sum of abs. weighted deviations = 6462.9259
Iteration 5: sum of abs. weighted deviations = 6451.8073
Iteration 6: sum of abs. weighted deviations = 6443.9252
Iteration 7: sum of abs. weighted deviations = 6442.3824
Iteration 8: sum of abs. weighted deviations = 6442.0261
Iteration 9: sum of abs. weighted deviations = 6440.5183
Iteration 10: sum of abs. weighted deviations = 6440.4227
Iteration 11: sum of abs. weighted deviations = 6438.443
Iteration 12: sum of abs. weighted deviations = 6438.3123
Iteration 13: sum of abs. weighted deviations = 6438.162
Iteration 14: sum of abs. weighted deviations = 6437.946
Iteration 15: sum of abs. weighted deviations = 6437.9068
Iteration 16: sum of abs. weighted deviations = 6437.8968
Iteration 17: sum of abs. weighted deviations = 6437.8601
Iteration 18: sum of abs. weighted deviations = 6437.8527
Iteration 19: sum of abs. weighted deviations = 6437.8321
Iteration 20: sum of abs. weighted deviations = 6437.8175
Iteration 21: sum of abs. weighted deviations = 6437.8175

Median regression Number of obs = 2229


Raw sum of deviations 7774.197 (about 6.2801929)
Min sum of deviations 6437.817 Pseudo R2 = 0.1719

------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
grade | .54382 .0292851 18.57 0.000 .486391 .601249
ttl_exp | .1645623 .0194886 8.44 0.000 .1263445 .2027802
tenure | .1203565 .016107 7.47 0.000 .0887702 .1519428
_cons | -3.17323 .4055463 -7.82 0.000 -3.968518 -2.377941
------------------------------------------------------------------------------

The qreg command has even fewer diagnostic options than rreg does. About the only
values we can obtain are the predicted values and the residuals.

predict p if e(sample)
(option xb assumed; fitted values)
(17 missing values generated)

predict r if e(sample), r
(17 missing values generated)

scatter r p, yline(0)
40
3020
Residuals
10 0
-10

-5 0 5 10 15
Fitted values

Department of Economics, CBE, MU


51
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

Stata has three additional commands that can do quantile regression.

iqreg estimates interquantile regressions, regressions of the difference in quantiles.


The estimated variance-covariance matrix of the estimators is obtained via
bootstrapping.

sqreg estimates simultaneous-quantile regression. It produces the same coefficients as


qreg for each quantile. sqreg obtains a bootstrapped variance-covariance matrix of the
estimators that includes between-quantiles blocks. Thus, one can test and construct
confidence intervals comparing coefficients describing different quantiles.

bsqreg is the same as sqreg with one quantile. sqreg is, therefore, faster than bsqreg.

7.7 Regression with Censored or Truncated Data

Analyzing data that contain censored values or are truncated is common in many
research disciplines.

Censored data is a data in which we do not always observe the outcome on the
dependent variable because at an upper (or lower) threshold we only know that the
outcome was above (or below) the threshold. In censored data we can use information
on the censored outcomes because we always observe the explanatory variables.

Truncated data is a data in which the sampling scheme entirely excludes part of the
population on the basis of outcomes on the dependent variable. In truncated data we
observe no information on units that are not covered by the sampling scheme.

We will begin by looking at analyzing data with censored values.

7.7.1 Regression with Censored Data

Using data from Wooldrige (2000), MROZ.DTA, we want to estimate a reduced form
annual hours equation for married women. Where, the dependent variable is annual
hours worked (hours) and the explanatory variables are nwifeinc educ exper
expersq age kidslt6 kidsge6.

Of the 753 women in the sample, 428 worked for a wage outside the home during the
year; 325 of the women worked zero hours. For the women who worked positive
hours, the range is fairly broad, ranging from 12 to 4,950. Thus, annual hours worked
is left censored at zero. Thus, estimation of reduced form annual hours equation for
married women require appropriate model that recognizes the censoring of the data. A
reasonable candidate is the Tobit model. For comparative purpose, we estimate a
linear model (using all 753 observations) by OLS.

The data set contains 22 variables for 753 observations, use describe command to get
description of the data set. Let's look at the description of the data, some descriptive
statistics, and correlations among the variables.

use MROZ.DTA, clear

Department of Economics, CBE, MU


52
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel
describe
Contains data from C:\Documents and
Settings\Yesuf\Desktop\Stata\Wooldridge2e\data\MROZ.DTA
obs: 753
vars: 22 2 Mar 1999 11:30
size: 39,909 (96.2% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
inlf byte %9.0g =1 if in lab frce, 1975
hours int %9.0g hours worked, 1975
kidslt6 byte %9.0g # kids < 6 years
kidsge6 byte %9.0g # kids 6-18
age byte %9.0g woman's age in yrs
educ byte %9.0g years of schooling
wage float %9.0g est. wage from earn, hrs
repwage float %9.0g rep. wage at interview in 1976
hushrs int %9.0g hours worked by husband, 1975
husage byte %9.0g husband's age
huseduc byte %9.0g husband's years of schooling
huswage float %9.0g husband's hourly wage, 1975
faminc float %9.0g family income, 1975
mtr float %9.0g fed. marg. tax rte facing woman
motheduc byte %9.0g mother's years of schooling
fatheduc byte %9.0g father's years of schooling
unem float %9.0g unem. rate in county of resid.
city byte %9.0g =1 if live in SMSA
exper byte %9.0g actual labor mkt exper
nwifeinc float %9.0g (faminc - wage*hours)/1000
lwage float %9.0g log(wage)
expersq int %9.0g exper^2
-------------------------------------------------------------------------------
Sorted by:

sum hours nwifeinc educ exper expersq age kidslt6 kidsge6


Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
hours | 753 740.5764 871.3142 0 4950
nwifeinc | 753 20.12896 11.6348 -.0290575 96
educ | 753 12.28685 2.280246 5 17
exper | 753 10.63081 8.06913 0 45
expersq | 753 178.0385 249.6308 0 2025
-------------+--------------------------------------------------------
age | 753 42.53785 8.072574 30 60
kidslt6 | 753 .2377158 .523959 0 3
kidsge6 | 753 1.353254 1.319874 0 8

count if hours>0
325
count if hours==0
428
corr hours nwifeinc educ exper expersq age kidslt6 kidsge6
(obs=753)

| hours nwifeinc educ exper expersq age kidslt6 kidsge6


-------------+------------------------------------------------------------------------
hours | 1.0000
nwifeinc | -0.1247 1.0000
educ | 0.1060 0.2776 1.0000
exper | 0.4050 -0.1722 0.0663 1.0000
expersq | 0.3352 -0.1652 0.0241 0.9376 1.0000
age | -0.0331 0.0586 -0.1202 0.3340 0.3803 1.0000
kidslt6 | -0.2221 0.0382 0.1087 -0.1940 -0.1838 -0.4339 1.0000
kidsge6 | -0.0906 0.0248 -0.0589 -0.2995 -0.2999 -0.3854 0.0842 1.0000

Now, let's run a standard OLS regression on the data and generate predicted scores in
olsp.

Department of Economics, CBE, MU


53
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel
regress hours nwifeinc educ exper expersq age kidslt6 kidsge6
Source | SS df MS Number of obs = 753
-------------+------------------------------ F( 7, 745) = 38.50
Model | 151647606 7 21663943.7 Prob > F = 0.0000
Residual | 419262118 745 562767.944 R-squared = 0.2656
-------------+------------------------------ Adj R-squared = 0.2587
Total | 570909724 752 759188.463 Root MSE = 750.18

------------------------------------------------------------------------------
hours | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nwifeinc | -3.446636 2.544 -1.35 0.176 -8.440898 1.547626
educ | 28.76112 12.95459 2.22 0.027 3.329283 54.19297
exper | 65.67251 9.962983 6.59 0.000 46.11365 85.23138
expersq | -.7004939 .3245501 -2.16 0.031 -1.337635 -.0633524
age | -30.51163 4.363868 -6.99 0.000 -39.07858 -21.94469
kidslt6 | -442.0899 58.8466 -7.51 0.000 -557.6148 -326.565
kidsge6 | -32.77923 23.17622 -1.41 0.158 -78.2777 12.71924
_cons | 1330.482 270.7846 4.91 0.000 798.8906 1862.074
------------------------------------------------------------------------------

predict olsp
(option xb assumed; fitted values)

As pointed above, the tobit model is a reasonable candidate for regression with
censored data, hence we use tobit command.

Syntax for tobit:

tobit depvar [indepvars] [weight] [if exp] [in range], ll[(#)]


ul[(#)] [ level(#) offset(varname) maximize_options ]

You can declare both lower and upper censored values. The censored values are fixed
in that the same lower and upper values apply to all observations.

ll[(#)] and ul[(#)] indicate the lower and upper limits for censoring, respectively. You
may specify one or both. Observations with depvar <= ll() are left-censored;
observations with depvar >= ul() are right-censored; and remaining observations are
not censored. You do not have to specify the censoring value at all. It is enough to
type ll, ul, or both. When you do not specify a censoring value, tobit assumes that the
lower limit is the minimum observed in the data (if ll is specified) and the upper limit
is the maximum (if ul is specified) .

There are two other commands in Stata that allow you more flexibility in doing
regression with censored data.

cnreg estimates a model in which the censored values may vary from observation to
observation.

intreg estimates a model where the response variable for each observation is either
point data, interval data, left-censored data, or right-censored data.

Department of Economics, CBE, MU


54
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

Now, let's run a tobit regression on the data and generate predicted scores in tobitp.

tobit hours nwifeinc educ exper expersq age kidslt6 kidsge6,


ll(0)
Tobit regression Number of obs = 753
LR chi2(7) = 271.59
Prob > chi2 = 0.0000
Log likelihood = -3819.0946 Pseudo R2 = 0.0343

------------------------------------------------------------------------------
hours | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nwifeinc | -8.814243 4.459096 -1.98 0.048 -17.56811 -.0603724
educ | 80.64561 21.58322 3.74 0.000 38.27453 123.0167
exper | 131.5643 17.27938 7.61 0.000 97.64231 165.4863
expersq | -1.864158 .5376615 -3.47 0.001 -2.919667 -.8086479
age | -54.40501 7.418496 -7.33 0.000 -68.96862 -39.8414
kidslt6 | -894.0217 111.8779 -7.99 0.000 -1113.655 -674.3887
kidsge6 | -16.218 38.64136 -0.42 0.675 -92.07675 59.64075
_cons | 965.3053 446.4358 2.16 0.031 88.88528 1841.725
-------------+----------------------------------------------------------------
/sigma | 1122.022 41.57903 1040.396 1203.647
------------------------------------------------------------------------------
Obs. summary: 325 left-censored observations at hours<=0
428 uncensored observations
0 right-censored observations

predict tobitp
(option xb assumed; fitted values)

Summarizing the olsp and tobitp scores shows that the tobit predicted values have a
larger standard deviation and a greater range of values.

summarize hours olsp tobitp

Variable | Obs Mean Std. Dev. Min Max


-------------+--------------------------------------------------------
hours | 753 740.5764 871.3142 0 4950
p1 | 753 740.5764 449.0646 -719.7679 1614.693
p2 | 753 296.7653 835.3766 -2641.953 1976.24

Comparing the tobit and OLS estimates, the Tobit coefficient estimates are the same
sign as the corresponding OLS estimates, and the statistical significance of the
estimates is similar except for the coefficient on nwifeinc. Second, though it is
tempting to compare the magnitudes of the OLS estimates and the Tobit estimates,
such comparisons are not very informative. In order to directly compare the tobit
coefficient estimates with their OLS counter part we need to multiply them by an
adjustment factor (see Wooldridge, 2000).

7.7.2 Regression with Truncated Data

Truncated data occurs when some observations are not included in the analysis
because of the value of the variable. We will illustrate analysis with truncation using
the dataset, nlsw88.dta. This data contains only employed individuals with positive
wages and hours worked. Hence, the sample excludes individuals who are not
employed, the data can be considered to be truncated at zero, i.e., wages need to be
greater than zero to be included in the sample.

Department of Economics, CBE, MU


55
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

If we want to model wages as predicted by years of schooling, work experience and


job tenure using nlsw88.dta, a reasonable estimation candidate will be truncated
regression. Let’s do!

Let’s use the data set and display some descriptive statistics, and correlations among
the variables.

use nlsw88.dta

sum wage grade ttl_exp tenure

Variable | Obs Mean Std. Dev. Min Max


-------------+--------------------------------------------------------
wage | 2246 7.766949 5.755523 1.004952 40.74659
grade | 2244 13.09893 2.521246 0 18
ttl_exp | 2246 12.53498 4.610208 .1153846 28.88461
tenure | 2231 5.97785 5.510331 0 25.91667

corr wage grade ttl_exp tenure


(obs=2229)

| wage grade ttl_exp tenure


-------------+------------------------------------
wage | 1.0000
grade | 0.3256 1.0000
ttl_exp | 0.2632 0.1979 1.0000
tenure | 0.1783 0.1228 0.5764 1.0000

For sake of comparison, first we estimate the model using OLS:


regress wage grade ttl_exp tenure

Source | SS df MS Number of obs = 2229


-------------+------------------------------ F( 3, 2225) = 128.82
Model | 10963.7014 3 3654.56714 Prob > F = 0.0000
Residual | 63124.2663 2225 28.3704568 R-squared = 0.1480
-------------+------------------------------ Adj R-squared = 0.1468
Total | 74087.9678 2228 33.2531274 Root MSE = 5.3264

------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
grade | .64945 .0456072 14.24 0.000 .560013 .738887
ttl_exp | .2329752 .0303483 7.68 0.000 .1734612 .2924891
tenure | .0378054 .0250756 1.51 0.132 -.0113686 .0869794
_cons | -3.86429 .6318619 -6.12 0.000 -5.103391 -2.625189
------------------------------------------------------------------------------

Given the truncated nature of the data, we estimate the model using truncated
regression. The syntax diagram for truncated regression is as below:

Syntax for Truncated Regression

truncreg depvar [indepvars] [if] [in] [weight] [, options]

truncreg fits a regression model of depvar on varlist from a sample drawn from a
restricted part of the population. Under the normality assumption for the whole
population, the error terms in the truncated regression model have a truncated normal
distribution, which is a normal distribution that has been scaled upward so that the
distribution integrates to one over the restricted range.

Department of Economics, CBE, MU


56
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel

ll(varname|#) and ul(varname|#) indicate the upper and lower limits for truncation,
respectively. You may specify one or both. Observations with depvar < ll() are left-
truncated, observations with depvar > ul() are right-truncated, and the remaining
observations are not truncated.

Estimating the wage model using truncated regression, truncreg:

truncreg wage grade ttl_exp tenure, ll(0)


(note: 0 obs. truncated)

Fitting full model:

Iteration 0: log likelihood = -6675.5723


Iteration 1: log likelihood = -6503.8152
Iteration 2: log likelihood = -6497.6854
Iteration 3: log likelihood = -6497.6435
Iteration 4: log likelihood = -6497.6434

Truncated regression
Limit: lower = 0 Number of obs = 2229
upper = +inf Wald chi2(3) = 252.42
Log likelihood = -6497.6434 Prob > chi2 = 0.0000

------------------------------------------------------------------------------
wage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
eq1 |
grade | 1.357205 .1056178 12.85 0.000 1.150198 1.564212
ttl_exp | .5526835 .0690394 8.01 0.000 .4173688 .6879983
tenure | .0454751 .0485286 0.94 0.349 -.0496392 .1405894
_cons | -21.8007 1.949961 -11.18 0.000 -25.62256 -17.97885
-------------+----------------------------------------------------------------
sigma |
_cons | 7.653959 .2447991 31.27 0.000 7.174161 8.133756
------------------------------------------------------------------------------

The coefficients from the truncreg command are a bit greater than the OLS results,
for example the coefficient for grade is 1.357 which is twice the OLS results of
0.649. Similarly, the coefficient for ttl_exp is 0.552 which is twice the OLS results of
0.232. While truncreg may improve the estimates on a restricted data file as
compared to OLS, it is certainly not a substitute for analyzing the complete
unrestricted data, i.e., non-truncated randomly drawn sample from a population.

7.8 Regression with Measurement Error

As you will most likely recall, one of the assumptions of regression is that the
predictor variables are measured without error. The problem is that measurement error
in predictor variables leads to under estimation of the regression coefficients. Stata's
eivreg command takes measurement error into account when estimating the
coefficients for the model.

Let's look at a regression using the bwages.dta dataset from Verbeek (2000):

use bwages.dta, clear

***summarizing variables
sum wage exper educ

*** data generation


gen expersq= exper^2
gen expercb= exper^3

Department of Economics, CBE, MU


57
E-mail: [email protected] August, 2009
STATA TRAINING MATERIAL
ECONOMETRIC ANALYSIS Yesuf Mohammednur Awel
*** OLS Regression analysis
reg wage exper educ male
Source | SS df MS Number of obs = 1472
-------------+------------------------------ F( 3, 1468) = 281.98
Model | 17333519.1 3 5777839.71 Prob > F = 0.0000
Residual | 30080026.9 1468 20490.4816 R-squared = 0.3656
-------------+------------------------------ Adj R-squared = 0.3643
Total | 47413546.1 1471 32232.1863 Root MSE = 143.14

------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
exper | 7.756359 .3865812 20.06 0.000 6.998049 8.51467
educ | 80.11866 3.252994 24.63 0.000 73.73765 86.49967
male | 54.30332 7.774967 6.98 0.000 39.05209 69.55455
_cons | 8.620323 15.60731 0.55 0.581 -21.99468 39.23532
------------------------------------------------------------------------------

The predictor exper is years of experience of a worker. If we assume that the reported
experience, exper, is measured with error, the above OLS result is biased. We don't
know the exact reliability of exper, but using 0.8 for the reliability would probably
not be far off. We will now estimate the same regression model with the Stata eivreg
command, which stands for errors-in-variables regression.

***Errors-in-variable Regression
eivreg wage exper educ male, r(exper .8)

assumed Errors-in-variables regression


variable reliability
------------------------ Number of obs = 1472
exper 0.8000 F( 3, 1468) = 305.88
* 1.0000 Prob > F = 0.0000
R-squared = 0.4152
Root MSE = 137.438

------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
exper | 9.966844 .4769471 20.90 0.000 9.031273 10.90241
educ | 85.20147 3.198329 26.64 0.000 78.92769 91.47525
male | 48.71597 7.503267 6.49 0.000 33.99771 63.43424
_cons | -43.22058 16.54975 -2.61 0.009 -75.68425 -10.7569
------------------------------------------------------------------------------

Note that the F-ratio and the R2 increased along with the regression coefficient for
exper. Additionally; there is an increase in the standard error for exper.

Check out the regressions below and note the difference.

*** OLS Regression


reg wage exper expersq educ male

***Errors-in-variable Regression
eivreg wage exper expersq educ male, r(exper .95)

Department of Economics, CBE, MU


58
E-mail: [email protected] August, 2009

You might also like