Econometrics With Stata PDF
Econometrics With Stata PDF
6.1 Introduction
In the workplace, you are often given questions or issues to address like what is the
effect of modern input use (like fertilizer) on productivity? What is the relationship
between CEOs performance and CEOs salary? What is the effect of one more year of
education on the wage of a labor? Does class size affect student performance? Does
population growth inhibit economic growth? What determines adoption of modern
inputs? And many more, all require quantitative answers. In such situation you opt
for statistical models that enables you quantify the relationship between a factor and
the response variable, generally known as Regression analysis or model.
Syntax
regress depvar [indepvars] [if] [in] [weight] [, options]
Options Description
------------------------------------------------------------------------------------------------------
Model
noconstant suppress constant term
hascons has user-supplied constant
tsscons compute total sum of squares with constant; seldom used
SE/Robust
vce(vcetype) vcetype may be robust, bootstrap, or jackknife
robust synonym for vce(robust)
cluster(varname) adjust standard errors for intragroup correlation
mse1 force mean squared error to 1
hc2 use u^2_j/(1-h_jj) as observation's variance
hc3 use u^2_j/(1-h_jj)^2 as observation's variance
Reporting
level(#) set confidence level; default is level(95)
beta report standardized beta coefficients
eform(string) report exponentiated coefficients and label as string
noheader suppress the table header
plus make table extendable
-------------------------------------------------------------------------------------------------------
depvar and the varlist following depvar may contain time-series operators
Example 6.1
For instance, there are numerous occasions when it behoves a researcher to test the
relationship between a dependent variable and a several potential predictors of that
dependent variable. If a person was interested in buying a car, s/he might wish to
know the statistically significant predictors of good gas mileage. From the STATA
data set, auto.dta, s/he has data on weight, length, trunk size, and headroom. S/He
wants to ascertain which of these variables predict miles per gallon. To do so, s/he
would construct a regression model. To set up the STATA command for a classical
ordinary least squares statistical model, s/he would select miles per gallon, mpg, as
her/his dependent variable. This would become the first variable in her/his list of
variables. S/He follows this dependent variable with the other variables, which s/he
hypothesizes to predict mpg. S/He issues the regression command, which commences
with regress:
clear
sysuse auto.dta
regress mpg weight
The regress command above generates the following simple linear regression results:
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 1, 72) = 134.62
Model | 1591.9902 1 1591.9902 Prob > F = 0.0000
Residual | 851.469256 72 11.8259619 R-squared = 0.6515
-------------+------------------------------ Adj R-squared = 0.6467
Total | 2443.45946 73 33.4720474 Root MSE = 3.4389
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0060087 .0005179 -11.60 0.000 -.0070411 -.0049763
_cons | 39.44028 1.614003 24.44 0.000 36.22283 42.65774
------------------------------------------------------------------------------
Interpretation:
The upper left corner is the analysis-of-variance (ANOVA) table. The column
headings SS, df, MS stand for ‘Sum of Squares’, ‘degrees of freedom’ and ‘Mean
square’ respectively. In this example, the total sum of squares is 2443.45, of which
1591.99 is accounted for by the model and 851.46 is left unexplained (in the residual).
Summary statistics are displayed on the top right corner. There are 74 observations
included in the regression analysis. The F-statistic (F(1,72)=134.62) tests the
hypothesis that all coefficients (excluding the constant) are zero. If the null hypothesis
is correct is, the probability of observing an F-statistic as large as 134.62 is 0.0000
(Stata’s way of indicating a number smaller than 0.00005). That is, the p-value is the
significance level of the test when we use the value of the test statistic, 134.62, as the
critical value for the test. The R-squared for the regression is 0.65 and Adjusted R-
squared is 0.64. Both R-squared and Adjusted R-squared tell that 65% and 64% of the
variation on the dependent variable, mpg is explained by the regressor, weight.
Finally, Stata produces a table of estimated coefficients. The first line indicates that
the dependent variable in this regression model was mpg. This table provides
information on: the estimated coefficient (Coef.), its standard error(Std. Err.),
the t-statistic (t) which tests the hypothesis that the coefficient is equal to zero, the
probability of observing this t-statistic if the 0-hypothesis were valid (P>|t|), and a
confidence interval for the estimated coefficient ([95% Conf. Interval]). Our
fitted model is:
This regression tells us that for every additional pound (lb) in the weigh of a car, the
gas mileage (mpg) rate on average decreases by 0.0066 units (miles per gallon). This
decrease is statistically significant as indicated by the 0.000 probability associated
with this coefficient.
We can redo the above regression changing the limit or width of confidence interval
in to, for instance, 90% as below:
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [90% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0060087 .0005179 -11.60 0.000 -.0068716 -.0051457
_cons | 39.44028 1.614003 24.44 0.000 36.75088 42.12969
------------------------------------------------------------------------------
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | .006252 .0003883 16.10 0.000 .0054781 .007026
------------------------------------------------------------------------------
Example 6.2
The person in example 6.1 felt that in addition to the weight of the car length, trunk
size and headroom has some thing to do with gas mileage, hence included the last
three variables as independent variable in her/his model and estimated the following
model, usually referred as Multiple Linear Regression Model.
The command for multiple linear regression models is same to the simple linear
regression models except more than one independent variable is listed on varlist as
shown here.
This command generates the following ANOVA and regression parameter estimates.
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0038778 .0016121 -2.41 0.019 -.007094 -.0006617
length | -.0742607 .060633 -1.22 0.225 -.1952201 .0466988
trunk | -.0342372 .1582544 -0.22 0.829 -.3499461 .2814716
headroom | .0161296 .6405386 0.03 0.980 -1.26171 1.293969
_cons | 47.38497 6.54025 7.25 0.000 34.33753 60.43241
------------------------------------------------------------------------------
Interpretation:
Most of things are interpreted in the same manner to the simple linear regression
estimates but the coefficients. The coefficients are interpreted as follows: for every
additional pound (lb) in the weigh of a car, keeping the effect of other regressors
constant (cetris paribus), the gas mileage (mpg) rate on average decreases by 0.0038
units (miles per gallon). This decrease is statistically significant as indicated by the
0.019 probability associated with this coefficient.
From the full model regression output shown above, it appears that only weight is a
significant predictor of the gas mileage. The hypotheses that headroom, trunk size,
length of the vehicle is statistically significant positive predictors of the gas mileage
are disconfirmed by this regression analysis, based on this data set. Once the original
hypotheses are run, this model can be trimmed to a parsimonious one showing the
relationship between the dependent variable and only significant predictors.
You may be wondering what -0.0038 change in weight really means, and how you
might compare the strength of that coefficient to the coefficient for another variable,
say length. To address this problem, we can add an option to the regress command
called beta, which will give us the standardized regression coefficients. The beta
coefficients are used by some researchers to compare the relative strength of the
various predictors within the model. Because the beta coefficients are all measured in
standard deviations, instead of the units of the variables, they can be compared to one
another. In other words, the beta coefficients are the coefficients that you would
obtain if the outcome and predictor variables were all transformed standard scores,
also called z-scores, before running the regression.
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
weight | -.0038778 .0016121 -2.41 0.019 -.5209278
length | -.0742607 .060633 -1.22 0.225 -.2858029
trunk | -.0342372 .1582544 -0.22 0.829 -.0253126
headroom | .0161296 .6405386 0.03 0.980 .0023586
_cons | 47.38497 6.54025 7.25 0.000 .
------------------------------------------------------------------------------
Interpretation:
Because the coefficients in the Beta column are all in the same standardized units you
can compare these coefficients to assess the relative strength of each of the
predictors. In this example, weight has the largest Beta coefficient, -0.52 (in absolute
value), and headroom has the smallest Beta, 0.002. Thus, a one standard deviation
increase in weight leads to a 0.52 standard deviation decrease in predicted mpg, with
the other variables held constant. And, a one standard deviation increase in
headroom, in turn, leads to a 0.002 standard deviation increase in predicted mpg with
the other variables in the model held constant.
Remember that the difference between the numbers listed in the Coef. column and
the Beta column is in the units of measurement.
Description
test checks or tests linear hypotheses about the estimated parameters from the most
recently fitted model.
So far, we have concerned ourselves with testing a single variable at a time, for
example looking at the coefficient for weight and determining if that is significant.
We can also test sets of variables, using the test command, to see if the set of
variables are significant.
F( 1, 69) = 5.79
Prob > F = 0.0188
If you compare this output with the output from the last regression you can see that
the result of the F-test, 5.79, is the same as the square of the result of the t-test in the
regression (-2.41^2 = 5.79).
( 1) weight = 0
( 2) length = 0
F( 2, 69) = 32.83
Prob > F = 0.0000
( 1) length = 0
( 2) headroom = 0
F( 2, 69) = 0.75
Prob > F = 0.4762
( 1) length = 0
( 2) headroom = 0
( 3) trunk = 0
F( 3, 69) = 0.69
Prob > F = 0.5621
Description
testnl tests (linear or nonlinear) hypotheses about the estimated parameters from
the most recently fitted model. testnl produces Wald-type tests of smooth nonlinear
(or linear) hypotheses about the estimated parameters from the most recently fitted
model. The p-values are based on the "delta method", an approximation appropriate
in large samples. testnl supports survey regression-type commands (svy: regress, svy:
logit, etc.). testnl may also be used to test linear hypotheses. test is faster if you want
to test only linear hypotheses. testnl is the only option for testing linear and nonlinear
hypotheses simultaneously.
Options
mtest[(opt)] specifies that tests be performed for each condition separately. Opt
specifies the method for adjusting p-values for multiple testing. Valid values for opt
are
bonferroni Bonferroni's method
holm Holm's method
sidak Sidak's method
noadjust no adjustment is to be made
Specifying mtest without an argument, is equivalent to mtest(noadjust).
Example 6.3
regress mpg weight length trunk headroom
testnl(_b[ weight]/_b[length]=_b[trunk])(_b[trunk]=_b[headroom]),
mtest
(1) _b[ weight]/_b[ length] = _b[ trunk]
(2) _b[ trunk] = _b[ headroom]
---------------------------------------
| F(df,69) df p
-------+-------------------------------
(1) | 0.22 1 0.6398 #
(2) | 0.00 1 0.9453 #
-------+-------------------------------
all | 0.14 2 0.8696
---------------------------------------
# unadjusted p-values
Description
lincom computes point estimates, standard errors, t or z statistics, p-values, and
confidence intervals for linear combinations of coefficients after any estimation
command. Results can optionally be displayed as odds ratios, hazard ratios, incidence-
rate ratios, or relative risk ratios.
Example 6.4
regress mpg weight length trunk headroom
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | -.0781385 .0591877 -1.32 0.191 -.1962148 .0399378
------------------------------------------------------------------------------
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | -1.383422 .1084071 -12.76 0.000 -1.599688 -1.167156
------------------------------------------------------------------------------
In addition to getting the regression table, it can be useful to see a scatterplot of the
predicted and outcome variables with the regression line plotted. After you run a
regression, you can create a variable that contains the predicted values using the
predict command.
predict mpghat
Draw graphs of scatter plot to see the sampled observations and linear fit line. To see
the fitted line predicted by OLS (mpghat) in this case.
twoway (scatter mpg weight) (lfit mpg weight)
As we saw earlier, the predict command can be used to generate predicted (fitted)
values after running regress. You can also obtain residuals by using the predict
command followed by a variable name, in this case e, with the residual option.
predict residualhat, residual
(obs=74)
| mpg weight length trunk headroom
-------------+---------------------------------------------
mpg | 1.0000
weight | -0.8072 1.0000
length | -0.7958 0.9460 1.0000
trunk | -0.5816 0.6722 0.7266 1.0000
headroom | -0.4138 0.4835 0.5163 0.6620 1.0000
If we look at the correlations with mpg, we see weight and length have the two
strongest correlations with mpg. The correlation between mpg and weight is
negative, meaning that as the value of one variable goes down, the value of the other
variable tends to go up. Knowing that these variables are strongly associated with
mpg might give us a clue that they would be statistically significant regressors in the
regression model.
We can also use the pwcorr command to do pairwise correlations. The most
important difference between correlate and pwcorr is the way in which missing data
is handled. With correlate, an observation or case is dropped if any variable has a
missing value, in other words, correlate uses listwise , also called casewise, deletion.
pwcorr uses pairwise deletion, meaning that the observation is dropped only if there
is a missing value for the pair of variables being correlated. Two options that you can
use with pwcorr, but not with correlate, are the sig option, which will give the
significance levels for the correlations and the obs option, which will give the number
of observations used in the correlation. Such an option is not necessary with corr as
Stata lists the number of observations at the top of the output.
So far we covered basics of estimating both simple and multiple linear regression
models, interpreting the regression results performing joint hypothesis test and
checking the linear association among variables. Some researchers believe that linear
regression requires that the outcome (dependent) and predictor (explanatory) variables
be normally distributed. We need to clarify this issue. Actually, it is the residuals that
need to be normally distributed. In fact, the residuals need to be normal only for the
hypothesis tests (like t-tests) to be valid. The estimation of the regression coefficients
do not require normally distributed residuals. As we are interested in having valid t-
tests, we will investigate issues concerning normality.
histogram weight
1.0e-04 2.0e-04 3.0e-04 4.0e-04 5.0e-04
Density
0
We can use the normal option to superimpose a normal curve on this graph and the
bin(15) option to use 15 bins. The distribution looks skewed to the right.
Histograms are sensitive to the number of bins or columns that are used in the display.
An alternative to histograms is the kernel density plot, which approximates the
probability density of the variable. Kernel density plots have the advantage of being
smooth and of being independent of the choice of origin, unlike histograms. Stata
implements kernel density plots with the kdensity command.
Not surprisingly, the kdensity plot also indicates that the variable weight does not
look normal.
There are three other types of graphs that are often used to examine the distribution of
variables; symmetry plots, normal quantile plots and normal probability plots.
Symmetry plots: A symmetry plot graphs the distance above the median for the i-th
value against the distance below the median for the i-th value. A variable that is
symmetric would have points that lie on the diagonal line. As we would expect, this
distribution is not symmetric.
symplot weight
Weight (lbs.)
2000
Distance above median
500 10000 1500
Normal quantile plots: A normal quantile plot graphs the quantiles of a variable
against the quantiles of a normal (Gaussian) distribution. qnorm is sensitive to non-
normality near the tails, and indeed we see considerable deviations from normal, the
diagonal line, in the tails. This plot is typical of variables that are xxxxxx.
qnorm weight
5,000
4,000
Weight (lbs.)
3,000 2,000
1,000
Normal probability plot: The normal probability plot is also useful for examining
the distribution of variables. pnorm is sensitive to deviations from normality nearer
to the center of the distribution. Again, we see indications of non-normality in weight.
pnorm weight
1.00 0.75
Normal F[(weight-m)/s]
0.25 0.50
0.00
Having concluded that weight is not normally distributed, how should we address this
problem? First, we may try entering the variable as-is into the regression, but if we
see problems, which we likely would, then we may try to transform weight to make it
more normally distributed. Potential transformations include taking the log, the
square root or raising the variable to a power. Selecting the appropriate transformation
is somewhat of an art. Stata includes the ladder and gladder commands to help in the
process.
Let's start with ladder and look for the transformation with the smallest chi-square.
ladder weight
The square transform has the smallest chi-square. Let's verify these results graphically
using gladder.
gladder weight
1.5e-07
8.0e-04
6.0e-04
2.0e-11
1.0e-07
4.0e-04
0 1.0e-11
0 5.0e-08
02.0e-04
0 5.00e+10 1.00e+11 5000000
1.00e+07
1.50e+07
2.00e+07
2.50e+07 2000 3000 4000 5000
0 50 100150200
0 .5 1 1.5 2
Density
0
2.0e+10
0 200040006000
8.0e+06
1.5e+10
6.0e+06
1.0e+10
0 4.0e+06
0.0e+09
2.0e+06
5
-.0006 -.0005 -.0004 -.0003 -.0002 -3.00e-07
-2.50e-07
-2.00e-07
-1.50e-07
-1.00e-07
-5.00e-08 -2.00e-10
-1.50e-10
-1.00e-10
-5.00e-11 0
Weight (lbs.)
Histograms by transformation
This also indicates that the square transformation would help to make weight more
normally distributed. Let's use the generate command to create the variable
weightsqr which will be weight squar (weight2) and then see if we have normalized
it.
generate weightsqr= weight^2
We can see now a little improvement on the normality of the variable weight after
transformation. We would then use the symplot, qnorm and pnorm commands to
help us assess whether weightsqr seems normal, as well as seeing how weightsqr
impacts the residuals, which is really the important consideration.
Exercise:
Check whether the dependent variable, mpg, is normally distributed?
6.3.3 Checking Data Problems and Validity of OLS Assumptions in a Data set
Without verifying that your data have met the assumptions underlying OLS
regression, your results may be misleading. In the following we explore how you can
use Stata to check on how well your data meet the assumptions of OLS regression. In
particular, we will consider the following assumptions.
• Linearity - the relationships between the predictors and the outcome variable
should be linear
• Normality - the errors should be normally distributed - technically normality is
necessary only for hypothesis tests to be valid,
estimation of the coefficients only requires that the errors be identically and
independently distributed
• Homogeneity of variance (homoscedasticity) - the error variance should be
constant
• Collinearity - predictors that are highly collinear, i.e., linearly related, can
cause problems in estimating the regression coefficients
• Independence - the errors associated with one observation are not correlated
with the errors of any other observation
• Errors-in-variables - predictor variables are measured without error (we will
cover this on further topics in regression)
• Model specification - the model should be properly specified (including all
relevant variables, and excluding irrelevant variables)
Additionally, there are issues that can arise during the analysis that, while strictly
speaking are not assumptions of regression, are nonetheless, of great concern to data
analysts.
A single observation that is substantially different from all other observations can
make a large difference in the results of your regression analysis. If a single
observation (or small group of observations) substantially changes your results, you
would want to know about this and investigate further. There are three ways that an
observation can be unusual.
To identify the aforementioned three types of observation, let’s look at the auto.dta
data set installed with in stata.
sysuse auto.dta
describe
Contains data from C:\Program Files\Stata9\ado\base/a/auto.dta
obs: 74 1978 Automobile Data
vars: 12 13 Apr 2005 17:45
size: 3,478 (99.7% of memory free) (_dta has notes)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
make str18 %-18s Make and Model
price int %8.0gc Price
mpg int %8.0g Mileage (mpg)
rep78 int %8.0g Repair Record 1978
headroom float %6.1f Headroom (in.)
trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn Circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear Ratio
foreign byte %8.0g origin Car type
-------------------------------------------------------------------------------
Sorted by: foreign
Mileage 30
(mpg) 20
10
5,000
4,000
Weight
3,000 (lbs.)
2,000
250
Length 200
(in.)
150
5.0
Headroom
(in.)
0.0
20
Trunk
space
(cu.
ft.)
0
10 20 30 40 150 200 250 0 20
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0038778 .0016121 -2.41 0.019 -.007094 -.0006617
length | -.0742607 .060633 -1.22 0.225 -.1952201 .0466988
headroom | .0161296 .6405386 0.03 0.980 -1.26171 1.293969
trunk | -.0342372 .1582544 -0.22 0.829 -.3499461 .2814716
_cons | 47.38497 6.54025 7.25 0.000 34.33753 60.43241
------------------------------------------------------------------------------
Let's examine the studentized residuals as a first means for identifying outliers.The
predict command can generate residuals, standardized, or studentized residuals.
Residual generation is performed with the predict command. The following
command generates residuals in the data set that are called, resid.
predict resid, residuals
You may also generate the leverage (influence) to identify observations that will have
potential great influence on regression coefficient estimates.
predict lev, leverage
Generally, a point with leverage greater than (2k+2)/n should be carefully examined.
Here k is the number of predictors and n is the number of observations.
Observations with high leverage values (leverage values range from 1/n to 1) would
qualify. One could sort the residuals by their standardized values and then tabulate
them with the commands:
sort stdres
tabulate stdres
There are robust methods used to detect outliers that require additional treatment.
Nonetheless, the observations need to be examined for typographical errors. Any such
errors need to be corrected. Given a large enough data set, they can be deleted. If the
data set is small, then some sort of smoothing or missing data replacement can be
invoked.
Another important diagnostic graph is that of the leverage versus the residual plot, a
plot which shows the influence of the outliers over the residuals. The command for
this plot is lvr2plot the influential outliers are revealed in the upper right sector. If
there are enough of them that have substantial leverage, then the analyst may wish to
resort to robust regression techniques.
lvr2plot
.2
.15
Leverage
.1
.05
0
0 .05 .1 .15 .2
Normalized residual squared
The two reference lines are the means for leverage, horizontal, and for the normalized
residual squared, vertical.
Now let's move on to overall measures of influence, specifically let's look at Cook's D
and DFITS. These measures both combine information on the residual and leverage.
Cook's D and DFITS are very similar except that they scale differently but they give
us similar answers.
The lowest value that Cook's D can assume is zero, and the higher the Cook's D is, the
more influential the point. The convention cut-off point is 4/n. We can list any
observation above the cut-off point by doing the following and see that the Cook's D
for 2 nd, 13th, 42nd, 57 th, 60 th, and 71 st larger than the cut-off point.
predict d, cooksd
Now let's take a look at DFITS. The cut-off point for DFITS is 2*sqrt(k/n). DFITS
can be either positive or negative, with numbers close to zero corresponding to the
points with small or zero influence.
The above measures are general measures of influence. You can also consider more
specific measures of influence that assess how each coefficient is changed by deleting
the observation. This measure is called DFBETA and is created for each of the
predictors. Apparently this is more computational intensive than summary statistics
such as Cook's D since the more predictors a model has the more computation it may
involve.
We have explored a number of the statistics that we can get after the regress
command. There are also several graphs that can be used to search for unusual and
influential observations, we left them for further reading.
When we do linear regression, we assume that the relationship between the response
variable and the predictors is linear. This is the assumption of linearity. If this
assumption is violated, the linear regression will try to fit a straight line to data that
does not follow a straight line. Checking the linear assumption in the case of simple
regression is straightforward, since we only have one predictor. All we have to do is a
scatter plot between the response variable and the predictor to see if nonlinearity is
present, such as a curved band or a big wave-shaped curve.
Example 6.5
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0060087 .0005179 -11.60 0.000 -.0070411 -.0049763
_cons | 39.44028 1.614003 24.44 0.000 36.22283 42.65774
------------------------------------------------------------------------------
Below we use the scatter command to show a scatterplot predicting mpg from
weight and use lfit to show a linear fit, and then lowess to show a lowess smoother
predicting mpg from weight. The different type of plots confirm linear relation ship
between mpg and weight.
Example 6.6
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0038515 .001586 -2.43 0.018 -.0070138 -.0006891
length | -.0795935 .0553577 -1.44 0.155 -.1899736 .0307867
_cons | 47.88487 6.08787 7.87 0.000 35.746 60.02374
------------------------------------------------------------------------------
predict r, resid
scatter r weight
15
10
Residuals
5
0
-5
scatter r length
15
10
Residuals
50
-5
The two residual versus predictor variable plots above do not indicate strongly a clear
departure from linearity. Another command for detecting non-linearity is acprplot.
acprplot graphs an augmented component-plus-residual plot, a.k.a. augmented
partial residual plot. It can be used to identify nonlinearities in the data. Let's use the
acprplot command for weight and length and use the lowess lsopts(bwidth(1))
options to request lowess smoothing with a bandwidth of 1.
In the first plot below the smoothed line is close to the ordinary regression line, except
some deviations at end and middle points. The second plot does seem more
problematic at the left end. This may come from some potential influential points.
Overall, they don't look too bad and we shouldn't be too concerned about non-
linearities in the data.
We have seen how to use acprplot to detect nonlinearity. However, the example
didn't boldly show much nonlinearity. Look at a more interesting example on the
accompanying do file of this material.
Residuals are assumed to be normally distributed. Like the other assumptions, this
assumption needs to be tested. The Normality of residuals is only required for valid
hypothesis testing, that is, the normality assumption assures that the p-values for the t-
tests and F-test will be valid. Normality is not required in order to obtain unbiased
estimates of the regression coefficients. OLS regression merely requires that the
residuals (errors) be identically and independently distributed.
After we run a regression analysis, we can use the predict command to create
residuals and then use commands such as kdensity, qnorm, pnorm, sktest to check
the normality of the residuals.
-10 -5 0 5 10
Inverse Normal
The above command generates the plot in figure 4.xx. The straight diagonal line
represents the theoretical normal distribution, while the dotted line represent the
residuals of the model. The closer they cleave to straight diagonal line, the more
normal the distribution is said to be. As you see above, the results from qnorm show
very slight deviation of non-normality at the upper tail.
You can also use the kdensity command to produce a kernel density plot with the
normal option requesting that a normal density be overlaid on the plot. kdensity
stands for kernel density estimate. It can be thought of as a histogram with narrow
bins and moving average.
-10 -5 0 5 10 15
Residuals
The pnorm command graphs a standardized normal probability (P-P) plot while
qnorm plots the quantiles of a variable against the quantiles of a normal distribution.
pnorm is sensitive to non-normality in the middle range of data and qnorm is
sensitive to non-normality near the tails. As you see below, the results from pnorm
show deviation from normal at the upper half.
pnorm resid
1.00 0.75
Normal F[(resid-m)/s]
0.25 0.500.00
sktest resid
The sktest reveals that the p-value is small (P-value=0.0001), indicating that we can
reject the hypothesis that the residual is normally distributed.
swilk: Another test available is the swilk test which performs the Shapiro-Wilk W test
for normality. The p-value is based on the assumption that the distribution is normal.
swilk resid
Shapiro-Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+-------------------------------------------------
resid | 74 0.92004 5.149 3.575 0.00018
Analogous to sktest the swilk test also rejects the claim that the residual is normally
distributed, since p-value is small (P-value=0.00018).
One of the main assumptions for the ordinary least squares regression is the
homogeneity of variance of the residuals. If the model is well-fitted, there should be
no pattern to the residuals plotted against the fitted values. If the variance of the
residuals is non-constant then the residual variance is said to be "heteroscedastic."
There are graphical and non-graphical methods for detecting heteroscedasticity.
rvfplot
see output(the graph) next page
15
10
Residuals
05
-5
10 15 20 25 30
Fitted values
Alternatively, we can also use the rvfplot command with the yline(0) option to put a
reference line at y=0. We see that the pattern of the data points is getting a little
narrower towards the right end, which is an indication of heteroscedasticity.
rvfplot, yline(0)
15
10
Residuals
05
-5
10 15 20 25 30
Fitted values
hettest
chi2(1) = 14.54
Prob > chi2 = 0.0001
The significant result from the Cook-Weisburg test indicates that the regression of the
residuals on the predicted values reveals significant heteroskedasticity.
imtest
Cameron & Trivedi's decomposition of IM-test
---------------------------------------------------
Source | chi2 df p
---------------------+-----------------------------
Heteroskedasticity | 18.45 14 0.1872
Skewness | 10.40 4 0.0342
Kurtosis | 1.49 1 0.2220
---------------------+-----------------------------
Total | 30.34 19 0.0477
---------------------------------------------------
imtest is the White's test. It tests the null hypothesis that the variance of the residuals
is homogenous. Therefore, if the p-value is very small, we would have to reject the
null hypothesis and accept the alternative hypothesis that the variance is not
homogenous. So in this case, the evidence is against the null hypothesis that the
variance is homogeneous.
Both tests shown above are very sensitive to model assumptions, such as the
assumption of normality. Therefore it is a common practice to combine the tests with
diagnostic plots to make a judgment on the severity of the heteroscedasticity and to
decide if any correction is needed for heteroscedasticity.
When there is a perfect linear relationship among the predictors, the estimates for a
regression model cannot be uniquely computed. The term collinearity implies that two
variables are near perfect linear combinations of one another. When more than two
variables are involved it is often called multicollinearity, although the two terms are
often used interchangeably.
The primary concern is that as the degree of multicollinearity increases, the regression
model estimates of the coefficients become unstable and the standard errors for the
coefficients can get wildly inflated. In this section, we will explore some Stata
commands that help to detect multicollinearity.
We can use the vif command after the regression to check for multicollinearity. vif
stands for variance inflation factor. As a rule of thumb, a variable whose VIF values
are greater than 10 may merit further investigation. Tolerance, defined as 1/VIF, is
used by many researchers to check on the degree of collinearity. A tolerance value
lower than 0.1 is comparable to a VIF value of 10. It means that the variable could be
considered as a linear combination of other independent variables.
We can test for multicollinearity with the pwcorr command and the vif command,
although STATA checks for this problem and automatically drops perfectly collinear
predictor variables prior to estimation. The command
The higher the correlations between the predictor variables, the more the
multicollinearity. The variance inflation factor, VIF, is a measure of the reciprocal of
the complement of the inter-correlation among the predictor variables: VIF= 1/(1- r2)
where r2 is the multiple correlation between the predictor variable and the other
predictors. VIF values greater than 10 indicate possible problems
vif
Variable | VIF 1/VIF
-------------+----------------------
length | 11.11 0.090047
weight | 9.56 0.104549
trunk | 2.79 0.358189
headroom | 1.79 0.558932
-------------+----------------------
Mean VIF | 6.31
From the above Stata output, length is a multicollinear variable with VIF=11.11 >10.
5. Model Specification
Other questions that arise are that of proper functional form. Is this a linear model?
Does the dependent variable or independent variable have to be transformed? Or is
there any important variable omitted or an irrelevant variable included in the model?
A model specification error can occur when one or more relevant variables are
omitted from the model or one or more irrelevant variables are included in the model.
If relevant variables are omitted from the model, the common variance they share
with included variables may be wrongly attributed to those variables, and the error
term is inflated. On the other hand, if irrelevant variables are included in the model,
the common variance they share with included variables may be wrongly attributed to
them. Model specification errors can substantially affect the estimate of regression
coefficients.
There are a couple of methods to detect specification errors. The linktest command
performs a model specification link test for single-equation models. linktest is based
on the idea that if a regression is properly specified, one should not be able to find any
additional independent variables that are significant except by chance. linktest creates
two new variables, the variable of prediction, _hat, and the variable of squared
prediction, _hatsq. The model is then refit using these two variables as predictors.
_hat should be significant since it is the predicted value. On the other hand, _hatsq
shouldn't, because if our model is specified correctly, the squared predictions should
not have much explanatory power. That is we wouldn't expect _hatsq to be a
significant predictor if our model is specified correctly. So we will be looking at the
p-value for _hatsq.
linktest
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_hat | -.3188663 .7367366 -0.43 0.666 -1.787877 1.150145
_hatsq | .0312539 .0173478 1.80 0.076 -.0033365 .0658444
_cons | 13.22945 7.562782 1.75 0.085 -1.850308 28.30921
------------------------------------------------------------------------------
The Stata output reveals that our regression model is not correctly specified, since the
coefficient of the predicted dependent variable (_hat) is statistically insignificant and
the coefficient of the predicted dependent variable square (_hatsq) is statistically
significant. The linktest rejects the hypothesis that the model is correctly specified.
estat ovtest performs two flavors of the Ramsey regression specification error test
(RESET) for omitted variables. This test amounts to fitting y=xb+zt+u and then
testing t=0. If option rhs is not specified, powers of the fitted values are used for z. If
rhs is specified, powers of the individual elements of x are used.
Syntax for estat ovtest
estat ovtest
The output of the above command is as below:
The ovtest result, on the contrary, fails to reject the null hypothesis of no omitted
variables indicating no model specification error. Given two contradictory results, we
better have to reconsider our model.
Another OLS assumption is the errors associated with one observation are not
correlated with the errors of any other observation. The statement of this assumption
that the errors associated with one observation are not correlated with the errors of
any other observation cover several different situations.
Consider the case of collecting data from different villages in different regions. It is
likely that the households within each village will tend to be more alike than
households from different villages, that is, their errors are not independent. [We will
see the regress command with cluster option on further topics in regression].
Another way in which the assumption of independence can be broken is when data are
collected on the same variables over time (time series data). In this situation it is
likely that the errors for observation between adjacent time periods will be more
highly correlated than for observations more separated in time. This is known as
autocorrelation. When you have data that can be considered to be time-series you
should use the dwstat command that performs a Durbin-Watson test for correlated
residuals.
• estat durbinalt performs Durbin's alternative test for serial correlation in the
disturbance. This test does not require that all the regressors be strictly
exogenous.
Example 6.7
Using time series data sp500.dta, we seek to investigate the relationship between
closing price (close) opening price (open) and volume (volume). To estimate this
relationship first we need to use the tsset command to let Stata know which variable
is the time variable. Then we can use regress command to estimate the model as
below:
sysuse sp500.dta
tsset date
regress close open volume
Source | SS df MS Number of obs = 248
-------------+------------------------------ F( 2, 245) = 3534.52
Model | 1798400.01 2 899200.006 Prob > F = 0.0000
Residual | 62329.1814 245 254.404822 R-squared = 0.9665
-------------+------------------------------ Adj R-squared = 0.9662
Total | 1860729.19 247 7533.31657 Root MSE = 15.95
------------------------------------------------------------------------------
close | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
open | .9804932 .0122914 79.77 0.000 .9562828 1.004704
volume | .0001216 .0004141 0.29 0.769 -.0006941 .0009373
_cons | 21.1053 17.04749 1.24 0.217 -12.47304 54.68364
------------------------------------------------------------------------------
estat dwatson
Number of gaps in sample: 55
The hypothesis
Ho: No Serial Correlation (no autocorrelation)
H1: There is Serial Correlation (there is autocorrelation)
Decision Rule
estat bgodfrey
Number of gaps in sample: 55
The last two commands check for high order autocorrelation instead of first order
serial correlation.
A simple visual check would be to plot the residuals versus the time variable.
predict resid, r
scatter resid date
100
50
Residuals
0
-50
Example 6.8
Using time series data sp500.dta, we re-specify the model in example 1 above to
additionally take in to account the effect of previous closing price (lagclose) on
today’s closing price (close) To estimate this relationship first we need to use the tsset
command to let Stata know which variable is the time variable. Then we can use
regress command to estimate the model as below:
sysuse sp500.dta
tsset date
------------------------------------------------------------------------------
close | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
open | -1.898847 4.687278 -0.41 0.686 -11.13173 7.334033
volume | .0001084 .0004126 0.26 0.793 -.0007043 .0009211
lagclose | 2.882026 4.687858 0.61 0.539 -6.351996 12.11605
_cons | 18.22684 16.99459 1.07 0.285 -15.24867 51.70235
------------------------------------------------------------------------------
In cases when the explanatory variable is lag variable of the dependent variable,
durbin-watson statistics (estat dwatson command) is not appropriate to check for
autocorrelation. The appropriate commands are estat durbinalt and/or estat
bgodfrey.
estat durbinalt
Number of gaps in sample: 55
estat bgodfrey
Number of gaps in sample: 55
Syntax
• areg depvar [indepvars] [if] [in] [weight], absorb(varname) [options]
Example 6.9
Using the auto.dta an example dataset with in stata type the command below and see
the results below.
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | 5.478309 1.158582 4.73 0.000 3.162337 7.794281
length | -109.5065 39.26104 -2.79 0.007 -187.9882 -31.02482
_cons | 10154.62 4270.525 2.38 0.021 1617.96 18691.27
-------------+----------------------------------------------------------------
rep78 | F(4, 62) = 2.079 0.094 (5 categories)
The regressor, rep78, has 5 categories that may result in to 5 dummy variables. If we
were to use the regress command, we have to construct 5 dummy variables for
each category and estimate the model. An alternative is to use areg command for the
same estimation with out creating 5 dummy variables.
In empirical analysis of the relationship between a dependent variable and some set of
independent variables, it is possible that we specify different functional forms
(logarithmic, semi-logarithmic, quadratic and others) to allow nonlinear relationship
between the dependent and the independent variables.
for different functional forms is to partially differentiate the dependent variable with
∂ log yi % ∆yi
respect to the independent variable! = β1 is equivalent to = β1 .
∂ log x1i % ∆x1i
Quadratic functions are also used quite often in applied economics to capture
decreasing or increasing marginal effects. A regression model with quadratic terms on
the independent variable can be given as follows:
yi = β 0 + β1 x1i + β 2 x12i + β 2 x2i + ν i
Example 6.10
***summarizing variables
sum wage exper educ
Example 6.11
In this part we will see various commands that go beyond OLS models. The topics
will include probability models (like probit, logit, multinomial logit and ordered
probit), robust regression methods, regression with censored and truncated data and
regression with measurement error.
Not all dependent variables are continuous, so here we quickly go through some of
the limited dependent variables as they are useful in poverty analysis, and other
situations where we don't always have continuous dependent variable. For instance
some of the questions we want to address may be: What is the probability an
individual will get a higher degree, what determines labour force participation, what
factors drive the incidence of civil war?
Rather than reporting the coefficients, dprobit reports the marginal effect, that is the
change in the probability for an infinitesimal change in each independent, continuous
variable and, by default, reports the discrete change in the probability for dummy
variables. probit may be typed without arguments after dprobit estimation to see the
model in coefficient form. If estimating on grouped data, refer to bprobit.
Example 7.1
Check the correlation among the explanatory variables to avoid inclusion of highly
multicollinear variable
correlate farmsize hhage adufem adumale oxen tlu primary secondary
dismarkt localmat
predict p
gen predp= 0
replace predp= 1 if p > 0.5
tab predp particip
drop p
probit particip
Syntax
logit depvar [indepvars] [if] [in] [weight] [, options]
Example 7.2
Using the same data above, estimate credit market participation using logit regression
and see the results
logit particip sexhh2 eduhh2 skillhh2 farmsize hhage adufem adumale
tlu primary secondary dismarkt , robust
You may have limited dependent variable that takes more than two discrete values
that are not possible to order; in such situation you estimate the relationship using
multinomial logit model.
Syntax
mlogit depvar [indepvars] [if] [in] [weight] [, options]
Example 7.3
Using the same data above, estimate land market participation using multinomial logit
model and see the results
mlogit landmark sexhh2 eduhh2 skillhh2 farmsize hhage adufem adumale
tlu primary secondary dismarkt zone2 zone3 zone4, robust
You may have limited dependent variable that takes more than two discrete values
that can be ordered; in such situation you estimate the relationship using ordered
probit/logit model.
Syntax
oprobit depvar [indepvars] [if] [in] [weight] [, options]
ologit depvar [indepvars] [if] [in] [weight] [, options]
oprobit fits ordered probit models of ordinal variable depvar on the independent
variables indepvars. The actual values taken on by the dependent variable are
irrelevant, except that larger values are assumed to correspond to "higher" outcomes.
Up to 50 outcomes are allowed in Stata/SE and Intercooled Stata, and up to 20
outcomes in Small Stata.
ologit fits ordered logit models of ordinal variable depvar on the independent
variables indepvars. The actual values taken on by the dependent variable are
irrelevant, except that larger values are assumed to correspond to "higher" outcomes.
Up to 50 outcomes are allowed in Stata/SE and Intercooled Stata, and up to 20
outcomes are allowed in Small Stata.
Example 7.5
Using the same data above, estimate land market participation using ordered probit
model and see the results
oprobit landmark sexhh2 eduhh2 skillhh2 farmsize hhage adufem adumale
tlu primary secondary dismarkt zone2 zone3 zone4, robust
A limited dependent variable may take either zero or positive continuous values. In
such situation Tobit estimation is appropriate. For instance, labor supply, health
expenditure, alcoholic expenditure, so on take nonnegative continuous values
Example 7.6
Using the same data above, estimate loan equation using tobit model and see the
results
mfx
Example 7.7
Using the same data above, estimate heckman two-step selection to see the effect of
participation on household welfare (poverty level) and see the results
Alternatively, one could estimate the two-step selection model in one step using full
maximum likeihood estimation as below:
It seems to be a rare dataset that meets all of the assumptions underlying multiple
regressions. We know that failure to meet assumptions can lead to biased estimates of
coefficients and especially biased estimates of the standard errors. This fact explains a
lot of the activity in the development of robust regression methods.
The idea behind robust regression methods is to make adjustments in the estimates
that take into account some of the flaws in the data itself. We are going to look at
three approaches to robust regression: 1) regression with robust standard errors
including the cluster option, 2) robust regression using iteratively reweighted least
squares, and 3) quantile regression, more specifically, median regression.
Before we look at these approaches, let's look at a standard OLS regression using the
nlsw88.dta dataset.
We will look at a model that predicts the hourly wage (wage) using the educational
level (grade), work experience (ttl_exp) and job tenure (tenure). First let's look at
the descriptive statistics for these variables. Note the missing values for grade and
tenure.
Below we see the regression predicting hourly wage (wage) using the educational
level (grade), work experience (ttl_exp) and job tenure (tenure). We see that all of
the variables are significant except for tenure.
------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
grade | .64945 .0456072 14.24 0.000 .560013 .738887
ttl_exp | .2329752 .0303483 7.68 0.000 .1734612 .2924891
tenure | .0378054 .0250756 1.51 0.132 -.0113686 .0869794
_cons | -3.86429 .6318619 -6.12 0.000 -5.103391 -2.625189
------------------------------------------------------------------------------
We can use the test command to test total work experience and job tenure, and we
find the overall test of these two variables is significant.
Here is the residual versus fitted plot for this regression. Notice that the pattern of the
residuals is not exactly as we would hope. The spread of the residuals is somewhat
wider toward the left of the graph than at the right, where the variability of the
residuals is somewhat larger, suggesting some heteroscedasticity.
rvfplot
40
30
Residuals
20
10
0
-10
-5 0 5 10 15
Fitted values
Below we show the avplots. Although the plots are small, you can see some points
that are of concern. There is not a single extreme point but a handful of points that
stick out. For example, in the top right graph you can see a handful of points that
stick out from the rest. If this were just one or two points, we might look for mistakes
or for outliers, but we would be more reluctant to consider such a large number of
points as outliers.
avplots
10 20 30 40
-10 0 10 20 30 40
e( wage | X )
e( wage | X )
-10 0
-10 -5 0 5 10
e( tenure | X )
coef = .03780541, se = .02507559, t = 1.51
Here is the lvr2plot for this regression. We see 4 points that are somewhat high in
both their leverage and their residuals.
lvr2plot
.015
.01
Leverage
.005
0
None of these results are dramatic problems, but the rvfplot suggests that there might
be some outliers and some possible heteroscedasticity; the avplots have some
observations that look to have high leverage, however the lvr2plot do not show as
such influential observation. We might wish to use something other than OLS
regression to estimate this model. In the next several sections we will look at some
robust regression methods.
The Stata regress command includes a robust option for estimating the standard
errors using the Huber-White sandwich estimators. Such robust standard errors can
deal with a collection of minor concerns about failure to meet assumptions, such as
minor problems about normality, heteroscedasticity, or some observations that exhibit
large residuals, leverage or influence. For such minor problems, the robust option may
effectively deal with these concerns.
With the robust option, the point estimates of the coefficients are exactly the same as
in ordinary OLS, but the standard errors take into account issues concerning
heterogeneity and lack of normality. Here is the same regression as above using the
robust option. Note the changes in the standard errors and t-tests (but no change in
the coefficients). In this particular example, using robust standard errors did not
change any of the conclusions from the original OLS regression.
------------------------------------------------------------------------------
| Robust
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
grade | .64945 .0446901 14.53 0.000 .5618113 .7370887
ttl_exp | .2329752 .0290066 8.03 0.000 .1760924 .289858
tenure | .0378054 .0243765 1.55 0.121 -.0099976 .0856084
_cons | -3.86429 .5998228 -6.44 0.000 -5.040561 -2.688019
------------------------------------------------------------------------------
As described earlier, OLS regression assumes that the residuals are independent. The
nlsw88 dataset contains data on 2246 individuals that come from 12 industries. It is
very possible that the wage within each industry may not be independent, and this
could lead to residuals that are not independent within industry. We can use the
cluster option to indicate that the observations are clustered into industry (based on
industry) and that the observations may be correlated within industry, but would be
independent between industries.
By the way, if we did not know the number of industries, we could quickly find out
how many industries there are as shown below, by quietly tabulating industry and
then displaying the macro r(r) which gives the numbers of rows in the table, which is
the number of industries in our data.
Now, we can run regress with the cluster option. We do not need to include the robust
option since robust is implied with cluster. Note that the standard errors have changed
substantially, much more so, than the change caused by the robust option by itself.
------------------------------------------------------------------------------
| Robust
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
grade | .6505642 .0482345 13.49 0.000 .5444008 .7567276
ttl_exp | .2322409 .0440692 5.27 0.000 .1352454 .3292365
tenure | .0362508 .0268599 1.35 0.204 -.0228675 .0953691
_cons | -3.85212 .6186058 -6.23 0.000 -5.213662 -2.490578
------------------------------------------------------------------------------
As with the robust option, the estimate of the coefficients are the same as the OLS
estimates, but the standard errors take into account that the observations within
industry are non-independent. Even though the standard errors are larger in this
analysis, the two variables that were significant in the OLS analysis are significant in
this analysis as well. These standard errors are computed based on aggregate wages
for the 12 industries, since these industry level wages should be independent. If you
have a very small number of clusters compared to your overall sample size it is
possible that the standard errors could be quite larger than the OLS results, which is
the case here.
The Stata rreg command performs a robust regression using iteratively reweighted
least squares, i.e., rreg assigns a weight to each observation with higher weights
given to better behaved observations. In fact, extremely deviant cases, those with
Cook's D greater than 1, can have their weights set to missing so that they are not
included in the analysis at all.
We will use rreg with the generate option so that we can inspect the weights used to
weight the observations. Note that in this analysis both the coefficients and the
standard errors differ from the original OLS regression. Below we show the same
analysis using robust regression using the rreg command.
------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
grade | .5179226 .0233127 22.22 0.000 .4722057 .5636395
ttl_exp | .1511732 .0155129 9.74 0.000 .1207519 .1815946
tenure | .1214573 .0128177 9.48 0.000 .0963213 .1465932
_cons | -2.6662 .3229847 -8.25 0.000 -3.299583 -2.032817
------------------------------------------------------------------------------
If you compare the robust regression results (directly above) with the OLS results
previously presented, you can see that the coefficients, standard errors, t values and p
values are different consistent to the problems we found in the data when we
performed the OLS analysis. Given this difference, we could further investigate the
reasons why the OLS and robust regression results are different [Among the two
results the robust regression results would probably be the more trustworthy].
Let's calculate and look at the predicted (fitted) values (p), the residuals (r), and the
leverage (hat) values (h). Note that we are including if e(sample) in the commands
because rreg can generate weights of missing and you wouldn't want to have
predicted values and residuals for those observations.
predict p if e(sample)
(option xb assumed; fitted values)
(17 missing values generated)
Now, let's check on the various predicted values and the weighting. First, we will sort
by wt then we will look at the first 100 observations. Notice that the smallest weights
are zero that slowly grows in to 0.2 ranges.
sort wt
list idcode wage p r h wt in 1/100
+-----------------------------------------------------------------+
| idcode wage p r h wt |
|-----------------------------------------------------------------|
1. | 236 24.66183 8.54511 16.11672 0 0 |
2. | 3250 40.19808 2.772077 37.42601 0 0 |
3. | 856 39.23074 5.928342 33.30239 0 0 |
4. | 1209 25.80515 7.041585 18.76356 0 0 |
5. | 2942 17.34411 5.361766 11.98234 0 0 |
|-----------------------------------------------------------------|
6. | 2792 30.19324 8.044289 22.14895 0 0 |
7. | 3937 21.38486 5.048339 16.33652 0 0 |
8. | 3648 40.19808 3.987089 36.21099 0 0 |
9. | 3507 40.19808 9.363749 30.83433 0 0 |
10. | 1863 20.64412 8.206309 12.43781 0 0 |
.. |-----------------------------------------------------------------|
.. |-----------------------------------------------------------------|
71. | 4801 38.70926 6.55222 32.15704 0 0 |
72. | 2049 30.92161 7.202032 23.71957 0 0 |
73. | 2613 40.74659 6.901216 33.84538 0 0 |
74. | 1664 20.64412 9.186622 11.4575 .0000133 .0072777 |
75. | 3999 19.35587 7.912395 11.44348 .0000101 .00763929 |
|-----------------------------------------------------------------|
Now, let's look at the last 100 observations. The weights for the last 100 observations
are all very close to one except the values for observations 2230 to the end are
missing due to the missing predictors. Note that the observations above that have the
lowest weights are also those with the largest residuals (residuals over 8) and the
observations below with the highest weights have very low residuals (all less than
0.14).
After using rreg, it is possible to generate predicted values, residuals and leverage
(hat), but most of the regression diagnostic commands are not available after rreg.
We will have to create some of them for ourselves. Here, of course, is the graph of
residuals versus fitted (predicted) with a line at zero. This plot looks much like the
OLS plot, except that in the OLS all of the observations would be weighted equally,
but as we saw above the observations with the greatest residuals are weighted less and
hence have less influence on the results.
scatter r p, yline(0)
40
30 20
residuals
10 0
-10
-5 0 5 10 15
Fitted values
To get lvr2plot we have to go through several steps in order to get the normalized
squared residuals and the means of both the residuals and the leverage (hat) values.
First, we generate the residual squared (r2) and then divide it by the sum of the
squared residuals. We then compute the mean of this value and save it as a local
macro called rm (which we will use for creating the leverage vs. residual plot).
generate r2=r^2
(17 missing values generated)
sum r2
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
r2 | 2229 29.70332 136.1406 .0000198 1400.706
replace r2 = r2/r(sum)
(2229 real changes made)
sum r2
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
r2 | 2229 .0004486 .0020562 2.99e-10 .0211559
local rm = r(mean)
Next we compute the mean of the leverage and save it as a local macro called hm.
sum h
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
h | 2229 .0017796 .0011727 0 .0106942
local hm = r(mean)
Now, we can plot the leverage against the residual squared as shown below.
Comparing the plot below with the plot from the OLS regression, this plot is much
better behaved. There are no longer points in the upper right quadrant of the graph.
.01
corrected (for wt) hat diagonals
.002 .004 .006 0 .008
drop wt p r h r2
Here is what the quantile regression looks like using Stata's qreg command.
Comparing the quantile regression results with the OLS results previously presented,
you can see that the coefficients, standard errors, t values and p values are different.
Moreover, the coefficient for the variable job tenure (tenure) is statistically
significant in contrast to the result in OLS.
------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
grade | .54382 .0292851 18.57 0.000 .486391 .601249
ttl_exp | .1645623 .0194886 8.44 0.000 .1263445 .2027802
tenure | .1203565 .016107 7.47 0.000 .0887702 .1519428
_cons | -3.17323 .4055463 -7.82 0.000 -3.968518 -2.377941
------------------------------------------------------------------------------
The qreg command has even fewer diagnostic options than rreg does. About the only
values we can obtain are the predicted values and the residuals.
predict p if e(sample)
(option xb assumed; fitted values)
(17 missing values generated)
predict r if e(sample), r
(17 missing values generated)
scatter r p, yline(0)
40
3020
Residuals
10 0
-10
-5 0 5 10 15
Fitted values
bsqreg is the same as sqreg with one quantile. sqreg is, therefore, faster than bsqreg.
Analyzing data that contain censored values or are truncated is common in many
research disciplines.
Censored data is a data in which we do not always observe the outcome on the
dependent variable because at an upper (or lower) threshold we only know that the
outcome was above (or below) the threshold. In censored data we can use information
on the censored outcomes because we always observe the explanatory variables.
Truncated data is a data in which the sampling scheme entirely excludes part of the
population on the basis of outcomes on the dependent variable. In truncated data we
observe no information on units that are not covered by the sampling scheme.
Using data from Wooldrige (2000), MROZ.DTA, we want to estimate a reduced form
annual hours equation for married women. Where, the dependent variable is annual
hours worked (hours) and the explanatory variables are nwifeinc educ exper
expersq age kidslt6 kidsge6.
Of the 753 women in the sample, 428 worked for a wage outside the home during the
year; 325 of the women worked zero hours. For the women who worked positive
hours, the range is fairly broad, ranging from 12 to 4,950. Thus, annual hours worked
is left censored at zero. Thus, estimation of reduced form annual hours equation for
married women require appropriate model that recognizes the censoring of the data. A
reasonable candidate is the Tobit model. For comparative purpose, we estimate a
linear model (using all 753 observations) by OLS.
The data set contains 22 variables for 753 observations, use describe command to get
description of the data set. Let's look at the description of the data, some descriptive
statistics, and correlations among the variables.
count if hours>0
325
count if hours==0
428
corr hours nwifeinc educ exper expersq age kidslt6 kidsge6
(obs=753)
Now, let's run a standard OLS regression on the data and generate predicted scores in
olsp.
------------------------------------------------------------------------------
hours | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nwifeinc | -3.446636 2.544 -1.35 0.176 -8.440898 1.547626
educ | 28.76112 12.95459 2.22 0.027 3.329283 54.19297
exper | 65.67251 9.962983 6.59 0.000 46.11365 85.23138
expersq | -.7004939 .3245501 -2.16 0.031 -1.337635 -.0633524
age | -30.51163 4.363868 -6.99 0.000 -39.07858 -21.94469
kidslt6 | -442.0899 58.8466 -7.51 0.000 -557.6148 -326.565
kidsge6 | -32.77923 23.17622 -1.41 0.158 -78.2777 12.71924
_cons | 1330.482 270.7846 4.91 0.000 798.8906 1862.074
------------------------------------------------------------------------------
predict olsp
(option xb assumed; fitted values)
As pointed above, the tobit model is a reasonable candidate for regression with
censored data, hence we use tobit command.
You can declare both lower and upper censored values. The censored values are fixed
in that the same lower and upper values apply to all observations.
ll[(#)] and ul[(#)] indicate the lower and upper limits for censoring, respectively. You
may specify one or both. Observations with depvar <= ll() are left-censored;
observations with depvar >= ul() are right-censored; and remaining observations are
not censored. You do not have to specify the censoring value at all. It is enough to
type ll, ul, or both. When you do not specify a censoring value, tobit assumes that the
lower limit is the minimum observed in the data (if ll is specified) and the upper limit
is the maximum (if ul is specified) .
There are two other commands in Stata that allow you more flexibility in doing
regression with censored data.
cnreg estimates a model in which the censored values may vary from observation to
observation.
intreg estimates a model where the response variable for each observation is either
point data, interval data, left-censored data, or right-censored data.
Now, let's run a tobit regression on the data and generate predicted scores in tobitp.
------------------------------------------------------------------------------
hours | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nwifeinc | -8.814243 4.459096 -1.98 0.048 -17.56811 -.0603724
educ | 80.64561 21.58322 3.74 0.000 38.27453 123.0167
exper | 131.5643 17.27938 7.61 0.000 97.64231 165.4863
expersq | -1.864158 .5376615 -3.47 0.001 -2.919667 -.8086479
age | -54.40501 7.418496 -7.33 0.000 -68.96862 -39.8414
kidslt6 | -894.0217 111.8779 -7.99 0.000 -1113.655 -674.3887
kidsge6 | -16.218 38.64136 -0.42 0.675 -92.07675 59.64075
_cons | 965.3053 446.4358 2.16 0.031 88.88528 1841.725
-------------+----------------------------------------------------------------
/sigma | 1122.022 41.57903 1040.396 1203.647
------------------------------------------------------------------------------
Obs. summary: 325 left-censored observations at hours<=0
428 uncensored observations
0 right-censored observations
predict tobitp
(option xb assumed; fitted values)
Summarizing the olsp and tobitp scores shows that the tobit predicted values have a
larger standard deviation and a greater range of values.
Comparing the tobit and OLS estimates, the Tobit coefficient estimates are the same
sign as the corresponding OLS estimates, and the statistical significance of the
estimates is similar except for the coefficient on nwifeinc. Second, though it is
tempting to compare the magnitudes of the OLS estimates and the Tobit estimates,
such comparisons are not very informative. In order to directly compare the tobit
coefficient estimates with their OLS counter part we need to multiply them by an
adjustment factor (see Wooldridge, 2000).
Truncated data occurs when some observations are not included in the analysis
because of the value of the variable. We will illustrate analysis with truncation using
the dataset, nlsw88.dta. This data contains only employed individuals with positive
wages and hours worked. Hence, the sample excludes individuals who are not
employed, the data can be considered to be truncated at zero, i.e., wages need to be
greater than zero to be included in the sample.
Let’s use the data set and display some descriptive statistics, and correlations among
the variables.
use nlsw88.dta
------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
grade | .64945 .0456072 14.24 0.000 .560013 .738887
ttl_exp | .2329752 .0303483 7.68 0.000 .1734612 .2924891
tenure | .0378054 .0250756 1.51 0.132 -.0113686 .0869794
_cons | -3.86429 .6318619 -6.12 0.000 -5.103391 -2.625189
------------------------------------------------------------------------------
Given the truncated nature of the data, we estimate the model using truncated
regression. The syntax diagram for truncated regression is as below:
truncreg fits a regression model of depvar on varlist from a sample drawn from a
restricted part of the population. Under the normality assumption for the whole
population, the error terms in the truncated regression model have a truncated normal
distribution, which is a normal distribution that has been scaled upward so that the
distribution integrates to one over the restricted range.
ll(varname|#) and ul(varname|#) indicate the upper and lower limits for truncation,
respectively. You may specify one or both. Observations with depvar < ll() are left-
truncated, observations with depvar > ul() are right-truncated, and the remaining
observations are not truncated.
Truncated regression
Limit: lower = 0 Number of obs = 2229
upper = +inf Wald chi2(3) = 252.42
Log likelihood = -6497.6434 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
wage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
eq1 |
grade | 1.357205 .1056178 12.85 0.000 1.150198 1.564212
ttl_exp | .5526835 .0690394 8.01 0.000 .4173688 .6879983
tenure | .0454751 .0485286 0.94 0.349 -.0496392 .1405894
_cons | -21.8007 1.949961 -11.18 0.000 -25.62256 -17.97885
-------------+----------------------------------------------------------------
sigma |
_cons | 7.653959 .2447991 31.27 0.000 7.174161 8.133756
------------------------------------------------------------------------------
The coefficients from the truncreg command are a bit greater than the OLS results,
for example the coefficient for grade is 1.357 which is twice the OLS results of
0.649. Similarly, the coefficient for ttl_exp is 0.552 which is twice the OLS results of
0.232. While truncreg may improve the estimates on a restricted data file as
compared to OLS, it is certainly not a substitute for analyzing the complete
unrestricted data, i.e., non-truncated randomly drawn sample from a population.
As you will most likely recall, one of the assumptions of regression is that the
predictor variables are measured without error. The problem is that measurement error
in predictor variables leads to under estimation of the regression coefficients. Stata's
eivreg command takes measurement error into account when estimating the
coefficients for the model.
Let's look at a regression using the bwages.dta dataset from Verbeek (2000):
***summarizing variables
sum wage exper educ
------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
exper | 7.756359 .3865812 20.06 0.000 6.998049 8.51467
educ | 80.11866 3.252994 24.63 0.000 73.73765 86.49967
male | 54.30332 7.774967 6.98 0.000 39.05209 69.55455
_cons | 8.620323 15.60731 0.55 0.581 -21.99468 39.23532
------------------------------------------------------------------------------
The predictor exper is years of experience of a worker. If we assume that the reported
experience, exper, is measured with error, the above OLS result is biased. We don't
know the exact reliability of exper, but using 0.8 for the reliability would probably
not be far off. We will now estimate the same regression model with the Stata eivreg
command, which stands for errors-in-variables regression.
***Errors-in-variable Regression
eivreg wage exper educ male, r(exper .8)
------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
exper | 9.966844 .4769471 20.90 0.000 9.031273 10.90241
educ | 85.20147 3.198329 26.64 0.000 78.92769 91.47525
male | 48.71597 7.503267 6.49 0.000 33.99771 63.43424
_cons | -43.22058 16.54975 -2.61 0.009 -75.68425 -10.7569
------------------------------------------------------------------------------
Note that the F-ratio and the R2 increased along with the regression coefficient for
exper. Additionally; there is an increase in the standard error for exper.
***Errors-in-variable Regression
eivreg wage exper expersq educ male, r(exper .95)