STATA TUTORIAL PART ONE
Stata is easy to learn and a powerful software used
widely in research
Get started
Organizing do-files
The do-file and creating a log-file
Input and load data
Combining data
Reshaping the data
Useful commands to modify data
Label variables
Create summary statistics
Import data
GET STARTED
Open up Stata in start menu Programs
Use help [commandname] to get the help files!
. help regress
If you want to search all the sources that has to do with
regression in general, type
. findit regress
Also, use the Internet for help! You can search for codes
written by someone else. For example, Stata does not
have an inbuilt command to calculate Gini index, type
. net search gini
To update Stata, type
. help update
ORGANIZING DO-FILES
For reproducibility of your results!
Write all the code in do-files:
Idea: original data is safe and you can always get back to
the raw data if something goes wrong. The new data and
new variables are created in do-files. Working directly in
Stata is useful to explore the data but as soon as it produces
something important you should write it in a do-file.
Always create logs in the do-file. Each do-file should have a log file
where all results are saved in text.
Separate do-files that create data from do-files that analyze data.
crdata1.do
crdata2.do
andata1.do
andata2.do
etc.
STARTING A PROJECT / CREATE DOFILE
Create a separate directory for each project. Tell Stata
to change directory by typing:
cd Z:/pathname/Econometrics 1
A do-file can be created in many ways:
Save the do-file:
In the menu: under Window > Do-file Editor
Clicking the do-file editor icon, a little pen and lines
Type doedit in the Command window
Under file in the do-file menu
Ctrl+s
Make sure the do-file is saved where you want it
THE DO-FILE AND CREATING A LOGFILE
Open and save the log-file in the do-files.
The do-file can have the following structure:
capture log close
log using crdata.log, replace
set more of
[
..body.
]
log close
THE DO-FILE AND CREATING A LOGFILE
log close closes any log files that are opened.
capture is a powerful command in do-files that allows
the do-file or program to continue even if there are error
messages terminating the program. As if there is no logfile open.
log using [filename] opens the log-file.
replace writes over the [filename] if it already exists,
otherwise it creates a new.
set more off tells Stata not to pause when running the
program. Default is set more on which tells Stata to
wait until you press a key before continuing when a
more- message is displayed. This is mostly annoying but
sometimes useful.
INPUT AND LOAD DATA
Create a dataset manually (command window):
. clear
. input id female wage2005 wage2006 wage2007
1.
1 0 94 96 98
2.
2 1 75 79 77
3.
3 0 70 69 70
4.
end
. save mydata1
. clear
. input id school public
5.
181
6.
270
7.
311
8.
4 10 1
9.
end
. save mydata2
INPUT AND LOAD DATA
That last command will save your newly created
data as mydata2.dta
You can choose other extensions too, like .xls
File -> Export -> Data to Excel spreadsheet
. export excel using mydata2
If you want to work with this data, type
. use mydata2
This command loads a stata file, that is files with
the extension .dta
COMBINING DATA
append using filename [filename], [option]
appends new data (using dataset) vertically into the data in
memory (master dataset)
. append using mydata1.dta
. list
merge 1:1 [varlist] using [filename], [option]
combines new data (using dataset) horizontally into the data
in memory (master dataset) by an identifying variable
specified in varlist. The data in the master data are never
replaced by the using dataset unless STATA is explicitly
asked to do this
. use mydata2, clear
. merge 1:1 id using mydata1.dta
. list
COMBINING DATA - HORIZONTALLY
+----------------------------------------------------------------------------------------------------+
| id female wage2005 wage2006 wage2007 school public _merge |
|-----------------------------------------------------------------------------------------------------|
| 1
0
94
96
98
8
1
matched (3) |
| 2
1
75
79
77
7
0
matched (3) |
| 3
0
70
69
70
1
1
matched (3) |
| 4
.
.
.
.
10
1 using only (2) |
+-----------------------------------------------------------------------------------------------------+
Observations1-3 contain information from both datasets
(_merge==3)
Observation 4 contains information from mydata2.dta (_merge==2)
Save the new dataset:
. save merged1, replace
RESHAPING THE DATA
reshape converts data from wide to long formats and
vice versa
. reshape long wage, i(id) j(year)
(note: j = 2005 2006 2007)
Data
wide -> long
Number of obs.
Number of variables
j variable (3 values)
4 ->
12
7 ->
-> year
xij variables:
wage2005 wage2006 wage2007
-> wage
RESHAPING THE DATA
. list
id year female wage school public
--------------------------------------------1. 1 2005
94
2. 1 2006
96
3. 1 2007
98
4. 2 2005
75
5. 2 2006
79
--------------------------------------------6. 2 2007
77
7. 3 2005
70
8. 3 2006
69
9. 3 2007
70
10
10. 4 2005
--------------------------------------------11. 4 2006
10
12. 4 2007
10
+---------------------------------------------+
USEFUL COMMANDS TO MODIFY
VARIABLES
count shows the number of observations in the dataset:
. save merged1.dta, replace
. count
You can also count the number of observations that fulfill a condition
. count if female==1 & wage>=76
gen creates a new variable. For example, to create a squared wage
variable and a log transformation:
. gen wagesqr=wage^2
. gen lnwage=log(wage)
. gen femaledummy=1 if female==1 & female !=.
. replace femaledummy=0 if female==0 & female !=.
. tab year, gen(year_dum)
replace changes the contents of an existing variable
. replace wagesqr=wagesqr/1000
USEFUL COMMANDS TO MODIFY
VARIABLES
rename [oldname] [newname] changes the name of an old
variable
. rename school schyears
recode varlist (rule) (rule) [if] [in], gen(newvar) recode
categorical variables
. recode female (1=0) (0=1 ) , gen(male)
. recode schyears (1/6=0 primary) (7/10=1 secondary) if public==1, gen(secsch)
drop [varlist] deletes variables in varlist
. drop secsch
keep [varlist] keeps the variables in varlist
. keep id year female wage schyears public
LABEL VARIABLES
label var [varname] label to label variable
. label var public The school attended was public
. describe public
label define [labelname] value label value label
. label define pp 0 private 1 public
pp is a created definition of 1/0 for public .
label values [varname] [labelname] connects the new
definition pp with the values in variable public
. label values public pp
. list
CREATING SUMMARY STATISTICS
egen create summary statistics of all observations for some variable
. egen avgwage=mean(wage)
. egen maxwage=max(wage), by(schyears)
collapse clist [if] [in] [weight] [, options] creates a new dataset
of summary statistics. clist can be sum, min, max, median, sd etc.
Default is mean, but now we take the sum of all the wages by school
years, if the school was public.
. save merged2.dta, replace
. collapse (sum) wage if public==1, by(schyears)
. list
IMPORT DATA
Download CPS92_08 Data (Excel Dataset) from:
http://wps.pearsoned.co.uk/ema_ge_stock_ie_3/193/4
9605/12699039.cw/index.html
If you do a Google search for cps92_08, its on the first
page you find, under CPS data
Open cps92_08.xlsx and save it as cps92_08.csv
instead
Arkiv -> Exportera -> ndra filtyp -> CSV
IMPORT DATA
Work from the do-file. Write clear to get rid of any old
datasets before we load any new data.
insheet using [filename], option to import a .csv file. If
you have a semi-colon as delimiter use delimit(;) as option
Save it as a Stata file (.dta)
capture log close
log using mycps.log, replace text
set more of
clear
insheet using cps92_08.csv, delimit(;)
li if _n<50
destring ahe, dpcomma replace
save mycps.dta
log close
ABBREVIATIONS IN STATA
Commands and variables can be abbreviated
You abbreviate variables by using the shortest length
that uniquely identifies the variable
. list ah ba ag in 1/50
Commands works the same way. It may be hard to
know exactly what identifies a command since there
are lots of them in the help file the abbreviation for a
command is underlined
Some common abbreviations:
gen generate
li list
des describe
reg regress
STATA TUTORIAL: PART TWO
Examine dataset
Organize your variables
Producing tables
Correlation
Regression analysis
Extracting regression results
Testing hypothesis
Graphing Data
EXAMINE DATASET
Open Stata and load cps92_08 dataset, what we saved
as mycps.dta. Type desc for an overview of the dataset,
or inspect for a quick overview:
. cd Z:\...
. use mycps, clear
. desc
. inspect
Go to the break button in the menu if you want to tell
Stata to stop running or press any key to continue when
a more message is displayed, or type set more off
before running to turn of the pause message.
EXAMINE DATASET
You may also want to use the data editor in the stata menu
to browse through your data in a spreadsheet.
Data data editor, or
. br
Use list [varlist] [if] [in], [options] to examine certain
variables or a particular range of numbers in one or all
variables:
. list age bachelor
. list in 1/20
. list age bachelor in 1/20
. list age bachelor female if female==1 in 1/20
Tips! Use help in Stata to find out all the options for each
command. For example, type help list
ORGANIZE YOUR VARIABLES
To organize your variables, generate id numbers for each
observation.
. gen id=_n
. bysort female: gen id_f=_n
. list id_f female in 8720/8750
To give diferent id:s for males and females, use bysort
option since the data need to be sorted by these groups
before.
bysort is a shorter way but works the same as sort/by:
. sort female
. by female: gen id_f2=_n
. list id_f id_f2 female in 8720/8750
. drop id_f2
PRODUCING TABLES TABULATE
tab produces one-way tables of frequency counts
tab1 produces one-way tables for each variable
tab [varname] if in , options
tab1 [varlist] if in, options
. tab bachelor
frequency table
. tab1 bachelor female
frequency table for several
variables
. tab bachelor female
crosstabulation
. tab bachelor female, column row
column and/or row percentages
. tab bachelor female, column nofreq
to hide frequences and only
see percentages
. tab bachelor female if ahe>15, column nofreq use a condition
PRODUCING TABLES - TABLE
table calculates and displays tables of summary statistics
table [rowvar] [colvar] if in, options
. table female age
frequency table
. table female age, c(mean ahe)
frequences
content is mean of ahe insted of
. table female age, c(mean ahe) center format(%9.2f)
tells Stata that the output to be
decimals
centered and with two
. table female age , by(bachelor) c(mean ahe) center format(%9.2f)
can be used with the by option
TABLES LABELLING VARIABLES
Ok, lets make it easier to tell which one is bachelor and
which one is female. Good idea to lable the variable values.
. label define mf 0 male 1 female
. label values female mf
. label define bach 0 high school 1 bachelor
. label values bachelor bach
. table female bachelor, by(year) c(mean ahe) center format(%9.2f)
PRODUCING TABLES TABSTAT AND
SUM
tabstat is another option to display a table of summary
statistics
tabstat [varlist] if in, options
. tabstat ahe age, stat(mean var sd min max N)
lists some summary statistics like mean, median, sd
. tabstat ahe age, by(female) stat(mean sd)
the by option to produce summary statistics for men and women
in a single table
summarize also produces summary statistics
summarize [varlist] if in, options
. sum ahe age
. bysort female: sum ahe age
CORRELATION
To get the correlation between two or more variables use
correlate [varlist] if in, options
. corr ahe age bachelor
. corr ahe female
the correlation between wage, age and education
REGRESSION ANALYSIS
The efect of years of schooling on hourly wage rates.
Use OLS regression of dependent variable ahe and
independent variable age
. reg ahe age
What does this result tell us about the effect of age on hourly wage
rate?
Dont forget to check
the
T-value
statistically significant?
P-value
significance level?
how well does the model explains the values of
independent variable?
REGRESSION ANALYSIS
Two independent variables
. reg ahe age female
What does the coefficient for female tell us?
Use the by option and/or if option to run separate
analysis for groups of observations.
. bysort year: reg ahe age female
Runs a regression for each year separately but in the same time
. reg ahe age female if year==2008
Runs a regression just for 2008 observation.
REGRESSION ANALYSIS
Heteroskedasticity? Assuming independent error
terms, ie. homoskedastic error terms, is hardly satisfied
in the real world. To correct the error terms adding the
option robust after the model specification.
. reg ahe age female, robust
Robust is abbreviated by just r
. reg ahe age female, r
EXTRACTING REGRESSION RESULTS
ereturn to see your stored results from a regression run
. ereturn list
. matrix list e(b)
<- shows the regression coefficients
est store [name] to save your last regression in an
estimate table called model1
. est store model1
est table [namelist] displays table of estimation results
. est table
. est table, b se t stats(N, r2, F)
If you want to see more result statistics, just add the desired
statistics after table,
EXTRACTING REGRESSION RESULTS
Example: Let's run another regression and
store it as model2
. reg ahe age female bachelor, robust
. est store model2
. est table model1 model2, b se stats(N, r2, F)
PREDICTIONS
predict [newvar], option computes predicted
(fitted) value and residual for each observation after
estimation.
. predict yhat
values for
regression and
Calculates predicted
timlon from our
store it as yhat
. predict uhat, resid
values of
it as uhat
Calculates predicted
residuals, and store
. des yhat uhat
Check your new variables
TESTING HYPOTHESIS (T-TEST)
T-test (mean comparison test). To test the equality of means,
we use:
ttest [varname] == # if in, level(#)
One sample mean comparison test
. ttest ahe==15
Test if the mean of a specified variable (aheis equal to a
certain
hypothesized value (15)
. ttest ahe==15, level(99)
The confidence interval is 95% by default, this can be changed by
setting it to 99%
Two-group mean-comparison test
. ttest ahe, by(female)
tests if men and women on average earn the same wage
GRAPHING DATA
hist [varname] if in, option produces histograms
. hist ahe
. hist uhat
. hist uhat, normal Superimpose a normal curve
graph twoway scatter [varlist] if in, option
or short
twoway scatter [varlist] if in, option
produces a scatter plot of two or more variables
. twoway scatter ahe age
. twoway scatter ahe age, ti(Hourly wage vs Age)
Writes a title for your graph
GRAPHING DATA
There are many options for graphing. Type help twoway and
find out. For example,
. twoway scatter ahe age, ms(o) mc(red)
changes the marker symbol to o and the marker color to red
. twoway lfit ahe age
fits a linear line onto our scatter plot to see any relationship more clearly
. twoway lfit ahe age, lc(blue)
changes the line color of the fitted line to blue
. twoway (lfit ahe age) (scatter ahe age), ti("Hourly wage vs Age")
fitted line and scatter plot in the same graph
. twoway (lfit ahe age, lw(0.5) lc(blue)) (scatter ahe age, ms(o) mc(red)), ti("Hourly wage vs
Age") xline(30, lw(1) lc(black))
adds a x-line at age 30 with line width 1 and line color black
GRAPHING DATA
graph box [yvarlist] if in, options creates
boxplots. This command draws vertical boxes
. graph box ahe, over(female) over(year)
graph hbox [yvarlist] if in, options creates
horizontal boxes
. graph hbox ahe, over(female) over(bachelor)
export graph export the graph as a post script file
. graph export mygraph.ps
Copy the graph directly to MS Word by right-clicking and use
copy.
SOME EXTRA TIPS - LOOPS
Loops are very useful in some circumstances and
comes in two flavours:
foreach, which loops over a list of something:
. foreach var in age bachelor female {
.
reg ahe `var', r
.}
forvalues, which loops over numbers:
. tab age, gen(age_dum)
. forvalues i = 1/10 {
.
qui sum ahe if age_dum`i' == 1
.
scalar a = r(mean)
.
di "Average hourly earnings is " a " if age = `i"
.}
SOME EXTRA TIPS - ESTOUT
Estout is a way to produce nice output tables. It is not
a standard feature of Stata so it might have to be
installed:
. ssc install estout
It has many features, explore the help-file to get to
know it! We present an example here:
. reg ahe age, r
. eststo mod1
. reg ahe age bachelor, r
. eststo mod2
. reg ahe age bachelor female, r
. eststo mod3
. esttab mod1 mod2 mod3 using table1, rtf replace