Chapter 1: A Gentle Introduction and UE Chapter 1
1.1 Acquiring Stata
First things first. Before you can use Stata, you have to get access to it. How do you
get it? Your college or university may provide Stata in official computer labs. If it
doesn’t (or if you want a personal copy), you can buy and download Stata directly
(www.stata.com). Fortunately, reasonable student pricing is available.
With access to Stata, you “open” it as you would any program on your computer
(like Word, Excel, etc.). When you open Stata on a PC, you should see something like
Figure 1.
FIGURE 1.1
Stata also runs on Macs and, while it looks slightly different, the commands and
functionality are nearly identical on a PC. Figure 2 shows Stata on a Mac.
Using Stata 1-1
FIGURE 1.2
Let’s talk about what you see. There are five “panels” or “windows” in Stata. The
biggest one, squarely in the middle of the screen, is the “Results” window. Nicely, it
shows you the results of what you tell Stata to do.
At the top left is the “Review” window. This area provides a history of all the
commands you have given Stata. The top right is where the variables in your dataset
will show up and the bottom right is where you’ll see properties of the variables.
The bottom, center window is the “Command” window. As the name suggests, this is
where you can tell Stata what to do, where you actually “program.” (Don’t panic! You
can work in Stata by typing commands one at a time or you can roll all your
commands up into a single program—called in Stata language a “do-file.”)
1.2 Getting Data in Stata
With Stata open, we should move along and open a dataset.
Using Stata 1-2
Stata Format
There are a number of ways to get data in Stata. The easiest, of course, is for data to
be in Stata format to begin with. Like most software packages, a certain “extension”
is associated with a certain type of file. A Microsoft Word document has the
extension “.docx” and “.pdf” is the extension for Adobe Acrobat. Stata data sets have
a “.dta” extension.
Opening a “.dta” file in Stata is pretty straightforward. Click on “File” at the top left.
Then click on “Open” and from there select the folder your data set is in, and then
click the dataset name. You are off and running. In Figure 1.3, I show doing exactly
this for the Magic Hill dataset (named HTWT1.dta) introduced in Section 1.4 of
Using Econometrics. I had previously saved that file to my hard drive (from the Using
Econometrics Student Companion website).
HTWT1.dta has two variables:
Y: weight (in pounds) of the ith customer
X: height (in inches above 5 feet) of the ith customer
FIGURE 1.3
Using Stata 1-3
Figure 1.4 displays what you should see in Stata after loading the HTWT1.dta file.
FIGURE 1.4
Notice in the Variables window (blue arrow) there are two variables X and Y—just
as you expected. The Results window gives a record of what I did (red arrow). In
this case, I opened a data set. In Stata-speak, to open a data set is it “use” a data set.
The line use "/Volumes/ECONOMICS/Economics/Data/HTWT1.dta" is really a Stata
programming command. Don’t stress out. We will break that (and many other)
commands down a bit later. For now, appreciate that even though you opened a
data set by a “point and click” approach, Stata recorded what you did in its language.
That is a nice nugget to keep in mind.
Also notice that opening the command was recorded in the Review window.
Of course, to open a Stata dataset you can could also find the file on your computer
and “double-click” on it, just like you open files with most other common software.
Using Stata 1-4
While you will have access to all the datasets used in Using Econometrics in Stata-
format, that won’t be the case with many other datasets. With that in mind, we
should cover a couple of common approaches to get data into Stata.
The Hard Way: Manual Data Entry
Often in life, there is a “hard way” to do something. Note “hard” does not necessarily
mean “ineffective way.” The “hard way” to get data into Stata is to manually input it.
Let’s say you have the following data that you need to get into Stata.
Income Experience Name
$35,000 8 Bruce
$45,000 6 Sue
$52,500 9 Maria
$37,500 15 Woody
$20,000 1 John
Income is defined as annual income in dollars, Experience is in years, and Name is,
well, the person’s name.
As before, open Stata as you would any other program. At the very top, you will see
an icon that looks like a spreadsheet with a pencil. Figure 1.5 shows this:
Using Stata 1-5
FIGURE 1.5
If you click on this icon, it will open a “Data Editor” window. Figure 1.6 shows the
Data Editor. As the name suggests, this is where you can edit data.
Using Stata 1-6
FIGURE 1.6
The Data Editor looks very similar to a spreadsheet. It is organized in rows and
columns. In Stata, each column is a variable. Each row is an observation.
Start in the top-left cell (indicated by blue arrow) and type 35000 and hit “enter.”
Figure 1.7 shows what you should see:
FIGURE 1.7
Using Stata 1-7
Notice that the column is now named “var1” and the row is numbered as “1”
automatically. We should go ahead and tell Stata that we want the name of this
variable to be “Income” and not “var1.” Since we are in the data editor, an easy way
to do this is to double-click on the “var1” under the Properties window at the far
right of the page (shown by a blue arrow in Figure 1.8).
FIGURE 1.8
Name the variable “Income.” Figure 1.9 shows what you see after doing this.
Using Stata 1-8
FIGURE 1.9
Naming variables is important for obvious reasons. You want to make sure variable
names are informative but not excessively long. Also, keep in mind that Stata is case
sensitive. To Stata “Income” and “income” are different words.
The next step should be to enter experience and the name of the first person
(Bruce). You would enter “8” in the first row, second column and then “Bruce” in the
first row, third column. Figure 1.10 shows this.
Using Stata 1-9
FIGURE 1.10
Again, notice that when we entered experience the variable was named
automatically “var2” and the variable name of the individual was “var3.” Naturally,
we would want to rename these “Experience” and “Name” as we did for Income.
At this point we have all the information for Bruce in the data set. The first row in
the data set contains all of Bruce’s information. It is worth repeating that a row in
Stata is an observation.
After we rename the variables, we should go ahead and enter the information for
the other four people. Figure 1.11 shows what you should see after all the
information is entered.
Using Stata 1-10
FIGURE 1.11
You have now worked through getting data into Stata the hard way. I would suggest
at this point you save your data set. “Save early and save often” is a VERY good rule
to live by! The easiest way to do this is the click on “file>save as” as you would with
any other software (such as Word) as shown in FIGURE 1.12. Naturally, after you
have saved and named the file the first time, to save you just click on “save.”
Using Stata 1-11
FIGURE 1.12
I want to make a really important note at this point, something you might have
stumbled onto. Notice that when I entered the income for Bruce I did NOT use a
comma or a dollar sign ($). In Stata, there are essentially two types of data: numeric
and non-numeric. Numeric data only have numbers (and a decimal, if called for).
Data with anything other than numbers is non-numeric. While this is an
oversimplification it is good place to start. The takeaway at the moment is that Stata
would have seen “$35,000” as a non-numeric entry no different than it saw “Bruce.”
Since we need it to be a number, we entered “35000” as the value.
After you have saved your data set, you can now close your data editor window.
You’ll probably notice that there are many lines in the Results and Review windows.
This is shown in Figure 1.13 (blue and red arrows, respectively).
Using Stata 1-12
FIGURE 1.13
What you see is Stata making a record of everything you did as you entered the data
in the form of Stata commands. As before, this is a helpful (and sensible) feature of
Stata and something we will explore more formally later.
The Less Hard Way: Importing Data
Another common way to get data into Stata is to “import” it from another form.
While Stata can import a number of data forms, perhaps the most common import is
from a Microsoft Excel spreadsheet. With that in mind, we’ll take some time to walk
through the process.
We will use the same data we manually imputed. I have recorded the data in an
Excel file, shown in Figure 1.14.
Using Stata 1-13
FIGURE 1.14
To import this into Stata, click “file>import” in Stata and select “Excel spreadsheet
(*.xls; *xlsx)” This is shown in Figure 1.15 (blue arrow).
FIGURE 1.15
Once you do that, another window will open, shown in Figure 1.16.
Using Stata 1-14
FIGURE 1.16
From here, click on “Browse…” (blue arrow) which will allow you to select the file
you want to import. My file is named ExcelImportData.xlsx. You should see
something along the lines of Figure 1.17
FIGURE 1.17
Using Stata 1-15
Before clicking “OK” we should talk about a couple of settings. The first, identified by
a blue arrow, asks whether you want to have the first row in your Excel file be the
variable names. In our case, we should check this box because row 1 has our
variable names. If the first Excel row doesn’t contain the names, of course, don’t
check it!
The second setting asks whether we want to import the data as “strings” (indicated
by red arrow). While “strings” has a formal computer science definition, for our
purposes it means “not a number.” Clearly, this is not what we want. We need our
income and experience data to be numbers in Stata. So, you should not check that
box.
After clicking the first box and NOT the second box, hit OK. This will automatically
pull the Excel data into Stata. You should then see something like Figure 1.18.
FIGURE 1.18
And you are in the same place as if you had manually entered the data (though this
is a good bit more fun!). If this was a real project, you’d want to go ahead and save
your newly imported data set.
Using Stata 1-16
1.3: Some Basics of Using Data
Once you have data in Stata, you can actually do interesting things. You will use
some commands frequently in Stata and here we’ll work through some of the
common ones. We’ll use the income and experience data introduced above and pick
up right after the Excel data import.
Summary Statistics
One question that might come up is, what is the average income for our data set? Put
another way, what is the sample mean of income?
To get summary statistics (which include mean, standard deviation, minimum and
maximum), you would give the command:
summarize variablename
As a general rule throughout this document, actual Stata commands will be
given in blue font and other elements in Stata command lines, such as
variables, will be in red. Both will be italicized.
Taking the above syntax and applying it to our income and experience data, you
would type the following in the Command window
summarize Income
and hit “enter.”
You should see something like Figure 1.19.
Using Stata 1-17
FIGURE 1.19
The results of your command are reported, nicely, in the Results window along with
a record of what command you gave Stata. This single command gives quite a bit of
information. Let’s walk through each:
1. Obs.: the number of observations used in the calculation.
2. Mean: the sample mean (i.e. average) of the data set.
3. Std. Dev.: the standard deviation of the sample.
4. Min: the minimum value found in the data set.
5. Max: the maximum value found in the data set.
If you wanted even more information on income, you could ask for “detailed”
summary statistics. To do this, you would add “,detail” to the end of the command.
summarize Income, detail
Figure 1.20 provides a picture of what you should see after this command, zooming
in to only see what would is displayed in the Results window.
Using Stata 1-18
FIGURE 1.20
Adding “detail” to the command gives you much more information. Our data set only
has 5 observations so this is not as interesting as if we had thousands of
observations. Still, the point is that you can easily get quite a bit of information
about a variable in Stata—whether it has 5 observations or 5 million.
It’s easy to get summary statistics for more than one variable at the same time. The
general syntax in Stata is:
summarize variablename1 variablename2 variablename3
You can add as many variables to the statement was you want. Or, if you are lazy (no
comment), you can just type:
summarize
That will give summary statistics on every variable in the data set. Doing that for our
data set generates something along the lines of Figure 1.21.
FIGURE 1.21
Using Stata 1-19
You get a listing of every variable in your data set along with calculated summary
statistics. Notice, however, there is something funny about the name variable. Stata
reports it has no observations and does not provide any summary statistics. What is
going on?
If you think about it for a moment, Name is a text variable. It records the names of
each individual in the sample. When was the last time you tried to average names? I
thought so. Stata is being polite when it reports 0 observations.
Creating Variables
Another useful ability to have in Stata is to be able to create variables from existing
variables. For example, using our current income and experience data, we might
wonder what each person is paid per year of experience. Put another way, we could
create a variable named IncPerYrExp (note, I tried to make the variable name
informative but not too long) which is defined as income divided by years of
experience.
The general syntax in Stata to create a new variable is:
generate newvariable = some_mathematical_function
Where “newvariable” is the name you give to the variable you are creating.
To create IncPerYrExp as defined above for our data set, I’d give the following
command:
generate IncPerYrExp = Income / Experience
Figure 1.22 shows what you should see in Stata after this command.
Using Stata 1-20
FIGURE 1.22
Not much exciting happens. But, notice that in the Variables window you have one
more variable than before: IncPerYrExp (indicated by blue arrow).
You can click on the icon to “see” it. If you do that, you will see something
along the lines of Figure 1.23.
FIGURE 1.23
Using Stata 1-21
Nicely, Stata did just what we asked: create a new variable, name it IncPerYrExp and
define it as income divided by experience. Perfect.
The generate command is quite flexible and can handle a number of mathematical
expressions. The following are examples of what you could do (even if you wouldn’t
want to). Can you decipher what is going on in each one?
generate IncMinusExp = Income – Experience
generate IncPlusExp = Income + Experience
generate IncInThousands = Income/1000
generate Inc_Squared = Income*Income
generate Inc_Squared = Income^2
generate ln_Inc = ln(Income)
The last generate command is one to note. It creates a variable (ln_Inc) which is the
natural log of income. It uses a mathematical operator command: ln(…). Stata has
many operators and we will cover more as needed.
1.4 Beyond Data Manipulation: OLS Regression
Hopefully, you are starting to feel a bit more comfortable with Stata and working
with data in Stata. There is much more to learn and do (Stata is pretty amazing!) and
we are on our way.
Sections 1.4 and 1.5 of Using Econometrics present two examples of regression
analysis. It seems entirely appropriate to use one of those to show how Stata can be
used to generate regression results.
The good news is that running a regression in Stata is pretty straightforward. The
basic syntax is:
regress dependentvariable independentvariable
The “regress” command tells Stata to take the specified variables and perform a
regression. Let’s work through the Magic Hill example on page 17 in Using
Econometrics.
The Magic Hill data can be downloaded from the Using Econometrics Student
Companion website. The name of the data set is HTWT1.dta and it has two
variables:
Using Stata 1-22
Y: weight (in pounds) of the ith customer
X: height (in inches above 5 feet) of the ith customer
The model proposed in from Using Econometrics, Section 1.4, Equation 1.18, is:
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖
After loading the data into Stata, type the following command in the Command
window and hit enter.
regress Y X
Figure 1.24 indicates what you will see right before hitting enter. Figure 1.25 shows
what you should see right after hitting enter.
FIGURE 1.24
Using Stata 1-23
FIGURE 1.25
A lot has happened in the Results window. For now, focus on the three arrows (blue,
red, and green). The blue arrow points to the regression command. The red arrow
points to the column of variables in the regression: Y, X, and something called
“_cons”. That “something” is the model’s estimated intercept term, otherwise know
as 𝛽̂0.
The green arrow points to the “Coef.” column, which reports the estimated
coefficients. The first number in the Coef. column is 6.377093. That is the estimate
for β1, the parameter for X. It matches the 6.38 (rounded) of Equation 1.19 in Using
Econometrics. Just below that is _cons, the estimate of β0, the intercept. It is
103.3971, which rounds to 103.40.
Using Stata 1-24