Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
136 views26 pages

STATA

Uploaded by

Tanvir Arefin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views26 pages

STATA

Uploaded by

Tanvir Arefin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

STATA is a powerful statistical software package for data management,

statistical analysis, graphics, simulations, and custom programming. It is


widely used in economics, sociology, political science, biomedicine, and
epidemiology. Here is a brief overview of what STATA offers:
1. Data Management: Efficient handling of large datasets.
2. Statistical Analysis:
Descriptive statistics (e.g., mean, median, standard deviation).
Inferential statistics (e.g., t-tests, chi-square tests).
Regression analysis (linear, logistic, multinomial, etc.).
Panel data analysis, time-series analysis, survival analysis, and more.
3. Graphics: High-quality, customizable graphs and plots (e.g., histograms,
scatter plots, line graphs).
4. Programming:
Creation of custom functions and procedures using Stata's programming
language.
Integration with other programming languages like Python and Mata (Stata's
matrix programming language)
5. Extensions: Regular updates and new features provided by StataCorp.
Example Commands-
Regression Analysis: regress wage education experience
Summary Statistics: summarize wage
Data Manipulation: generate log_wage = log(wage)
Graphing: scatter wage experience

Loading Data into STATA: You can load the data into STATA using the
insheet command:

insheet using "yourfile.txt", clear [code]


Alternatively, you can use the import delimited command:

import delimited "yourfile.csv", clear [code]

To calculate Pearson's chi-square, you can use the tabulate command:

tabulate s d, chi2 [code]


Alt:
The advantages of using do-files in STATA:
Reproducibility: Do-files allow you to document your entire analysis process,
making it easy to reproduce results or share your workflow with others.
Automation: Automate repetitive tasks, saving time and reducing human
error risk.
Documentation: Serve as a detailed record of your data analysis steps, which
is useful for reviewing and understanding your analysis later.
Debugging: Identifying and correcting errors in your analysis is easier since
you can run individual lines or sections of code.
Version Control: Track changes over time by saving different versions of
your do-files, enabling you to revert to previous versions if needed.
Efficiency: Perform complex analyses quickly by running pre-written
commands instead of entering them manually each time.
Consistency: Ensure consistent application of data cleaning, transformation,
and analysis steps across different datasets or projects.
Scalability: Handle large and complex projects more efficiently by breaking
the analysis into manageable sections within the do-file.
Collaboration: Facilitate collaboration by providing a clear and organized
script that team members can understand and modify.
Customization: Customize and extend your analysis by adding or modifying
commands in the do-file.
(2) Relational Operators:
> Greater than

< Less than

>= Greater or equal


<= Less or equal

== Equal

!= Not equal

The descriptions and examples of the specified STATA commands with


original code syntax are:
1. display- The display command prints text or the results of expressions in
the STATA console. Example:
display 2 + 3 [code]
2. insheet- The insheet command is used to import data from a CSV or other
delimited text file. Example:
insheet using "data.csv", clear [code]
3. use- The use command loads a STATA dataset (.dta file) into memory.
Example:
use auto.dta [code]
4. codebook- The codebook command provides a detailed description of the
variables in the dataset, including type, range, unique values, and labels.
Example:
codebook price weight length [code]
5. encode- The encode command converts a string variable to a numeric
variable with value labels. Example:
encode region, gen(region_num) [code]
6. replace- The replace command modifies the values of an existing variable.
Example: Replace all values of wage less than 5 with 5:
replace wage = 5 if wage < 5 [code]
7. substr- The substr function extracts a substring from a string variable or
literal. Example: Extract the first three characters of the variable name:
generate first_three_letters = substr(name, 1, 3) [code]
8. sort- The sort command sorts the dataset by one or more variables.
Example: Sort the dataset by education:
sort education [code]
Sort the dataset by education and wage:
sort education wage [code]
9. merge- The merge command combines datasets based on one or more key
variables. Example: Merge two datasets by ID:
use auto.dta, clear
merge 1:1 make using auto2.dta [code]
10. append- The append command adds the observations from one dataset to
another. Example:
* Load the first dataset
use "auto1.dta", clear [code]
* Append the second dataset
append using "auto2.dta" [code]
* Save the combined dataset
save "combined_auto.dta", replace [code]
11. log- This command opens a log file to record commands and their output.
Syntax:
log using filename.log, replace
Example:
log using analysis.log, replace [code]
12. egen- This command is used to generate new variables based on various
functions and transformations.
Syntax:
egen newvar = function(variable)
Example:
egen mean_price = mean(price) [code]
13. notes- This command allows you to add annotations or comments to your
dataset.
Syntax:
notes add variable "Your comment here."
Example:
notes add price "This variable represents the price of the car" [code]
14. tabstat- This command computes summary statistics for variables.
Syntax:
tabstat varlist, statistics(options)
Example:
tabstat mpg weight, statistics(mean sd min max) [code]
15. usespss- This command is used to import datasets from SPSS format into
STATA.
Syntax:
usespss filename.sav, clear
Example:
usespss "data.sav", clear [code]
16. fdause- This command is used to load datasets in the FD format into
STATA.
Syntax:
fdause filename.fd, clear
Example:
fdause "data.fd", clear [code]
17. label define- This command is used to create value labels for categorical
variables.
Syntax:
label define labelname value "label"
Example:
label define foreign_label 0 "Domestic" 1 "Foreign" [code]
18. label values- This command is used to assign value labels to variables.
Syntax:
label values variablename labelname
Example:
label values foreign foreign_label [code]
How to draw a pie diagram and normality probability plot in stata
1. Drawing a Pie Diagram: STATA does not have a built-in command
specifically for pie charts, but you can create one using the graph pie
command, which creates a pie chart for categorical data. Example:
Suppose you have a dataset with a categorical variable region that you want
to visualize as a pie chart.
* Load the auto dataset
sysuse auto, clear [code]
* Create a categorical variable for the pie chart
* For example, let's create a variable 'foreign' which indicates if a car is
foreign or domestic
graph pie, over(foreign) title("Distribution of Car Types") [code]
2. Drawing a Normality Probability Plot: A normality probability plot (also
known as a Q-Q plot) is used to assess whether a variable follows a normal
distribution. In STATA, you can create this plot using the qnorm command.
Example: Suppose you want to assess the normality of a variable wage.
* Load the auto dataset
sysuse auto, clear [code]
* Generate a Q-Q plot for the 'price' variable
qnorm price [code]
Explanation of the Commands:
● sysuse auto, clear: Loads the built-in auto.dta dataset and clears any
existing data in memory.
● graph pie, over(foreign) title("Distribution of Car Types"): Creates a
pie chart for the foreign variable and adds a title.
● qnorm price: Generates a Q-Q plot for the price variable to assess its
normality.
The procedure for two sample means test in STATA:
The procedure for performing a two-sample mean test (also known as an
independent samples t-test) in STATA is:
1. Ensure your dataset is loaded into STATA.
2. Check the structure of your dataset and the variables involved in the test.
3. Identify the dependent variable (continuous variable) and the independent
variable (binary/categorical variable) that indicates the group membership.
For example, wage (continuous) and gender (binary: 0 for male, 1 for
female).
4. Ensure the binary variable has exactly two groups.
5. Check Assumptions -
Normality: Check if the dependent variable is approximately normally
distributed within each group.
histogram wage, by(gender) [code]
Equal Variances: You may also want to check if the variances are equal
between the groups using a variance ratio test.
robvar wage, by(gender) [code]
6. Use the test command to perform the two-sample mean test. Specify the
continuous variable and the grouping variable.
7. Look at the output, which will provide:
● The mean and standard deviation for each group.
● The t-statistic value.
● Degrees of freedom.
● p-value for the test.
A small p-value (typically < 0.05) indicates that there is a statistically
significant difference between the means of the two groups.
Example with Sample Data
Suppose you have a dataset called auto.dta and you want to compare the
mean price of cars based on the foreign variable (0 for domestic, 1 for
foreign).
sysuse auto, clear [Load the dataset]
describe [Inspect the data]
summarize [Check the groups]
histogram price, by(foreign) [Check assumptions,Normality]
robvar price, by(foreign) [Check assumptions,Equal variances]
ttest price, by(foreign) [Perform the t-test]
Performing a one-way analysis of variance (ANOVA) in STATA involves
comparing the means of three or more groups to determine if there are
statistically significant differences between them. Here's the procedure:
Procedure for One-Way ANOVA in STATA
1. Ensure your dataset is loaded into STATA.
use "path/to/yourdata.dta", clear [code]
2. Check the structure of your dataset and the variables involved in the
analysis.
Describe [code]
Summarize [code]
3. Identify the dependent variable (continuous variable) and the independent
variable (categorical variable) that indicate group membership.
For example, score (continuous) and treatment (categorical: control,
treatment1, treatment2).
4. Use the anova command to perform the one-way ANOVA. Specify the
continuous variable and the grouping variable.
anova score treatment [code]
This command will test the null hypothesis that the means of score are equal
across all levels of treatment.
5. Look at the output, which will provide:
● F-statistic value.
● Degrees of freedom between groups and within groups.
● p-value for the test.
A small p-value (typically < 0.05) indicates that there is a statistically
significant difference in at least one pair of group means.
6. Post-Hoc Tests (Optional)
If the ANOVA results are significant, you may want to conduct post-hoc tests
to determine which specific group means differ from each other. Common
post-hoc tests include Tukey's HSD, Bonferroni, or Sidak tests.
Example with Sample Data
Suppose you have a dataset study.dta with the variables score and treatment,
and you want to compare the mean scores of students across different
treatment groups.
use study.dta, clear [Load the dataset]
describe [Inspect the data]
summarize [Check the groups]
anova score treatment [Perform the one-way ANOVA]
Interpret the results and review the output provided by STATA to determine
if there is a significant difference in mean scores across the treatment groups.
By following these steps, you can effectively perform and interpret a one-
way ANOVA in STATA to assess differences between multiple groups.
To import data into STATA, you can use the use command for STATA
datasets (.dta files) or the import delimited command for text files like CSV.
Here's how you can use both methods:
Importing STATA Dataset (cars.dta)
use "cars.dta", clear [code]
Explanation:
use: This command is used to load a dataset into STATA.
"cars.dta": This is the file path to the STATA dataset named cars.dta.
clear: This option clears any existing data in memory before loading the new
dataset.
Example: Suppose you have a dataset named cars.dta stored in the directory
C:\Data. To load this dataset into STATA, you would use:
use "C:\Data\cars.dta", clear [code]
Importing CSV File (sales.csv)
import delimited "sales.csv", clear [code]
Explanation:
import delimited: This command is used to import data from a delimited text
file (such as CSV) into STATA.
"sales.csv": This is the file path to the CSV file named sales.csv.
clear: This option clears any existing data in memory before loading the new
dataset.
Example: Suppose you have a CSV file named sales.csv stored in the
directory C:\Data. To import this CSV file into STATA, you would use:
import delimited "C:\Data\sales.csv", clear [code]
Explanation of Usage: In both cases, the use and import delimited commands
are followed by the file path to the respective dataset files.
The clear option is included to ensure that any existing data in memory is
cleared before loading the new dataset. This prevents any potential conflicts
or issues with existing data.
It's important to provide the correct file path to the dataset files. If the files
are not in the current directory, you need to specify the full path or the
relative path to the files.
The rules for writing variable names in STATA are:
1. Length: Variable names can be up to 32 characters long.
2. Characters: Variable names can consist of letters (uppercase and
lowercase), numbers, and underscores (_).
3. First Character: Variable names must start with a letter or an underscore
(_). They cannot start with a number.
4. Reserved Words: Variable names cannot be STATA-reserved words.
These are words that have special meanings in STATA commands and
cannot be used as variable names. Examples include if, in, by, merge, etc.
5. Case Sensitivity: STATA is case sensitive, so variable names myvar and
MyVar are treated as different variables.
6. Special Characters: Avoid using special characters such as spaces,
punctuation marks, mathematical symbols, or any other non-alphanumeric
characters in variable names. Underscores (_) are allowed and commonly
used as word separators.
7. Keywords: Avoid using common keywords or abbreviations that might be
ambiguous or easily confused with STATA commands or functions.
8. Descriptive: Choose variable names that are descriptive and indicative of
the content or purpose of the variable. This makes your code more
understandable and maintainable.
Examples of valid Variable Names:
age
income
education_level
var1
var_2
Var3
Examples of Invalid Variable Names:
3var (starts with a number)
my var (contains a space)
if (reserved word)
income$ (contains a special character)
merge (reserved word)
My_Var (different from MyVar due to case sensitivity)
Difference between i) correlation and pwcorr :

Feature correlate pwcorr


Basic Function Computes Pearson,
Computes Pearson Spearman, and
correlation coefficients Kendall's tau-b
correlation coefficients

Syntax Simple and More detailed,


straightforward allowing customization
Default Output Correlation matrix Correlation matrix with
p-values and
significance levels
Does not provide p- Provides p-values and
Significance Levels values or significance significance levels
levels by default
Handling Missing Data Handles missing data Handles missing data
through listwise through pairwise
deletion deletion
Customizability Limited options Flexible with options
(mainly Pearson and for different types of
covariance) correlations and
inclusion of
significance levels
Example correlate mpg weight pwcorr mpg weight
price price, sig
ii) infile and insheet:
infile
Feature insheet

Basic Function Reads raw data from a Reads data from a


text file (ASCII text spreadsheet file (e.g.,
format) CSV or Excel)

Suitable for fixed or free Suitable for CSV and


File Format format text files tab-delimited text files,
as well as Excel files

Requires specifying Generally simpler, auto-


Syntax variable names and data detects variable names
format and formats

Requires handling Handles missing data in


Missing Data missing data explicitly a spreadsheet-friendly
manner
Best for importing well- Best for importing data
Use Case structured, large datasets from spreadsheets, easy-
to-read formats

Flexible with custom


Flexibility formats, requires more Simpler for common
detailed specification formats, less flexible
with custom data
structures

infile var1 var2 var3 insheet using


Example Command using data.txt yourfile.txt, clear

iii) describe and summarize:


Feature describe summarize

Basic Function Provides summary


Describes the dataset and
statistics for the dataset
its variables
or specified variables

Output Details Describes variable types,


Provides statistics like
formats, labels, and
mean, standard deviation,
storage types
min, max, and more
Variable Information Detailed information Statistical summary of
about variables variable values

Syntax Can be detailed with


Simple, primarily
options for various
focused on data structure
statistical measures

Advanced Options Limited options, Options for detailed


primarily for variable statistics including
description percentiles and detailed
summaries

Best for understanding Best for obtaining quick


Use Case the structure and summary statistics of the
metadata of the dataset dataset

describe var1 var2 summarize var1 var2,


Example Command detail

To obtain the covariance matrix of variables in STATA, you can use the
correlate command with the cov option. This command calculates both the
correlation and covariance matrices for the specified variables.

Example:

Assume you have three variables: age, weight, and height.


correlate age weight height, cov [code]

This will output the covariance matrix for the variables age, weight, and
height.

Outputs from pcorr age weight height Command

The pcorr command in STATA is used to compute the partial correlation


coefficients between variables, controlling for the effects of other variables in
the model.

Example Command:

pcorr age weight height [code]

Outputs Obtained:

1. Partial Correlation Coefficients:


o The command provides the partial correlation coefficients
between each pair of variables (age, weight, and height),
controlling for the other variables.
o For example, the partial correlation between age and weight,
controlling for height.
2. Significance Levels (p-values):
o It provides the significance levels (p-values) for the partial
correlation coefficients.
o These p-values indicate whether the partial correlations are
statistically significant.
3. Number of Observations:
o The number of observations used in the calculation of partial
correlations is also provided.

Detailed Example:

Suppose you have a dataset with age, weight, and height. Running the pcorr
command will give you an output like this:
Partial correlations of age, weight, and height controlling for each other:

| Partial
Variable | Corr P-value
-------------+------------------
age |
weight |
height |

In this output:

● Partial Corr: The partial correlation coefficients between each pair of


variables.
● P-value: The significance level for each partial correlation coefficient.

Summary:

● Getting Covariance Matrix: Use correlate var1 var2 var3, cov to


obtain the covariance matrix of specified variables.
● Outputs from pcorr Command: Provides partial correlation
coefficients, their p-values, and the number of observations used for
each pairwise partial correlation.

Statistics with summarize Command: When the detail option is used with
the summarize command in STATA, additional statistics are provided. These
include mean, standard deviation, variance, minimum, maximum, sum of
weights, percentiles (by default 5, 10, ..., 95), skewness, kurtosis, and missing
values
Purpose of mean Command: In STATA, the mean command calculates the
mean (average) of a variable. To get the mean, median, standard deviation,
coefficient of variation, skewness, and kurtosis of a variable d, you can use:
summarize d, detail [code]
To obtain the mean, median, standard deviation, coefficient of variation,
skewness, and kurtosis of a variable y in STATA, you can use the summarize
command with the detail option, along with some additional calculations for
the coefficient of variation. Here are the steps:
**Calculate detailed statistics
summarize y, detail [code] {Mean,Median,Sd,Skewness,Kurtosis}
** Calculate the coefficient of variation
display "Coefficient of Variation: " r(sd) / r(mean) [code]
What is the difference between STATA commands: regress yield i.block
i.treat and anova yield block treat? What will happen if you use only
regress command following anova command above?
Ans: The regress and anova commands in STATA perform different types of
analyses, even though they can be used to analyze the same types of data.

1. regress yield i.block i.treat:


o Purpose: Performs ordinary least squares (OLS) regression.
o Usage: regress yield i.block i.treat
o Interpretation: This command fits a linear regression model
where yield is the dependent variable and block and treat are
independent variables. The i. prefix indicates that block and treat
are categorical variables (factors).
o Output: Coefficients, standard errors, t-values, p-values, R-
squared, etc.
2. anova yield block treat:
o Purpose: Performs analysis of variance (ANOVA).
o Usage: anova yield block treat
o Interpretation: This command fits an ANOVA model where
yield is the dependent variable and block and treat are factors.
ANOVA is used to partition the variance in yield into
components attributable to block, treat, and residual error.
o Output: ANOVA table with sums of squares, degrees of
freedom, mean squares, F-statistics, and p-values.
Difference:

● Focus: regress focuses on estimating the effects of predictors on the


dependent variable and providing detailed regression diagnostics, while
anova focuses on partitioning variance and testing hypotheses about
factor effects.
● Output: regress provides coefficients and related statistics for each
predictor, whereas anova provides sums of squares, mean squares, F-
statistics, and p-values for the factors.

What happens if you use only regress following anova?:

● Running regress after anova will not provide the ANOVA table.
Instead, it will provide the standard output of a linear regression model,
including regression coefficients, standard errors, t-values, p-values,
and goodness-of-fit measures like R-squared. The regression output can
still be useful for understanding the effects of the predictors, but it will
not give the same partitioning of variance as the ANOVA output.

In practice, both commands can be used to analyze the same data, but they
emphasize different aspects of the model:

● Use anova when you are primarily interested in testing the significance
of categorical factors and partitioning the variance.
● Use regress when you are interested in estimating the effects of
predictors and examining detailed regression diagnostics.

Here is an example of running both commands:

**ANOVA
anova yield block treat [code]

**Linear regression
regress yield i.block i.treat [code]

The first command provides the ANOVA table, while the second provides
the regression coefficients and related statistics.
The variables angina, sex, csmoke, htn and dm are all binary type, where
angina depends on sex, csmoke, htn and dm. Write down the differences
between logistic angina sex csmoke htn dm and logistic angina i.sex
i.csmoke i.htn i.dm commands in STATA. How the coding of angina
variable is important in above commands?

Ans: Here are the differences between the logistic angina sex csmoke htn dm
and logistic angina i.sex i.csmoke i.htn i.dm commands in STATA:

Command: logistic angina sex csmoke htn dm

1. Treatment of Variables:
o Assumes that all variables (sex, csmoke, htn, dm) are continuous
by default.
o This means that STATA will treat changes in these variables as
though they are on a continuous scale, which is not appropriate
for binary variables.
2. Model Interpretation:
o Coefficients represent the change in the log-odds of angina for a
one-unit increase in each predictor variable.
o For binary variables, interpreting the coefficients as continuous
changes does not make sense and can lead to incorrect
conclusions.
3. Coding Assumptions:
o The model does not correctly specify that the predictor variables
are categorical (binary), leading to a potential misinterpretation
of the results.

Command: logistic angina i.sex i.csmoke i.htn i.dm

1. Treatment of Variables:
o The i. prefix tells STATA to treat these variables as indicator
(dummy) variables.
o This is appropriate for binary variables, as it correctly specifies
them as categorical.
2. Model Interpretation:
o Coefficients represent the change in the log-odds of angina when
each binary predictor changes from 0 to 1.
o This provides a more meaningful and accurate interpretation for
binary variables.
3. Coding Assumptions:
o The model correctly specifies that the predictor variables are
categorical (binary), ensuring accurate interpretation of the
results.

 Importance of Coding of angina Variable:

The coding of the angina variable is critical because it represents the outcome
variable (dependent variable) in the logistic regression model. In binary
logistic regression, the coding convention typically follows:

● 0: Absence of the outcome (e.g., absence of angina)


● 1: Presence of the outcome (e.g., presence of angina)

The correct coding ensures that the logistic regression model interprets the
effects of predictors (sex, csmoke, htn, dm) correctly concerning the presence
or absence of the outcome (angina).

Difference in Interpretation based on Coding:

● If angina is coded as 0 for absence and 1 for presence:


o A positive coefficient for a predictor variable (e.g., sex) indicates that
the odds of angina presence increase with that predictor.
o A negative coefficient indicates that the odds of angina presence
decrease with that predictor.
● If angina is coded differently (e.g., 1 for absence and 2 for presence):
o Interpretation of coefficients would differ accordingly.

Impact on Model Interpretation:

Using the correct coding ensures that the interpretation of coefficients in


logistic regression is meaningful and aligned with the research question.
Incorrect coding may lead to misinterpretation of results and incorrect
conclusions about the relationship between predictor variables and the
outcome.
***2018***

6(a) Suppose you have a data set with three variables WAZ=weight-for-age Z-score of
children, SES=Socio-Economic Status, and DIVISION=division. Simply write the
STATA command for getting the following results:

(i) Find five number summary of WAZ,

Syntax: summarize WAZ, detail

(ii) Recode WAZ as 1 for WAZ < -2.00 SD and 0 for WAZ ≥ -2.00 SD into STUNT
variable,

Syntax:

(iii) Summary statistics of WAZ by DIVISION,

Syntax: by DIVISION: summarize WAZ

(iv) Boxplot of WAZ by SES,

Syntax: graph box WAZ, over(SES)

(v) QQ-plot of WAZ.

Syntax: qnorm WAZ

6(b) Suppose you have another variable called SEX for the sex of the children. How do
you test that mean WAZ is significantly higher for males than females? Also, how do you
test equality of mean WAZ by SES in STATA? Just write down the STATA code.

To test if mean WAZ is significantly higher for males than females:

ttest WAZ, by(SEX) [Syntax]

To test equality of mean WAZ by SES:

oneway WAZ SES [Syntax]

Or you can use ANOVA:

anova WAZ SES [Syntax]


***2019*** 6. Free format ASCII Data file
V1 V2 V3
1 3 5
2 16 3
5 12 2

i. Read the data into STATA using insheet

Syntax: insheet v1 v2 v3 using "your_file_path_here.csv", clear

ii. Add a new variable sex with values 1 and 2

Syntax:

gen sex = .

replace sex = 1 if _n <= 3

replace sex = 2 if _n > 3

iii. Define value labels for sex (1=male, 2=female):

Syntax: label define sex_lbl 1 "male" 2 "female"

label values sex sex_lbl

iv. Generate id, a subject index (from 1 to 3)

Syntax: gen id = _n

v. Rename the variables v1 to v3 to time to time3

Syntax: rename v1-time v3-time3

vi. Convert the dataset to a long shape

Syntax: reshape long time, i(id) j(occasion)


vii. Generate a variable d that is equal to the squared difference between the variable time
at each occasion and the average of time for each subject

Syntax: by id: egen time_mean = mean(time)

gen d = (time - time_mean)^2

viii. Drop the observations corresponding to the third occasion for id=2

Syntax: drop if id == 2 & occasion == 3

ix. Compute the total time on the first occasion for each sex

Syntax: collapse (sum) time, by(sex)

You might also like