1
Econometrics Analysis of Experimental Data
Prof. Dr. Sabrina Jeworrek
Chair for Organizational Behavior and Human Resource Management
Parametric and Non-parametric Treatment Testing
Part I
2
Basic Principles of Treatment Testing
3
Parametric or non-parametric?
• A treatment test always has a null hypothesis and an alternative
hypothesis. Usually, the null hypothesis is that there is no effect.
* If alternative hypothesis specifies the direction of the effect, one-tailed tests can
be conducted and the p-value is only half of two-tailed test (usually, not always!).
• But which test to choose? Statistical procedures may be grouped
into two major classifications: parametric and nonparametric.
* The assumptions of parametric statistics (i.e. normality and equal variances) are
more specific and stringent than of nonparametric ones.
* But: The more rigorous the assumptions, the more trustworthy the conclusions
because of exploiting the richness of the data.
• Example: Two independent groups reveal similar median values but
the mean values are different.
* Taking into account the variance, as it is the case for parametric tests, the test
could yield a statistically significant result whereas this would not be the case for a
nonparametric test.
Preferential use of parametric tests whenever the data meet the
assumptions.
4
Parametric or non-parametric? An overview
Nonparametric Statistics Parametric Statistics
Continuous distribution Assumptions of normality and
equal variances
Uses median as location Uses mean, variance, and standard
parameter deviation as location parameters
Random sample Random sample
Independence of responses Independence of responses
Uses nominal, ordinal, interval, Uses interval and ratio data
and sometimes ratio data
Large and small data sets Large data sets (minimum of 30
or more cases)
Weaker statistical power than More powerful than nonparametric
parametric statistics tests for rejecting null hypothesis
Source: Kraska-Miller (2014), p. 35.
5
The importance of nonparametric methods
• Outcomes of interest are often dichotomous or on an ordinal scale.
* Example: The probability to pass an exam.
* Distributional assumptions of parametric tests can only hold if variables of interest
are measured on a cardinal scale.
• Even if data is cardinal in nature, researchers might transform the
data to ordinal data due to their resarch question, e.g. income in
high, medium and low income groups.
6
Using Parametric Statistics?
Even if data is cardinal, assumptions may not be met. But:
1. Non-normal distributions:
* Sufficiently large sample sizes make it possible to appeal to the central limit
theorem (CLT): the standardized mean of a sample follows a normal distribution
even when the sample is drawn from a distribution that is not normal.
*If sample sizes are low, bootstrapping ensures that inferences made from
parametric tests are valid regardless of the distribution.
2. Heterogeneous variances:
Some parametric tests are still robust. Additionally, for some tests there are test
statistics for unequal variances along with the test for equal variances.
7
Let‘s get started! The description of the experiment.
8
Data is obtained from the following game:
• There are two different types of players, seller and buyer.
• The seller is given 60 units which he can invest (binary decision!) to
generate a greater amount (i.e. 100 units).
*If seller does not invest: she keeps the 60 units which are paid out at the end of the
experiment.
• If the seller has invested, the buyer proposes how to split the 100
units.
• The seller chooses whether to accept this offer. In case of rejection,
both parties receive zero.
Combination of a binary trust game and an ultimatum game.
9
The treatments
The game is played with three different treatment groups:
T1: Control group, no communication except the actions
themselves.
T2: Buyer can send a message to the seller before the seller
makes the investment decision.
T3: Seller can send a message to the buyer along with the
investment decision.
Examplary research question: Does the amount offered by the buyer
increase due to the seller‘s message?
You can do the analysis on your own. Use the following data set:
„IndSamples MeanDifferences.dta“. This part of the lecture is based on
Moffat (2016), chapter 3.
10
Testing for differences in means between two independent samples
1. Testing for normality as first step (in case of cardinal data)
2. The parametric t-test
3. The nonparametric Wilcoxon rank sum test
4. Mood‘s median test (also nonparametric)
11
1.1. Testing for normality: Visual inspection
Getting a first impression using a histogramm:
hist depvar, disc freq normal Normal density superimposed on the histogram,
with the same mean and std as the data
20
Minimum for
executing the
command
15
Frequency
10
5
0
0 20 40 60 80 100
buyer's offer to seller
depvar = dependent variable = outcome of interest
12
1.2. Testing for normality: Formal tests
The skewness-kurtosis test
sktest depvar Skewness/Kurtosis tests for Normality
joint
Variable Obs Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2
offer 51 0.0021 0.4595 8.60 0.0136
Rejection of Normal kurtosis
symmetry
. sktest offer if treatment==1
Separate estimation by treatment is important!
Skewness/Kurtosis tests for Normality
joint
Variable Obs Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2
offer 14 0.5120 0.0088 6.52 0.0383
. sktest offer if treatment==2
Skewness/Kurtosis tests for Normality
joint
Variable Obs Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2
offer 16 0.0003 0.0012 16.65 0.0002
13
1.2. Testing for normality: Formal tests II
The Shapiro-Wilk test
swilk depvar Shapiro-Wilk W test for normal data
Variable Obs W V z Prob>z
offer 51 0.87616 5.916 3.795 0.00007
Conclusion:
Normality is strongly rejected: Nonparametric tests should be used,
especially given that sample sizes are low. Validity of parametric tests
have to be doubted.
14
2.1. The parametric t-test
• Used for testing whether two samples (and, hence, their means)
belong to the same population.
• How does the test work?
1. Calculating the t-test statistic
Sample means
Pooled std as a weighted average
of the two individual sample std Sample sizes
2. Calculating the degrees of freedom
3. Comparing the test statistic with the distribution: If
test statistics is larger, the difference in means is statistically
significant on the chosen significance level
15
The groups you want to
compare. Alternativ formulation:
2.2. T-test in Stata if treatment!=2
ttest depvar if treatment==1 | treatment==3, by(treatment)
Two-sample t test with equal variances
Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
1 14 48.57143 8.619371 32.25073 29.95041 67.19245
3 21 63.33333 4.230464 19.38642 54.50874 72.15793
combined 35 57.42857 4.383753 25.93463 48.51971 66.33743
diff -14.7619 8.711774 -32.48614 2.962333
diff = mean(1) - mean(3) t = -1.6945
Ho: diff = 0 degrees of freedom = 33
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0498 Pr(|T| > |t|) = 0.0996 Pr(T > t) = 0.9502
• In case of a p-value smaller than 0.05 you reject the null hypothesis that there is
no difference between the two groups.
• Here: Using the two-sided(!) test there is only „mild/ suggestive evidence“ for a
treatment effect.
16
2.3. Taking unequal variances into account
sdtest depvar if treatment==1 | treatment==3, by(treatment)
Ha: ratio < 1 Ha: ratio != 1 Ha: ratio > 1
Pr(F < f) = 0.9801 2*Pr(F > f) = 0.0398 Pr(F > f) = 0.0199
The variance-ratio test suggests that variances are not equal
ttest depvar if treatment==1 | treatment==3, by(treatment) unequal
Two-sample t test with unequal variances
Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
1 14 48.57143 8.619371 32.25073 29.95041 67.19245
3 21 63.33333 4.230464 19.38642 54.50874 72.15793
combined 35 57.42857 4.383753 25.93463 48.51971 66.33743
diff -14.7619 9.601583 -34.83782 5.314009
diff = mean(1) - mean(3) t = -1.5374
Ho: diff = 0 Satterthwaite's degrees of freedom = 19.29
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0702 Pr(|T| > |t|) = 0.1404 Pr(T > t) = 0.9298
17
2.4. The bootstrap technique
• Using the possibly rich cardinal information in the data even in case
of non-normality
• Procedure:
1. Obtain test statistic from parametric test (as usual).
2. Generate a „healthy“ number of bootstrap sample:
Samples with same sample size as the original sample and also drawn
from the original sample but with replacement
3. For each bootstrap sample, compute the test statistic.
4. Compute the sandard deviation of the bootstrap test statistics .
5. Obtain new test statistic .
18
2.4. The bootstrap technique II
• Depending on the chosen significance level (1%, 5%, 10%), the
number of bootstrap samples should be chosen (99, 999, 9999).
bootstrap t=r(t), rep(999) nodrop: ttest depvar if treatment==1 |
treatment==3, by(treatment)
Bootstrap results Number of obs = 103
Replications = 999
command: ttest offer if treatment==1 | treatment==3, by(treatment)
t: r(t)
Observed Bootstrap Normal-based
Coef. Std. Err. z P>|z| [95% Conf. Interval]
t -1.694477 1.150798 -1.47 0.141 -3.949999 .5610444
19
3.1. The nonparametric Wilcoxon rank sum test
• Also called Mann-Whitney U test
• Determining the test statistic
The „sum of ranks“
Test statistic is approximately normally distributed:
• An example for the sum of ranks:
offer 20 30 50 50 55 60 65 70 80 R1 = 1+2+3.5+7 = 13.5
R2 = 3.5+5+6+8+9 = 31.5
treat 1 1 2 1 2 2 1 2 2
rank 1 2 3.5 3.5 5 6 7 8 9
20
3.2. Rank sum test in Stata
ranksum depvar if treatment==1 | treatment==3, by(treatment)
Two-sample Wilcoxon rank-sum (Mann-Whitney) test
treatment obs rank sum expected
1 14 220.5 252
3 21 409.5 378
combined 35 630 630
unadjusted variance 882.00
adjustment for ties -26.56
adjusted variance 855.44
Ho: offer(treatm~t==1) = offer(treatm~t==3)
z = -1.077
Prob > |z| = 0.2815
Given that nonparametric tests are more suitable for this specific data, we find no
evidence for a statistically significant difference between the treatments.
Means have to be displayed separately: bysort treatment: sum depvar
21
4. Mood‘s Median Test
• Compares the median of two indepent groups.
• Similar to the rank sum test, no distributional assumptions.
• Rank sum test, however, is more powerful since it uses the rank of
each observation instead of only the relation of a score to the
median value in the distribution (above or below).
• Command: median. For further information: help median.
• To conclude: The amount offered by the buyer does not increase
due to the seller‘s message. Obtained p-values:
t-test Mann-Whitney t-test bootstrapped
0.0996 0.2815 0.141
22
Summary: Using t-test or rank sum test?
• We want to compare the means of two samples. If not, other tests
have to be used.
• In case of ordinal data, you have no choice: only rank sum test
appropriate!
• In case of cardinal data, check for normality first.
* If the data is normally distributed, check for equal variances next.
* If the data is not normally distributed, check the sample sizes.
You can use the bootstrap technique or go to the rank sum if the
sample size is low.
23
How to present results?
24
Data analysis is not enough…
• You can simply report the results in the main text of your paper
(means for each group, the corresponding p-values, and which test
was used), but:
*To help the reader to find the most important results, graphical
illustrations are extremely useful.
• Simple bar charts would do the job, but experimental economists
would like to see confidence intervals in order to judge the variation
of outcomes.
80
• cibar depvar, over(treatment)
70
Some graphical adjustments still
mean of offer
60
necessary, use the graph editor. You
50
can also add text, e.g. p-values…
40
30
25
… and what about economic significance?
• Comparing means already gives you an idea about the economic
importance, but hardly comparable across different studies.
• Cohen‘s d (see introductory lecture) as one common measure.
* Hedge‘s g very similar but for sample sizes <20
• esize twosample depvar if treatment!=2, by(treatment) coh hed
Comparably large effect size Effect size based on mean comparison
but statistically insignificant. Obs per group:
treatment==1 = 14
treatment==3 = 21
So what is your conclusion? Effect size Estimate [95% conf. interval]
Cohen's d -.5846503 -1.271109 .1102795
Hedges's g -.5712442 -1.241962 .1077508
26
Multiple hypothesis testing (MHT)
(Hard to find in older publications but is becoming increasingly important…)
27
The Problem with Multiple Hypotheses
• False discovery drives resource allocation and future streams of
thought, private and social costs might be quite high.
• Multiple hypothesis testing (MHT) as one key reason for false
positives. Testing…
- several dependent variables that might be affected by the
treatment
- for heterogeneity
- multiple treatment groups
• If all p-values are mutually independent, the probability of at least
one true null hypothesis being rejected would equal
with N as the number of tested hypotheses.
28
Classical Correction Methods
• Bonferroni correction: Test each idividual hypothesis at a
significance level .
Extremely conservative especially if there is a large number of
tests and it reduces the statistical power.
• Bonferroni-Holmes correction: Stepwise algorithm.
Example:
Testing H1 to H4 with p1=0.01, p2=0.04, p3=0.03 and p4=0.005.
1. The smallest p-value (0.005) is compared to which is 0.0125.
Hence, the null hypothesis can be rejected.
2. The second smallest p-value (0.01) is compared with . Since
0.01<0.0167, the null hypothesis is again rejected.
3. Continue as above for all available hypotheses.
29
MHT in Experimental Economics
• New method that incorporates information about the joint dependence
structure of the test statistics when determining which null hypotheses
to reject. Developed by List, Shaikh, and Xu (2019).
• Example: mhtexp x y z, treatment(treatment) subgroup(gender)
outcome subgroup treatme~1 treatme~2 diff_in~s Remark3_1 Thm3_1 Remark3_7 Bonf Holm
r1 1 1 0 1 26.15011 .0336667 .0336667 .0336667 .0673333 .0336667
r2 2 1 0 1 .6393227 .0046667 .0086667 .0086667 .0093333 .0093333
outcome subgroup treatme~1 treatme~2 diff_in~s Remark3_1 Thm3_1 Remark3_7 Bonf Holm
r1 1 1 0 1 26.15011 .0336667 .065 .065 .101 .0673333
r2 2 1 0 1 .6393227 .0046667 .013 .013 .014 .014
r3 3 1 0 1 .021702 .8763333 .8763333 .8763333 1 .8763333
outcome subgroup treatme~1 treatme~2 diff_in~s Remark3_1 Thm3_1 Remark3_7 Bonf Holm
r1 1 1 0 1 41.16463 .0136667 .0763333 .0763333 .082 .082
r2 1 2 0 1 4.027027 .813 .9926667 .9926667 1 1
r3 2 1 0 1 .6791667 .028 .1336667 .1336667 .168 .14
r4 2 2 0 1 .5527463 .101 .3476667 .3476667 .606 .404
r5 3 1 0 1 .0271739 .8946667 .9893333 .9893333 1 1
r6 3 2 0 1 .0243243 .9043333 .9043333 .9043333 1 .9043333
• Attention: The subgroup should not be coded as 0 vs. 1, use instead 1 vs. 2!
The treatment variable, however, has to contain values of 0.