Hypothesis tests and
z-scores
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
A/B testing
In 2013, Electronic Arts (EA) released
SimCity 5
They wanted to increase pre-orders of the
game
They used A/B testing to test different
advertising scenarios
This involves splitting users into control and
treatment groups
1 Image credit: "Electronic Arts" by majaX1 CC BY-NC-SA 2.0
HYPOTHESIS TESTING IN PYTHON
Retail webpage A/B test
Control: Treatment:
HYPOTHESIS TESTING IN PYTHON
A/B test results
The treatment group (no ad) got 43.4% more purchases than the control group (with ad)
Intuition that "showing an ad would increase sales" was false
Was this result statistically significant or just chance?
Need EA's data to determine this
Techniques from Sampling in Python + this course to do so
HYPOTHESIS TESTING IN PYTHON
Stack Overflow Developer Survey 2020
import pandas as pd
print(stack_overflow)
respondent age_1st_code ... age hobbyist
0 36.0 30.0 ... 34.0 Yes
1 47.0 10.0 ... 53.0 Yes
2 69.0 12.0 ... 25.0 Yes
3 125.0 30.0 ... 41.0 Yes
4 147.0 15.0 ... 28.0 No
... ... ... ... ... ...
2259 62867.0 13.0 ... 33.0 Yes
2260 62882.0 13.0 ... 28.0 Yes
[2261 rows x 8 columns]
HYPOTHESIS TESTING IN PYTHON
Hypothesizing about the mean
A hypothesis:
The mean annual compensation of the population of data scientists is $110,000
The point estimate (sample statistic):
mean_comp_samp = stack_overflow['converted_comp'].mean()
119574.71738168952
HYPOTHESIS TESTING IN PYTHON
Generating a bootstrap distribution
import numpy as np
# Step 3. Repeat steps 1 & 2 many times, appending to a list
so_boot_distn = []
for i in range(5000):
so_boot_distn.append(
# Step 2. Calculate point estimate
np.mean(
# Step 1. Resample
stack_overflow.sample(frac=1, replace=True)['converted_comp']
)
)
1 Bootstrap distributions are taught in Chapter 4 of Sampling in Python
HYPOTHESIS TESTING IN PYTHON
Visualizing the bootstrap distribution
import matplotlib.pyplot as plt
plt.hist(so_boot_distn, bins=50)
plt.show()
HYPOTHESIS TESTING IN PYTHON
Standard error
std_error = np.std(so_boot_distn, ddof=1)
5607.997577378606
HYPOTHESIS TESTING IN PYTHON
z-scores
value − mean
standardized value =
standard deviation
sample stat − hypoth. param. value
z=
standard error
HYPOTHESIS TESTING IN PYTHON
sample stat − hypoth. param. value
z=
standard error
stack_overflow['converted_comp'].mean()
119574.71738168952
mean_comp_hyp = 110000
std_error
5607.997577378606
z_score = (mean_comp_samp - mean_comp_hyp) / std_error
1.7073326529796957
HYPOTHESIS TESTING IN PYTHON
Testing the hypothesis
Is 1.707 a high or low number?
This is the goal of the course!
HYPOTHESIS TESTING IN PYTHON
Testing the hypothesis
Is 1.707 a high or low number?
This is the goal of the course!
Hypothesis testing use case:
Determine whether sample statistics are close to or far away from expected (or
"hypothesized" values)
HYPOTHESIS TESTING IN PYTHON
Standard normal (z) distribution
Standard normal distribution: normal distribution with mean = 0 + standard deviation = 1
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
p-values
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Criminal trials
Two possible true states:
1. Defendant committed the crime
2. Defendant did not commit the crime
Two possible verdicts:
1. Guilty
2. Not guilty
Initially the defendant is assumed to be not guilty
Prosecution must present evidence "beyond reasonable doubt" for a guilty verdict
HYPOTHESIS TESTING IN PYTHON
Age of first programming experience
age_first_code_cut classifies when Stack Overflow user first started programming
"adult" means they started at 14 or older
"child" means they started before 14
Previous research: 35% of software developers started programming as children
Evidence that a greater proportion of data scientists starting programming as children?
HYPOTHESIS TESTING IN PYTHON
Definitions
A hypothesis is a statement about an unknown population parameter
A hypothesis test is a test of two competing hypotheses
The null hypothesis (H0 ) is the existing idea
The alternative hypothesis (HA ) is the new "challenger" idea of the researcher
For our problem:
H0 : The proportion of data scientists starting programming as children is 35%
HA : The proportion of data scientists starting programming as children is greater than 35%
1"Naught" is British English for "zero". For historical reasons, "H-naught" is the international convention for
pronouncing the null hypothesis.
HYPOTHESIS TESTING IN PYTHON
Criminal trials vs. hypothesis testing
Either HA or H0 is true (not both)
Initially, H0 is assumed to be true
The test ends in either "reject H0 " or "fail to reject H0 "
If the evidence from the sample is "significant" that HA is true, reject H0 , else choose H0
Significance level is "beyond a reasonable doubt" for hypothesis testing
HYPOTHESIS TESTING IN PYTHON
One-tailed and two-tailed tests
Hypothesis tests check if the sample statistics
lie in the tails of the null distribution
Test Tails
alternative different from null two-tailed
alternative greater than null right-tailed
alternative less than null left-tailed
HA : The proportion of data scientists starting
programming as children is greater than 35%
This is a right-tailed test
HYPOTHESIS TESTING IN PYTHON
p-values
p-values: probability of obtaining a result,
assuming the null hypothesis is true
Large p-value, large support for H0
Statistic likely not in the tail of the null
distribution
Small p-value, strong evidence against H0
Statistic likely in the tail of the null
distribution
"p" in p-value → probability
"small" means "close to zero"
HYPOTHESIS TESTING IN PYTHON
Calculating the z-score
prop_child_samp = (stack_overflow['age_first_code_cut'] == "child").mean()
0.39141972578505085
prop_child_hyp = 0.35
std_error = np.std(first_code_boot_distn, ddof=1)
0.010351057228878566
z_score = (prop_child_samp - prop_child_hyp) / std_error
4.001497129152506
HYPOTHESIS TESTING IN PYTHON
Calculating the p-value
norm.cdf() is normal CDF from scipy.stats .
Left-tailed test → use norm.cdf() .
Right-tailed test → use 1 - norm.cdf() .
from scipy.stats import norm
1 - norm.cdf(z_score, loc=0, scale=1)
3.1471479512323874e-05
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Statistical
significance
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
p-value recap
p-values quantify evidence for the null hypothesis
Large p-value → fail to reject null hypothesis
Small p-value → reject null hypothesis
Where is the cutoff point?
HYPOTHESIS TESTING IN PYTHON
Significance level
The significance level of a hypothesis test (α) is the threshold point for "beyond a
reasonable doubt"
Common values of α are 0.2 , 0.1 , 0.05 , and 0.01
If p ≤ α, reject H0 , else fail to reject H0
α should be set prior to conducting the hypothesis test
HYPOTHESIS TESTING IN PYTHON
Calculating the p-value
alpha = 0.05
prop_child_samp = (stack_overflow['age_first_code_cut'] == "child").mean()
prop_child_hyp = 0.35
std_error = np.std(first_code_boot_distn, ddof=1)
z_score = (prop_child_samp - prop_child_hyp) / std_error
p_value = 1 - norm.cdf(z_score, loc=0, scale=1)
3.1471479512323874e-05
HYPOTHESIS TESTING IN PYTHON
Making a decision
alpha = 0.05
print(p_value)
3.1471479512323874e-05
p_value <= alpha
True
Reject H0 in favor of HA
HYPOTHESIS TESTING IN PYTHON
Confidence intervals
For a significance level of α, it's common to choose a confidence interval level of 1 - α
α = 0.05 → 95% confidence interval
import numpy as np
lower = np.quantile(first_code_boot_distn, 0.025)
upper = np.quantile(first_code_boot_distn, 0.975)
print((lower, upper))
(0.37063246351172047, 0.41132242370632466)
HYPOTHESIS TESTING IN PYTHON
Types of errors
Truly didn't commit crime Truly committed crime
Verdict not guilty correct they got away with it
Verdict guilty wrongful conviction correct
actual H0 actual HA
chosen H0 correct false negative
chosen HA false positive correct
False positives are Type I errors; false negatives are Type II errors.
HYPOTHESIS TESTING IN PYTHON
Possible errors in our example
If p ≤ α, we reject H0 :
A false positive (Type I) error: data scientists didn't start coding as children at a higher rate
If p > α, we fail to reject H0 :
A false negative (Type II) error: data scientists started coding as children at a higher rate
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Performing t-tests
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Two-sample problems
Compare sample statistics across groups of a variable
converted_comp is a numerical variable
age_first_code_cut is a categorical variable with levels ( "child" and "adult" )
Are users who first programmed as a child compensated higher than those that started as
adults?
HYPOTHESIS TESTING IN PYTHON
Hypotheses
H0 : The mean compensation (in USD) is the same for those that coded first as a child and
those that coded first as an adult.
H0 : μchild = μadult
H0 : μchild − μadult = 0
HA : The mean compensation (in USD) is greater for those that coded first as a child
compared to those that coded first as an adult.
HA : μchild > μadult
HA : μchild − μadult > 0
HYPOTHESIS TESTING IN PYTHON
Calculating groupwise summary statistics
stack_overflow.groupby('age_first_code_cut')['converted_comp'].mean()
age_first_code_cut
adult 111313.311047
child 132419.570621
Name: converted_comp, dtype: float64
HYPOTHESIS TESTING IN PYTHON
Test statistics
Sample mean estimates the population mean
x̄ - a sample mean
x̄child - sample mean compensation for coding first as a child
x̄adult - sample mean compensation for coding first as an adult
x̄child − x̄adult - a test statistic
z-score - a (standardized) test statistic
HYPOTHESIS TESTING IN PYTHON
Standardizing the test statistic
sample stat − population parameter
z=
standard error
difference in sample stats − difference in population parameters
t=
standard error
(x̄child − x̄adult ) − (μchild − μadult )
t=
SE(x̄child − x̄adult )
HYPOTHESIS TESTING IN PYTHON
Standard error
SE(x̄child − x̄adult ) ≈ √
s2child s2adult
+
nchild nadult
s is the standard deviation of the variable
n is the sample size (number of observations/rows in sample)
HYPOTHESIS TESTING IN PYTHON
Assuming the null hypothesis is true
(x̄child − x̄adult ) − (μchild − μadult )
t=
SE(x̄child − x̄adult )
(x̄child − x̄adult )
H0 : μchild − μadult = 0 → t=
SE(x̄child − x̄adult )
(x̄child − x̄adult )
t=
√
s2child s2adult
+
nchild nadult
HYPOTHESIS TESTING IN PYTHON
Calculations assuming the null hypothesis is true
xbar = stack_overflow.groupby('age_first_code_cut')['converted_comp'].mean()
adult 111313.311047
child 132419.570621
Name: converted_comp, dtype: float64 age_first_code_cut
s = stack_overflow.groupby('age_first_code_cut')['converted_comp'].std()
adult 271546.521729
child 255585.240115
Name: converted_comp, dtype: float64 age_first_code_cut
n = stack_overflow.groupby('age_first_code_cut')['converted_comp'].count()
adult 1376
child 885
Name: converted_comp, dtype: int64
HYPOTHESIS TESTING IN PYTHON
Calculating the test statistic
(x̄child − x̄adult )
t=
√
s2child s2adult
+
nchild nadult
import numpy as np
numerator = xbar_child - xbar_adult
denominator = np.sqrt(s_child ** 2 / n_child + s_adult ** 2 / n_adult)
t_stat = numerator / denominator
1.8699313316221844
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Calculating p-values
from t-statistics
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
t-distributions
t statistic follows a t-distribution
Have a parameter named degrees of
freedom, or df
Look like normal distributions, with fatter
tails
HYPOTHESIS TESTING IN PYTHON
Degrees of freedom
Larger degrees of freedom → t-distribution
gets closer to the normal distribution
Normal distribution → t-distribution with
infinite df
Degrees of freedom: maximum number of
logically independent values in the data
sample
HYPOTHESIS TESTING IN PYTHON
Calculating degrees of freedom
Dataset has 5 independent observations
Four of the values are 2, 6, 8, and 5
The sample mean is 5
The last value must be 4
Here, there are 4 degrees of freedom
df = nchild + nadult − 2
HYPOTHESIS TESTING IN PYTHON
Hypotheses
H0 : The mean compensation (in USD) is the same for those that coded first as a child and
those that coded first as an adult
HA : The mean compensation (in USD) is greater for those that coded first as a child
compared to those that coded first as an adult
Use a right-tailed test
HYPOTHESIS TESTING IN PYTHON
Significance level
α = 0.1
If p ≤ α then reject H0 .
HYPOTHESIS TESTING IN PYTHON
Calculating p-values: one proportion vs. a value
from scipy.stats import norm
1 - norm.cdf(z_score)
SE(x̄child − x̄adult ) ≈ √
s2child s2adult
+
nchild nadult
z-statistic: needed when using one sample statistic to estimate a population parameter
t-statistic: needed when using multiple sample statistics to estimate a population parameter
HYPOTHESIS TESTING IN PYTHON
Calculating p-values: two means from different groups
numerator = xbar_child - xbar_adult
denominator = np.sqrt(s_child ** 2 / n_child + s_adult ** 2 / n_adult)
t_stat = numerator / denominator
1.8699313316221844
degrees_of_freedom = n_child + n_adult - 2
2259
HYPOTHESIS TESTING IN PYTHON
Calculating p-values: two means from different groups
Use t-distribution CDF not normal CDF
from scipy.stats import t
1 - t.cdf(t_stat, df=degrees_of_freedom)
0.030811302165157595
Evidence that Stack Overflow data scientists who started coding as a child earn more.
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Paired t-tests
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
US Republican presidents dataset
state county repub_percent_08 repub_percent_12
0 Alabama Hale 38.957877 37.139882
1 Arkansas Nevada 56.726272 58.983452
2 California Lake 38.896719 39.331367
3 California Ventura 42.923190 45.250693
.. ... ... ... ...
96 Wisconsin La Crosse 37.490904 40.577038
97 Wisconsin Lafayette 38.104967 41.675050
98 Wyoming Weston 76.684241 83.983328
99 Alaska District 34 77.063259 40.789626
[100 rows x 4 columns]
100 rows; each row represents county-level votes in a presidential election.
1 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ
HYPOTHESIS TESTING IN PYTHON
Hypotheses
Question: Was the percentage of Republican candidate votes lower in 2008 than 2012?
H0 : μ2008 − μ2012 = 0
HA : μ2008 − μ2012 < 0
Set α = 0.05 significance level.
Data is paired → each voter percentage refers to the same county
Want to capture voting patterns in model
HYPOTHESIS TESTING IN PYTHON
From two samples to one
sample_data = repub_votes_potus_08_12
sample_data['diff'] = sample_data['repub_percent_08'] - sample_data['repub_percent_12']
import matplotlib.pyplot as plt
sample_data['diff'].hist(bins=20)
HYPOTHESIS TESTING IN PYTHON
Calculate sample statistics of the difference
xbar_diff = sample_data['diff'].mean()
-2.877109041242944
HYPOTHESIS TESTING IN PYTHON
Revised hypotheses
Old hypotheses: x̄diff − μdiff
t=
√
H0 : μ2008 − μ2012 = 0 s2dif f
HA : μ2008 − μ2012 < 0 ndiff
df = ndif f − 1
New hypotheses:
H0 : μdiff = 0
HA : μdiff < 0
HYPOTHESIS TESTING IN PYTHON
Calculating the p-value
x̄diff − μdiff
n_diff = len(sample_data) t=
√
s2diff
100
ndiff
s_diff = sample_data['diff'].std()
df = ndiff − 1
t_stat = (xbar_diff-0) / np.sqrt(s_diff**2/n_diff)
-5.601043121928489 from scipy.stats import t
p_value = t.cdf(t_stat, df=n_diff-1)
degrees_of_freedom = n_diff - 1
9.572537285272411e-08
99
HYPOTHESIS TESTING IN PYTHON
Testing differences between two means using ttest()
import pingouin
pingouin.ttest(x=sample_data['diff'],
y=0,
alternative="less")
T dof alternative p-val CI95% cohen-d \
T-test -5.601043 99 less 9.572537e-08 [-inf, -2.02] 0.560104
BF10 power
T-test 1.323e+05 1.0
1Details on Returns from pingouin.ttest() are available in the API docs for pingouin at https://pingouin-
stats.org/generated/pingouin.ttest.html#pingouin.ttest.
HYPOTHESIS TESTING IN PYTHON
ttest() with paired=True
pingouin.ttest(x=sample_data['repub_percent_08'],
y=sample_data['repub_percent_12'],
paired=True,
alternative="less")
T dof alternative p-val CI95% cohen-d \
T-test -5.601043 99 less 9.572537e-08 [-inf, -2.02] 0.217364
BF10 power
T-test 1.323e+05 0.696338
HYPOTHESIS TESTING IN PYTHON
Unpaired ttest()
pingouin.ttest(x=sample_data['repub_percent_08'],
y=sample_data['repub_percent_12'],
paired=False, # The default
alternative="less")
T dof alternative p-val CI95% cohen-d BF10 \
T-test -1.536997 198 less 0.062945 [-inf, 0.22] 0.217364 0.927
power
T-test 0.454972
Unpaired t-tests on paired data increases the chances of false negative errors
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
ANOVA tests
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Job satisfaction: 5 categories
stack_overflow['job_sat'].value_counts()
Very satisfied 879
Slightly satisfied 680
Slightly dissatisfied 342
Neither 201
Very dissatisfied 159
Name: job_sat, dtype: int64
HYPOTHESIS TESTING IN PYTHON
Visualizing multiple distributions
Is mean annual compensation different for
different levels of job satisfaction?
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x="converted_comp",
y="job_sat",
data=stack_overflow)
plt.show()
HYPOTHESIS TESTING IN PYTHON
Analysis of variance (ANOVA)
A test for differences between groups
alpha = 0.2
pingouin.anova(data=stack_overflow,
dv="converted_comp",
between="job_sat")
Source ddof1 ddof2 F p-unc np2
0 job_sat 4 2256 4.480485 0.001315 0.007882
0.001315 <α
At least two categories have significantly different compensation
HYPOTHESIS TESTING IN PYTHON
Pairwise tests
μvery dissatisfied ≠ μslightly dissatisfied μslightly dissatisfied ≠ μslightly satisfied
μvery dissatisfied ≠ μneither μslightly dissatisfied ≠ μvery satisfied
μvery dissatisfied ≠ μslightly satisfied μneither ≠ μslightly satisfied
μvery dissatisfied ≠ μvery satisfied μneither ≠ μvery satisfied
μslightly dissatisfied ≠ μneither μslightly satisfied ≠ μvery satisfied
Set significance level to α = 0.2.
HYPOTHESIS TESTING IN PYTHON
pairwise_tests()
pingouin.pairwise_tests(data=stack_overflow,
dv="converted_comp",
between="job_sat",
padjust="none")
Contrast A B Paired Parametric ... dof alternative p-unc BF10 hedges
0 job_sat Slightly satisfied Very satisfied False True ... 1478.622799 two-sided 0.000064 158.564 -0.192931
1 job_sat Slightly satisfied Neither False True ... 258.204546 two-sided 0.484088 0.114 -0.068513
2 job_sat Slightly satisfied Very dissatisfied False True ... 187.153329 two-sided 0.215179 0.208 -0.145624
3 job_sat Slightly satisfied Slightly dissatisfied False True ... 569.926329 two-sided 0.969491 0.074 -0.002719
4 job_sat Very satisfied Neither False True ... 328.326639 two-sided 0.097286 0.337 0.120115
5 job_sat Very satisfied Very dissatisfied False True ... 221.666205 two-sided 0.455627 0.126 0.063479
6 job_sat Very satisfied Slightly dissatisfied False True ... 821.303063 two-sided 0.002166 7.43 0.173247
7 job_sat Neither Very dissatisfied False True ... 321.165726 two-sided 0.585481 0.135 -0.058537
8 job_sat Neither Slightly dissatisfied False True ... 367.730081 two-sided 0.547406 0.118 0.055707
9 job_sat Very dissatisfied Slightly dissatisfied False True ... 247.570187 two-sided 0.259590 0.197 0.119131
[10 rows x 11 columns]
HYPOTHESIS TESTING IN PYTHON
As the number of groups increases...
HYPOTHESIS TESTING IN PYTHON
Bonferroni correction
pingouin.pairwise_tests(data=stack_overflow,
dv="converted_comp",
between="job_sat",
padjust="bonf")
Contrast A B ... p-unc p-corr p-adjust BF10 hedges
0 job_sat Slightly satisfied Very satisfied ... 0.000064 0.000638 bonf 158.564 -0.192931
1 job_sat Slightly satisfied Neither ... 0.484088 1.000000 bonf 0.114 -0.068513
2 job_sat Slightly satisfied Very dissatisfied ... 0.215179 1.000000 bonf 0.208 -0.145624
3 job_sat Slightly satisfied Slightly dissatisfied ... 0.969491 1.000000 bonf 0.074 -0.002719
4 job_sat Very satisfied Neither ... 0.097286 0.972864 bonf 0.337 0.120115
5 job_sat Very satisfied Very dissatisfied ... 0.455627 1.000000 bonf 0.126 0.063479
6 job_sat Very satisfied Slightly dissatisfied ... 0.002166 0.021659 bonf 7.43 0.173247
7 job_sat Neither Very dissatisfied ... 0.585481 1.000000 bonf 0.135 -0.058537
8 job_sat Neither Slightly dissatisfied ... 0.547406 1.000000 bonf 0.118 0.055707
9 job_sat Very dissatisfied Slightly dissatisfied ... 0.259590 1.000000 bonf 0.197 0.119131
[10 rows x 11 columns]
HYPOTHESIS TESTING IN PYTHON
More methods
padjust : string
Method used for testing and adjustment of pvalues.
'none' : no correction [default]
'bonf' : one-step Bonferroni correction
'sidak' : one-step Sidak correction
'holm' : step-down method using Bonferroni adjustments
'fdr_bh' : Benjamini/Hochberg FDR correction
'fdr_by' : Benjamini/Yekutieli FDR correction
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
One-sample
proportion tests
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Chapter 1 recap
Is a claim about an unknown population proportion feasible?
1. Standard error of sample statistic from bootstrap distribution
2. Compute a standardized test statistic
3. Calculate a p-value
4. Decide which hypothesis made most sense
Now, calculate the test statistic without using the bootstrap distribution
HYPOTHESIS TESTING IN PYTHON
Standardized test statistic for proportions
p: population proportion (unknown population parameter)
p^: sample proportion (sample statistic)
p0 : hypothesized population proportion
p^ − mean( p^) p^ − p
z= =
SE( p^) SE( p^)
Assuming H0 is true, p = p0 , so
p^ − p0
z=
SE( p^)
HYPOTHESIS TESTING IN PYTHON
Simplifying the standard error calculations
SE p^ = √
p0 ∗ (1 − p0 )
→ Under H0 , SE p^ depends on hypothesized p0 and sample size n
n
Assuming H0 is true,
p^ − p0
z=
√ p0 ∗ (1 − p0 )
n
^ and n) and the hypothesized parameter (p0 )
Only uses sample information ( p
HYPOTHESIS TESTING IN PYTHON
Why z instead of t?
(x̄child − x̄adult )
t=
√
s2child s2adult
+
nchild nadult
s is calculated from x̄
x̄ estimates the population mean
s estimates the population standard deviation
↑ uncertainty in our estimate of the parameter
t-distribution - fatter tails than a normal distribution
p^ only appears in the numerator, so z-scores are fine
HYPOTHESIS TESTING IN PYTHON
Stack Overflow age categories
H0 : Proportion of Stack Overflow users under thirty = 0.5
HA : Proportion of Stack Overflow users under thirty ≠ 0.5
alpha = 0.01
stack_overflow['age_cat'].value_counts(normalize=True)
Under 30 0.535604
At least 30 0.464396
Name: age_cat, dtype: float64
HYPOTHESIS TESTING IN PYTHON
Variables for z
p_hat = (stack_overflow['age_cat'] == 'Under 30').mean()
0.5356037151702786
p_0 = 0.50
n = len(stack_overflow)
2261
HYPOTHESIS TESTING IN PYTHON
Calculating the z-score
p^ − p0
z=
√
p0 ∗ (1 − p0 )
n
import numpy as np
numerator = p_hat - p_0
denominator = np.sqrt(p_0 * (1 - p_0) / n)
z_score = numerator / denominator
3.385911440783663
HYPOTHESIS TESTING IN PYTHON
Calculating the p-value
Two-tailed ("not equal"):
p_value = norm.cdf(-z_score) +
1 - norm.cdf(z_score)
p_value = 2 * (1 - norm.cdf(z_score))
Left-tailed ("less than"):
0.0007094227368100725
from scipy.stats import norm
p_value = norm.cdf(z_score)
p_value <= alpha
Right-tailed ("greater than"):
True
p_value = 1 - norm.cdf(z_score)
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Two-sample
proportion tests
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Comparing two proportions
H0 : Proportion of hobbyist users is the same for those under thirty as those at least thirty
H0 : p≥30 − p<30 = 0
HA : Proportion of hobbyist users is different for those under thirty to those at least thirty
HA : p≥30 − p<30 ≠ 0
alpha = 0.05
HYPOTHESIS TESTING IN PYTHON
Calculating the z-score
z-score equation for a proportion test:
( p^≥30 − p^<30 ) − 0
z=
SE( p^≥30 − p^<30 )
Standard error equation:
SE( p^≥30 − p^<30 ) = √
p^ × (1 − p^) p^ × (1 − p^)
+
n≥30 n<30
p^ → weighted mean of p^≥30 and p^<30
n≥30 × p^≥30 + n<30 × p^<30
p^ =
n≥30 + n<30
Only require p^≥30 , p^<30 , n≥30 , n<30 from the sample to calculate the z-score
HYPOTHESIS TESTING IN PYTHON
Getting the numbers for the z-score
p_hats = stack_overflow.groupby("age_cat")['hobbyist'].value_counts(normalize=True)
age_cat hobbyist
At least 30 Yes 0.773333
No 0.226667
Under 30 Yes 0.843105
No 0.156895
Name: hobbyist, dtype: float64
n = stack_overflow.groupby("age_cat")['hobbyist'].count()
age_cat
At least 30 1050
Under 30 1211
Name: hobbyist, dtype: int64
HYPOTHESIS TESTING IN PYTHON
Getting the numbers for the z-score
p_hats = stack_overflow.groupby("age_cat")['hobbyist'].value_counts(normalize=True)
age_cat hobbyist
At least 30 Yes 0.773333
No 0.226667
Under 30 Yes 0.843105
No 0.156895
Name: hobbyist, dtype: float64
p_hat_at_least_30 = p_hats[("At least 30", "Yes")]
p_hat_under_30 = p_hats[("Under 30", "Yes")]
print(p_hat_at_least_30, p_hat_under_30)
0.773333 0.843105
HYPOTHESIS TESTING IN PYTHON
Getting the numbers for the z-score
n = stack_overflow.groupby("age_cat")['hobbyist'].count()
age_cat
At least 30 1050
Under 30 1211
Name: hobbyist, dtype: int64
n_at_least_30 = n["At least 30"]
n_under_30 = n["Under 30"]
print(n_at_least_30, n_under_30)
1050 1211
HYPOTHESIS TESTING IN PYTHON
Getting the numbers for the z-score
p_hat = (n_at_least_30 * p_hat_at_least_30 + n_under_30 * p_hat_under_30) /
(n_at_least_30 + n_under_30)
std_error = np.sqrt(p_hat * (1-p_hat) / n_at_least_30 +
p_hat * (1-p_hat) / n_under_30)
z_score = (p_hat_at_least_30 - p_hat_under_30) / std_error
print(z_score)
-4.223718652693034
HYPOTHESIS TESTING IN PYTHON
Proportion tests using proportions_ztest()
stack_overflow.groupby("age_cat")['hobbyist'].value_counts()
age_cat hobbyist
At least 30 Yes 812
No 238
Under 30 Yes 1021
No 190
Name: hobbyist, dtype: int64
n_hobbyists = np.array([812, 1021])
n_rows = np.array([812 + 238, 1021 + 190])
from statsmodels.stats.proportion import proportions_ztest
z_score, p_value = proportions_ztest(count=n_hobbyists, nobs=n_rows,
alternative="two-sided")
(-4.223691463320559, 2.403330142685068e-05)
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Chi-square test of
independence
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Revisiting the proportion test
age_by_hobbyist = stack_overflow.groupby("age_cat")['hobbyist'].value_counts()
age_cat hobbyist
At least 30 Yes 812
No 238
Under 30 Yes 1021
No 190
Name: hobbyist, dtype: int64
from statsmodels.stats.proportion import proportions_ztest
n_hobbyists = np.array([812, 1021])
n_rows = np.array([812 + 238, 1021 + 190])
stat, p_value = proportions_ztest(count=n_hobbyists, nobs=n_rows,
alternative="two-sided")
(-4.223691463320559, 2.403330142685068e-05)
HYPOTHESIS TESTING IN PYTHON
Independence of variables
Previous hypothesis test result: evidence that hobbyist and age_cat are associated
Statistical independence - proportion of successes in the response variable is the same
across all categories of the explanatory variable
HYPOTHESIS TESTING IN PYTHON
Test for independence of variables
import pingouin
expected, observed, stats = pingouin.chi2_independence(data=stack_overflow, x='hobbyist',
y='age_cat', correction=False)
print(stats)
test lambda chi2 dof pval cramer power
0 pearson 1.000000 17.839570 1.0 0.000024 0.088826 0.988205
1 cressie-read 0.666667 17.818114 1.0 0.000024 0.088773 0.988126
2 log-likelihood 0.000000 17.802653 1.0 0.000025 0.088734 0.988069
3 freeman-tukey -0.500000 17.815060 1.0 0.000024 0.088765 0.988115
4 mod-log-likelihood -1.000000 17.848099 1.0 0.000024 0.088848 0.988236
5 neyman -2.000000 17.976656 1.0 0.000022 0.089167 0.988694
χ2 statistic = 17.839570 = (−4.223691463320559)2 = (z -score)2
HYPOTHESIS TESTING IN PYTHON
Job satisfaction and age category
stack_overflow['age_cat'].value_counts() stack_overflow['job_sat'].value_counts()
Under 30 1211 Very satisfied 879
At least 30 1050 Slightly satisfied 680
Name: age_cat, dtype: int64 Slightly dissatisfied 342
Neither 201
Very dissatisfied 159
Name: job_sat, dtype: int64
HYPOTHESIS TESTING IN PYTHON
Declaring the hypotheses
H0 : Age categories are independent of job satisfaction levels
HA : Age categories are not independent of job satisfaction levels
alpha = 0.1
Test statistic denoted χ2
Assuming independence, how far away are the observed results from the expected values?
HYPOTHESIS TESTING IN PYTHON
Exploratory visualization: proportional stacked bar plot
props = stack_overflow.groupby('job_sat')['age_cat'].value_counts(normalize=True)
wide_props = props.unstack()
wide_props.plot(kind="bar", stacked=True)
HYPOTHESIS TESTING IN PYTHON
Exploratory visualization: proportional stacked bar plot
HYPOTHESIS TESTING IN PYTHON
Chi-square independence test
import pingouin
expected, observed, stats = pingouin.chi2_independence(data=stack_overflow, x="job_sat", y="age_cat")
print(stats)
test lambda chi2 dof pval cramer power
0 pearson 1.000000 5.552373 4.0 0.235164 0.049555 0.437417
1 cressie-read 0.666667 5.554106 4.0 0.235014 0.049563 0.437545
2 log-likelihood 0.000000 5.558529 4.0 0.234632 0.049583 0.437871
3 freeman-tukey -0.500000 5.562688 4.0 0.234274 0.049601 0.438178
4 mod-log-likelihood -1.000000 5.567570 4.0 0.233854 0.049623 0.438538
5 neyman -2.000000 5.579519 4.0 0.232828 0.049676 0.439419
Degrees of freedom:
(No. of response categories − 1) × (No. of explanatory categories − 1)
(2 − 1) ∗ (5 − 1) = 4
HYPOTHESIS TESTING IN PYTHON
Swapping the variables?
props = stack_overflow.groupby('age_cat')['job_sat'].value_counts(normalize=True)
wide_props = props.unstack()
wide_props.plot(kind="bar", stacked=True)
HYPOTHESIS TESTING IN PYTHON
Swapping the variables?
HYPOTHESIS TESTING IN PYTHON
chi-square both ways
expected, observed, stats = pingouin.chi2_independence(data=stack_overflow, x="age_cat", y="job_sat")
print(stats[stats['test'] == 'pearson'])
test lambda chi2 dof pval cramer power
0 pearson 1.0 5.552373 4.0 0.235164 0.049555 0.437417
Ask: Are the variables X and Y independent?
Not: Is variable X independent from variable Y?
HYPOTHESIS TESTING IN PYTHON
What about direction and tails?
Observed and expected counts squared must be non-negative
chi-square tests are almost always right-tailed 1
1Left-tailed chi-square tests are used in statistical forensics to detect if a fit is suspiciously good because the
data was fabricated. Chi-square tests of variance can be two-tailed. These are niche uses, though.
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Chi-square
goodness of fit tests
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Purple links
How do you feel when you discover that you've already visited the top resource?
purple_link_counts = stack_overflow['purple_link'].value_counts()
purple_link_counts = purple_link_counts.rename_axis('purple_link')\
.reset_index(name='n')\
.sort_values('purple_link')
purple_link n
2 Amused 368
3 Annoyed 263
0 Hello, old friend 1225
1 Indifferent 405
HYPOTHESIS TESTING IN PYTHON
Declaring the hypotheses
hypothesized = pd.DataFrame({ purple_link prop
'purple_link': ['Amused', 'Annoyed', 'Hello, old friend', 'Indifferent'], 0 Amused 0.166667
'prop': [1/6, 1/6, 1/2, 1/6]}) 1 Annoyed 0.166667
2 Hello, old friend 0.500000
3 Indifferent 0.166667
H0 : The sample matches the hypothesized χ2 measures how far observed results are
distribution from expectations in each group
HA : The sample does not match the alpha = 0.01
hypothesized distribution
HYPOTHESIS TESTING IN PYTHON
Hypothesized counts by category
n_total = len(stack_overflow)
hypothesized["n"] = hypothesized["prop"] * n_total
purple_link prop n
0 Amused 0.166667 376.833333
1 Annoyed 0.166667 376.833333
2 Hello, old friend 0.500000 1130.500000
3 Indifferent 0.166667 376.833333
HYPOTHESIS TESTING IN PYTHON
Visualizing counts
import matplotlib.pyplot as plt
plt.bar(purple_link_counts['purple_link'], purple_link_counts['n'],
color='red', label='Observed')
plt.bar(hypothesized['purple_link'], hypothesized['n'], alpha=0.5,
color='blue', label='Hypothesized')
plt.legend()
plt.show()
HYPOTHESIS TESTING IN PYTHON
Visualizing counts
HYPOTHESIS TESTING IN PYTHON
chi-square goodness of fit test
print(hypothesized)
purple_link prop n
0 Amused 0.166667 376.833333
1 Annoyed 0.166667 376.833333
2 Hello, old friend 0.500000 1130.500000
3 Indifferent 0.166667 376.833333
from scipy.stats import chisquare
chisquare(f_obs=purple_link_counts['n'], f_exp=hypothesized['n'])
Power_divergenceResult(statistic=44.59840778416629, pvalue=1.1261810719413759e-09)
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Assumptions in
hypothesis testing
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Randomness
Assumption
The samples are random subsets of larger
populations
Consequence
Sample is not representative of population
How to check this
Understand how your data was collected
Speak to the data collector/domain expert
1 Sampling techniques are discussed in "Sampling in Python".
HYPOTHESIS TESTING IN PYTHON
Independence of observations
Assumption
Each observation (row) in the dataset is independent
Consequence
Increased chance of false negative/positive error
How to check this
Understand how our data was collected
HYPOTHESIS TESTING IN PYTHON
Large sample size
Assumption
The sample is big enough to mitigate uncertainty, so that the Central Limit Theorem applies
Consequence
Wider confidence intervals
Increased chance of false negative/positive errors
How to check this
It depends on the test
HYPOTHESIS TESTING IN PYTHON
Large sample size: t-test
One sample Two samples
At least 30 observations in the sample At least 30 observations in each sample
n ≥ 30 n1 ≥ 30, n2 ≥ 30
n: sample size ni : sample size for group i
Paired samples ANOVA
At least 30 pairs of observations across the At least 30 observations in each sample
samples
ni ≥ 30 for all values of i
Number of rows in our data ≥ 30
HYPOTHESIS TESTING IN PYTHON
Large sample size: proportion tests
One sample Two samples
Number of successes in sample is greater Number of successes in each sample is
than or equal to 10 greater than or equal to 10
n × p^ ≥ 10 n1 × p^1 ≥ 10
Number of failures in sample is greater n2 × p^2 ≥ 10
than or equal to 10
Number of failures in each sample is
n × (1 − p^) ≥ 10 greater than or equal to 10
n: sample size n1 × (1 − p^1 ) ≥ 10
p^: proportion of successes in sample
n2 × (1 − p^2 ) ≥ 10
HYPOTHESIS TESTING IN PYTHON
Large sample size: chi-square tests
The number of successes in each group in greater than or equal to 5
ni × p^i ≥ 5 for all values of i
The number of failures in each group in greater than or equal to 5
ni × (1 − p^i ) ≥ 5 for all values of i
ni : sample size for group i
p^i : proportion of successes in sample group i
HYPOTHESIS TESTING IN PYTHON
Sanity check
If the bootstrap distribution doesn't look normal, assumptions likely aren't valid
Revisit data collection to check for randomness, independence, and sample size
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Non-parametric
tests
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Parametric tests
z-test, t-test, and ANOVA are all parametric tests
Assume a normal distribution
Require sufficiently large sample sizes
HYPOTHESIS TESTING IN PYTHON
Smaller Republican votes data
print(repub_votes_small)
state county repub_percent_08 repub_percent_12
80 Texas Red River 68.507522 69.944817
84 Texas Walker 60.707197 64.971903
33 Kentucky Powell 57.059533 61.727293
81 Texas Schleicher 74.386503 77.384464
93 West Virginia Morgan 60.857614 64.068711
HYPOTHESIS TESTING IN PYTHON
Results with pingouin.ttest()
5 pairs is not enough to meet the sample size condition for the paired t-test:
At least 30 pairs of observations across the samples.
alpha = 0.01
import pingouin
pingouin.ttest(x=repub_votes_potus_08_12_small['repub_percent_08'],
y=repub_votes_potus_08_12_small['repub_percent_12'],
paired=True,
alternative="less")
T dof alternative p-val CI95% cohen-d BF10 power
T-test -5.875753 4 less 0.002096 [-inf, -2.11] 0.500068 26.468 0.239034
HYPOTHESIS TESTING IN PYTHON
Non-parametric tests
Non-parametric tests avoid the parametric assumptions and conditions
Many non-parametric tests use ranks of the data
x = [1, 15, 3, 10, 6]
from scipy.stats import rankdata
rankdata(x)
array([1., 5., 2., 4., 3.])
HYPOTHESIS TESTING IN PYTHON
Non-parametric tests
Non-parametric tests are more reliable than parametric tests for small sample sizes and
when data isn't normally distributed
HYPOTHESIS TESTING IN PYTHON
Non-parametric tests
Non-parametric tests are more reliable than parametric tests for small sample sizes and
when data isn't normally distributed
Wilcoxon-signed rank test
Developed by Frank Wilcoxon in 1945
One of the first non-parametric procedures
HYPOTHESIS TESTING IN PYTHON
Wilcoxon-signed rank test (Step 1)
Works on the ranked absolute differences between the pairs of data
repub_votes_small['diff'] = repub_votes_small['repub_percent_08'] -
repub_votes_small['repub_percent_12']
print(repub_votes_small)
state county repub_percent_08 repub_percent_12 diff
80 Texas Red River 68.507522 69.944817 -1.437295
84 Texas Walker 60.707197 64.971903 -4.264705
33 Kentucky Powell 57.059533 61.727293 -4.667760
81 Texas Schleicher 74.386503 77.384464 -2.997961
93 West Virginia Morgan 60.857614 64.068711 -3.211097
HYPOTHESIS TESTING IN PYTHON
Wilcoxon-signed rank test (Step 2)
Works on the ranked absolute differences between the pairs of data
repub_votes_small['abs_diff'] = repub_votes_small['diff'].abs()
print(repub_votes_small)
state county repub_percent_08 repub_percent_12 diff abs_diff
80 Texas Red River 68.507522 69.944817 -1.437295 1.437295
84 Texas Walker 60.707197 64.971903 -4.264705 4.264705
33 Kentucky Powell 57.059533 61.727293 -4.667760 4.667760
81 Texas Schleicher 74.386503 77.384464 -2.997961 2.997961
93 West Virginia Morgan 60.857614 64.068711 -3.211097 3.211097
HYPOTHESIS TESTING IN PYTHON
Wilcoxon-signed rank test (Step 3)
Works on the ranked absolute differences between the pairs of data
from scipy.stats import rankdata
repub_votes_small['rank_abs_diff'] = rankdata(repub_votes_small['abs_diff'])
print(repub_votes_small)
state county repub_percent_08 repub_percent_12 diff abs_diff rank_abs_diff
80 Texas Red River 68.507522 69.944817 -1.437295 1.437295 1.0
84 Texas Walker 60.707197 64.971903 -4.264705 4.264705 4.0
33 Kentucky Powell 57.059533 61.727293 -4.667760 4.667760 5.0
81 Texas Schleicher 74.386503 77.384464 -2.997961 2.997961 2.0
93 West Virginia Morgan 60.857614 64.068711 -3.211097 3.211097 3.0
HYPOTHESIS TESTING IN PYTHON
Wilcoxon-signed rank test (Step 4)
state county repub_percent_08 repub_percent_12 diff abs_diff rank_abs_diff
80 Texas Red River 68.507522 69.944817 -1.437295 1.437295 1.0
84 Texas Walker 60.707197 64.971903 -4.264705 4.264705 4.0
33 Kentucky Powell 57.059533 61.727293 -4.667760 4.667760 5.0
81 Texas Schleicher 74.386503 77.384464 -2.997961 2.997961 2.0
93 West Virginia Morgan 60.857614 64.068711 -3.211097 3.211097 3.0
Incorporate the sum of the ranks for negative and positive differences
T_minus = 1 + 4 + 5 + 2 + 3
T_plus = 0
W = np.min([T_minus, T_plus])
HYPOTHESIS TESTING IN PYTHON
Implementation with pingouin.wilcoxon()
alpha = 0.01
pingouin.wilcoxon(x=repub_votes_potus_08_12_small['repub_percent_08'],
y=repub_votes_potus_08_12_small['repub_percent_12'],
alternative="less")
W-val alternative p-val RBC CLES
Wilcoxon 0.0 less 0.03125 -1.0 0.72
Fail to reject H0 , since 0.03125 > 0.01
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Non-parametric
ANOVA and
unpaired t-tests
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Wilcoxon-Mann-Whitney test
Also know as the Mann Whitney U test
A t-test on the ranks of the numeric input
Works on unpaired data
HYPOTHESIS TESTING IN PYTHON
Wilcoxon-Mann-Whitney test setup
age_vs_comp = stack_overflow[['converted_comp', 'age_first_code_cut']]
age_vs_comp_wide = age_vs_comp.pivot(columns='age_first_code_cut',
values='converted_comp')
age_first_code_cut adult child
0 77556.0 NaN
1 NaN 74970.0
2 NaN 594539.0
... ... ...
2258 NaN 97284.0
2259 NaN 72000.0
2260 NaN 180000.0
[2261 rows x 2 columns]
HYPOTHESIS TESTING IN PYTHON
Wilcoxon-Mann-Whitney test
alpha=0.01
import pingouin
pingouin.mwu(x=age_vs_comp_wide['child'],
y=age_vs_comp_wide['adult'],
alternative='greater')
U-val alternative p-val RBC CLES
MWU 744365.5 greater 1.902723e-19 -0.222516 0.611258
HYPOTHESIS TESTING IN PYTHON
Kruskal-Wallis test
Kruskal-Wallis test is to Wilcoxon-Mann-Whitney test as ANOVA is to t-test
alpha=0.01
pingouin.kruskal(data=stack_overflow,
dv='converted_comp',
between='job_sat')
Source ddof1 H p-unc
Kruskal job_sat 4 72.814939 5.772915e-15
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Congratulations!
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Course recap
Chapter 1 Chapter 3
Workflow for testing proportions vs. a Testing differences in sample proportions
hypothesized value between two groups using proportion tests
False negative/false positive errors Using chi-square independence/goodness
of fit tests
Chapter 2 Chapter 4
Testing differences in sample means Reviewing assumptions of parametric
between two groups using t-tests hypothesis tests
Extending this to more than two groups Examined non-parametric alternatives
using ANOVA and pairwise t-tests when assumptions aren't valid
HYPOTHESIS TESTING IN PYTHON
More courses
Inference
Statistics Fundamentals with Python skill track
Bayesian statistics
Bayesian Data Analysis in Python
Applications
Customer Analytics and A/B Testing in Python
HYPOTHESIS TESTING IN PYTHON
Congratulations!
HYPOTHESIS TESTING IN PYTHON