0% found this document useful (0 votes)

4 views20 pages

AB Testing - Part II

The document outlines the pitfalls of A/B testing, emphasizing the importance of not solely relying on p-values for decision-making and considering effect sizes and practical significance. It details the steps for conducting A/B tests, including developing a framework, generating hypotheses, defining goals, determining sample sizes through power analysis, and analyzing results. Additionally, it highlights the risks of early test termination and the need for careful metric selection to avoid spurious correlations.

Uploaded by

anureddy1722

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views20 pages

AB Testing - Part II

Uploaded by

anureddy1722

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

ITCS 6100: Big Data for Competitive Advantage

Data Driven Decision Making: A/B Testing

Part II
Dr. Gabriel Terejanu
Pitfalls in A/B Testing

• Confusing p-value with probability of the null hypothesis being

true. A p-value does not indicate the probability of the null hypothesis
being true or the alternative hypothesis being true. It only indicates the
probability of observing the data given that the null hypothesis is true.

• Using a p-value as the sole criterion for accepting or rejecting the

null hypothesis. A p-value should not be the only criterion used to
make a decision about the null hypothesis. Other factors such as the
strength of the evidence, the practical significance of the results, and
the potential for confounding variables should also be considered.
• Ignoring the effect size. A p-value only tells us whether a difference is
statistically significant, not whether it is practically significant. The effect
size should also be considered to understand the magnitude of the
difference. If the effect size is too small then it may not have practical
significance.
A/B testing steps

1. Develop you’re A/B Testing Framework

2. Generate a hypothesis & product variant

3. Define your goals and metrics

4. Determine the sample size (Power analysis)

5. Run A/B test

6. Analyze results
7. Deploy winner

8. Rinse and Repeat

(1) Develop you’re A/B Testing Framework

• How will you actually implement the random allocation of users

into different experiences?

• Use A/A testing as a rehearsal to work out all the mechanics of

A/B testing and make sure that your randomization is working.
A/A testing (test your analysis)

• Run lots of A/A tests (no differences between experimental and

control conditions)
• We should only observe p-values of 0.05 or less about 5% of
the time
• The p-value distribution should be uniform rather than skewed to
low or high values
Python Example A/A testing
(2) Generate hypothesis & product variant

• How will you generate A/B testing ideas?

Use observable data to generate new hypothesis.

• Example: Heatmaps and Scrollmaps allow you to decide where you

should focus your energy.
• WallMonkeys decided to run an A/B test by exchanging the stock-style
image with a more whimsical alternative that would show visitors the
opportunities they could enjoy with WallMonkeys products.
(3) Define goals & metrics

• Example bad experimentation

• Your data scientists makes an observation:
2% of queries end up with “No results.”
• Manager: must reduce.
Assigns a team to minimize “no results” metric
• Metric improves, but results for query brochure
paper are crap (or in this case, paper to clean
crap)
• Sometimes it *is* better to show “No Results.”
This is a good example of gaming the OEC.
• “No results” is the wrong metric.

Real example from my Amazon Prime now search

https://twitter.com/ronnyk/status/713949552823263234
Too many metrics

• If you’re looking at a large number of metrics at the same time,

you’re at risk of making what statisticians call “spurious
correlations.”
• In proper test design, “you should decide on the metrics you’re
going to look at before you execute an experiment and select a
few. The more you’re measuring, the more likely that you’re
going to see random fluctuations.”

• With so many metrics, instead of asking yourself, “What’s

happening with this variable?” you’re asking, “What interesting
(and potentially insignificant) changes am I seeing?”

https://hbr.org/2017/06/a-refresher-on-ab-testing
A/B testing: type of data & statistical test

https://www.kaggle.com/code/janiezj/a-b-testing-examples-from-easy-to-advanced
(4) Determine sample size (Power analysis)

• How much data do we need to collect?

• When do we stop the experiment?
• For this we need to implement Power analysis. We need:

1. Power level: the probability of correctly rejecting the null

hypothesis (H0) when the alternative hypothesis (Ha) is true.
2. Significance level: the probability of making the error of
rejecting a true null hypothesis.

3. Effect size (Cohen’s d): the strength of a relationship or the

magnitude of a difference between two groups.
Power level

• In the case of an independent t-test, power is the probability of

correctly detecting a difference in means between the treatment
group and the control group, if such a difference exists.

• Power is an important consideration when designing a study

because it determines the sample size required to achieve a
certain level of precision.
• Higher power means that the test is more likely to detect a true
difference. Increasing the power will increase the required
sample size.
• Lower power means that the test is less likely to detect a true
difference.
• It is generally accepted that power should be 80% or greater to
finding a statistically significant difference when there is one.
Significance level

• The significance level, also known as the alpha level, is a

probability threshold to determine whether to reject or fail to
reject the null hypothesis. It is the probability of rejecting a
true null hypothesis.
• The most commonly used significance level is 0.05, which
means that there is a 5% chance of of rejecting a true null
hypothesis.
• The significance level is chosen before the study is
conducted, and it reflects the level of risk that the researcher is
willing to accept for rejecting a true null hypothesis.

• Lowering the significance level, for example to 0.01, makes the

test more conservative, and thus less likely to reject a true null
hypothesis, but it also makes it less likely to detect a true
difference if it exists.
Effect size

• The effect size is a measure of the strength of a relationship or the

magnitude of a difference between two groups.
• Cohen's d: This is a standardized mean difference, which compares the
means of two groups and is calculated as the difference between the
two means divided by the pooled standard deviation. To interpret the
resulting number, most social scientists use this general guide
developed by Cohen:
• < 0.1 = trivial effect
• 0.1 - 0.3 = small effect
• 0.3 - 0.5 = moderate effect
• > 0.5 = large difference effect

• Because effect size can only be calculated after you collect data
from program participants, you will have to use an estimate for the
power analysis.
• Common practice is to use a value of 0.5 as it indicates a moderate to
large difference.

https://meera.seas.umich.edu/power-analysis-statistical-significance-effect-size.html
Python Example Power analysis
(5) Run A/B test

• Too many managers don’t let the tests run their course. Because most of the
software for running these tests lets you watch results in real time, managers
want to make decisions too quickly.
• This mistake, he says, “evolves out of impatience,” and many software vendors
have played into this over eagerness by offering a type of A/B testing called
“real-time optimization,” in which you can use algorithms to make adjustments
as results come in.
• Problems with early stopping:
• Inaccurate results: Stopping the A/B Test early can also result in inaccurate
results, because the sample size may not be large enough to be
representative of the population. This can lead to false positives (rejecting
the null hypothesis when it is true) or false negatives (failing to reject the
null hypothesis when it is false), which can bias the results of the test.
• Lack of statistical power: Stopping the A/B Test early can also result in a
lack of statistical power, which means that the test may not have a high
probability of detecting a significant difference between the control and
treatment groups, if one exists. This can make it difficult to determine whether
the changes made to the treatment group had a significant effect on the key
metrics of the test.

https://hbr.org/2017/06/a-refresher-on-ab-testing
(6) Analyze results

• Remove outliers & account for duplicate users

• Example
• An experiment treatment with 100,000 users on Amazon, where 2%
convert with an average of $30.
• Total revenue = 100,000*2%*$30 = $60,000.
• A lift of 2% is $1,200
• Sometimes (rarely) a “user” purchases double the lift amount, or
around $2,500.
• That single user who falls into Control or Treatment is enough to
significantly skew the result.
• The discovery: libraries purchase books irregularly and order a lot
each time
• Solution: cap the attribute value of single users to the 99th percentile
of the distribution

https://exp-platform.com/2017abtestingtutorial/
Conversion Rate

https://www.kaggle.com/code/janiezj/a-b-testing-examples-from-easy-to-advanced
Conversion Rate

• Let’s imagine you work on the product team at a medium-sized online

e-commerce business.
• The UX designer worked really hard on a new version of the product
page, with the hope that it will lead to a higher conversion rate.
• The product manager (PM) told you that the current conversion
rate is about 13% on average throughout the year.

• The team would be happy with an increase of 2%, meaning that the
new design will be considered a success if it raises the conversion rate
to 15%.

https://towardsdatascience.com/ab-
testing-with-python-e5964dd66143
Python Example Conversion Rate

LSSGB Practice Exam Questions and Answers
100% (4)
LSSGB Practice Exam Questions and Answers
101 pages
Six Sigma Process and Tools Q&A
100% (3)
Six Sigma Process and Tools Q&A
14 pages
15ma207 Probability & Queueing Theory Maths 4th Semester Question Bank All Unit Question Paper 2017.v.srm - Ramapuram PDF
No ratings yet
15ma207 Probability & Queueing Theory Maths 4th Semester Question Bank All Unit Question Paper 2017.v.srm - Ramapuram PDF
15 pages
Data Analytics Question Bank
67% (3)
Data Analytics Question Bank
27 pages
AP Stats 12 AP Stats Vocab PDF
100% (1)
AP Stats 12 AP Stats Vocab PDF
3 pages
Hypothesis Testing: A:B Tests Explained - by Trist'n Joseph - Jul, 2020 - Towards Data Science
No ratings yet
Hypothesis Testing: A:B Tests Explained - by Trist'n Joseph - Jul, 2020 - Towards Data Science
1 page
Hypothesis Testing Complete Slides
No ratings yet
Hypothesis Testing Complete Slides
54 pages
A B Testing
100% (1)
A B Testing
28 pages
A/B Testing Guide for Job Interviews
No ratings yet
A/B Testing Guide for Job Interviews
13 pages
U02Lecture05 - Statistical Experiments and Significance Testing
No ratings yet
U02Lecture05 - Statistical Experiments and Significance Testing
51 pages
Steps To Completing An A/B Test
No ratings yet
Steps To Completing An A/B Test
1 page
Mostwinningabtestresultsareillusory 0
No ratings yet
Mostwinningabtestresultsareillusory 0
16 pages
Advanced A/B Testing Techniques
No ratings yet
Advanced A/B Testing Techniques
339 pages
Introduction To Statistics With GraphPad Prism Slides
No ratings yet
Introduction To Statistics With GraphPad Prism Slides
101 pages
An Intuitive Explanation of Bayes Theorem 1-4-2011
No ratings yet
An Intuitive Explanation of Bayes Theorem 1-4-2011
40 pages
2020-04-08 A - B-Tests - Jasmin Yaya
No ratings yet
2020-04-08 A - B-Tests - Jasmin Yaya
42 pages
The Art Science of AB Testing For Business Decisions
No ratings yet
The Art Science of AB Testing For Business Decisions
97 pages
Business Research Essentials
No ratings yet
Business Research Essentials
232 pages
Prepare For The Unexpected: A Guide To Testing and Learning With Incrementality Measurement
No ratings yet
Prepare For The Unexpected: A Guide To Testing and Learning With Incrementality Measurement
16 pages
ABIntuition Busters KDDTalk
No ratings yet
ABIntuition Busters KDDTalk
16 pages
Quality Assurance Manual: WWW - Penndot.gov
No ratings yet
Quality Assurance Manual: WWW - Penndot.gov
61 pages
Significance-Testing-White Paper
No ratings yet
Significance-Testing-White Paper
7 pages
25 A - B Testing Concepts You Must Know - Interview Refresher
No ratings yet
25 A - B Testing Concepts You Must Know - Interview Refresher
7 pages
Power, Power Curves and Sample Size
No ratings yet
Power, Power Curves and Sample Size
36 pages
Sample Size Calculation & Software
No ratings yet
Sample Size Calculation & Software
26 pages
A/B Testing Pitfalls & Solutions
No ratings yet
A/B Testing Pitfalls & Solutions
13 pages
Last Assignment
No ratings yet
Last Assignment
5 pages
A/B Testing for Game Developers
No ratings yet
A/B Testing for Game Developers
33 pages
Ejercicios Resueltos de Inferencia Estadistica
No ratings yet
Ejercicios Resueltos de Inferencia Estadistica
229 pages
A - B Testing
No ratings yet
A - B Testing
27 pages
Question and Answer - 25
No ratings yet
Question and Answer - 25
30 pages
ANOVA Means Separation & Error Rates
No ratings yet
ANOVA Means Separation & Error Rates
18 pages
ML Classification Metrics Guide
100% (1)
ML Classification Metrics Guide
30 pages
Writing A Quantitative Research Thesis
100% (1)
Writing A Quantitative Research Thesis
14 pages
Blue Dye Test in Predicting Aspiration
No ratings yet
Blue Dye Test in Predicting Aspiration
4 pages
L18 Hypothesis Testing1
No ratings yet
L18 Hypothesis Testing1
62 pages
Chapter 11
No ratings yet
Chapter 11
24 pages
A/B Testing: Mazher Khan - IIT (BHU) - B.Tech (DR-2)
No ratings yet
A/B Testing: Mazher Khan - IIT (BHU) - B.Tech (DR-2)
4 pages
T Statistic and Z Statics Difference
No ratings yet
T Statistic and Z Statics Difference
4 pages
Power Analysis
No ratings yet
Power Analysis
37 pages
Multiple-Choice Statistics Questions
100% (1)
Multiple-Choice Statistics Questions
11 pages
Advanced Experiment Design Guide
No ratings yet
Advanced Experiment Design Guide
45 pages
Session 3
No ratings yet
Session 3
28 pages
Time: Sep 9, 2024 Every Week On Monday, Until Dec 23, 2024, 16 Occurrence(s) Join Zoom Meeting
No ratings yet
Time: Sep 9, 2024 Every Week On Monday, Until Dec 23, 2024, 16 Occurrence(s) Join Zoom Meeting
33 pages
102 Chapter 9 & 10 Notes
No ratings yet
102 Chapter 9 & 10 Notes
6 pages
4 ABTesting
No ratings yet
4 ABTesting
18 pages
Ab Testing
No ratings yet
Ab Testing
16 pages
AB Test Notes
No ratings yet
AB Test Notes
7 pages
Stats3 - Choosing A Test
No ratings yet
Stats3 - Choosing A Test
23 pages
Lesson 10 PDF
No ratings yet
Lesson 10 PDF
27 pages
AB Testing Notes
No ratings yet
AB Testing Notes
13 pages
Incrementality Measurement Inline
No ratings yet
Incrementality Measurement Inline
16 pages
Unlock Insights With AB Testing Data-Driven Decision Making
No ratings yet
Unlock Insights With AB Testing Data-Driven Decision Making
5 pages
How To Design An A B Test As A Data Scientist Am
No ratings yet
How To Design An A B Test As A Data Scientist Am
9 pages
Aghayedo, Elmi-Homework 6
No ratings yet
Aghayedo, Elmi-Homework 6
3 pages
Module2 DS
No ratings yet
Module2 DS
46 pages
Solution Work 1.ea
No ratings yet
Solution Work 1.ea
31 pages
A Refresher On A B Testing
No ratings yet
A Refresher On A B Testing
9 pages
Colegio de San Gabriel Arcangel of Caloocan, Inc. Escuela San Gabriel de Arcangel Foundation
100% (2)
Colegio de San Gabriel Arcangel of Caloocan, Inc. Escuela San Gabriel de Arcangel Foundation
8 pages
Test What Matters - Level-Up Your Product Experiments With Behavioral Data
No ratings yet
Test What Matters - Level-Up Your Product Experiments With Behavioral Data
12 pages
A Comprehensive Getting Started Guide To A/B Testing
No ratings yet
A Comprehensive Getting Started Guide To A/B Testing
8 pages
Psych 2
No ratings yet
Psych 2
24 pages
Comparison of P Control Charts For Low Defective Rate
No ratings yet
Comparison of P Control Charts For Low Defective Rate
11 pages
DDDM Lecture3 ExperimentBasics Dec11
No ratings yet
DDDM Lecture3 ExperimentBasics Dec11
38 pages
EP-Experimentation Best Practices-200125-135749
No ratings yet
EP-Experimentation Best Practices-200125-135749
5 pages
Student Introduction To The Simulation Updated 11-27-23 - FINAL
No ratings yet
Student Introduction To The Simulation Updated 11-27-23 - FINAL
11 pages
Testing of Hypothesis
No ratings yet
Testing of Hypothesis
52 pages
Homework 4: Shortest Path & Pattern Matching
No ratings yet
Homework 4: Shortest Path & Pattern Matching
8 pages
Class 3 After
No ratings yet
Class 3 After
28 pages
Google (DA) 面试准备
No ratings yet
Google (DA) 面试准备
20 pages
B Experiments
No ratings yet
B Experiments
3 pages
Slides at 2017 Talk At: Conversion Hotel #CH2022 Nov 19, 2022
No ratings yet
Slides at 2017 Talk At: Conversion Hotel #CH2022 Nov 19, 2022
36 pages
Chapter 05 - Tool For Experiments
No ratings yet
Chapter 05 - Tool For Experiments
13 pages
Module2 Ds
No ratings yet
Module2 Ds
28 pages
Slide 1: Title Slide (1 Minute) : UNICORN: Runtime Provenance-Based Detector For Advanced Persistent Threats
No ratings yet
Slide 1: Title Slide (1 Minute) : UNICORN: Runtime Provenance-Based Detector For Advanced Persistent Threats
10 pages
DS Interview Prep
No ratings yet
DS Interview Prep
6 pages
Slide 1: Title Slide (1 Minute) : UNICORN: Runtime Provenance-Based Detector For Advanced Persistent Threats
No ratings yet
Slide 1: Title Slide (1 Minute) : UNICORN: Runtime Provenance-Based Detector For Advanced Persistent Threats
9 pages
User Manual
No ratings yet
User Manual
6 pages
Sprint Reports
No ratings yet
Sprint Reports
4 pages
Assignment 2 Business Stat 1 Individual Case Study On Inferential Statistics
No ratings yet
Assignment 2 Business Stat 1 Individual Case Study On Inferential Statistics
22 pages
A-B Testing - Framework-2025061017080742
No ratings yet
A-B Testing - Framework-2025061017080742
5 pages
AB Testing Cheat Sheet
No ratings yet
AB Testing Cheat Sheet
13 pages
Clinical Research Definitions and Terminologies
No ratings yet
Clinical Research Definitions and Terminologies
44 pages
Exercise 3 - PM299.1 - Hilot and Hilario
No ratings yet
Exercise 3 - PM299.1 - Hilot and Hilario
4 pages
02 ABTest
No ratings yet
02 ABTest
3 pages
CMSC 177 Statistical Experiments and Significance Testing
No ratings yet
CMSC 177 Statistical Experiments and Significance Testing
31 pages
Module 3
No ratings yet
Module 3
79 pages
Kohavi FalsePositivesInABTestsPosted
No ratings yet
Kohavi FalsePositivesInABTestsPosted
12 pages
A - B Testing - Data Science Guide
No ratings yet
A - B Testing - Data Science Guide
12 pages
AB Testing - Part I
No ratings yet
AB Testing - Part I
25 pages
Wa0002.
No ratings yet
Wa0002.
8 pages
Psyb70 Lec07
No ratings yet
Psyb70 Lec07
74 pages
Assignment III 2
No ratings yet
Assignment III 2
2 pages
Assignment2 Q1
No ratings yet
Assignment2 Q1
3 pages
Data Warehouse
No ratings yet
Data Warehouse
9 pages
SMDA Concept Explanation
No ratings yet
SMDA Concept Explanation
4 pages

AB Testing - Part II

Uploaded by

AB Testing - Part II

Uploaded by

ITCS 6100: Big Data for Competitive Advantage

Data Driven Decision Making: A/B Testing

• Confusing p-value with probability of the null hypothesis being

• Using a p-value as the sole criterion for accepting or rejecting the

1. Develop you’re A/B Testing Framework

3. Define your goals and metrics

5. Run A/B test

8. Rinse and Repeat

• How will you actually implement the random allocation of users

• Use A/A testing as a rehearsal to work out all the mechanics of

• Run lots of A/A tests (no differences between experimental and

• How will you generate A/B testing ideas?

• Example: Heatmaps and Scrollmaps allow you to decide where you

• Example bad experimentation

Real example from my Amazon Prime now search

• If you’re looking at a large number of metrics at the same time,

• With so many metrics, instead of asking yourself, “What’s

• How much data do we need to collect?

1. Power level: the probability of correctly rejecting the null

3. Effect size (Cohen’s d): the strength of a relationship or the

• In the case of an independent t-test, power is the probability of

• Power is an important consideration when designing a study

• The significance level, also known as the alpha level, is a

• Lowering the significance level, for example to 0.01, makes the

• The effect size is a measure of the strength of a relationship or the

• Remove outliers & account for duplicate users

• Let’s imagine you work on the product team at a medium-sized online

You might also like