ITCS 6100: Big Data for Competitive Advantage
Data Driven Decision Making: A/B Testing
Part II
Dr. Gabriel Terejanu
Pitfalls in A/B Testing
• Confusing p-value with probability of the null hypothesis being
true. A p-value does not indicate the probability of the null hypothesis
being true or the alternative hypothesis being true. It only indicates the
probability of observing the data given that the null hypothesis is true.
• Using a p-value as the sole criterion for accepting or rejecting the
null hypothesis. A p-value should not be the only criterion used to
make a decision about the null hypothesis. Other factors such as the
strength of the evidence, the practical significance of the results, and
the potential for confounding variables should also be considered.
• Ignoring the effect size. A p-value only tells us whether a difference is
statistically significant, not whether it is practically significant. The effect
size should also be considered to understand the magnitude of the
difference. If the effect size is too small then it may not have practical
significance.
A/B testing steps
1. Develop you’re A/B Testing Framework
2. Generate a hypothesis & product variant
3. Define your goals and metrics
4. Determine the sample size (Power analysis)
5. Run A/B test
6. Analyze results
7. Deploy winner
8. Rinse and Repeat
(1) Develop you’re A/B Testing Framework
• How will you actually implement the random allocation of users
into different experiences?
• Use A/A testing as a rehearsal to work out all the mechanics of
A/B testing and make sure that your randomization is working.
A/A testing (test your analysis)
• Run lots of A/A tests (no differences between experimental and
control conditions)
• We should only observe p-values of 0.05 or less about 5% of
the time
• The p-value distribution should be uniform rather than skewed to
low or high values
Python Example A/A testing
(2) Generate hypothesis & product variant
• How will you generate A/B testing ideas?
Use observable data to generate new hypothesis.
• Example: Heatmaps and Scrollmaps allow you to decide where you
should focus your energy.
• WallMonkeys decided to run an A/B test by exchanging the stock-style
image with a more whimsical alternative that would show visitors the
opportunities they could enjoy with WallMonkeys products.
(3) Define goals & metrics
• Example bad experimentation
• Your data scientists makes an observation:
2% of queries end up with “No results.”
• Manager: must reduce.
Assigns a team to minimize “no results” metric
• Metric improves, but results for query brochure
paper are crap (or in this case, paper to clean
crap)
• Sometimes it *is* better to show “No Results.”
This is a good example of gaming the OEC.
• “No results” is the wrong metric.
Real example from my Amazon Prime now search
https://twitter.com/ronnyk/status/713949552823263234
Too many metrics
• If you’re looking at a large number of metrics at the same time,
you’re at risk of making what statisticians call “spurious
correlations.”
• In proper test design, “you should decide on the metrics you’re
going to look at before you execute an experiment and select a
few. The more you’re measuring, the more likely that you’re
going to see random fluctuations.”
• With so many metrics, instead of asking yourself, “What’s
happening with this variable?” you’re asking, “What interesting
(and potentially insignificant) changes am I seeing?”
https://hbr.org/2017/06/a-refresher-on-ab-testing
A/B testing: type of data & statistical test
https://www.kaggle.com/code/janiezj/a-b-testing-examples-from-easy-to-advanced
(4) Determine sample size (Power analysis)
• How much data do we need to collect?
• When do we stop the experiment?
• For this we need to implement Power analysis. We need:
1. Power level: the probability of correctly rejecting the null
hypothesis (H0) when the alternative hypothesis (Ha) is true.
2. Significance level: the probability of making the error of
rejecting a true null hypothesis.
3. Effect size (Cohen’s d): the strength of a relationship or the
magnitude of a difference between two groups.
Power level
• In the case of an independent t-test, power is the probability of
correctly detecting a difference in means between the treatment
group and the control group, if such a difference exists.
• Power is an important consideration when designing a study
because it determines the sample size required to achieve a
certain level of precision.
• Higher power means that the test is more likely to detect a true
difference. Increasing the power will increase the required
sample size.
• Lower power means that the test is less likely to detect a true
difference.
• It is generally accepted that power should be 80% or greater to
finding a statistically significant difference when there is one.
Significance level
• The significance level, also known as the alpha level, is a
probability threshold to determine whether to reject or fail to
reject the null hypothesis. It is the probability of rejecting a
true null hypothesis.
• The most commonly used significance level is 0.05, which
means that there is a 5% chance of of rejecting a true null
hypothesis.
• The significance level is chosen before the study is
conducted, and it reflects the level of risk that the researcher is
willing to accept for rejecting a true null hypothesis.
• Lowering the significance level, for example to 0.01, makes the
test more conservative, and thus less likely to reject a true null
hypothesis, but it also makes it less likely to detect a true
difference if it exists.
Effect size
• The effect size is a measure of the strength of a relationship or the
magnitude of a difference between two groups.
• Cohen's d: This is a standardized mean difference, which compares the
means of two groups and is calculated as the difference between the
two means divided by the pooled standard deviation. To interpret the
resulting number, most social scientists use this general guide
developed by Cohen:
• < 0.1 = trivial effect
• 0.1 - 0.3 = small effect
• 0.3 - 0.5 = moderate effect
• > 0.5 = large difference effect
• Because effect size can only be calculated after you collect data
from program participants, you will have to use an estimate for the
power analysis.
• Common practice is to use a value of 0.5 as it indicates a moderate to
large difference.
https://meera.seas.umich.edu/power-analysis-statistical-significance-effect-size.html
Python Example Power analysis
(5) Run A/B test
• Too many managers don’t let the tests run their course. Because most of the
software for running these tests lets you watch results in real time, managers
want to make decisions too quickly.
• This mistake, he says, “evolves out of impatience,” and many software vendors
have played into this over eagerness by offering a type of A/B testing called
“real-time optimization,” in which you can use algorithms to make adjustments
as results come in.
• Problems with early stopping:
• Inaccurate results: Stopping the A/B Test early can also result in inaccurate
results, because the sample size may not be large enough to be
representative of the population. This can lead to false positives (rejecting
the null hypothesis when it is true) or false negatives (failing to reject the
null hypothesis when it is false), which can bias the results of the test.
• Lack of statistical power: Stopping the A/B Test early can also result in a
lack of statistical power, which means that the test may not have a high
probability of detecting a significant difference between the control and
treatment groups, if one exists. This can make it difficult to determine whether
the changes made to the treatment group had a significant effect on the key
metrics of the test.
https://hbr.org/2017/06/a-refresher-on-ab-testing
(6) Analyze results
• Remove outliers & account for duplicate users
• Example
• An experiment treatment with 100,000 users on Amazon, where 2%
convert with an average of $30.
• Total revenue = 100,000*2%*$30 = $60,000.
• A lift of 2% is $1,200
• Sometimes (rarely) a “user” purchases double the lift amount, or
around $2,500.
• That single user who falls into Control or Treatment is enough to
significantly skew the result.
• The discovery: libraries purchase books irregularly and order a lot
each time
• Solution: cap the attribute value of single users to the 99th percentile
of the distribution
https://exp-platform.com/2017abtestingtutorial/
Conversion Rate
https://www.kaggle.com/code/janiezj/a-b-testing-examples-from-easy-to-advanced
Conversion Rate
• Let’s imagine you work on the product team at a medium-sized online
e-commerce business.
• The UX designer worked really hard on a new version of the product
page, with the hope that it will lead to a higher conversion rate.
• The product manager (PM) told you that the current conversion
rate is about 13% on average throughout the year.
• The team would be happy with an increase of 2%, meaning that the
new design will be considered a success if it raises the conversion rate
to 15%.
https://towardsdatascience.com/ab-
testing-with-python-e5964dd66143
Python Example Conversion Rate