More Create Blog Sign In
Probably Overthinking It
A blog by Allen Downey.
Tuesday, May 31, 2011 Probably Overthinking It
There is only one test!
UPDATE: Here is a more recent (and I think clearer) version of this post.
A few weeks ago I started reading reddit/r/statistics regularly. I post links to this blog there
and the readers have given me some good feedback. And I have answered a few
questions.
One of the most commons question I see is something like, "I have some data. Which test
should I use?" When I see this question, I always have two thoughts:
1) I hate the way statistics is taught, because it gives students the impression that
hypothesis testing is all about choosing the right test, and if you get it wrong the statistics
wonks will yell at you, and
2) There is only one test!
Available for preorder now
Let me explain. All tests try to answer the same question: "Is the apparent effect real, or is it
due to chance?" To answer that question, we formulate two hypotheses: the null hypothesis,
Probably Overthinking It: The Book!
H0, is a model of the system if the effect is due to chance; the alternate hypothesis, HA, is a
model where the effect is real. I am working on a book, also called
Probably Overthinking It, which is
Ideally we should compute the probability of seeing the effect (E) under both hypotheses; about using evidence and reason to
that is P(E | H0) and P(E | HA). But formulating HA is not always easy, so in conventional answer questions and make better
decisions. It will be published in 2023
hypothesis testing, we just compute P(E | H0), which is the p-value. If the p-value is small,
by the University of Chicago Press. If
we conclude that the effect is unlikely to have occurred by chance, which suggests that it is
you would like to get an occasional
real. update about the book, please join
my mailing list.
That's hypothesis testing. All of the so-called tests you learn in statistics class are just ways
to compute p-values efficiently. When computation was expensive, these shortcuts were
This blog has moved
important, but now that computation is virtually free, they are not.
I'm posting new articles on this new
And the shortcuts are often based on simplifying assumptions and approximations. If you site. Older articles are still available
violate the assumptions, the results can be misleading, which is why statistics classes are here.
filled with dire warnings and students end up paralyzed with fear.
Tweet about this
Fortunately, there is a simple alternative: simulation. If you can construct a model of the null
hypothesis, you can estimate p-values just by counting. This figure shows the structure of Post
the simulation.
Share on Facebook
Subscribe
Posts
All Comments
Books by Allen Downey
Think DSP
Think Java
Think Bayes
Think Python 2e
Think Stats 2e
The first step is to get data from the system and compute a test statistic. The result is some
measure of the size of the effect, which I call "delta". For example, if you are comparing the
mean of two groups, delta is the difference in the means. If you are comparing actual values
with expected values, delta is a chi-squared statistic (described below) or some other Think Complexity
measure of the distance between the observed and expected values.
The null hypothesis is a model of the system under the assumption that the apparent effect
is due to chance. For example, if you observed a difference in means between two groups, Blog Archive
the null hypothesis is that the two groups are the same.
►
► 2018 (16)
The next step is to use the model to generate simulated data that has the sample size as the ►
► 2017 (13)
actual data. Then apply the test statistic to the simulated data. ►
► 2016 (30)
►
► 2015 (29)
The last step is the easiest; all you have to do is count how many times the test statistic for ►
► 2014 (21)
the simulated data exceeds delta. That's the p-value. ►
► 2013 (21)
►
► 2012 (24)
As an example, let's do the Casino problem from Think Stats:
▼
▼ 2011 (30)
►
► November (6)
Suppose you run a casino and you suspect that a customer has replaced a die
►
► October (4)
provided by the casino with a ``crooked die''; that is, one that has been
tampered with to make one of the faces more likely to come up than the ►
► August (2)
others. You apprehend the alleged cheater and confiscate the die, but now ►
► June (2)
you have to prove that it is crooked. You roll the die 60 times and get the ▼
▼ May (2)
following results:
There is only one test!
Value 1 2 3 4 5 6 Think Stats will be published by
O'Reilly in June
Frequency 8 9 19 6 8 10
►
► April (3)
What is the probability of seeing results like this by chance? ►
► March (3)
►
► February (4)
The null hypothesis is that the die is fair. Under that assumption, the expected frequency for
each value is 10, so the frequencies 8, 9 and 10 are not surprising, but 6 is a little funny and ►
► January (4)
19 is suspicious.
About Me
To compute a p-value, we have to choose a test statistic that measures how unexpected
these results are. The chi-squared statistic is a reasonable choice: for each value we Allen Downey
compare the expected frequency, exp, and the observed frequency, obs, and compute the
View my complete profile
sum of the squared relative differences:
def ChiSquared(expected, observed):
total = 0.0
for x, exp in expected.Items():
obs = observed.Freq(x)
total += (obs - exp)**2 / exp
return total
Why relative? Because the variation in the observed values depends on the expected
value. Why squared? Well, squaring makes the differences positive, so they don't cancel
each other when we add them up. But other than that, there is no special reason to choose
the exponent 2. The absolute value would also be a reasonable choice.
For the observed frequencies, the chi-squared statistic is 20.6. By itself, this number doesn't
mean anything. We have to compare it to results from the null hypothesis.
Here is the code that generates simulated data:
def SimulateRolls(sides, num_rolls):
"""Generates a Hist of simulated die rolls.
Args:
sides: number of sides on the die
num_rolls: number of times to rolls
Returns:
Hist object
"""
hist = Pmf.Hist()
for i in range(num_rolls):
roll = random.randint(1, sides)
hist.Incr(roll)
return hist
And here is the code that runs the simulations:
count = 0.
num_trials = 1000
num_rolls = 60
threshold = ChiSquared(expected, observed)
for _ in range(num_trials):
simulated = SimulateRolls(sides, num_rolls)
chi2 = ChiSquared(expected, simulated)
if chi2 >= threshold:
count += 1
pvalue = count / num_trials
print 'p-value', pvalue
Out of 1000 simulations, only 34 generated a chi-squared value greater than 20.6, so the
estimated p-value is 3.4%. I would characterize that as borderline significant. Based on this
result, I think the die is probably crooked, but would I have the guy whacked? No, I would
not. Maybe just roughed up a little.
For this problem, I could have computed the sample distribution of the test statistic
analytically and computed the p-value quickly and exactly. If I had to run this test many
times for large datasets, computational efficiency might be important. But usually it's not.
And accuracy isn't very important either. Remember that the test statistic is arbitrary, and
the null hypothesis often involves arbitrary choices, too. There is no point in computing an
exact answer to an arbitrary question.
For most problems, we only care about the order of magnitude: if the p-value is smaller that
1/100, the effect is likely to be real; if it is greater than 1/10, probably not. If you think there
is a difference between a 4.8% (significant!) and 5.2% (not significant!), you are taking it too
seriously.
So the advantages of analysis are mostly irrelevant, but the disadvantages are not:
1) Analysis often dictates the test statistic; simulation lets you choose whatever test statistic
is most appropriate. For example, if someone is cheating at craps, they will load the die to
increase the frequency of 3 and/or 4, not 1 and/or 6. So in the casino problem the results
are suspicious not just because one of the frequencies is high, but also because the
frequent value is 3. We could construct a test statistic that captures this domain knowledge
(and the resulting p-value would be lower).
2) Analytic methods are inflexible. If you have issues like censored data, non-
independence, and long-tailed distributions, you won't find an off-the-shelf test; and unless
you are a mathematical statistician, you won't be able to make one. With simulation, these
kinds of issues are easy.
3) When people think of analytic methods as black boxes, they often fixate on finding the
right test and figuring out how to apply it, instead of thinking carefully about the problem.
In summary, don't worry about finding the "right" test. There's no such thing. Put your effort
into choosing a test statistic (one that reflects the effect you are testing) and modeling the
null hypothesis. Then simulate it, and count!
Added June 24, 2011: For some additional examples, see this followup post.
Posted by Allen Downey at 6:28 PM 5 comments:
Wednesday, May 4, 2011
Think Stats will be published by O'Reilly in June
No statistics today, just news. I signed a contract with O'Reilly to publish Think Stats. I turn
in the manuscript next week, and it should be out in June. So if you find any errors, now is
the time to tell me!
Here's my new author page at O'Reilly.
Some people have asked if I get to choose the animal on the cover (which is a signature of
O'Reilly cover design). The short answer is no.
First, I'm actually not sure whether Think Stats will be an "animal book" or part of another
O'Reilly series. We haven't talked about it.
Also, here's what the O'Reilly author materials have to say on the topic:
Cover Design
The purpose of a book cover is to get a potential reader to pick up the book, and to persuade a bookstore to
display it.
We're confident that we have the most striking and effective covers in technical publishing. Despite our
relatively small size, it's not unusual to see window displays of our books at technical bookstores.
Our covers are all designed by Edie Freedman. She is open to suggestions, but has the final say on all cover
designs. Here's what Edie has to say about how she designs the animal covers for Nutshell Handbooks:
I ask the authors to supply me with a description of the topic of the book. What I am looking for
is adjectives that really give me an idea of the "personality" of the topic. Authors are free to
make suggestions about animals, but I prefer to deal with adjectives. Once I have the
information from the author, I spend some time thinking about it, and then I choose an animal.
Sometimes it is based on no more than what the title sounds like. (COFF, for example,
sounded like a walrus noise to me.) Sometimes it is very much linked to the name of the book
or software. (For example, vi, the "Visual Editor," suggested some beast with huge eyes). And
it is always subject to what sort of artwork I can find.
It seems to work best this way. We've tried doing animals that the authors want, and have
found that it works better if I do the selection, and then submit it for approval. I don't think
anyone has been too disappointed with the final choices we've made.
I'll add that if you do want a particular animal, Edie is more inclined to respond to "right brain" reasons than to
obvious mental associations. For example, several people on the net suggested the obvious clam for the Perl
handbook, but Edie responded instead to Larry Wall's obscure argument for the camel: ugly but serviceable,
able to go long distances without a lot of nourishment.
In any event, you'll have a chance to see and comment on Edie's cover design before we commit it to print. If
you really hate it, and can't be persuaded to feel otherwise, she'll probably try again (but no promises!).
So here's what I have for "right brain" ideas: the themes of the books are curiosity and
analysis. One of the problems I see with conventional approaches to statistics is that it
makes people timid -- afraid to do the wrong thing. One of my goals with this approach is to
help students be fearless. So I want something curious and fearless. Ideas?
Here's another article about the animal selection process.
And here's the list of animals they've already used.
And here it is:
The line drawing looks better at higher resolution, but you get the idea. Can anyone name
that fish?
Posted by Allen Downey at 5:11 PM 4 comments:
Newer Posts Home Older Posts
Subscribe to: Posts (Atom)
Simple theme. Theme images by gaffera. Powered by Blogger.