Veazie - Understanding Statistical Testing
Veazie - Understanding Statistical Testing
research-article2015
SGOXXX10.1177/2158244014567685SAGE OpenSAGE OpenVeazie
Article
SAGE Open
Peter J. Veazie1
Abstract
Statistical hypothesis testing is common in research, but a conventional understanding sometimes leads to mistaken application
and misinterpretation. The logic of hypothesis testing presented in this article provides for a clearer understanding, application,
and interpretation. Key conclusions are that (a) the magnitude of an estimate on its raw scale (i.e., not calibrated by the
standard error) is irrelevant to statistical testing; (b) which statistical hypotheses are tested cannot generally be known a
priori; (c) if an estimate falls in a hypothesized set of values, that hypothesis does not require testing; (d) if an estimate does
not fall in a hypothesized set, that hypothesis requires testing; (e) the point in a hypothesized set that produces the largest
p value is used for testing; and (f) statistically significant results constitute evidence, but insignificant results do not and must
not be interpreted as evidence for or against the hypothesis being tested.
Keywords
research methods, data processing and interpretation, hypothesis testing, estimation, inference
Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 3.0 License
(http://www.creativecommons.org/licenses/by/3.0/) which permits any use, reproduction and distribution of the work without further
permission provided the original work is attributed as specified on the SAGE and Open Access page (http://www.uk.sagepub.com/aboutus/openaccess.htm).
Downloaded from by guest on January 9, 2015
2 SAGE Open
direct interpretation of CIs or Bayesian methods, nor is this generating process under the assumption that the statistical
article intended as an argument in favor of a particular hypothesis is true (commonly termed the p value); if the
method. The following sections define hypotheses and p value is sufficiently small compared with an a priori set level
hypothesis testing, distinguish the goal of hypothesis testing (commonly called the significance level), the data are consid-
from that of parameter estimation, present a logic of testing, ered evidence against the hypothesis being tested. This is also
and discuss its scope. a categorical method of testing; however, the p value can also
provide a continuous measure of evidence for the hypothesis
Hypothesis Testing Versus Parameter being tested. The most common use of formal testing is to
adopt the categorical approach with its designations of results
Estimation being “statistically significant” or “statistically insignificant”;
By the term hypothesis, I mean a formal proposition about I will discuss testing in these terms.
which its truth or falsity is unknown. An empirical hypothe- By this definition, the classic Neyman–Pearson test
sis is one for which empirical evidence can, in principle, bear (NPT), in which we set our acceptable Type I and Type II
on judgments of its truth or falsity. A statistical hypothesis is error rates and proceed as if the null were true or false accord-
an empirical hypothesis about distribution parameters of ran- ing to our test, is not hypothesis testing: Notwithstanding
dom variables defined by a data generating process. Neyman and Pearson’s reference to testing, it is a decision
To properly understand Frequentist statistical hypothesis rule, as Neyman and Pearson themselves state (Neyman &
testing, it is important to understand that the relevant random Pearson, 1933). Its goal is to decide whether or not to act as
variables represent the distribution of possible values that a if a hypothesis is true, not to judge whether the hypothesis is
data generating process could obtain, and not actual data. In true: The former is suited to model specification; the latter is
this sense, data and corresponding estimates are realizations suited to generating scientific understanding.
of the underlying random variables but are not themselves However, Fisher’s (1956) approach to statistical infer-
random variables. Hence, the sample mean statistic has a dis- ence, in which we use data as evidence for or against the
tribution of possible values, whereas the mean of a given truth of a claim, provides a basis for hypothesis testing by the
sample is a number. definition used here. What I provide is in one sense a gener-
A statistical hypothesis should be stated in terms of distri- alization and in another sense a restriction of Fisher’s
bution parameters of random variables, and not in data-spe- approach. It is a generalization because Fisher’s approach
cific terms. If a statement includes reference to data, then it has focused primarily on point hypotheses, whereas the logic
will either not be a hypothesis or it will be uninformative. As I present applies to set hypotheses in general. It is a restric-
an example, consider the claim that “there will be a signifi- tion because Fisher’s approach does not explicitly state an
cant result.” In the first case, in terms of the data generating alternative, whereas the logic I present addresses sets of
process, there is a particular probability that what is claimed hypotheses that partition a parameter space—an idea that
will occur, and the claim is therefore neither true nor false Neyman and Pearson (1928a, 1928b) initiated with the intro-
and thereby not a hypothesis. In the second case, in terms of duction of the formal alternative hypothesis.
the resulting data, the claim will be either true or false, but it It is important to distinguish the goals of testing and esti-
is a proposition by virtue of a necessary numeric characteris- mation. The goal of hypothesis testing is to make a judgment
tic only (of course the result will be either statistically sig- regarding the truth or falsity of a hypothesis, whereas the
nificant or not). The same hypothesis applies to any data goal of estimation is to make a judgment regarding the value
generating process, and knowing whether it is true or false is of a parameter.
uninformative regarding the data generating process under If I know whether a hypothesis is true or false, I have
investigation. achieved the goal of hypothesis testing. Suppose I am inter-
Hypothesis testing is a process by which we can inform ested in the hypothesis that “the average annual health care
judgments of the truth or falsity of a hypothesis. Formal statis- expenditure among men is greater than that among women”:
tical hypothesis testing is a method that compares data-spe- It is either true or false, it cannot be nearly true or mostly
cific value of a statistic to the statistic’s sampling distribution false. If an honest omniscient being were to tell me “your
as implied by the hypothesized values of a statistical hypoth- hypothesis is true,” then my goal has been achieved. Knowing
esis. There are two largely substitutable methods, in their com- the magnitudes of the averages or their difference adds noth-
mon usage. One is to define a set of values in the statistic’s ing more to achieving this goal.
range that correspond to sufficiently rare events under the Suppose you are testing the hypothesis that the effect of a
hypothesis-specific distribution (often termed the rejection policy intervention is zero, and the result is a substantively
region); if the data-specific value of the statistic is found to be trivial but statistically significant difference. A criticism is
in this set, the data are considered evidence against the under- that you have identified a statistically significant yet substan-
lying hypothesis. This is a strictly categorical method of test- tively insignificant effect. Such a critique is not an indict-
ing. The second is to calculate the probability of obtaining data ment against your hypothesis test. Your statement that the
at least as extreme as that actually obtained from the data data provide evidence the hypothesis is false is not
¬H H remaining hypotheses.
NOT POSSIBLE
H0
or against sets of hypothesis. In such a case, the data cannot
adjudicate between the hypotheses in the set of hypotheses
with which the data are consistent but can rule out the
H1
hypotheses with which the data are inconsistent.
falsity of a statistical hypothesis, then hypothesis testing can (b) which of the a priori specified hypotheses are statistically
be used to inform judgments about the substantive claim. tested cannot generally be known before the parameter is
This is the basis for hypothesis-driven science. For example, estimated (the exception being when a point hypothesis is
Kan (2007) derived and tested statistical hypotheses from the involved), (c) the parameter estimate is consistent with a
claim that time inconsistent preferences with hyperbolic dis- hypothesis to which the parameter estimate conforms and
counting explains lack of self-control among smokers. Cook, thereby this hypothesis does not require statistical testing, (d)
Orav, Liang, Guadagnoli, and Hicks (2011) tested the all hypotheses to which the estimate does not conform are
hypothesis that disparities in the placement of implantable subject to statistical testing to rule them out as alternative
cardioverter-defibrillators (ICD) can be explained by the explanations, (e) the element in the hypothesis’ set of values
underutilization of ICD implantation among clinically appro- that produces the largest p value is used to test the hypothe-
priate racial/ethnic minorities and women and the overuti- sis, and (f) except in the case of a point hypothesis, an esti-
lization of the procedure among clinically inappropriate mate can provide either evidence for or against hypotheses
Whites and men. Veazie (2006) derived and tested statistical (or sets of hypotheses), or remain ambiguous regarding them.
hypotheses from the claim that variation in individuals’ per- When testing hypotheses, researchers should report
ceptions of those with chronic medical conditions is whether there is either evidence for or against hypotheses.
explained by Ames’s (2004a, 2004b) theory of social infer- Moreover, ambiguous findings (i.e., insignificant findings)
ences. It is the fact that the estimate is consistent with a should not be reported as evidence from a formal test for a
hypothesized set of parameter values and inconsistent with hypothesis. For example, the common practice of treating
others that constitutes evidence for the hypothesis, not the insignificant results of a formal two-tailed test as evidence
raw-scale distance (i.e., magnitude) from the boundary that there is no effect should be avoided. Instead, it should be
between such sets. acknowledged that the data cannot distinguish hypotheses or
Statistical hypothesis testing is also used when the goal of cannot rule out certain alternatives.
estimation is of interest only if the parameter is in a particular In this article, I focused on hypotheses about a single
range of values. The cost–benefit example presented above parameter. The presented logic naturally extends to hypoth-
is one such case. Another is when researchers determine eses regarding multiple parameters as well (e.g., hypotheses
whether a variable predicts an outcome. Stating that some- regarding two parameters θ and γ such as H1: [θ > 0 and γ >
thing “will either go up or not go up” clearly does not consti- 0] and H2: [θ ≤ 0 or γ ≤ 0]). See the online appendix for a
tute an informative prediction, and stating that something description of hypothesis testing with multiple parameters.
“will either go up or down” (e.g., an inference from a signifi- For clarity of presentation, I adopted the standard concept
cant two-tailed test) is not much better. Consequently, identi- of a threshold (i.e., significance level) to categorically deter-
fying predictors typically requires isolating a direction. In mine whether data are consistent with a hypothesis; an
this case, it is reasonable for the researcher to first address approach that leads to the common use of categorical state-
the three hypotheses that (1) the parameter is greater than ments such as having a “significant result” or an “insignifi-
zero, (2) the parameter is less than zero, and (3) the parame- cant result.” This does not preclude the determination of a set
ter is equal to zero. Because, in a continuous parameter of thresholds to define multiple categories of evidence such
space, the third hypothesis is a point on the boundary between as weak evidence (e.g., perhaps 0.1 ≥ p > .05), moderate evi-
the first two, testing this set of hypotheses reduces in practice dence (e.g., perhaps .05 ≥ p > .01), and strong evidence (e.g.,
to essentially testing the disjunction of one of the first two perhaps p ≤ .01). Note, however, that such thresholds are
with the third. If the estimate conforms to (1), then it is sta- arbitrary, relative to a scientist’s judgment, or conventional,
tistically tested against the disjunction of (2) and (3). If the relative to expectations of a community of scholars (e.g., as
estimate conforms to (2), then it is statistically tested against broad as a discipline or field of study, and as narrow as a
the disjunction of (1) and (3). If an adequate judgment specific journal). It should also be clear that it is not neces-
regarding the truth or falsity of these hypotheses can be sary to adopt formal thresholds at all in the application of the
made, then the researcher continues with the estimation goal presented logic: A scientist may directly interpret the eviden-
and interprets point or interval estimates accordingly. tial value of the p value: For example, notwithstanding the
conventional .05 significance level, a scientist may consider
p values of .052 and .048 as essentially equivalent in their
Discussion evidential bearing, perhaps judging both indicate the data are
The objective of this article was to present a coherent inconsistent with the hypothesis being tested. Moreover, the
Frequentist logic of testing. To do so, I distinguished the goal logic described here can be applied by considering the
of hypothesis testing from that of estimation, and presented a p value as a continuous measure of consistency with a value
logic for the former that does not confuse it with the latter. contained in the hypotheses with which the data do not con-
The key points include (a) hypotheses are expressed as a par- form. Nonetheless, unlike Bayesian methods, the logic of
tition of the parameter space specifying the distribution of formal Frequentist hypothesis testing does not imply state-
random variables associated with a data generating process, ments of mathematical probability reflecting subjective
beliefs. Consequently, the p value (i.e., the probability that a and Social Psychology, 87, 340-353. doi:10.1037/0022-
data generating process would produce a statistic value as 3514.87.3.340
extreme as that observed given hypothesized distributional Ames, D. R. (2004b). Strategies for social inference: A similar-
characteristics) requires interpretation by the scientist in ity contingency model of projection and stereotyping in attri-
bute prevalence estimates. Journal of Personality and Social
light of the context and scientific goals.
Psychology, 87, 573-585. doi:10.1037/0022-3514.87.5.573
A final point of clarification may be helpful. I have men-
Berger, J. O. (1985). Statistical decision theory and Bayesian anal-
tioned the need for a priori specification of hypothesis but ysis (2nd ed.). New York, NY: Springer-Verlag.
also the fact that one cannot determine which hypothesis (or Cook, N. L., Orav, E. J., Liang, C. L., Guadagnoli, E., & Hicks,
hypotheses) will be statistically tested before observing the L. S. (2011). Racial and gender disparities in implantable car-
estimate (except when including a point hypothesis). These dioverter-defibrillator placement: Are they due to overuse or
do not conflict. The first is an epistemic requirement for underuse? Medical Care Research and Review, 68, 226-246.
using test results as evidence. The second is a logical conse- doi:10.1177/1077558710379421
quence of the first bearing on the process of establishing evi- Engsted, T. (2009). Statistical vs. economic significance in eco-
dence. The first means you should not use the data to nomics and econometrics: Further comments on McCloskey
determine whether you are addressing, for example, the and Ziliak. Journal of Economic Methodology, 16, 393-408.
doi:10.1080/13501780903337339
hypotheses H1: θ ≤ 0 and H2: θ > 0, or you are addressing the
Fisher, R. A. (1956). Statistical methods and scientific inference.
hypotheses H1: θ = 0 and H2: θ ≠ 0. This specification should
New York, NY: Hafner.
be determined a priori. Notice this precludes arbitrarily dou- Hoover, K. D., & Siegler, M. V. (2008a). The rheto-
bling your power given your results. Suppose, however, I ric of “signifying nothing”: A rejoinder to Ziliak and
were to decide ahead of time that I will statistically test H2: θ McCloskey. Journal of Economic Methodology, 15, 57-68.
> 0 and I subsequently obtain an estimate θ = 5. It is unclear doi:10.1080/13501780801913546
how I would formally test H2: θ > 0 given the estimate. What Hoover, K. D., & Siegler, M. V. (2008b). Sound and fury:
value in H2 do I base the test on and how would it be struc- McCloskey and significance testing in economics. Journal of
tured? There is no useful answer. How can the fact that θ is Economic Methodology, 15, 1-37. doi:10.1080/1350178080-
in the hypothesized set of values provide evidence against 1913298
the hypothesized values? It remains, however, to rule out H1, Kan, K. (2007). Cigarette smoking and self-control. Journal of
Health Economics, 26, 61-81.
but this is not my a priori specified statistical test. The steps
Mayo, D. G. (2010). Learning from error, severe testing, and the
in Table 1 avoid this issue because the hypothesis that is sta-
growth of theoretical knowledge. In D. G. Mayo & A. Spanos
tistically tested, but not the hypothesis specification, depends (Eds.), Error and inference: Recent exchanges on experimen-
on the obtained result and is not determined a priori. tal reasoning, reliability, and the objectivity and rationality of
By following the logic of formal statistical testing pre- science (pp. 28-57). Cambridge, UK: Cambridge University
sented here, a researcher does not confuse the goal of testing Press.
with that of estimation and can thereby avoid the conflict Mayo, D. G., & Spanos, A. (2010). Error statistics. In P. S.
inherent in interpreting results of testing and interpreting Bandyopadhyay & M. R. Forster (Eds.), Handbook of the
raw-scale magnitudes of estimation. However, a limitation of philosophy of science: Philosophy of statistics (Vol. 7, pp.
hypothesis testing is that it provides evidence solely for the 153-198). New York, NY: Elsevier.
truth or falsity of the specified hypotheses. It is the responsi- McCloskey, D. N., & Ziliak, S. T. (1996). The standard error of
regressions. Journal of Economic Literature, 34, 97-114.
bility of the researcher to justify knowing this fact. For the
McCloskey, D. N., & Ziliak, S. T. (2008). Signifying nothing: Reply
case in which knowing the truth or falsity of a hypothesis is
to Hoover and Siegler. Journal of Economic Methodology, 15,
not important, formal hypothesis testing is not an appropriate 39-55. doi:10.1080/13501780801913413
goal—Estimation, or other informative means, without the Neutens, J. J., & Rubinson, L. (2002a). Analyzing and interpreting
pretense of formal testing, may be the better objective. data: Inferential analysis. In Research techniques for the health
sciences (3rd ed., pp. 272-273). New York, NY: Benjamin
Declaration of Conflicting Interests Cummings.
Neutens, J. J., & Rubinson, L. (2002b). Research techniques for
The author(s) declared no potential conflicts of interest with respect the health sciences (3rd ed.). San Francisco, CA: Benjamin
to the research, authorship, and/or publication of this article. Cummings.
Neyman, J. (1957). The use of the concept of power in agricultural
Funding experimentation. Journal of the Indian Society of Agricultural
The author(s) received no financial support for the research and/or Statistics, IX, 9-17.
authorship of this article. Neyman, J., & Pearson, E. S. (1928a). On the use and interpreta-
tion of certain test criteria for purposes of statistical inference:
Part I. Biometrika, 20A, 175-240.
References Neyman, J., & Pearson, E. S. (1928b). On the use and interpreta-
Ames, D. R. (2004a). Inside the mind reader’s tool kit: Projection and tion of certain test criteria for purposes of statistical inference:
stereotyping in mental state inference. Journal of Personality Part II. Biometrika, 20A, 263-294.
Neyman, J., & Pearson, E. S. (1933). On the problem of the Ziliak, S. T., & McCloskey, D. N. (2004a). Significance redux.
most efficient tests of statistical hypotheses. Philosophical Journal of Socio-Economics, 33, 665-675. doi:10.1016/j.
Transactions of the Royal Society of London, Series A, 231, socec.2004.09.038
289-337. doi:10.1098/rsta.1933.0009 Ziliak, S. T., & McCloskey, D. N. (2004b). Size matters: The stan-
Portney, L. G., & Watkins, M. P. (2000). Foundations of clinical dard error of regressions in the American Economic Review.
research: Applications to practice (2nd ed.). Upper Saddle Journal of Socio-Economics, 33, 527-546.
River, NJ: Prentice Hall. Ziliak, S. T., & McCloskey, D. N. (2008a). The cult of statistical
Rothman, K. J., & Greenland, S. (1998). Modern epidemiology significance: How the standard error costs us jobs, justice, and
(2nd ed.). Philadelphia, PA: Lippincott-Raven. lives. Ann Arbor: University of Michigan Press.
Spanos, A. (1999a). Hypothesis testing probability theory and sta- Ziliak, S. T., & McCloskey, D. N. (2008b). Science in judgment, not
tistical inference: Econometric modeling with observational only calculation: A reply to Aris Spanos’s review of The Cult
data. Cambridge, UK: Cambridge University Press. of Statistical Significance. Erasmus Journal for Philosophy
Spanos, A. (1999b). Probability theory and statistical inference: and Economics, 1, 165-170.
Econometric modeling with observational data. Cambridge,
UK: Cambridge University Press. Author Biography
Spanos, A. (2008). Review of Stephen T. Ziliak and Deirdre N. Peter J. Veazie is Associate Professor in the Department of Public
McCloskey’s the cult of statistical significance: How the stan- Health Sciences, Chief of the Division of Health Policy and
dard error costs us jobs, justice, and lives. Ann Arbor: The Outcomes Research, and Director of the Health Services Research
University of Michigan Press. and Policy Doctoral Program at the University of Rochester. His
Veazie, P. J. (2006). Projection, stereotyping, and the perception of research focuses on medical and healthcare decision making, health
chronic medical conditions. Chronic Illness, 2, 303-310. and quality of life outcomes, and research methods.