Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views9 pages

Veazie - Understanding Statistical Testing

Uploaded by

alinadenham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views9 pages

Veazie - Understanding Statistical Testing

Uploaded by

alinadenham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

567685

research-article2015
SGOXXX10.1177/2158244014567685SAGE OpenSAGE OpenVeazie

Article

SAGE Open

Understanding Statistical Testing


January-March 2015: 1­–9
© The Author(s) 2015
DOI: 10.1177/2158244014567685
sgo.sagepub.com

Peter J. Veazie1

Abstract
Statistical hypothesis testing is common in research, but a conventional understanding sometimes leads to mistaken application
and misinterpretation. The logic of hypothesis testing presented in this article provides for a clearer understanding, application,
and interpretation. Key conclusions are that (a) the magnitude of an estimate on its raw scale (i.e., not calibrated by the
standard error) is irrelevant to statistical testing; (b) which statistical hypotheses are tested cannot generally be known a
priori; (c) if an estimate falls in a hypothesized set of values, that hypothesis does not require testing; (d) if an estimate does
not fall in a hypothesized set, that hypothesis requires testing; (e) the point in a hypothesized set that produces the largest
p value is used for testing; and (f) statistically significant results constitute evidence, but insignificant results do not and must
not be interpreted as evidence for or against the hypothesis being tested.

Keywords
research methods, data processing and interpretation, hypothesis testing, estimation, inference

Introduction interpreted as evidence for no association. I argue that the


Neyman–Pearson approach is not a formal hypothesis testing
Current concepts of statistical testing can lead to mistaken strategy, and I present a generalization of Fisher’s approach
ideas among researchers such as (a) the raw-scale magnitude that is a formal strategy.
of an estimate is relevant, (b) the classic Neyman–Pearson The one-tailed test is often presented as having a point
approach constitutes formal testing, which in its misapplica- null hypothesis (Neutens & Rubinson, 2002b; Portney &
tion can lead to mistaking statistical insignificance for evi- Watkins, 2000; Rothman & Greenland, 1998; Spanos,
dence of no effect, (c) one-tailed tests are tied to point null 1999b). Such a presentation does not constitute a general
hypotheses, (d) one- and two-tailed tests can be arbitrarily framework; I present a general characterization that allows
selected, (e) two-tailed tests are informative, and (f) power- each competing hypothesis to be a set of parameter values.
defined intervals or data-specific intervals constitute formal The two-tailed test is perhaps the most common test. I
test hypotheses. In this article, I challenge convention regard- argue that, as a formal process, the two-tailed test (indeed, a
ing hypothesis testing that leads to such mistaken ideas, and test of any point hypothesis) is almost never informative.
I provide a coherent conceptualization and logic for testing Intervals associated with power, confidence intervals
that avoids such mistakes. (CIs), and data-specific measures are sometimes offered as
A recent book and related works by Ziliak and McCloskey defining hypotheses (Mayo, 2010; Mayo & Spanos, 2010;
(McCloskey & Ziliak, 1996, 2008; Ziliak & McCloskey, Neyman, 1957). I argue they do not constitute formal test
2004a, 2004b, 2008a, 2008b) declare statistical significance hypotheses.
is invalid for scientific inquiry. Critics responded (Engsted, The goal of this article is to provide a coherent under-
2009; Hoover & Siegler, 2008a, 2008b; Spanos, 2008). standing and approach to formal statistical hypothesis testing
Ziliak and McCloskey, and their critic Spanos, imply the for the researcher who seeks to use this inference tool with-
raw-scale magnitudes of parameters are relevant to hypoth- out confusion. This article does not discuss alternative meth-
esis testing. This confuses the goals of hypothesis testing ods for using empirical evidence in inference such as the
with that of parameter estimation; I provide a careful distinc-
tion between the two and argue the raw-scale magnitude of
an estimate is irrelevant to testing. 1
University of Rochester, NY, USA
The Neyman and Pearson (1933) approach to addressing
hypotheses is often offered as a formal logic of testing Corresponding Author:
Peter J. Veazie, Department of Public Health Sciences, University of
(Neutens & Rubinson, 2002b; Portney & Watkins, 2000; Rochester School of Medicine and Dentistry, 265 Crittenden Blvd., CU
Rothman & Greenland, 1998; Spanos, 1999b), with the unfor- 420644, Rochester, NY 14642-0644, USA.
tunate consequence that statistically insignificant findings are Email: [email protected]

Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 3.0 License
(http://www.creativecommons.org/licenses/by/3.0/) which permits any use, reproduction and distribution of the work without further
permission provided the original work is attributed as specified on the SAGE and Open Access page (http://www.uk.sagepub.com/aboutus/openaccess.htm).
Downloaded from by guest on January 9, 2015
2 SAGE Open

direct interpretation of CIs or Bayesian methods, nor is this generating process under the assumption that the statistical
article intended as an argument in favor of a particular hypothesis is true (commonly termed the p value); if the
method. The following sections define hypotheses and p value is sufficiently small compared with an a priori set level
hypothesis testing, distinguish the goal of hypothesis testing (commonly called the significance level), the data are consid-
from that of parameter estimation, present a logic of testing, ered evidence against the hypothesis being tested. This is also
and discuss its scope. a categorical method of testing; however, the p value can also
provide a continuous measure of evidence for the hypothesis
Hypothesis Testing Versus Parameter being tested. The most common use of formal testing is to
adopt the categorical approach with its designations of results
Estimation being “statistically significant” or “statistically insignificant”;
By the term hypothesis, I mean a formal proposition about I will discuss testing in these terms.
which its truth or falsity is unknown. An empirical hypothe- By this definition, the classic Neyman–Pearson test
sis is one for which empirical evidence can, in principle, bear (NPT), in which we set our acceptable Type I and Type II
on judgments of its truth or falsity. A statistical hypothesis is error rates and proceed as if the null were true or false accord-
an empirical hypothesis about distribution parameters of ran- ing to our test, is not hypothesis testing: Notwithstanding
dom variables defined by a data generating process. Neyman and Pearson’s reference to testing, it is a decision
To properly understand Frequentist statistical hypothesis rule, as Neyman and Pearson themselves state (Neyman &
testing, it is important to understand that the relevant random Pearson, 1933). Its goal is to decide whether or not to act as
variables represent the distribution of possible values that a if a hypothesis is true, not to judge whether the hypothesis is
data generating process could obtain, and not actual data. In true: The former is suited to model specification; the latter is
this sense, data and corresponding estimates are realizations suited to generating scientific understanding.
of the underlying random variables but are not themselves However, Fisher’s (1956) approach to statistical infer-
random variables. Hence, the sample mean statistic has a dis- ence, in which we use data as evidence for or against the
tribution of possible values, whereas the mean of a given truth of a claim, provides a basis for hypothesis testing by the
sample is a number. definition used here. What I provide is in one sense a gener-
A statistical hypothesis should be stated in terms of distri- alization and in another sense a restriction of Fisher’s
bution parameters of random variables, and not in data-spe- approach. It is a generalization because Fisher’s approach
cific terms. If a statement includes reference to data, then it has focused primarily on point hypotheses, whereas the logic
will either not be a hypothesis or it will be uninformative. As I present applies to set hypotheses in general. It is a restric-
an example, consider the claim that “there will be a signifi- tion because Fisher’s approach does not explicitly state an
cant result.” In the first case, in terms of the data generating alternative, whereas the logic I present addresses sets of
process, there is a particular probability that what is claimed hypotheses that partition a parameter space—an idea that
will occur, and the claim is therefore neither true nor false Neyman and Pearson (1928a, 1928b) initiated with the intro-
and thereby not a hypothesis. In the second case, in terms of duction of the formal alternative hypothesis.
the resulting data, the claim will be either true or false, but it It is important to distinguish the goals of testing and esti-
is a proposition by virtue of a necessary numeric characteris- mation. The goal of hypothesis testing is to make a judgment
tic only (of course the result will be either statistically sig- regarding the truth or falsity of a hypothesis, whereas the
nificant or not). The same hypothesis applies to any data goal of estimation is to make a judgment regarding the value
generating process, and knowing whether it is true or false is of a parameter.
uninformative regarding the data generating process under If I know whether a hypothesis is true or false, I have
investigation. achieved the goal of hypothesis testing. Suppose I am inter-
Hypothesis testing is a process by which we can inform ested in the hypothesis that “the average annual health care
judgments of the truth or falsity of a hypothesis. Formal statis- expenditure among men is greater than that among women”:
tical hypothesis testing is a method that compares data-spe- It is either true or false, it cannot be nearly true or mostly
cific value of a statistic to the statistic’s sampling distribution false. If an honest omniscient being were to tell me “your
as implied by the hypothesized values of a statistical hypoth- hypothesis is true,” then my goal has been achieved. Knowing
esis. There are two largely substitutable methods, in their com- the magnitudes of the averages or their difference adds noth-
mon usage. One is to define a set of values in the statistic’s ing more to achieving this goal.
range that correspond to sufficiently rare events under the Suppose you are testing the hypothesis that the effect of a
hypothesis-specific distribution (often termed the rejection policy intervention is zero, and the result is a substantively
region); if the data-specific value of the statistic is found to be trivial but statistically significant difference. A criticism is
in this set, the data are considered evidence against the under- that you have identified a statistically significant yet substan-
lying hypothesis. This is a strictly categorical method of test- tively insignificant effect. Such a critique is not an indict-
ing. The second is to calculate the probability of obtaining data ment against your hypothesis test. Your statement that the
at least as extreme as that actually obtained from the data data provide evidence the hypothesis is false is not

Downloaded from by guest on January 9, 2015


Veazie 3

compromised by the raw-scale size of the effect. The two


statements, “10−100 ≠ 0” and “10100 ≠ 0,” are both true: There H1: θ ≤ a H2: θ > a
is no degree of truth that varies with the magnitude. What
then is the objection? The critic is objecting to your goal of
A
testing the hypothesis and is instead presumably seeking evi- −∞ a ∞
dence regarding the magnitude of the estimate; apparently to
this critic the values in the CI, whether it crosses zero or not, H2: θ ≠ a H2 : θ ≠ a
are too small to be useful. H1 : θ = a
Suppose you are interested in estimating an odds ratio,
and your data produce a 95% CI of [0.9, 2]. A policy maker
B
may consider the costs and benefits of the program at differ- −∞ a ∞
ent values across the interval, or she may take a more rigor-
ous approach and apply statistical decision theory (Berger, H1 : θ ≤ a H2: a < θ ≤ b H3: θ > b
1985). A critic, however, points out that the CI crosses 1 and
states that you cannot rule out there is no effect. Such a state-
C
ment is not an indictment against your estimation. What then −∞ a b ∞
is the objection? The critic is objecting to your goal of esti-
mation and is instead presumably seeking to judge whether
Figure 1. Example specifications of sets of hypotheses that are
there is a difference; not being able to rule out an odds ratio
mutually exclusive but together make up the full set of possible
of 1 disallows such a judgment. parameter values (i.e., sets of hypotheses that partition the set of
These examples presume the researcher is interested in possible values).
either hypothesis testing or estimation, but the researcher
may be interested in both. Suppose we are considering the
incremental cost–benefit of a program modification. But, µ ≥ 0). The set of hypotheses can include substantively
regardless of the magnitudes of the cost and benefit differ- derived hypotheses as well as a catchall negation of these
ences, if there is a decrease in benefit, then the modification hypotheses. We wish to determine in which set of values is
will not be adopted, and if there is an increase in benefit with the true parameter (i.e., which hypothesis regarding the value
a corresponding decrease in cost, then the modification will of the parameter is true).
be adopted. In this case, it is only if there is increasing ben- If an estimate θ (i.e., the value the estimator yields when
efit and increasing costs that we need to know the values of applied to specific data) is in a given hypothesis-specified set
these changes to determine whether to adopt the modifica- of values, then it conforms to the corresponding hypothesis.
tion. Consequently, it is only when there is sufficient evi- Because H and ¬H (in which “¬” denotes logical negation
dence supporting the last hypothesis that we need to pursue and ¬H can be interpreted as “not H”) represent two sets of
the goal of estimation. possible values that are mutually exclusive but together
make up the full set of possible values, if the estimate θ
conforms to one of the hypotheses, it will conform to only
A Logic of Statistical Testing that hypothesis (e.g., any point on the real lines depicted in
The logic presented here requires the a priori specification of Figure 1 can only be in one of the hypothesized sets). If the
hypotheses in terms of mutually exclusive and mutually set of possible parameter values is a proper subset of the esti-
exhaustive sets of possible values for distribution parameters mator’s range (e.g., whole numbers are a proper subset of all
of random variables reflecting a data generating process. A real numbers), then it is possible for θ not to conform to any
priori specification is an epistemological requirement for hypothesis (e.g., the estimate might be a fraction when the
results to have evidential value; however, as discussed in this hypotheses are sets of whole numbers). If an estimate con-
section, this requirement does not apply to the act of testing. forms to a hypothesis, then it is a plausible result if the
See Figure 1 for examples regarding a parameter that can hypothesis were true; indeed, for θ = θ , then for an unbiased
possibly take values on the extended real line (i.e., any num- estimator with a symmetric unimodal distribution, θ is the
ber from negative infinity to positive infinity): Panel A most likely result, given the data.
depicts the set of hypotheses that underlie typical one-tailed I define an estimate θ , and by extension the underlying
tests; Panel B depicts the hypotheses that underlie two-tailed data, as being consistent with a hypothesis if the estimate
tests; Panel C represents how three hypotheses might be conforms to the hypothesis or if there exists at least one ele-
expressed. Each set of values represents a hypothesis regard- ment in the corresponding set of values that would define a
ing the parameter. For example, partitioning the real line into data generating process that could plausibly have produced
the set of negative values and the set of nonnegative values data at least as extreme as the obtained estimate. For hypoth-
can represent a comprehensive set of hypotheses (e.g., H1 esis H, to which the data do not conform, θ is consistent
and H2) regarding a parameter µ (e.g., H1: µ < 0 and H2: with H if there exists an element θ in the set of values

Downloaded from by guest on January 9, 2015


4 SAGE Open

Table 1. Steps in the Application of the Logic of Statistical


¬H
Testing.
Consistent Not Consistent
Step 1 Determine the hypothesis-specific partition of the
A parameter space associated with the data generating
B
process. How this is achieved depends on the
Consistent

¬H H substance and logic of the research being pursued and


¬H H is not merely a question of statistics.
Step 2 Obtain the parameter estimate using data from the data
generating process.
C D Step 3 Identify the hypothesis to which the estimate conforms.
H

Step 4 Test whether the estimate is consistent with the


Not Consistent

¬H H remaining hypotheses.
NOT POSSIBLE

with values in ¬H (i.e., regardless of the insignificant test of


¬H). Making a judgment based on this fact alone, however,
Figure 2. Examples of consistent and inconsistent combinations would not be an exercise in formal statistical testing as it
for a one-tailed hypothesis and its negation in which the would not be properly accounting for the fact that the data
parameter space is the real line. generating process could have produced estimates at least as
extreme as θ if the true value were actually in ¬H.
Panel B depicts the estimate being consistent with H but
corresponding to H for which the p-value (θ : θ) is large. The inconsistent with ¬H: Hypothesis H could well have pro-
term p-value (θ : θ) denotes the probability of the specified duced the data, but ¬H is not likely to have (if we consider
data generating process, with an actual parameter value of θ, the area under the curve to the right of θ as being small).
producing data having at least as extreme an estimated value This constitutes evidence for H and evidence against ¬H.
as θ (this is the usual definition of a p value). An estimate, Panel C depicts the estimate being consistent with ¬H but
and therefore the data, is inconsistent with hypothesis H if inconsistent with H: Hypothesis H is not likely to have pro-
the estimate is not consistent with H: which is to say, if θ duced such an estimate (if we consider the area under the
does not conform to H and for all elements θ in the set of curve to the left of θ as being small), but ¬H could have.
values corresponding to H the p-value (θ : θ) is small. This situation constitutes evidence against H and evidence
Judgments regarding what constitutes a large or small for ¬H. Panel D is not possible because in this example the
p value are typically made in comparison with a threshold estimate must conform to, and thereby be consistent with, at
value termed a significance level. Notice that to be consistent least one of the hypotheses.
with a hypothesis, the estimate need only correspond to a Because data are consistent with a hypothesis to which
large p value for one value in the hypothesized set of values; the estimate conforms, it is not necessary to statistically test
to be inconsistent with a hypothesis, the estimate needs to such a hypothesis; however, a statistical test is required of
correspond with a small p value for all values in the hypoth- those hypotheses to which the estimate does not conform. In
esized set of values. this case, because the estimate will conform to only one
Figure 2 presents the possibilities for the single-parameter hypothesis, if there are N hypotheses, then N − 1 hypotheses
two-hypotheses case (H and ¬H) defined on the real line. must be statistically tested. In the case where the estimate
Panel A depicts the estimate being consistent with both does not conform to any hypothesis, all hypotheses must be
hypotheses: θ conforms to H, and there also exists at least tested.
one parameter value in the set of values corresponding to ¬H
that could plausibly produce data with an estimate at least as
Steps in the Application of the Testing Logic
extreme as that obtained (if we consider the area under the The application of this logic proceeds in four steps as stated
curve to the right of θ as being large). In this case, the data in Table 1. The interpretation of results depends on which of
cannot adjudicate between them, and we cannot rule out two cases apply.
either hypothesis. Surely, θ , providing a statistically insig- Case 1: Data are consistent with all hypotheses. If the data
nificant test result, still provides some evidence in favor of are consistent with all hypotheses, then there exists at least
¬H? No. Even though θ is consistent with some values in one plausible parameter value in each of the hypothesized set
¬H, it is in fact in H and thereby it is even “more consistent” of values. In this case, the data do not provide evidence for or
with values in H. Consider, for example, the true value being against any of the hypotheses.
exactly that of the estimate θ , a value that is in H. If any- Case 2: Data are inconsistent with at least one hypothesis.
thing, the fact that θ is anywhere in H provides more evi- If the data are consistent with one hypothesis but inconsistent
dence in favor of H than ¬H regardless of being consistent with all others, then the data provide evidence for the

Downloaded from by guest on January 9, 2015


Veazie 5

hypothesis and evidence against the others. When there are


more than two hypotheses, the data can provide evidence for

H0
or against sets of hypothesis. In such a case, the data cannot
adjudicate between the hypotheses in the set of hypotheses
with which the data are consistent but can rule out the
H1
hypotheses with which the data are inconsistent.

The One-Tailed Test


Hypotheses regarding a single parameter often take the form
of directional hypotheses such as H0: θ ≤ 0 versus H1: θ > 0. p-value
Which hypothesis should we statistically test? Suppose at
Step 3, we estimate θ = −0.6. Being negative, θ conforms
to H0 and is consequently consistent with this hypothesis.We
do not, therefore, need to statistically test H0. But the ques-
tion remains whether it is likely that if θ were in fact positive,
the data generating process would produce data with an esti- Figure 3. Statistical testing of an estimate against hypothesis
mate at least as extreme as −0.6. We answer this question by H0 using the limit point on the boundary between hypothesis
statistically testing H1. Notice that we do not know a priori subspaces (use of the distribution depicted by the solid line).
which hypotheses will be subject to statistical testing; we Note. This point produces the largest p value; any point further inside of
the H0 subspace produces a smaller p value (e.g., the distribution depicted
must first know to which hypothesis, if any, the estimate by the dashed line).
conforms.
Because the hypotheses specify sets of possible values,
they are not necessarily expressed in terms of the specific is achieved using the p value associated with the value θ = 0,
parameter value used to calculate the p value. If the hypoth- but the hypotheses being considered remain as originally
eses are H0: θ ≤ 0 versus H1: θ > 0, and if in Step 3 we obtain stated: H0: θ ≤ 0 and H1: θ > 0.
θ = 0.4, which conforms to H1, then we need to statistically Directional hypotheses are sometimes introduced as a
test H0. But which of the infinite number of values in H0 do point hypothesis and a directional hypothesis such as H0: θ =
we use to calculate a p value? To determine whether there 0 versus H1: θ > 0 (Neutens & Rubinson, 2002a; Spanos,
exists a value in H0 consistent with θ , we only need a 1999a). As indicated by the preceding discussion, this speci-
p value for a value in H0 with which the data must be consis- fication is not a general description of directional hypotheses
tent if there exists any such point. This will be the value near- and would only be appropriate when the parameter space is
est to the estimate, on the metric of statistical distance. For a legitimately restricted to the indicated range and the point
discrete parameter space, this value will be in the set of val- hypothesis is part of an a priori specification of the
ues corresponding to the hypothesis being tested; for a con- partition.
tinuous parameter space, this point will be on the boundary
between the hypotheses’ sets of values. For example, con-
sider an estimate of −0.6 and a test of H1: θ > 0 for a continu-
The Two-Tailed Test
ous parameter space. We would use θ = 0 to calculate the Perhaps the most common statistical test is the two-tailed test
p value even though this value is not in H1. The reason is that of a scalar parameter. In this case, the hypotheses include a
for any value in H1 near 0, there are an infinite number of single point and its negation. Step 1 of the preceding logic is
values between it and 0. However, using θ = 0 to calculate a to express the hypotheses that constitute the relevant parti-
p value will produce the same quantity as a point in H1 infi- tion of the parameter space: for example, H0: θ = 0 versus H1:
nitely close to 0. θ ≠ 0. However, we know a priori that the estimate is almost
Returning to our example with θ = 0.4 and testing the certainly going to conform to H1: The chance of an estimate
hypothesis θ ≤ 0, notice from Figure 3 that if θ is plausible equaling 0 (to the precision of the computer, and certainly to
given θ = 0 (i.e., has a large p value under the distribution the infinite decimal place) is almost certainly 0. This has two
depicted by the solid line), then we have at least one point in consequences: First, we know a priori that we will almost
H0 that makes θ consistent with H0—the very point we certainly be statistically testing the null hypothesis H0.
tested. However, if θ is implausible given θ = 0 (i.e., has a Second, it is not possible to accrue evidence for the point
small p value), then it must be implausible for any other hypothesis H0 in a continuous parameter space. To have evi-
point further away from θ in H0 (e.g., the distribution dence for H0, one would need an estimated value in H0 that
depicted by the dashed line) and therefore there is no point in would be unlikely under all points in H1. Even if we obtained
H0 that would make θ consistent with H0. Consequently, the an estimate exactly equal to 0, for any finite sample there
test of whether θ is consistent with the hypothesis that θ ≤ 0 will be some positive number c such that if θ = 10−c (which,

Downloaded from by guest on January 9, 2015


6 SAGE Open

not being 0, is clearly in H1) will make the estimate of 0 plau-


H0: θ = 0
sible (ignoring that a p value does not actually exist for con-
tinuous parameter spaces if H0 is a point on the real line).
Moreover, except in ideal cases such as perfect randomized
experiments, the presumption of a point hypothesis is itself
implausible—The parameter is almost certainly not exactly
equal to the specified value. The conclusion is that a formal
point hypothesis (in a continuous parameter space) will not
likely be empirically informative: It cannot be confirmed,
and it seldom needs to be disconfirmed.

Other Sources of Conceptual Errors


Another mistaken question of concern regards overly power- 0 δ
ful tests. This concern stems from confusing statistical with
A
substantive goals and thereby confusing the metric of statis- Reject θ = δ

tical distance (based on the standard error, reflecting varia- Reject θ = 0

tion in the data generating process) with that of a substantive


determination (typically associated with the raw scale of the
Figure 4. Power-specified discernible effects do not define
variables). Indeed, if we had full information, we would
formal hypotheses. If an estimate is anywhere in the Reject θ =
know whether the hypothesis was actually true or false, or 0 region, one would reject H0. However, estimates in the range
we would know the actual value of the parameter. Such denoted as  would provide evidence to reject the hypothesis θ
knowledge is not to be shunned. = 0 (the solid curve) but would not reject θ = δ in a formal test
A related issue is the a posteriori interpretation of results of that value (the dashed curve).
that accounts for power. Suppose we have sufficient power to
discern a deviation δ from a point hypothesis θ0. If results are
significantly different from θ0, we might infer that θ is at 
Upper 
Limit ] and θ ∉ [ Lower 
Limit , Upper Limit ]? No.
least θ0 ± δ; if results are not significant, then we might infer Because, as argued above, such statements refer to the data
that θ is no more than θ0 ± δ (Neyman, 1957). Do these and thereby simply do not constitute informative hypotheses
results, however, constitute a formal test of the implied set of about data generating processes. Although the a posteriori
hypothesis θ ∈ [θ0 − δ, θ0 + δ] and θ ∉ [θ0 − δ, θ0 + δ]? No. consideration of power, CIs, or severity measures can inform
The rejection region of the statistic can be between θ0 and δ, inference, they do not deliver their epistemic value through
or greater than δ but not statistically significantly different formal hypothesis testing.
from δ (see Figure 4, in which θ0 = 0). If we wish to test these We could, however, use these data-specific values as part
new hypotheses, we would follow the steps in the testing of a formal test. For example, regarding the CI, we could use
logic, which would lead to collecting data and a test center-
the underlying interval estimator (L, U) as a test statistic and
ing the sampling distribution on either θ0 − δ or θ0 + δ (which-
would thereby require its sampling distribution to calculate
ever is closest to the new estimate). But, given the less than
the probability of the statistic taking on values at least as
perfect power in the new data generating process, we will
extreme as the calculated data-specific CI. Here it is impor-
end up with another discernible deviation δ* around this
tant to remember that the CI as calculated from data is not a
point. If we again follow the a posteriori interpretation above,
statistic (i.e., not a random variable) upon which a formal test
we would be lead to the implied hypotheses θ ∈ [θ0 − δ − δ*,
can be based: It is a single realization from the distribution of
θ0 + δ + δ*] and θ ∉ [θ0 − δ − δ*, θ0 + δ + δ*]. As we keep
an underlying interval statistic (L, U). A formal test would be
pursuing these new implied hypotheses, taking into account
based on the distribution of (L, U). If such a test were con-
the discernible deviation due to power, we ultimately
structed, the logic of its use would follow that as described in
(through infinite iterations, assuming the sequence of stan-
this article. Nonetheless, these types of data-specific values
dard errors has a nonzero lower bound) get to the implied
are not commonly used in formal statistical testing as defined
statement that θ ∈ (−∞,∞), which we presume to be true a
here.
priori—We have arrived at a trivial truth rather than an infor-
mative hypothesis.
Can we consider a similar construction of implied hypoth-
Why Test Hypotheses?
eses using data-specific concepts such as CIs or severity
measures (Mayo, 2010; Mayo & Spanos, 2010)? For exam- Statistical hypothesis testing is common when a researcher
ple, based on the CI might we consider our results as for- wishes to determine a substantive claim. If the truth or falsity
mally testing the implied hypotheses θ ∈ [ Lower  Limit , of the substantive claim can be identified with the truth or

Downloaded from by guest on January 9, 2015


Veazie 7

falsity of a statistical hypothesis, then hypothesis testing can (b) which of the a priori specified hypotheses are statistically
be used to inform judgments about the substantive claim. tested cannot generally be known before the parameter is
This is the basis for hypothesis-driven science. For example, estimated (the exception being when a point hypothesis is
Kan (2007) derived and tested statistical hypotheses from the involved), (c) the parameter estimate is consistent with a
claim that time inconsistent preferences with hyperbolic dis- hypothesis to which the parameter estimate conforms and
counting explains lack of self-control among smokers. Cook, thereby this hypothesis does not require statistical testing, (d)
Orav, Liang, Guadagnoli, and Hicks (2011) tested the all hypotheses to which the estimate does not conform are
hypothesis that disparities in the placement of implantable subject to statistical testing to rule them out as alternative
cardioverter-defibrillators (ICD) can be explained by the explanations, (e) the element in the hypothesis’ set of values
underutilization of ICD implantation among clinically appro- that produces the largest p value is used to test the hypothe-
priate racial/ethnic minorities and women and the overuti- sis, and (f) except in the case of a point hypothesis, an esti-
lization of the procedure among clinically inappropriate mate can provide either evidence for or against hypotheses
Whites and men. Veazie (2006) derived and tested statistical (or sets of hypotheses), or remain ambiguous regarding them.
hypotheses from the claim that variation in individuals’ per- When testing hypotheses, researchers should report
ceptions of those with chronic medical conditions is whether there is either evidence for or against hypotheses.
explained by Ames’s (2004a, 2004b) theory of social infer- Moreover, ambiguous findings (i.e., insignificant findings)
ences. It is the fact that the estimate is consistent with a should not be reported as evidence from a formal test for a
hypothesized set of parameter values and inconsistent with hypothesis. For example, the common practice of treating
others that constitutes evidence for the hypothesis, not the insignificant results of a formal two-tailed test as evidence
raw-scale distance (i.e., magnitude) from the boundary that there is no effect should be avoided. Instead, it should be
between such sets. acknowledged that the data cannot distinguish hypotheses or
Statistical hypothesis testing is also used when the goal of cannot rule out certain alternatives.
estimation is of interest only if the parameter is in a particular In this article, I focused on hypotheses about a single
range of values. The cost–benefit example presented above parameter. The presented logic naturally extends to hypoth-
is one such case. Another is when researchers determine eses regarding multiple parameters as well (e.g., hypotheses
whether a variable predicts an outcome. Stating that some- regarding two parameters θ and γ such as H1: [θ > 0 and γ >
thing “will either go up or not go up” clearly does not consti- 0] and H2: [θ ≤ 0 or γ ≤ 0]). See the online appendix for a
tute an informative prediction, and stating that something description of hypothesis testing with multiple parameters.
“will either go up or down” (e.g., an inference from a signifi- For clarity of presentation, I adopted the standard concept
cant two-tailed test) is not much better. Consequently, identi- of a threshold (i.e., significance level) to categorically deter-
fying predictors typically requires isolating a direction. In mine whether data are consistent with a hypothesis; an
this case, it is reasonable for the researcher to first address approach that leads to the common use of categorical state-
the three hypotheses that (1) the parameter is greater than ments such as having a “significant result” or an “insignifi-
zero, (2) the parameter is less than zero, and (3) the parame- cant result.” This does not preclude the determination of a set
ter is equal to zero. Because, in a continuous parameter of thresholds to define multiple categories of evidence such
space, the third hypothesis is a point on the boundary between as weak evidence (e.g., perhaps 0.1 ≥ p > .05), moderate evi-
the first two, testing this set of hypotheses reduces in practice dence (e.g., perhaps .05 ≥ p > .01), and strong evidence (e.g.,
to essentially testing the disjunction of one of the first two perhaps p ≤ .01). Note, however, that such thresholds are
with the third. If the estimate conforms to (1), then it is sta- arbitrary, relative to a scientist’s judgment, or conventional,
tistically tested against the disjunction of (2) and (3). If the relative to expectations of a community of scholars (e.g., as
estimate conforms to (2), then it is statistically tested against broad as a discipline or field of study, and as narrow as a
the disjunction of (1) and (3). If an adequate judgment specific journal). It should also be clear that it is not neces-
regarding the truth or falsity of these hypotheses can be sary to adopt formal thresholds at all in the application of the
made, then the researcher continues with the estimation goal presented logic: A scientist may directly interpret the eviden-
and interprets point or interval estimates accordingly. tial value of the p value: For example, notwithstanding the
conventional .05 significance level, a scientist may consider
p values of .052 and .048 as essentially equivalent in their
Discussion evidential bearing, perhaps judging both indicate the data are
The objective of this article was to present a coherent inconsistent with the hypothesis being tested. Moreover, the
Frequentist logic of testing. To do so, I distinguished the goal logic described here can be applied by considering the
of hypothesis testing from that of estimation, and presented a p value as a continuous measure of consistency with a value
logic for the former that does not confuse it with the latter. contained in the hypotheses with which the data do not con-
The key points include (a) hypotheses are expressed as a par- form. Nonetheless, unlike Bayesian methods, the logic of
tition of the parameter space specifying the distribution of formal Frequentist hypothesis testing does not imply state-
random variables associated with a data generating process, ments of mathematical probability reflecting subjective

Downloaded from by guest on January 9, 2015


8 SAGE Open

beliefs. Consequently, the p value (i.e., the probability that a and Social Psychology, 87, 340-353. doi:10.1037/0022-
data generating process would produce a statistic value as 3514.87.3.340
extreme as that observed given hypothesized distributional Ames, D. R. (2004b). Strategies for social inference: A similar-
characteristics) requires interpretation by the scientist in ity contingency model of projection and stereotyping in attri-
bute prevalence estimates. Journal of Personality and Social
light of the context and scientific goals.
Psychology, 87, 573-585. doi:10.1037/0022-3514.87.5.573
A final point of clarification may be helpful. I have men-
Berger, J. O. (1985). Statistical decision theory and Bayesian anal-
tioned the need for a priori specification of hypothesis but ysis (2nd ed.). New York, NY: Springer-Verlag.
also the fact that one cannot determine which hypothesis (or Cook, N. L., Orav, E. J., Liang, C. L., Guadagnoli, E., & Hicks,
hypotheses) will be statistically tested before observing the L. S. (2011). Racial and gender disparities in implantable car-
estimate (except when including a point hypothesis). These dioverter-defibrillator placement: Are they due to overuse or
do not conflict. The first is an epistemic requirement for underuse? Medical Care Research and Review, 68, 226-246.
using test results as evidence. The second is a logical conse- doi:10.1177/1077558710379421
quence of the first bearing on the process of establishing evi- Engsted, T. (2009). Statistical vs. economic significance in eco-
dence. The first means you should not use the data to nomics and econometrics: Further comments on McCloskey
determine whether you are addressing, for example, the and Ziliak. Journal of Economic Methodology, 16, 393-408.
doi:10.1080/13501780903337339
hypotheses H1: θ ≤ 0 and H2: θ > 0, or you are addressing the
Fisher, R. A. (1956). Statistical methods and scientific inference.
hypotheses H1: θ = 0 and H2: θ ≠ 0. This specification should
New York, NY: Hafner.
be determined a priori. Notice this precludes arbitrarily dou- Hoover, K. D., & Siegler, M. V. (2008a). The rheto-
bling your power given your results. Suppose, however, I ric of “signifying nothing”: A rejoinder to Ziliak and
were to decide ahead of time that I will statistically test H2: θ McCloskey. Journal of Economic Methodology, 15, 57-68.
> 0 and I subsequently obtain an estimate θ = 5. It is unclear doi:10.1080/13501780801913546
how I would formally test H2: θ > 0 given the estimate. What Hoover, K. D., & Siegler, M. V. (2008b). Sound and fury:
value in H2 do I base the test on and how would it be struc- McCloskey and significance testing in economics. Journal of
tured? There is no useful answer. How can the fact that θ is Economic Methodology, 15, 1-37. doi:10.1080/1350178080-
in the hypothesized set of values provide evidence against 1913298
the hypothesized values? It remains, however, to rule out H1, Kan, K. (2007). Cigarette smoking and self-control. Journal of
Health Economics, 26, 61-81.
but this is not my a priori specified statistical test. The steps
Mayo, D. G. (2010). Learning from error, severe testing, and the
in Table 1 avoid this issue because the hypothesis that is sta-
growth of theoretical knowledge. In D. G. Mayo & A. Spanos
tistically tested, but not the hypothesis specification, depends (Eds.), Error and inference: Recent exchanges on experimen-
on the obtained result and is not determined a priori. tal reasoning, reliability, and the objectivity and rationality of
By following the logic of formal statistical testing pre- science (pp. 28-57). Cambridge, UK: Cambridge University
sented here, a researcher does not confuse the goal of testing Press.
with that of estimation and can thereby avoid the conflict Mayo, D. G., & Spanos, A. (2010). Error statistics. In P. S.
inherent in interpreting results of testing and interpreting Bandyopadhyay & M. R. Forster (Eds.), Handbook of the
raw-scale magnitudes of estimation. However, a limitation of philosophy of science: Philosophy of statistics (Vol. 7, pp.
hypothesis testing is that it provides evidence solely for the 153-198). New York, NY: Elsevier.
truth or falsity of the specified hypotheses. It is the responsi- McCloskey, D. N., & Ziliak, S. T. (1996). The standard error of
regressions. Journal of Economic Literature, 34, 97-114.
bility of the researcher to justify knowing this fact. For the
McCloskey, D. N., & Ziliak, S. T. (2008). Signifying nothing: Reply
case in which knowing the truth or falsity of a hypothesis is
to Hoover and Siegler. Journal of Economic Methodology, 15,
not important, formal hypothesis testing is not an appropriate 39-55. doi:10.1080/13501780801913413
goal—Estimation, or other informative means, without the Neutens, J. J., & Rubinson, L. (2002a). Analyzing and interpreting
pretense of formal testing, may be the better objective. data: Inferential analysis. In Research techniques for the health
sciences (3rd ed., pp. 272-273). New York, NY: Benjamin
Declaration of Conflicting Interests Cummings.
Neutens, J. J., & Rubinson, L. (2002b). Research techniques for
The author(s) declared no potential conflicts of interest with respect the health sciences (3rd ed.). San Francisco, CA: Benjamin
to the research, authorship, and/or publication of this article. Cummings.
Neyman, J. (1957). The use of the concept of power in agricultural
Funding experimentation. Journal of the Indian Society of Agricultural
The author(s) received no financial support for the research and/or Statistics, IX, 9-17.
authorship of this article. Neyman, J., & Pearson, E. S. (1928a). On the use and interpreta-
tion of certain test criteria for purposes of statistical inference:
Part I. Biometrika, 20A, 175-240.
References Neyman, J., & Pearson, E. S. (1928b). On the use and interpreta-
Ames, D. R. (2004a). Inside the mind reader’s tool kit: Projection and tion of certain test criteria for purposes of statistical inference:
stereotyping in mental state inference. Journal of Personality Part II. Biometrika, 20A, 263-294.

Downloaded from by guest on January 9, 2015


Veazie 9

Neyman, J., & Pearson, E. S. (1933). On the problem of the Ziliak, S. T., & McCloskey, D. N. (2004a). Significance redux.
most efficient tests of statistical hypotheses. Philosophical Journal of Socio-Economics, 33, 665-675. doi:10.1016/j.
Transactions of the Royal Society of London, Series A, 231, socec.2004.09.038
289-337. doi:10.1098/rsta.1933.0009 Ziliak, S. T., & McCloskey, D. N. (2004b). Size matters: The stan-
Portney, L. G., & Watkins, M. P. (2000). Foundations of clinical dard error of regressions in the American Economic Review.
research: Applications to practice (2nd ed.). Upper Saddle Journal of Socio-Economics, 33, 527-546.
River, NJ: Prentice Hall. Ziliak, S. T., & McCloskey, D. N. (2008a). The cult of statistical
Rothman, K. J., & Greenland, S. (1998). Modern epidemiology significance: How the standard error costs us jobs, justice, and
(2nd ed.). Philadelphia, PA: Lippincott-Raven. lives. Ann Arbor: University of Michigan Press.
Spanos, A. (1999a). Hypothesis testing probability theory and sta- Ziliak, S. T., & McCloskey, D. N. (2008b). Science in judgment, not
tistical inference: Econometric modeling with observational only calculation: A reply to Aris Spanos’s review of The Cult
data. Cambridge, UK: Cambridge University Press. of Statistical Significance. Erasmus Journal for Philosophy
Spanos, A. (1999b). Probability theory and statistical inference: and Economics, 1, 165-170.
Econometric modeling with observational data. Cambridge,
UK: Cambridge University Press. Author Biography
Spanos, A. (2008). Review of Stephen T. Ziliak and Deirdre N. Peter J. Veazie is Associate Professor in the Department of Public
McCloskey’s the cult of statistical significance: How the stan- Health Sciences, Chief of the Division of Health Policy and
dard error costs us jobs, justice, and lives. Ann Arbor: The Outcomes Research, and Director of the Health Services Research
University of Michigan Press. and Policy Doctoral Program at the University of Rochester. His
Veazie, P. J. (2006). Projection, stereotyping, and the perception of research focuses on medical and healthcare decision making, health
chronic medical conditions. Chronic Illness, 2, 303-310. and quality of life outcomes, and research methods.

Downloaded from by guest on January 9, 2015

You might also like