A Machine Learning Approach To Modeling Scope Preferences
A Machine Learning Approach To Modeling Scope Preferences
net/publication/2894499
CITATIONS READS
19 1,744
2 authors, including:
Jerrold Sadock
University of Chicago
63 PUBLICATIONS 2,285 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Jerrold Sadock on 11 September 2014.
1. Overview
This article addresses the issue of determining the most accessible quantifier scope
reading for a sentence. Quantifiers are elements of natural and logical languages (such
as each, no, and some in English and ∀ and ∃ in predicate calculus) that have certain
semantic properties. Loosely speaking, they express that a proposition holds for some
proportion of a set of individuals. One peculiarity of these expressions is that there
can be semantic differences that depend on the order in which the quantifiers are
interpreted. These are known as scope differences.
∗ Department of Linguistics, University of Chicago, 1010 East 59th Street, Chicago, IL 60637. E-mail:
[email protected].
† Department of Linguistics, University of Chicago, 1010 East 59th Street, Chicago, IL 60637. E-mail:
[email protected].
c 2003 Association for Computational Linguistics
Computational Linguistics Volume 29, Number 1
sentence that are possible, without regard to their relative likelihood or naturalness.
Recently, however, linguists such as Kuno, Takami, and Wu (1999) have begun to turn
their attention to scope prediction, or determining the relative accessibility of different
scope readings.
In computational linguistics, more attention has been paid to the factors that de-
termine scope preferences. Systems such as the SRI Core Language Engine (Moran
1988; Moran and Pereira 1992), LUNAR (Woods 1986), and TEAM (Martin, Appelt,
and Pereira 1986) have employed scope critics that use heuristics to decide between
alternative scopings. However, the rules that these systems use in making quantifier
scope decisions are motivated only by the researchers’ intuitions, and no empirical
results have been published regarding their accuracy.
In this article, we use the tools of machine learning to construct a data-driven
model of quantifier scope preferences. For theoretical linguistics, this model serves as
an illustration that Kuno, Takami, and Wu’s approach can capture some of the clear-
est generalizations about quantifier scoping. For computational linguistics, this article
provides a baseline result on the task of scope prediction, with which other scope
critics can be compared. In addition, it is the most extensive empirical investigation
of which we are aware that collects data of any kind regarding the relative frequency
of different quantifier scope readings in English text.1
Section 2 briefly discusses treatments of scoping issues in theoretical linguistics,
and Section 3 reviews the computational work that has been done on natural language
quantifier scope. In Section 4 we introduce the models that we use to predict quantifier
scoping, as well as the data on which they are trained and tested. Section 5 combines
the scope model of the previous section with a probabilistic context-free grammar
(PCFG) model of syntax and addresses the issue of whether these two modules of
grammar ought to be combined in serial, with information from the syntax feeding the
quantifier scope module, or in parallel, with each module constraining the structures
provided by the other.
Most, if not all, linguistic treatments of quantifier scope have closely integrated it with
the way in which the syntactic structure of a sentence is built up. Montague (1973) used
a syntactic rule to introduce a quantified expression into a derivation at the point where
it was to take scope, whereas generative semantic analyses such as McCawley (1998)
represented the scope of quantification at deep structure, transformationally lowering
quantifiers into their surface positions during the course of the derivation. More recent
work in the interpretive paradigm takes the opposite approach, extracting quantifiers
from their surface positions to their scope positions by means of a quantifier-raising
(QR) transformation (May 1985; Aoun and Li 1993; Hornstein 1995). Another popular
technique is to percolate scope information up through the syntactic tree using Cooper
storage (Cooper 1983; Hobbs and Shieber 1987; Pollard 1989; Nerbonne 1993; Park 1995;
Pollard and Yoo 1998).
The QR approach to dealing with scope in linguistics consists in the claim that
there is a covert transformation applying to syntactic structures that moves quantified
elements out of the position in which they are found on the surface and raises them to
a higher position that reflects their scope. The various incarnations of the strategy that
1 See Carden (1976), however, for a questionnaire-based approach to gathering data on the accessibility
of different quantifier scope readings.
74
Higgins and Sadock Modeling Scope Preferences
Figure 1
Simple illustration of the QR approach to quantifier scope generation.
follows from this claim differ in the precise characterization of this QR transforma-
tion, what conditions are placed upon it, and what tree-configurational relationship
is required for one operator to take scope over another. The general idea of QR is
represented in Figure 1, a schematic analysis of the reading of the sentence Someone
saw everyone in which someone takes wide scope (i.e., ‘there is some person x such that
for all persons y, x saw y’).
In the Cooper storage approach, quantifiers are gathered into a store and passed
upward through a syntactic tree. At certain nodes along the way, quantifiers may be
retrieved from the store and take scope. The relative scope of quantifiers is determined
by where each quantifier is retrieved from the store, with quantifiers higher in the tree
taking wide scope over lower ones. As with QR, different authors implement this
scheme in slightly different ways, but the simplest case is represented in Figure 2, the
Cooper storage analog of Figure 1.
These structural approaches, QR and Cooper storage, have in common that they
allow syntactic factors to have an effect only on the scope readings that are available for
a given sentence. They are also similar in addressing only the issue of scope generation,
or identifying all and only the accessible readings for each sentence. That is to say,
they do not address the issue of the relative salience of these readings.
Kuno, Takami, and Wu (1999, 2001) propose to model the scope of quantified
elements with a set of interacting expert systems that basically consists of a weighted
vote taken of the various factors that may influence scope readings. This model is
meant to account not only for scope generation, but also for “the relative strengths of
the potential scope interpretations of a given sentence” (1999, page 63). They illustrate
the plausibility of this approach in their paper by presenting a number of examples
that are accounted for fairly well by the approach even when an unweighted vote of
the factors is allowed to be taken.
So, for example, in Kuno, Takami and Wu’s (49b) (1999), repeated here as (2), the
correct prediction is made: that the sentence is unambiguous with the first quantified
noun phrase (NP) taking wide scope over the second (the reading in which we don’t
all have to hate the same people). Table 1 illustrates how the votes of each of Kuno,
Takami, and Wu’s “experts” contribute to this outcome. Since the expression many of
us/you receives more votes, and the numbers for the two competing quantified expres-
sions are quite far apart, the first one is predicted to take wide scope unambiguously.
75
Computational Linguistics Volume 29, Number 1
Figure 2
Simple illustration of the Cooper storage approach to quantifier scope generation.
Table 1
Voting to determine optimal scope readings for quantifiers, according to Kuno, Takami, and
Wu (1999).
Some adherents of the structural approaches also seem to acknowledge the ne-
cessity of eventually coming to terms with the factors that play a role in determining
scope preferences in language. Aoun and Li (2000) claim that the lexical scope pref-
erences of quantifiers “are not ruled out under a structural account” (page 140). It is
clear from the surrounding discussion, though, that they intend such lexical require-
ments to be taken care of in some nonsyntactic component of grammar. Although
Kuno, Takami, and Wu’s dialogue with Aoun and Li in Language has been portrayed
by both sides as a debate over the correct way of modeling quantifier scope, they are
not really modeling the same things. Whereas Aoun and Li (1993) provide an account
of scope generation, Kuno, Takami, and Wu (1999) intend to model both scope gen-
eration and scope prediction. The model of scope preferences provided in this article
is an empirically based refinement of the approach taken by Kuno, Takami, and Wu,
but in principle it is consistent with a structural account of scope generation.
76
Higgins and Sadock Modeling Scope Preferences
Many studies, such as Pereira (1990) and Park (1995), have dealt with the issue of
scope generation from a computational perspective. Attempts have also been made
in computational work to extend a pure Cooper storage approach to handle scope
prediction. Hobbs and Shieber (1987) discuss the possibility of incorporating some sort
of ordering heuristics into the SRI scope generation system, in the hopes of producing
a ranked list of possible scope readings, but ultimately are forced to acknowledge that
“[t]he modifications turn out to be quite complicated if we wish to order quantifiers
according to lexical heuristics, such as having each out-scope some. Because of the
recursive nature of the algorithm, there are limits to the amount of ordering that can
be done in this manner” (page 55). The stepwise nature of these scope mechanisms
makes it hard to state the factors that influence the preference for one quantifier to
take scope over another.
Those natural language processing (NLP) systems that have managed to provide
some sort of account of quantifier scope preferences have done so by using a separate
system of heuristics (or scope critics) that apply postsyntactically to determine the most
likely scoping. LUNAR (Woods 1986), TEAM (Martin, Appelt, and Pereira 1986), and
the SRI Core Language Engine as described by Moran (1988; Moran and Pereira 1992)
all employ scope rules of this sort. By and large, these rules are of an ad hoc nature,
implementing a linguist’s intuitive idea of what factors determine scope possibilities,
and no results have been published regarding the accuracy of these methods. For
example, Moran (1988) incorporates rules from other NLP systems and from VanLehn
(1978), such as a preference for a logically weaker interpretation, the tendency for each
to take wide scope, and a ban on raising a quantifier across multiple major clause
boundaries. The testing of Moran’s system is “limited to checking conformance to
the stated rules” (pages 40–41). In addition, these systems are generally incapable of
handling unrestricted text such as that found in the Wall Street Journal corpus in a
robust way, because they need to do a full semantic analysis of a sentence in order
to make scope predictions. The statistical basis of the model presented in this article
offers increased robustness and the possibility of more serious evaluation on the basis
of corpus data.
77
Computational Linguistics Volume 29, Number 1
4.2 Data
The data on which the quantifier scope classifiers are trained and tested is an extract
from the Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993) that we have
tagged to indicate the most salient scope interpretation of each sentence in context.
Figure 3 shows an example of a training sentence with the scope reading indicated.
The quantifier lower in the tree bears the tag “Q1,” and the higher quantifier bears the
tag “Q2,” so this sentence is interpreted such that the lower quantifier has wide scope.
Reversing the tags would have meant that the higher quantifier takes wide scope, and
while if both quantifiers had been marked “Q1,” this would have indicated that there
is no scope interaction between them (as when they are logically independent or take
scope in different conjuncts of a conjoined phrase).2
The sentences tagged were chosen from the Wall Street Journal (WSJ) section of
the Penn Treebank to have a certain set of attributes that simplify the task of design-
ing the quantifier scope module of the grammar. First, in order to simplify the coding
process, each sentence has exactly two scope-taking elements of the sort considered
for this project.3 These include most NPs that begin with a determiner, predeterminer,
or quantifier phrase (QP)4 but exclude NPs in which the determiner is a, an, or the. Ex-
2 This “no interaction” class is a sort of “elsewhere” category that results from phrasing the classification
question as “Which quantifier takes wider scope in the preferred reading?” Where there is no scope
interaction, the answer is “neither.” This includes cases in which the relative scope of operators does
not correspond to a difference in meaning, as in One woman bought one horse, or when they take scope
in different propositional domains, such as in Mary bought two horses and sold three sheep. The human
coders used in this study were instructed to choose class 0 whenever there was not a clear preference
for one of the two scope readings.
3 This restriction that each sentence contain only two quantified elements does not actually exclude
many sentences from consideration. We identified only 61 sentences with three quantifiers of the sort
we consider and 12 sentences with four. In addition, our review of these sentences revealed that many
of them simply involve lists in which the quantifiers do not interact in terms of scope (as in, for
example, “We ask that you turn off all cell phones, extinguish all cigarettes, and open any candy before
the performance begins”). Thus, the class of sentences with more than two quantifiers is small and
seems to involve even simpler quantifier interactions than those found in our corpus.
4 These categories are intended to be understood as they are used in the tagging and parsing of the Penn
Treebank. See Santorini (1990) and Bies et al. (1995) for details; the Appendix lists selected codes used
78
Higgins and Sadock Modeling Scope Preferences
( (S
(NP-SBJ
(NP (DT Those) )
(SBAR
(WHNP-1 (WP who) )
(S
(NP-SBJ-2 (-NONE- *T*-1) )
(ADVP (RB still) )
(VP (VBP want)
(S
(NP-SBJ (-NONE- *-2) )
(VP (TO to)
(VP (VB do)
(NP (PRP it) ))))))))
(‘‘ ‘‘)
(VP (MD will)
(ADVP (RB just) )
(VP (VB find)
(NP
(NP (DT-Q2 some) (NN way) )
(SBAR
(WHADVP-3 (-NONE- 0) )
(S
(NP-SBJ (-NONE- *) )
(VP (TO to)
(VP (VB get)
(PP (IN around) (’’ ’’)
(NP (DT-Q1 any) (NN attempt)
(S
(NP-SBJ (-NONE- *) )
(VP (TO to)
(VP (VB curb)
(NP (PRP it) ))))))
(ADVP-MNR (-NONE- *T*-3) ))))))))
(. .) ))
Figure 3
Tagged Wall Street Journal text from the Penn Treebank. The lower quantifier takes wide
scope, indicated by its tag “Q1.”
cluding these determiners from consideration largely avoids the problem of generics
and the complexities of assigning scope readings to definite descriptions. In addi-
tion, only sentences that had the root node S were considered. This serves to exclude
sentence fragments and interrogative sentence types. Our data set therefore differs
systematically from the full WSJ corpus, but we believe it is sufficient to allow many
generalizations about English quantification to be induced. Given these restrictions on
the input data, the task of the scope classifier is a choice among three alternatives:5
for annotating the Penn Treebank corpus. The category QP is particularly unintuitive in that it does not
correspond to a quantified noun phrase, but to a measure expression, such as more than half.
5 Some linguists may find it strange that we have chosen to treat the choice of preferred scoping for two
quantified elements as a tripartite decision, since the possibility of independence is seldom treated in
the linguistic literature. As we are dealing with corpus data in this experiment, we cannot afford to
ignore this possibility.
79
Computational Linguistics Volume 29, Number 1
The result is a set of 893 sentences,6 annotated with Penn Treebank II parse trees and
hand-tagged for the primary scope reading.
To assess the reliability of the hand-tagged data used in this project, the data were
coded a second time by an independent coder, in addition to the reference coding.
The independent codings agreed with the reference coding on 76.3% of sentences. The
kappa statistic (Cohen 1960) for agreement was .52, with a 95% confidence interval
between .40 and .64. Krippendorff (1980) has been widely cited as advocating the
view that kappa values greater than .8 should be taken as indicating good reliability,
with values between .67 and .8 indicating tentative reliability, but we are satisfied
with the level of intercoder agreement on this task. As Carletta (1996) notes, many
tasks in computational linguistics are simply more difficult than the content analysis
classifications addressed by Krippendorff, and according to Fleiss (1981), kappa values
between .4 and .75 indicate fair to good agreement anyhow.
Discussion between the coders revealed that there was no single cause for their dif-
ferences in judgments when such differences existed. Many cases of disagreement stem
from different assumptions regarding the lexical quantifiers involved. For example, the
coders sometimes differed on whether a given instance of the word any corresponds
to a narrow-scope existential, as we conventionally treat it when it is in the scope of
negation, or the “free-choice” version of any. To take another example, two universal
quantifiers are independent in predicate calculus (∀x∀y[φ] ⇐⇒ ∀y∀x[φ]), but in creat-
ing our scope-tagged corpus, it was often difficult to decide whether two universal-like
English quantifiers (such as each, any, every, and all) were actually independent in a
given sentence. Some differences in coding stemmed from coder disagreements about
whether a quantifier within a fixed expression (e.g., all the hoopla) truly interacts with
other operators in the sentence. Of course, another major factor contributing to inter-
coder variation is the fact that our data sentences, taken from Wall Street Journal text,
are sometimes quite long and complex in structure, involving multiple scope-taking
operators in addition to the quantified NPs. In such cases, the coders sometimes had
difficulty clearly distinguishing the readings in question.
Because of the relatively small amount of data we had, we used the technique of
tenfold cross-validation in evaluating our classifiers, in each case choosing 89 of the
893 total data sentences from the data as a test set and training on the remaining 804.
We preprocessed the data in order to extract the information from each sentence that
we would be treating as relevant to the prediction of quantifier scoping in this project.
(Although the initial coding of the preferred scope reading for each sentence was done
manually, this preprocessing of the data was done automatically.) At the end of this
preprocessing, each sentence was represented as a record containing the following
information (see the Appendix for a list of annotation codes for Penn Treebank):
6 These data have been made publicly available to all licensees of the Penn Treebank by means of a
patch file that may be retrieved from http://humanities.uchicago.edu/linguistics/students/dchiggin/
qscope-data.tgz. This file also includes the coding guidelines used for this project.
80
Higgins and Sadock Modeling Scope Preferences
class: 2
first cat: DT
first head: some
second cat: DT
second head: any
join cat: NP
first c-commands: YES
second c-commands: NO
nodes intervening: 6
VP intervenes: YES
ADVP intervenes: NO
..
.
S intervenes: YES
conj intervenes: NO
, intervenes: NO
: intervenes: NO
..
.
” intervenes: YES
Figure 4
Example record corresponding to the sentence shown in Figure 3.
Figure 4 illustrates how these features would be used to encode the example in Fig-
ure 3.
The items of information included in the record, as listed above, are not the exact
factors that Kuno, Takami, and Wu (1999) suggest be taken into consideration in mak-
ing scope predictions, and they are certainly not sufficient to determine the proper
scope reading for all sentences completely. Surely pragmatic factors and real-world
knowledge influence our interpretations as well, although these are not represented
here. This list does, however, provide information that could potentially be useful in
predicting the best scope reading for a particular sentence. For example, information
7 We take a node α to intervene between two other nodes β and γ in a tree if and only if δ is the lowest
node dominating both β and γ, δ dominates α or δ = α, and α dominates either β or γ.
81
Computational Linguistics Volume 29, Number 1
Table 2
Baseline performance, summed over all ten test sets.
about whether one quantified NP in a given sentence c-commands the other corre-
sponds to Kuno, Takami, and Wu’s observation that subject quantifiers tend to take
wide scope over object quantifiers and topicalized quantifiers tend to outscope ev-
erything. The identity of each lexical quantifier clearly should allow our classifiers to
make the generalization that each tends to take wide scope, if this word is found in
the data, and perhaps even learn the regularity underlying Kuno, Takami, and Wu’s
observation that universal quantifiers tend to outscope existentials.
8 The implementations of these classifiers are publicly available as Perl modules at http://humanities.
uchicago.edu/linguistics/students/dchiggin/classifiers.tgz.
82
Higgins and Sadock Modeling Scope Preferences
Table 3
Performance of the naive Bayes classifier, summed over all 10 test runs.
4.3.1 Naive Bayes Classifier. Our data D will consist of a vector of features (d0 · · · dn )
that represent aspects of the sentence under examination, such as whether one quan-
tified expression c-commands the other, as described in Section 4.2. The fundamental
simplifying assumption that we make in designing a naive Bayes classifier is that
these features are independent of one another and therefore can be aggregated as in-
dependent sources of evidence about which class c∗ a given sentence belongs to. This
independence assumption is formalized in equations (1) and (2).
n
≈ arg max P(c) P(dk | c) (2)
c k=0
9 We include the term P(f ) in the product in order to prevent sparsely instantiated features from
showing up as highly-ranked.
83
Computational Linguistics Volume 29, Number 1
Table 4
Most active features from naive Bayes classifier.
whereas feature 6 indicates a preference for class 0 under the same conditions. Pre-
sumably, this reflects a dispreference for the second quantifier to take wide scope
when there is a clause boundary intervening between it and the first quantifier. The
fourth-ranked feature in Table 4 indicates that, if the first quantified NP does not
c-command the second, it is less likely to take wide scope. This is not surprising,
given the importance that c-command relations have had in theoretical discussions
of quantifier scope. The fifth-ranked feature expresses a preference for quantified ex-
pressions of category QP to take narrow scope, if they are the second of the two
quantifiers under consideration. This may simply be reflective of the fact that class
1 is more common than class 2, and the measure expressions found in QP phrases
in the Penn Treebank (such as more than three or about half ) tend not to be logically
independent of other quantifiers. Finally, the feature 15 in Table 4 indicates a high
correlation between the second quantified expression’s c-commanding the first and
the second quantifier’s taking wide scope. We can easily see this as a translation into
our feature set of Kuno, Takami, and Wu’s claim that subjects tend to outscope ob-
jects and obliques and topicalized elements tend to take wide scope. Some of these
top-ranked features have to do with information found only in the written medium,
but on the whole, the features induced by the naive Bayes classifier seem consis-
tent with those suggested by Kuno, Takami, and Wu, although they are distinct by
necessity.
This classifier superficially resembles in form the naive Bayes classifier in equation (2),
but it differs from that classifier in that the way in which values for each α are chosen
does not assume that the features in the data are independent. For each of the 10
training sets, we used the generalized iterative scaling algorithm to train this classifier
on 654 training examples, using 150 examples for validation to choose the best set of
10 Z in Equation 3 is simply a normalizing constant that ensures that we end up with a probability
distribution.
84
Higgins and Sadock Modeling Scope Preferences
Table 5
Performance of the maximum-entropy classifier, summed over all 10 test runs.
Table 6
Most active features from maximum-entropy classifier.
values for the αs.11 Test data could then be classified by choosing the class for the data
that maximizes the joint probability in equation (3).
The results of training with the maximum-entropy classifier are shown in Table 5.
The classifier showed slightly higher performance than the naive Bayes classifier, with
the lowest error rate on the class of sentences having no scope interaction.
To determine exactly which features of the data the maximum-entropy classifier
sees as relevant to the classification problem, we can simply look at the α values (from
equation (3)) for each feature. Those features with higher values for α are weighted
more heavily in determining the proper scoping. Some of the features with the highest
values for α are listed in Table 6. Because of the way the classifier is built, predictor
features for class 2 need to have higher loadings to overcome the lower prior probabil-
ity of the class. Therefore, we actually rank the features in Table 6 according to αP̂(c)k
(which we denote as αc,k ). P̂(c) represents the empirical prior probability of a class c,
and k is simply a constant (.25 in this case) chosen to try to get a mix of features for
different classes at the top of the list.
The features ranked first and fifth in Table 6 express lexical preferences for certain
quantifiers to take wide scope, even when they are the second of the two quantifiers
according to linear order in the string of words. The tendency for each to take wide
scope is stronger than for the other quantifier, which is in line with Kuno, Takami,
and Wu’s decision to list it as the only quantifier with a lexical preference for scoping.
Feature 2 makes the “no scope interaction” class more likely if a comma intervenes, and
11 Overtraining is not a problem with the pure version of the generalized iterative scaling algorithm. For
efficiency reasons, however, we chose to take the training corpus as representative of the event space,
rather than enumerating the space exhaustively (see Jelinek [1998] for details). For this reason, it was
necessary to employ validation in training.
85
Computational Linguistics Volume 29, Number 1
Table 7
Performance of the single-layer perceptron, summed over all 10 test runs.
feature 25 makes a wide-scope reading for the first quantifier more likely if there is no
intervening comma. The third-ranked feature expresses the tendency mentioned above
for quantifiers in conjoined clauses not to interact. Features 4 and 12 indicate that if the
first quantified expression c-commands the second, it is likely to take wide scope, and
that if this is not the case, there is likely to be no scope interaction. Finally, the sixth-
and seventh-ranked features in the table show that an intervening quotation mark or
colon will make the classifier tend toward class 0, “no scope interaction,” which is easy
to understand. Quotations are often opaque to quantifier scope interactions. The top
features found by the maximum-entropy classifier largely coincide with those found
by the naive Bayes model, which indicates that these generalizations are robust and
objectively present in the data.
4.3.3 Single-Layer Perceptron. For our neural network classifier, we employed a feed-
forward single-layer perceptron, with the softmax function used to determine the acti-
vation of nodes at the output layer, because this is a one-of-n classification task (Bridle
1990). The data to be classified are presented as a vector of features at the input layer,
and the output layer has three nodes, representing the three possible classes for the
data: “first has wide scope,” “second has wide scope,” and “no scope interaction.”
The output node with the highest activation is interpreted as the class of the datum
presented at the input layer.
For each of the 10 test sets of 89 examples, we trained the connection weights
of the network using error backpropagation on 654 training sentences, reserving 150
sentences for validation in order to choose the weights from the training epoch with the
highest classification performance. In Table 7 we present the results of the single-layer
neural network in classifying our test sentences. As the table shows, the single-layer
perceptron has much better classification performance than the naive Bayes classifier
and maximum-entropy model, possibly because the training of the network aims to
minimize error in the activation of the classification output nodes, which is directly
related to the classification task at hand, whereas the other models do not directly
make use of the notion of “classification error.” The perceptron also uses a sort of
weighted voting and could be interpreted as an implementation of Kuno, Takami,
and Wu’s proposal for scope determination. This clearly illustrates that the tenability
of their proposal hinges on the exact details of its implementation, since all of our
classifier models are reasonable interpretations of their approach, but they have very
different performance results on our scope determination task.
To determine exactly which features of the data the network sees as relevant to
the classification problem, we can simply look at the connection weights for each
feature-class pair. Higher connection weights indicate a greater correlation between
input features and output classes. For one of the 10 networks we trained, some of
the features with the highest connection weights are listed in Table 8. Since class 0 is
86
Higgins and Sadock Modeling Scope Preferences
Table 8
Most active features from single-layer perceptron.
simply more frequent in the training data than the other two classes, the weights for
this class tend to be higher. Therefore, we also list some of the best predictor features
for classes 1 and 2 in the table.
The first- and third-ranked features in Table 8 show that an intervening comma or
colon will make the classifier tend toward class 0, “no scope interaction.” This finding
by the classifier is similar to the maximum-entropy classifier’s finding an intervening
quotation mark relevant and can be taken as an indication that quantifiers in distant
syntactic subdomains are unlikely to interact. Similarly, the fourth-ranked feature indi-
cates that quantifiers in separate conjuncts are unlikely to interact. The second-ranked
feature in the table expresses a tendency for there to be no scope interaction between
two quantifiers if the second of them is headed by all. This may be related to the
independence of universal quantifiers (∀x∀y[φ] ⇐⇒ ∀y∀x[φ]). Feature 17 in Table 8
indicates a high correlation between the first quantified expression’s c-commanding
the second and the first quantifier’s taking wide scope, which again supports Kuno,
Takami, and Wu’s claim that scope preferences are related to syntactic superiority re-
lations. Feature 18 expresses a preference for a quantified expression headed by most
to take wide scope, even if it is the second of the two quantifiers (since most is the
only quantifier in the corpus that bears the tag RBS). Feature 19 indicates that the
first quantifier is more likely to take wide scope if there is a clause boundary in-
tervening between the two quantifiers, which supports the notion that the syntactic
distance between the quantifiers is relevant to scope preferences. Finally, feature 20
expresses the well-known tendency for quantified expressions headed by each to take
wide scope.
87
Computational Linguistics Volume 29, Number 1
Table 9
Summary of classifier results.
on the data on which both coders agreed and tested on the remaining sentences. This
classifier agreed with the reference coding (the coding of the first coder) 51.4% of the
time and with the additional independent coder 35.8% of the time. The first coder con-
structed the annotation guidelines for this project and may have been more successful
in applying them consistently. Alternatively, it is possible that different individuals use
different strategies in determining scope preferences, and the strategy of the second
coder may simply have been less similar than the strategy of the first coder to that of
the single-layer network.
These three classifiers directly implement a sort of weighted voting, the method
of aggregating evidence proposed by Kuno, Takami, and Wu (although the classifiers’
implementation is slightly more sophisticated than the unweighted voting that is ac-
tually used in Kuno, Takami, and Wu’s paper). Of course, since we do not use exactly
the set of features suggested by Kuno, Takami, and Wu, our model should not be
seen as a straightforward implementation of the theory outlined in their 1999 paper.
Nevertheless, the results in Table 9 suggest that Kuno, Takami, and Wu’s suggested
design can be used with some success in modeling scope preferences. Moreover, the
project undertaken here provides an answer to some of the objections that Aoun and
Li (2000) raise to Kuno, Takami, and Wu. Aoun and Li claim that Kuno, Takami, and
Wu’s choice of experts is seemingly arbitrary and that it is unclear how the voting
weights of each expert are to be set, but the machine learning approach we employ
in this article is capable of addressing both of these potential problems. Supervised
training of our classifiers is a straightforward approach to setting the weights and
also constitutes our approach to selecting features (or “experts” in Kuno, Takami, and
Wu’s terminology). In the training process, any feature that is irrelevant to scoping
preferences should receive weights that make its effect negligible.
In this section, we show how the classifier models of quantifier scope determination
introduced in Section 4 may be integrated with a PCFG model of syntax. We com-
pare two different ways in which the two components may be combined, which may
loosely be termed serial and parallel, and argue for the latter on the basis of empirical
results.
88
Higgins and Sadock Modeling Scope Preferences
of syntactically provided features, such as the number of nodes of a certain type that
intervene between two quantifiers in a phrase structure tree.
Thus, the combined language model that we define in this article assigns probabil-
ities according to the pairs of structures that may be assigned to a sentence by the Q-
structure and phrase structure syntax modules. The probability of a word string w1−n
is therefore defined as in equation (4), where Q ranges over all possible Q-structures
in the set Q and S ranges over all possible syntactic structures in the set S.
P(w1−n ) = P(S, Q | w1−n ) (4)
S∈S,Q∈Q
= P(S | w1−n )P(Q | S, w1−n ) (5)
S∈S,Q∈Q
Equation (5) shows how we can use the definition of conditional probability to
break our calculation of the language model probability into two parts. The first of
these parts, P(S | w1−n ), which we may abbreviate as simply P(S), is the probability
of a particular syntactic tree structure’s being assigned to a particular word string. We
model this probability using a probabilistic phrase structure grammar (cf. Charniak
[1993, 1996]). The second distribution on the right side of equation (5) is the conditional
probability of a particular quantifier scope structure’s being assigned to a particular
word string, given the syntactic structure of that string. This probability is written as
P(Q | S, w1−n ), or simply P(Q | S), and represents the quantity we estimated above
in constructing classifiers to predict the scopal representation of a sentence based on
aspects of its syntactic structure.
Thus, given a PCFG model of syntactic structure and a probabilistically defined
classifier of the sort introduced in Section 4, it is simple to determine the probability
of any pairing of two particular structures from each domain for a given sentence.
We simply multiply the values of P(S) and P(Q | S) to obtain the joint probability
P(Q, S). In the current section, we examine two different models of combination for
these components: one in which scope determination is applied to the optimal syn-
tactic structure (the Viterbi parse), and one in which optimization is performed in the
space of both modules to find the optimal pairing of syntactic and quantifier scope
structures.
89
Computational Linguistics Volume 29, Number 1
Figure 5
A simple phrase structure tree.
Table 10
A simple probabilistic phrase structure grammar.
Rule Probability
S → NP VP .7
S → VP .2
S → V NP VP .1
VP → V VP .3
VP → ADV VP .1
VP → V .1
VP → V NP .3
VP → V NP NP .2
NP → Susan .3
NP → you .4
NP → Yves .3
V → might .2
V → believe .3
V → show .3
V → stay .2
ADV → not .5
ADV → always .5
90
Higgins and Sadock Modeling Scope Preferences
The actual grammar rules and associated probabilities that we use in defining our
syntactic module are derived from the WSJ corpus of the Penn Treebank by maximum-
likelihood estimation. That is, for each rule N → φ used in the treebank, we add the
rule to the grammar and set its probability to C(N→φ)C(N→ψ)
, where C(·) denotes the
ψ
“count” or a rule (i.e., the number of times it is used in the corpus). A grammar
composed in this manner is referred to as a treebank grammar, because its rules are
directly derived from those in a treebank corpus.
We used sections 00–20 of the WSJ corpus of the Penn Treebank for collecting the
rules and associated probabilities of our PCFG, which is implemented as a bottom-up
chart parser. Before constructing the grammar, the treebank was preprocessed using
known procedures (cf. Krotov et al. [1998]; Belz [2001]) to facilitate the construction of
a rule list. Functional and anaphoric annotations (basically anything following a “-”
in a node label; cf. Santorini [1990]; Bies et al. [1995]) were removed from nonterminal
labels. Nodes that dominate only “empty categories” such as traces were removed.
In addition, unary-branching constructions were removed by replacing the mother
category in such a structure with the daughter node. (For example, given an instance
of the rule X → YZ, if the daughter category Y were expanded by the unary rule
Y → W, our algorithm would induce the single rule X → WZ.) Finally, we discarded
all rules that had more than 10 symbols on the right-hand side (an arbitrary limit of
our parser implementation). This resulted in a total of 18,820 rules, of which 11,156
were discarded as hapax legomena, leaving 7,664 rules in our treebank grammar.
Table 11 shows some of the rules in our grammar with the highest and lowest corpus
counts.
91
Computational Linguistics Volume 29, Number 1
Table 11
Rules derived from sections 00–20 of the Penn Treebank WSJ corpus. “TOP” is a special “start”
symbol that may expand to any of the symbols found at the root of a tree in the corpus.
a scope reading to maximize the probability of the pairing. That is, are syntax and
quantifier scope mutually dependent components of grammar, or can scope relations
be “read off of” syntax? The serial model suggests that the optimal syntactic structure
τ ∗ should be chosen on the basis of the syntactic module only, as in equation (9),
and the optimal quantifier scope structure χ∗ then chosen on the basis of τ ∗ , as in
equation (10). The parallel model, on the other hand, suggests that the most likely
pairing of structures must be chosen in the joint probability space of both components,
as in equation (11).
5.3.1 Experimental Design. For this experiment, we implement the scoping compo-
nent as a single-layer feed-forward network, because the single-layer perceptron clas-
sifier had the best prediction rate among the three classifiers tested in Section 4. The
softmax activation function we use for the output nodes of the classifier guarantees
that the activations of all of the output nodes sum to one and can be interpreted as
class probabilities. The syntactic component, of course, is determined by the treebank
PCFG grammar described above.
Given these two models, which respectively define PQ (χ | τ , w1−n ) and PS (τ | w1−n )
from equation (11), it remains only to specify how to search the space of pairings
(τ , χ) in performing this optimization to find χ∗ . Unfortunately, it is not feasible to
examine all values τ ∈ S, since our PCFG will generally admit a huge number of
92
Higgins and Sadock Modeling Scope Preferences
Table 12
Performance of models on the unlabeled scope prediction task, summed over all 10 test runs.
trees for a sentence (especially given a mean sentence length of over 20 words in
the WSJ corpus).12 Our solution to this search problem is to make the simplifying as-
sumption that the syntactic tree that is used in the optimal set of structures (τ ∗ , χ∗ )
will always be among the top few trees τ for which PS (τ | w1−n ) is the greatest.
That is, although we suppose that quantifier scope information is relevant to pars-
ing, we do not suppose that it is so strong a determinant as to completely over-
ride syntactic factors. In practice, this means that our parser will return the top 10
parses for each sentence, along with the probabilities assigned to them, and these
are the only parses that are considered in looking for the optimal set of linguistic
structures.
We again used 10-fold cross-validation in evaluating the competing models, di-
viding the scope-tagged corpus into 10 test sections of 89 sentences each, and we
used the same version of the treebank grammar for our PCFG. The first model re-
trieved the top 10 syntactic parses (τ0 · · · τ9 ) for each sentence and computed the
probability P(τ , χ) for each τ ∈ τ0 · · · τ9 , χ ∈ 0, 1, 2, choosing that scopal represen-
tation χ that was found in the maximum-probability pairing. We call this the par-
allel model, because the properties of each probabilistic model may influence the
optimal structure chosen by the other. The second model retrieved only the Viterbi
parse τ0 from the PCFG and chose the scopal representation χ for which the pair-
ing (τ0 , χ) took on the highest probability. We call this the serial model, because it
represents syntactic phrase structure as independent of other components of gram-
mar (in this case, quantifier scope), though other components are dependent
upon it.
5.3.2 Results. There was an appreciable difference in performance between these two
models on the quantifier scope test sets. As shown in Table 12, the parallel model
narrowly outperformed the serial model, by 1.2%. A 10-fold paired t-test on the test
sections of the scope-tagged corpus shows that the parallel model is significantly better
(p < .05).
12 Since we are allowing χ to range only over the three scope readings (0, 1, 2), however, it is possible to
enumerate all values of χ to be paired with a given syntactic tree τ .
93
Computational Linguistics Volume 29, Number 1
6. Conclusion
Appendix: Selected Codes Used to Annotate Syntactic Categories in the Penn Tree-
bank, from Marcus et al. (1993) and Bies et al. (1995)
Part-of-speech tags
CC Conjunction RB Adverb
CD Cardinal number RBR Comparative adverb
DT Determiner RBS Superlative adverb
IN Preposition TO “to”
JJ Adjective UH Interjection
JJR Comparative adjective VB Verb in base form
JJS Superlative adjective VBD Past-tense verb
NN Singular or mass noun VBG Gerundive verb
NNS Plural noun VBN Past participial verb
NNP Singular proper noun VBP Non-3sg, present-
NNPS Plural proper noun tense verb
PDT Predeterminer VBZ 3sg, present-tense
verb
PRP Personal pronoun WP WH pronoun
PRP$ Possessive pronoun WP$ Possessive WH pronoun
94
Higgins and Sadock Modeling Scope Preferences
Phrasal categories
95
Computational Linguistics Volume 29, Number 1
96