COS 484: Natural Language Processing
Constituency Parsing
Fall 2019
(Some slides adapted from Chris Manning, Mike Collins)
Overview
• Constituency structure vs dependency structure
• Context-free grammar (CFG)
• Probabilistic context-free grammar (PCFG)
• The CKY algorithm
• Evaluation
• Lexicalized PCFGs
Syntactic structure: constituency and dependency
Two views of linguistic structure
• Constituency
• = phrase structure grammar
• = context-free grammars (CFGs)
• Dependency
Constituency structure
• Phrase structure organizes words into nested constituents
• Starting units: words are given a category: part-of-speech tags
the, cuddly, cat, by, the, door
Det, Adj, N, P, Det, N
• Words combine into phrases with categories
the cuddly cat, by the door
NP→Det Adj N PP→P NP
• Phrases can combine into bigger phrases recursively
the cuddly cat by the door
NP→ NP PP
This Thursday
Dependency structure
• Dependency structure shows which words depend on (modify or
are arguments of) which other words.
nmod
nsubj dobj case
Satellites spot whales from space
Satellites spot whales from space
❌
Why do we need sentence structure?
• We need to understand sentence structure in order to be able to
interpret language correctly
• Human communicate complex ideas by composing words together
into bigger units
• We need to know what is connected to what
Syntactic parsing
• Syntactic parsing is the task of recognizing a sentence and
assigning a structure to it.
Input: Output:
Beoing is located in Seattle.
Syntactic parsing
• Used as intermediate representation for downstream applications
English word order: subject — verb — object
Japanese word order: subject — object — verb
Image credit: http://vas3k.com/blog/machine_translation/
Syntactic parsing
• Used as intermediate representation for downstream applications
Image credit: (Zhang et al, 2018)
Context-free grammars
• The most widely used formal system for modeling
constituency structure in English and other natural languages
• A context free grammar G = (N, Σ, R, S) where
• N is a set of non-terminal symbols
• Σ is a set of terminal symbols
• R is a set of rules of the form X → Y1Y2…Yn for n ≥ 1,
X ∈ N, Yi ∈ (N ∪ Σ)
• S ∈ N is a distinguished start symbol
A Context-Free Grammar for English
Grammar Lexicon
S:sentence, VP:verb phrase, NP: noun phrase, PP:prepositional phrase,
DT:determiner, Vi:intransitive verb, Vt:transitive verb, NN: noun, IN:preposition
(Left-most) Derivations
• Given a CFG G, a left-most derivation is a sequence of strings
s1, s2, …, sn, where
• s1 = S
• sn ∈ Σ*: all possible strings made up of words from Σ
• Each si for i = 2,…, n is derived from si−1 by picking the left-most
non-terminal X in si−1 and replacing it by some β where X → β ∈ R
• sn: yield of the derivation
(Left-most) Derivations
• s1 = S
• s2 = NP VP
• s3 = DT NN VP
• s4 = the NN VP
• s5 = the man VP
• s6 = the man Vi
• s7 = the man sleeps
A derivation can be represented as a parse tree!
• A string s ∈ Σ* is in the language defined by the CFG if
there is at least one derivation whose yield is s
• The set of possible derivations may be finite or infinite
Ambiguity
• Some strings may have more than one derivations (i.e. more
than one parse trees!).
“Classical” NLP Parsing
• In fact, sentences can have a very large number of possible parses
The board approved [its acquisition] [by Royal Trustco Ltd.] [of
Toronto] [for $27 a share] [at its monthly meeting].
((ab)c)d (a(bc))d (ab)(cd) a((bc)d) a(b(cd))
1
n+1 (n)
Catalan number: Cn = 2n
• It is also difficult to construct a grammar with enough coverage
• A less constrained grammar can parse more sentences but
result in more parses for even simple sentences
• There is no way to choose the right parse!
Statistical parsing
• Learning from data: treebanks
• Adding probabilities to the rules: probabilistic CFGs (PCFGs)
Treebanks: a collection of sentences paired with their parse trees
The Penn Treebank Project (Marcus et al, 1993)
Treebanks
• Standard setup (WSJ portion of Penn Treebank):
• 40,000 sentences for training
• 1,700 for development
• 2,400 for testing
• Why building a treebank instead of a grammar?
• Broad coverage
• Frequencies and distributional information
• A way to evaluate systems
Probabilistic context-free grammars (PCFGs)
• A probabilistic context-free grammar (PCFG) consists of:
• A context-free grammar: G = (N, Σ, R, S)
• For each rule α → β ∈ R, there is a parameter q(α → β) ≥ 0.
For any X ∈ N,
∑
q(α → β) = 1
α→β:α=X
Probabilistic context-free grammars (PCFGs)
For any derivation (parse tree) containing rules:
α1 → β1, α2 → β2, …, αl → βl, the probability of the parse is:
l
∏
q(αi → βi)
i=1
P(t) = q(S → NP VP) × q(NP → DT NN) × q(DT → the)
× q(NN → man) × q(VP → Vi) × q(Vi → sleeps)
= 1.0 × 0.3 × 1.0 × 0.7 × 0.4 × 1.0 = 0.084
∑
Why do we want q(α → β) = 1?
α→β:α=X
Deriving a PCFG from a treebank
• Training data: a set of parse trees t1, t2, …, tm
• A PCFG (N, Σ, S, R, q):
• N is the set of all non-terminals seen in the trees
• Σ is the set of all words seen in the trees
• S is taken to be S.
• R is taken to be the set of all rules α → β seen in the trees
• The maximum-likelihood parameter estimates are:
Count(α → β)
qML(α → β) =
Count(α)
If we have seen the rule VP → Vt NP 105 times, and the the non-terminal
VP 1000 times, q(VP → Vt NP) = 0.105
Parsing with PCFGs
• Given a sentence s and a PCFG, how to find the highest scoring
parse tree for s?
argmaxt∈𝒯(s)P(t)
• The CKY algorithm: applies to a PCFG in Chomsky normal
form (CNF)
• Chomsky Normal Form (CNF): all the rules take one
of the two following forms:
• X → Y1Y2 where X ∈ N, Y1 ∈ N, Y2 ∈ N
• X → Y where X ∈ N, Y ∈ Σ
• It is possible to convert any PCFG into an equivalent grammar in CNF!
• However, the trees will look differently; It is possible to do “reverse
transformation”
Converting PCFGs into a CNF grammar
• n-ary rules (n > 2): NP → DT NNP VBG NN
• Unary rules: VP → Vi, Vi → sleeps
• Eliminate all the unary rules recursively by adding VP → sleeps
• We will come back to this later!
The CKY algorithm
• Dynamic programming
• Given a sentence x1, x2, …, xn, denote π(i, j, X) as the highest score
for any parse tree that dominates words xi, …, xj and has non-
terminal X ∈ N as its root.
• Output: π(1,n, S)
• Initially, for i = 1,2,…, n,
{0
q(X → xi) if X → xi ∈ R
π(i, i, X) =
otherwise
The CKY algorithm
• For all (i, j) such that 1 ≤ i < j ≤ n for all X ∈ N,
π(i, j, X) = max q(X → YZ) × π(i, k, Y ) × π(k + 1,j, Z)
X→YZ∈R,i≤k<j
Also stores backpointers which allow us to recover the parse tree
The CKY algorithm
Running time?
O(n 3 | R | )
CKY with unary rules
• In practice, we also allow unary rules:
X → Y where X, Y ∈ N
conversion to/from the normal form is easier
How does this change CKY?
π(i, j, X) = max q(X → Y ) × π(i, j, Y )
X→Y∈R
• Compute unary closure: if there is a rule chain
X → Y1, Y1 → Y2, …, Yk → Y, add
q(X → Y ) = q(X → Y1) × ⋯ × q(Yk → Y )
• Update unary rule once after the binary rules
Evaluating constituency parsing
Evaluating constituency parsing
• Recall: (# correct constituents in candidate) / (# constituents in
gold tree)
• Precision: (# correct constituents in candidate) / (# constituents in
candidate)
• Labeled precision/recall require getting the non-terminal label
correct
• F1 = (2 * precision * recall) / (precision + recall)
Evaluating constituency parsing
• Precision: 3/7 = 42.9%
• Recall: 3/8 = 37.5%
• F1 = 40.0%
• Tagging accuracy: 100%
Weaknesses of PCFGs
• Lack of sensitivity to lexical information (words)
The only difference between these two parses:
q(VP → VP PP) vs q(NP → NP PP)
… without looking at the words!
Weaknesses of PCFGs
• Lack of sensitivity to lexical information (words)
Exactly the same set of context-free rules!
Lexicalized PCFGs
• Key idea: add headwords to trees
• Each context-free rule has one special child that is the
head of the rule (a core idea in syntax)
Lexicalized PCFGs
• Further reading: Michael Collins. 2003. Head-Driven
Statistical Models for Natural Language Parsing.
• Results for a PCFG: 70.6% recall, 74.8% precision
• Results for a lexicalized PCFG: 88.1% recall, 88.3% precision
http://nlpprogress.com/english/constituency_parsing.html