Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
208 views31 pages

NLP Unit 2 - Part2 - Features and Augmented Grammars

The document discusses feature systems and augmented grammars in natural language processing, focusing on agreement restrictions, morphological analysis, and parsing techniques. It explains how features can be used to represent grammatical structures and handle ambiguities, as well as the importance of a well-defined lexicon for processing words and their forms. Additionally, it covers the role of binary features and the need for morphological analysis to efficiently manage verb forms and irregularities in language.

Uploaded by

sivatarak12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
208 views31 pages

NLP Unit 2 - Part2 - Features and Augmented Grammars

The document discusses feature systems and augmented grammars in natural language processing, focusing on agreement restrictions, morphological analysis, and parsing techniques. It explains how features can be used to represent grammatical structures and handle ambiguities, as well as the importance of a well-defined lexicon for processing words and their forms. Additionally, it covers the role of binary features and the need for morphological analysis to efficiently manage verb forms and irregularities in language.

Uploaded by

sivatarak12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

NLP Lecture Notes Unit2

NLP UNIT II (PART 2)


Feature and Augmented Grammars
Feature Systems and Augmented Grammars,Morphological Analysis and the Lexicon,Parsing with Features,Augmented Transition Networks,Bayee’s Rule,Shannon
Game,Entropy and Cross Entropy.

Feature Systems and Augmented Grammars


• In natural languages there are often agreement restrictions between words and phrases.

• For example, the NP "a men" is not correct in English because the article a indicates a single object
while the noun "men" indicates a plural object; the noun phrase does not satisfy the number agreement
restriction of English.
• There are many other forms of agreement, including subject-verb agreement, gender agreement for
pronouns, restrictions between the head of a phrase and the form of its complement, and so on.
• To handle such phenomena conveniently, the grammatical formalism is extended to allow constituents
to have features.
• For example, we might define a feature NUMBER that may take a value of either s (for singular) or p
(for plural), and we then might write an augmented CFG rule(AGR) such as NP -> ART N only when
NUMBER1 agrees with NUMBER2
• This rule says that a legal noun phrase consists of an article followed by a noun, but only when the
number feature of the first word agrees with the number feature of the second.
• This one rule is equivalent to two CFG rules that would use different terminal symbols for encoding
singular and plural forms of all noun phrases, such as
NP-SING -> ART-SING N-SING

NP-PLURAL -> ART-PLURAL N-PLURAL


• Using features, the size of the augmented grammar remains the same as the original one.To accomplish
this, a constituent is defined as a feature structure - a mapping from features to values that
defines the relevant properties of the constituent.

• In the examples, feature names in formulas will be written in boldface.

• For example, a feature structure for a constituent ART1 that represents a particular use of the word a
might be written as follows:
ART1: (CAT ART
ROOT a
NUMBER s)

DR.P.GANGADHARA REDDY 1
NLP Lecture Notes Unit2
• This says it is a constituent in the category ART that has as its root the word a and is singular.
• Usually an abbreviation is used that gives the CAT value more prominence and provides an intuitive
tie back to simple context-free grammars.
• In this abbreviated form, constituent ART1 would be written as

ART1: (ART ROOT a NUMBER s)

• Feature structures can be used to represent larger constituents. feature structures can occur as values.
• Special features based on the integers - 1, 2, 3, and so on - will stand for the first
subconstituent, second subconstituent, and so on, as needed.
• With this, the representation of the NP constituent for the phrase "a fish" could be
NP1: (NP NUMBERs
1 (ART ROOT a
NUMBER s)
2 (N ROOT fish
NUMBER s))

• Note that this can also be viewed as a representation of a parse tree shown in Figure 4.1, where the
subconstituent features 1 and 2 correspond to the subconstituent links in the tree.

Figure 4.1 Viewing a feature structure as an extended parse tree

• The rules in an augmented grammar are stated in terms of feature structures rather than simple
categories.
• Variables are allowed as feature values so that a rule can apply to a wide range of situations.
• For example, a rule for simple noun phrases would be as follows:

(NP NUMBER ?n) - (ART NUMBER ?n) (N NUMBER ?n)

• This says that an NP constituent can consist of two subconstituents, the first being an ART and the second
being an N, in which the NUMBER feature in all three constituents is identical.
• According to this rule, constituent NP1given previously is a legal constituent.

DR.P.GANGADHARA REDDY 2
NLP Lecture Notes Unit2
• On the other hand, the constituent (NP 1 (ART NUMBER s)
2 (N NUMBER s))
is not allowed by this rule because there is no NUMBER feature in the NP, and the constituent
(NP NUMBER s
1 (ART NUMBER s)
2 (N NUMBER p))

is not allowed because the NUMBER feature of the N constituent is not identical to the other two
NUMBER features.
• Variables are also useful in specifying ambiguity in a constituent. For instance, the word fish is
ambiguous between a singular and a plural reading.
• Thus the word might have two entries in the lexicon that differ only by the value of the NUMBER
feature.
• Alternatively, we could define a single entry that uses a variable as the value of the NUMBER feature,
that is, (N ROOT fish NUMBER ?n)

• This works because any value of the NUMBER feature is allowed for the word fish.

• In many cases, however, not just any value would work, but a range of values is possible.

• To handle these cases, we introduce constrained variables, which are variables that can only take a
value out of a specified list.
• For example, the variable ?n{s p} would be a variable that can take the value s or the value p.
• Typically. when we write such variables, we will drop the variable name altogether and just list the
possible values.
• Given this, the word fish might be represented by the constituent (N ROOT fish NUMBER ?n{sp})
or more simply as (N ROOT fish NUMBER {s p})

BOX 4.1

DR.P.GANGADHARA REDDY 3
NLP Lecture Notes Unit2

• There is an interesting issue of whether an augmented context-free grammar can describe languages
that cannot be described by a simple context-free grammar.
• If the set of feature values is finite, then it would always be possible to create new
constituent categories for every combination of features.Thus it is expressively equivalent to a
context-free grammar.
• If the set of feature values is unconstrained,then such grammars have arbitrary computational
power.In practice,the standard parsing algorithms can be used on grammars that include features.

Some Basic Feature Systems for English


Some basic feature systems that are commonly used in grammars of English and develops the particular set
of features used. Specifically, it considers number and person agreement, verb form features, and features
required to handle subcategorization constraints.
Person and Number Features
Words may be classified as they can describe a single object or multiple objects. While number agreement
restrictions occur in several different places in English, They are most importantly found in subject-verb
agreement. The possible values are
First Person (1): The noun phrase refers to the speaker, or a group of people including the

DR.P.GANGADHARA REDDY 4
NLP Lecture Notes Unit2
speaker (for example, I, we, you, and 0.
Second Person (2): The noun phrase refers to the listener, or a group including the listener but
not including the speaker (for example, you, all of you).
Third Person (3): The noun phrase refers to one or more objects, not including the speaker or
hearer.
Since number and person features always co-occur, it is convenient to combine the two into a single feature,
AGR, that has six possible values: first person singular (1s), second person singular (2s), third person
singular (3s), and first, second and third person plural (1p, 2p, and 3p, respectively
Verb-Form Features and Verb Subcategorization
Another very important feature system in English involves the form of the verb. This feature is used in
many situations, such as the analysis of auxiliaries and generally in the subcategorization restrictions of
many head words. There are five basic forms of verbs. In particular, we will use the following feature values
for the feature VFORM:
base - base form (for example, go, be, say, decide)
pres - simple present tense (for example, go, goes, am, is, say, says, decide)
past - simple past tense (for example, went, was, said, decided)
fin - finite (that is, a tensed form, equivalent to {pres past})
ing - present participle (for example, going, being, saying, deciding)
pastprt - past participle (for example, gone, been, said, decided)
inf - a special feature value that is used for infinitive forms with the word to

Figure 4.2 The SUBCAT values for NPVP combinations


To handle the interactions between words and their complements, an additional feature, SUBCAT,
is used. In some common verb sub-categorization possibilities, Each one will correspond to a different
value of the SUBCAT feature. Figure 4.2 shows some SUBCAT values for complements consisting of
combinations of NPs and VPs. If the category is restricted by a feature value, then the feature value
follows the constituent separated by a colon. Thus the value npvp:inf will be used to indicate a
complement that consists of an NP followed by a VP with VFORM value inf.For instance, the rule for

DR.P.GANGADHARA REDDY 5
NLP Lecture Notes Unit2
verbs with a SUBCAT value of _np_vp:inf would be

(VP) -> (V SUBCAT _np_vp:inf) (NP)


(VP VFORM inf)
This says that a VP can consist of a V with SUBCAT value _np_vp:inf, followed by an NP, followed by a
VP with VFORM value inf. Clearly, this rule could be rewritten using any other unique symbol instead of
_np_vp:inf, as long as the lexicon is changed to use this new value.
Many verbs have complement structures that require a prepositional phrase with a particular preposition,
or one that plays a particular role. For example, the verb give allows a complement consisting of an NP
followed by a PP using the preposition to, as in "Jack gave the money to the bank". Qther verbs, such as
"put", require a prepositional phrase that describes a location, using prepositions such as "in", "inside",
"on", and "by". To express this within the feature system, we introduce a feature PFORM on prepositional
phrases.
A prepositional phrase with a PFORM value such as TO must have the preposition to as its head, and so
on. A prepositional phrase with a PFORM value LOC must describe a location.

Figure 4.3 Some values of the PFORM feature for prepositional phrases
Figure 4.4 Additional SUBCAT values

DR.P.GANGADHARA REDDY 6
NLP Lecture Notes Unit2

Another useful PFORM value is MOT, used with verbs such as walk, which may take a prepositional
phrase that describes some aspect of a path, as in We walked to the store. Prepositions that can create such
phrases include to, from, and along. The LOC and MOT values might seem hard to distinguish, as certain
prepositions might describe either a location or a path, but they are distinct. For example, while Jack put
the box (in on by] the corner is fine, *Jack put the box (to from along] the corner is ill-formed. Figure 4.3
summarizes the PFORM feature.

This feature can be used to restrict the complement forms for various verbs. Using the naming convention
discussed previously, the SUBCAT value of a verb such as put would be jip pp:loc, and the appropriate
rule in the grammar would be
(VP) -> (V SUBCAT _np_pp:loc) (NP)
(PP PFORM LOC)
Binary Features
Certain features are binary in that a constituent either has or doesn’t have the feature. In our formalization
a binary feature is simply a feature whose value is restricted to be either + or -. For example, the INV
feature is a binary feature that indicates whether or not an S structure has an inverted subject (as in a yes/no
question). The S structure for the sentence Jack laughed will have an INV value —, whereas the S structure
for the sentence Did Jack laugh? will have the INV value +. Often, the value is used as a prefix, and we
would say that a structure has the feature +INV or -INV. Other binary features will be introduced as
necessary throughout the development of the grammars.

Morphological Analysis and the Lexicon

• Before you can specify a grammar, you must define the lexicon.This section explores some issues in
lexicon design and the need for a morphological analysis component.
• The lexicon must contain information about all the different words that can be used, including all the
relevant feature value restrictions.
• When a word is ambiguous, it may be described by multiple entries in the lexicon, one for each
different use.
• Because words tend to follow regular morphological patterns, however, many forms of words need
not be explicitly included in the lexicon.
• Most English verbs, for example, use the same set of suffixes to indicate different forms: -s is added
for third person singular present tense, -ed for past tense, -ing for the present participle, and so on.

DR.P.GANGADHARA REDDY 7
NLP Lecture Notes Unit2
• Without any morphological analysis, the lexicon would have to contain every one of these forms.
• For the verb want this would require six entries, for want (both in base and present form), wants,
wanting, and wanted (both in past and past participle form).
• In contrast, by using the methods described in Section 3.7(search tree for two parse strategies) to strip
suffixes there needs to be only one entry for want.
• The idea is to store the base form of the verb in the lexicon and use context-free rules to combine
verbs with suffixes to derive the other entries.
• Consider the following rule for present tense verbs:
(V ROOT ?r SUBCAT ?s VFORM pres AGR 3s) -> (V ROOT ?r SUBCAT ?s VFORM base) (+S)
where +S is a new lexical category that contains only the suffix morpheme -s.
• This rule, coupled with the lexicon entry want: (V ROOT want
SUBCAT {_np_vp:inf _np_vp:inf}
VFORM base)

would produce the following constituent given the input string want -s
want: (V ROOT want
SUBCAT {_np_vp:inf _np_vp:inf}
VFORM pres
AGR 3s)

• Another rule would generate the constituents for the present tense form not in third person
singular, which for most verbs is identical to the root form:
(V ROOT ?r SUBCAT ?s VFORM pres AGR {ls 2s lp 2p 3p}) —>
(V ROOT ?r SUBCAT ?s VFORM base)
• But this rule needs to be modified in order to avoid generating erroneous interpretations.
• Currently, it can transform any base form verb into a present tense form, which is clearly wrong for
some irregular verbs.
• For instance, the base form be cannot be used as a present form (for example, *We be at the store).
• To cover these cases, a feature is introduced to identify irregular forms.
• Specifically, verbs with the binary feature +IRREGPRES have irregular present tense forms.
• Now the rule above can be stated correctly:
(V ROOT ?r SUBCAT ?s VFORM pres AGR (ls 2s lp 2p 3p)) —>
(V ROOT ?r SUBCAT ?s VFORM base IRREG-PRES -)

DR.P.GANGADHARA REDDY 8
NLP Lecture Notes Unit2
• Because of the default mechanism, the IRREG-PRES feature need only be specified on the
irregular verbs.
• The regular verbs default to -, as desired.
• Similar binary features would be needed to flag irregular past forms (IRREG-PAST, such as saw),
and to distinguish -en past participles from -ed past participles (EN -PASTPRT).
• These features restrict the application of the standard lexical rules, and the irregular forms are added
explicitly to the lexicon.
• Grammar 4.5 gives a set of rules for deriving different verb and noun forms using these features.

Grammar 4.5 Some lexical rules for common suffixes on verbs and nouns

• Given a large set of features, the task of writing lexical entries appears very difficult.Most
frameworks allow some mechanisms that help alleviate these problems.

• The first technique - allowing default values for features.With this capability, if an entry takes a

DR.P.GANGADHARA REDDY 9
NLP Lecture Notes Unit2
• default value for a given feature, then it need not be explicitly stated.
• Another commonly used technique is to allow the lexicon writing to define clusters of features,
and then indicate a cluster with a single symbol rather than listing them all.
• Later, additional techniques will be discussed that allow the inheritance of features in a feature hierarchy.

• Figure 4.6 contains a small lexicon.Figure 4.6 A lexicon

• It contains many of the words to be used in the examples that follow.


• It contains three entries for the word "saw "- as a noun, as a regular verb, and as the irregular past
tense form of the verb "see" - as illustrated in the sentences

DR.P.GANGADHARA REDDY 10
NLP Lecture Notes Unit2
The saw was broken.

Jack wanted me to saw the board in half.

I saw Jack eat the pizza.


• With an algorithm for stripping the suffixes and regularizing the spelling, the derived entries can be
generated using any of the basic parsing algorithms on Grammar 4.5.
• With the lexicon in Figure 4.6 and Grammar 4.5, correct constituents for the following words can be
derived: been, being, cries, cried, crying, dogs, saws (two interpretations), sawed, sawing, seen,
seeing, seeds, wants, wanting, and wanted.
• For example, the word cries would be transformed into the sequence cry +s, and then rule 1 would
produce the present tense entry from the base form in the lexicon.Often a word will have multiple
interpretations that use different entries and different lexical rules.
• The word saws, for instance, transformed into the sequence saw +s, can be a plural noun (via rule 7
and the first entry for saw), or the third person present form of the verb saw (via rule 1 and the second
entry for saw).
• Note that rule I cannot apply to the third entry, as its VFORM is not base.The success of this approach
depends on being able to prohibit erroneous derivations, such as analyzing seed as the past tense of the
verb "see".
• This analysis will never be considered if the FST(Finite State Transducer) that strips suffixes is correctly
designed. Specifically, the word see will not allow a transition to the states that allow the -ed suffix. But
even if this were produced for some reason, the IRREG-PAST value + in the entry for see would prohibit
rule 3 from applying.
A Simple Grammar Using Features
A simple grammar using the feature systems and lexicon developed in the earlier sections. It will
handle sentences such as the following:
The man cries.
The men cry.
The man saw the dogs.
He wants the dog.
He wants to be happy.
He wants the man to see the dog.
He is happy to be a dog.
It does not find the following acceptable:

DR.P.GANGADHARA REDDY 11
NLP Lecture Notes Unit2
* The men cries.
* The man cry.
* The man saw to be happy.
* He wants.
* He wants the man saw the dog.
BOX 4.2 Systemic Grammar
An important influence on the development of computational feature-based systems was systemic
grammar (Halliday, 1985). This theory emphasizes the functional role of linguistic constructs as they
affect communication. The grammar is organized as a set of choices about discourse function that
determine the structure of the sentence. The choices are organized into hierarchical structures
called systems.

This structure indicates that once certain choices are made, others become relevant. For instance, if
you decide that a sentence is in the declarative mood, then the choice between bound and relative
becomes relevant. The choice between yes/no and wh, on the other hand, is not relevant to a declarative
sentence.
(VP VFORM ?v AGR ?a) -> (V VFORM ?v AGR ?a SUBCAT _np_vp:inf) (NP)
(VP VFORM inf)
If the head features can be declared separately from the rules, the system can automatically add these
features to the rules as needed. With VFORM and AGR declared as head features, the previous VP
rule can be abbreviated as
VP -> (V SUBCAT _np_vp:inf) NP (VP VFORM inf)

DR.P.GANGADHARA REDDY 12
NLP Lecture Notes Unit2
Grammar 4.7 A simple grammar in the abbrveiated form

Grammar 4.8 The expanded grammar showing all features

Parsing Algorithms with Features


• The parsing algorithms for context-free grammars can be extended to handle
augmented context-free grammars.
• This involves generalizing the algorithm for matching rules to constituents.
• For instance, the chart-parsing algorithms all used an operation for extending active arcs with a
new constituent.
• A constituent X could extend an arc of the form C -> C1 ... Ci o X ... Cn to produce a new arc of
the form C -> C1 ... Ci X o ... Cn

DR.P.GANGADHARA REDDY 13
NLP Lecture Notes Unit2
• A similar operation can be used for grammars with features, but the parser may have to instantiate
variables in the original arc before it can be extended by X.
• The key to defining this matching operation precisely is to remember the definition of grammar
rules with features.
• A rule such as
1. (NP AGR ?a) -> o (ART AGR ?a) (N AGR ?a)

says that an NP can be constructed out of an ART and an N if all three agree on the AGR feature.
• It does not place any restrictions on any other features that the NP, ART, or N may have.Thus, when
matching constituents against this rule, the only thing that matters is the AGR feature.All other features
in the constituent can be ignored.

• For instance, consider extending arc 1 with the constituent


2. (ART ROOT A AGR 3s)

• To make arc 1 applicable, the variable ?a must be instantiated to 3s, producing


3. (NP AGR 3s) -> o (ART AGR 3s) (N AGR 3s)

• This arc can now be extended because every feature in the rule is in constituent 2:
4. (NP AGR 3s) -> (ART AGR 3s) o (N AGR 3s)

• Now, consider extending this arc with the constituent for the word dog:
5. (N ROOT DOG1 AGR 3s)

• This can be done because the AGR features agree. This completes the arc
6. (NP AGR 3s) —> (ART AGR 3s) (N AGR 3s)
• This means the parser has found a constituent of the form (NP AGR 3s).
• This algorithm can be specified more precisely as follows:
• Given an arc A, where the constituent following the dot is called NEXT, and a new constituent
X, which is being used to extend the arc,
a. Find an instantiation of the variables such that all the features specified in NEXT are found in X.
b. Create a new arc A', which is a copy of A except for the instantiations of the variables determined
in step (a).
c. Update A' as usual in a chart parser.

• For instance, let A be arc 1, and X be the ART constituent 2. Then NEXT will be (ART AGR ?a).
• In step a, NEXT is matched against X, and you find that ?a must be instantiated to 3s. In step b, a new

DR.P.GANGADHARA REDDY 14
NLP Lecture Notes Unit2
copy of A is made, which is shown as arc 3. In step c, the arc is updated to produce the new arc shown
as arc 4.
• When constrained variables, such as ?a{3s 3p}, are involved, the matching proceeds in the same manner,
but the variable binding must be one of the listed values.
• If a variable is used in a constituent, then one of its possible values must match the requirement in the
rule.
• If both the rule and the constituent contain variables, the result is a variable ranging over the intersection
of their allowed values.
• For instance, consider extending arc 1 with the constituent

(ART ROOT the AGR ?v{3s 3p}), that is, the word "the".

• To apply, the variable ?a would have to be instantiated to ?v{ 3s 3p}, producing the rule

(NP AGR ?v{3s 3p}) —> (ART AGR ?v{3s 3p}) o (N AGR ?v{3s 3p})
• This arc could be extended by (N ROOT dog AGR 3s), because ?v{3s 3p} could be instantiated
by the value 3s.
• The resulting arc would be identical to arc 6.

• The entry in the chart for the is not changed by this operation.

• It still has the value ?v{3s 3p}.

• The AGR feature is restricted to 3s only in the arc.

• Another extension is useful for recording the structure of the parse.

• Subconstituent features (1, 2, and so on, depending on which subconstituent is being added) are
automatically inserted by the parser each time an arc is extended.
• The values of these features name subconstituents already in the chart.

• With this treatment, and assuming that the chart already contains two constituents, ARTL and Ni, for the
words the and dog, the constituent added to the chart for the phrase the dog would be
(NP AGR 3s
1. ART1
2. N1)
where ART1 (ART ROOT the AGR {3s 3p}) and N1 = (N ROOT dog AGR {3s}).
• Note that the AGR feature of ART1 was not changed.

DR.P.GANGADHARA REDDY 15
NLP Lecture Notes Unit2
• Thus it could be used with other interpretations that require the value 3p if they are possible.
• Any of the chart-parsing algorithms can now be used with an augmented granunar by using these
extensions to extend arcs and build constituents.
• Consider an example.

• Figure 4.10 contains the final chart produced from parsing the sentence

Figure 4.10 The chart for "He wants to cry".


• He wants to cry using Grammar 4.8.
• Constituent NP1 was constructed by rule 3, repeated here for convenience:
3. (NP AGR ?a) -> (PRO AGR ?a)
• To match the constituent PRO1, the variable ?a must be instantiated to 3s.
• Thus the new constituent built is

NP1: (CAT NP
AGR 3s
1 PRO1)

• Next consider constructing constituent VP1 using rule 4, namely

4. (VP AGR ?a VFORM ?v) -> (V SUBCAT _none AGR ?a VFORM ?v)

• For the right-hand side to match constituent V2, the variable ?v must be instantiated to base.

DR.P.GANGADHARA REDDY 16
NLP Lecture Notes Unit2
• The AGR feature of V2 is not defined, so it defaults to -.

• The new constituent is VP1: (CAT VP


AGR -
VFORM base
1 V2)
• Generally, default values are not shown in the chart.
• In a similar way, constituent VP2 is built from TO1 and VP1 using rule 9, VP3 is built from V1 and
VP2 using rule 6, and S1 is built from NP1 and VP3 using rule 1.
Augmented Transition Networks
• Features can also be added to a Recursive transition network to produce an Augmented transition
network (ATN).
• Features in an ATN are traditionally called registers.
• Constituent structures are created by allowing each network to have a set of registers.Each time a
new network is pushed, a new set of registers is created.
• As the network is traversed, these registers are set to values by actions associated with each arc.
• When the network is popped, the registers are assembled to form a constituent structure, with the
CAT slot being the network name.
• Grammar 4.11 is a simple NP network.

• The actions are listed in the table.

DR.P.GANGADHARA REDDY 17
NLP Lecture Notes Unit2
• ATNs use a special mechanism to extract the result of following an arc.When a lexical arc, such as arc 1,
is followed, the constituent built from the word in the input is put into a special variable named
"*".
• The action DET := * then assigns this constituent to the DET register.
• The second action on this arc, AGR := AGR* assigns the AGR register of the network to the value of the
AGR register of the new word (the constituent in "*").
• Agreement checks are specified in the tests.
• A test is an expression that succeeds if it returns a nonempty value and fails if it returns the empty
set or nil.If a test fails, its arc is not traversed.
• The test on arc 2 indicates that the arc can be followed only if the AGR feature of the network has a
non-null intersection with the AGR register of the new word (the noun constituent in "*").
• Features on push arcs are treated similarly.
• The constituent built by traversing the NP network is returned as the value "*".
• Thus in Grammar 4.12, the action on the arc from S to S1,
SUBJ := * would assign the constituent returned by the NP network to the register SUBJ.

Grammar 4.12 A simple S network


• The test on arc 2 will succeed only if the AGR register of the constituent in the SUBJ register has a non-
null intersection with the AGR register of the new constituent (the verb).
• This test enforces subject-verb agreement.
• With the lexicon, the ATN accepts the following sentences:
The dog cried.
The dogs saw Jack.
Jack saw the dogs.
• Consider an example.A trace of a parse of the sentence "The dog saw Jack" is shown in Figure 4.13.

DR.P.GANGADHARA REDDY 18
NLP Lecture Notes Unit2

Figure 4.13 Trace tests and actions used with "1 The 2 dogs 3 saw 4 Jack 5"
• It indicates the current node in the network, the current word position, the arc that is followed from
the node, and the register manipulations that are performed for the successful parse.
• It starts in the S network but moves immediately to the NP network from the call on arc 4.
• The NP network checks for number agreement as it accepts the word sequence The dog.
• It constructs a noun phrase with the AGR feature plural.
• When the pop arc is followed, it completes arc 4 in the S network.
• The NP is assigned to the SUBJ register and then checked for agreement with the verb when arc 3 is
followed.
• The NP "Jack" is accepted in another call to the NP network.An ATN Grammar for Simple Declarative
Sentences
• Here is a more comprehensive example of the use of an ATN to describe some declarative sentences.
• The allowed sentence structure is an initial NP followed by a main verb, which may then be followed
by a maximum of two NPs and many PPs, depending on the Verb.
• Using the feature system extensively, you can create a grammar that accepts any of the preceding
complement forms, leaving the actual verb-complement agreement to the feature restrictions.
• Grammar 4.14 shows the S network.
• Arcs are numbered using the conventions.

DR.P.GANGADHARA REDDY 19
NLP Lecture Notes Unit2

• For instance, the arc S3/1 is the arc labeled 1 leaving node 83.

DR.P.GANGADHARA REDDY 20
NLP Lecture Notes Unit2
• The NP network in Grammar 4.15 allows simple names, bare plural nouns, pronouns, and a simple
sequence of a determiner followed by an adjective and a head noun.

Grammar 4.15 The NP network

• Allowable noun complements include an optional number of prepositional phrases.

• The prepositional phrase network in Grammar 4.16 is straightforward.

• Examples of parsing sentences with this grammar are left for the exercises.

DR.P.GANGADHARA REDDY 21
NLP Lecture Notes Unit2

Grammar 4.16 The PP network

Presetting Registers
• One further extension to the feature-manipulation facilities in ATNs involves the ability to preset
registers in a network as that network is being called, much like parameter passing in a
programming language.
• This facility, called the SENDR action in the original ATN systems, is useful to pass information to the
network that aids in analyzing the new constituent.
• Consider the class of verbs, including want and pray, that accept complements using the infinitive forms
of verbs, which are introduced by the word to.
• According to the classification, this includes the following:

_vp:inf Mary wants to have a party.


_np_vp:inf Mary wants John to have a party.
• In the context-free grammar developed earlier, such complements were treated as VPs with the
VFORM value inf.
• To capture this same analysis in an ATN, you would need to be able to call a network
corresponding to VPs but preset the VFORM register in that network to inf.
• Another common analysis of these constructs is to view the complements as a special form of
sentence with an understood subject.
• In the first case it is Mary who would be the understood subject (that is, the host), while in the other
case it is John.

DR.P.GANGADHARA REDDY 22
NLP Lecture Notes Unit2
• To capture this analysis, many ATN grammars preset the SUBJ register in the new S network
when it is called.

Bayes’ Theorem

• Bayes’ Theorem is named after Reverend Thomas Bayes.

• It is a very important theorem in mathematics that is used to find the probability of an event, based
on prior knowledge of conditions that might be related to that event.
• It is a further case of conditional probability.
• It is used where the probability of occurrence of a particular event is calculated based on other
conditions which are also called conditional probability.
What is Bayes’ Theorem?

• Bayes theorem is also known as the Bayes Rule or Bayes Law.

• It is used to determine the conditional probability of event A when event B has already happened.
• The general statement of Bayes’ theorem is “The conditional probability of an event A, given the
occurrence of another event B, is equal to the product of the event of B, given A and the probability of
A divided by the probability of event B.” i.e.
P(A|B) = P(B|A)P(A) / P(B)

where,

P(A) and P(B) are the probabilities of events A and B P(A|B) is the probability
of event A when event B happens P(B|A) is the probability of event B when A happens
Bayes Theorem Statement

• Bayes’ Theorem for n set of events is defined as, Let E1, E2,…, En be a set of events associated with the
sample space S, in which all the events E1, E2,…, En have a non- zero probability of occurrence.
• All the events E1, E2,…, E form a partition of S.

• Let A be an event from space S for which we have to find probability, then according to Bayes’
theorem,
P(Ei|A) = P(Ei)P(A|Ei) / ∑ P(Ek)P(A|Ek)

for k = 1, 2, 3, …., n

Terms Related to Bayes Theorem

DR.P.GANGADHARA REDDY 23
NLP Lecture Notes Unit2
As we have studied about Bayes theorem in detail, let us understand the meanings of a few terms related
to the concept which have been used in the Bayes theorem formula and derivation:
Conditional Probability

• The probability of an event A based on the occurrence of another event B is termed conditional
Probability. It is denoted as P(A|B) and represents the probability of A when event B has already
happened.
Joint Probability

• When the probability of two more events occurring together and at the same time is measured it is
marked as Joint Probability. For two events A and B, it is denoted by joint probability is denoted
as, P(A∩B).
Random Variables

• Real-valued variables whose possible values are determined by random experiments are called random
variables. The probability of finding such variables is the experimental probability.
Bayes Theorem Formula

• For any two events A and B, then the formula for the Bayes theorem is given by: (the image given below
gives the Bayes’ theorem formula)

Bayes’ Theorem Formula

where,

P(A) and P(B) are the probabilities of events A and B also P(B) is never equal to zero.
P(A|B) is the probability of event A when event B happens

P(B|A) is the probability of event B when A happens


Theorem of Total Probability

• Let E1, E2,................................En is mutually exclusive and exhaustive events associated with a

DR.P.GANGADHARA REDDY 24
NLP Lecture Notes Unit2
random experiment and lets E be an event that occurs with some Ei.

• Then, prove that

P(E) = ∑𝒏𝒊=𝟏 𝐏(𝐄/𝐄𝐢) . 𝐏(𝐄𝐢)

Proof:

Let S be the sample space. Then,

S = E1 ∪ E2 ∪ E3 ∪ .............................................. ∪ En and Ei ∩ Ej = ∅ for i ≠ j.

A=A∩S

=E∩(E1∪E2∪E3∪................................................ ∪En)

= (E ∩ E1) ∪ (E ∩ E2) ∪ ……∪ (E ∩ En)

P(A)=P{(E∩E1)∪(E∩E2)∪……∪(E∩En)}
P(A)=P(E∩E1)+P(E∩E2)+……+P(E∩En)

= {Therefore, (E∩E1), (E∩E2),………….,(E∩En)} are pairwise disjoint}

P(A)= P(E/E1) . P(E1) + P(E/E2) . P(E2) +………+ P(E/En) . P(En) [by multiplication theorem]
P(A)= ∑𝑛𝑖=1 P(E/Ei) . P(Ei)
Bayes Theorem Derivation

• The proof of Bayes’ Theorem is given as, according to the conditional probability formula,
P(Ei|A) = P(Ei∩A) / P(A)…..(i)

Then, by using the multiplication rule of probability, we get

P(Ei∩A) = P(Ei)P(A|Ei)……(ii) Now, by the total probability theorem, P(A) = ∑ P(Ek)P(A|Ek)…..(iii)


Substituting the value of P(Ei∩A) and P(A) from eq (ii) and eq(iii) in eq(i) we get,

P(Ei|A) = P(Ei)P(A|Ei) / ∑ P(Ek)P(A|Ek)

Note:

Various terms used in Bayes theorem are explained below in this article,

• Hypotheses: Events happening in the sample space E1, E2,… En is called the hypotheses

• Priori Probability: P(Ei) is known as the priori probability of hypothesis Ei.

• Posteriori Probability: Probability P(Ei|A) is considered as the posterior probability of hypothesis Ei


DR.P.GANGADHARA REDDY 25
NLP Lecture Notes Unit2
• Bayes’ theorem is also known as the formula for the probability of “causes”.

• As we know that, the Ei‘s are a partition of the sample space S, and at any given time only one of
the events Ei occurs.
• Thus we conclude that the Bayes’ theorem formula gives the probability of a particular Ei, given the
event A has occurred.
Difference Between Conditional Probability and Bayes Theorem

• The difference between Conditional Probability and Bayes Theorem can be understood with the help
of the table given below,

Bayes’ Theorem Conditional Probability

Bayes’ Theorem is derived using the


Conditional Probability is the probability of
definition of conditional probability. It is used
event A when event B has already occurred.
to find the reverse probability.
Formula: P(A|B) = P(A∩B) / P(B)
Formula: P(A|B) = [P(B|A)P(A)] / P(B)

Examples of Bayes’ Theorem

• Bayesian inference is very important and has found application in various activities, including medicine,
science, philosophy, engineering, sports, law, etc. and Bayesian inference is directly derived from Bayes’
theorem.
• Example: Bayes’ theorem defines the accuracy of the medical test by taking into account how likely
a person is to have a disease and what is the overall accuracy of the test.

Numerical Example of Bayes’ Theorem

Example 1: A person has undertaken a job. The probabilities of completion of the job on time with
and without rain are 0.44 and 0.95 respectively. If the probability that it will rain is 0.45, then
determine the probability that the job will be completed on time.
Solution:

Let A be the event that Job will be completed

B1 be the event that the mining job will be completed on time with rain and
DR.P.GANGADHARA REDDY 26
NLP Lecture Notes Unit2
B2 be the event that job completed with no rain. We have,
P(B1) = 0.45,

P(B2)=1 − P(B1) = 1 − 0.45 = 0.55

And P(A/B1) = 0.44


P(A/B2) = 0.95

Since, events A and B form partitions of the sample space S, by total probability theorem, we
have
P(A) = P(A/ B1) P(B1) + P(A/ B2) P(B2)

= 0.44 × 0.45 + 0.95 × 0.55

= 0.198 + 0.5225 = 0.7205

So, the probability that the job will be completed on time is 0.7205.

Example 2: There are three urns containing 3 white and 2 black balls; 2 white and 3 black
balls; 1 black and 4 white balls respectively. There is an equal probability of each urn being
chosen. One ball is equal probability chosen at random. what is the probability that a white ball is
drawn?
Solution:

Let E1, E2, and E3 be the events of choosing the first, second, and third urn respectively.
Then,
P(E1) = P(E2) = P(E3) =1/3

Let E be the event that a white ball is drawn. Then,

P(E/E1) = 3/5, P(E/E2) = 2/5, P(E/E3) = 4/5

By theorem of total probability, we have

P(E) = P(E/E1) . P(E1) + P(E/E2) . P(E2) + P(E/E3) . P(E3)

= (3/5 × 1/3) + (2/5 × 1/3) + (4/5 × 1/3)

= 9/15 = 3/5

Example 3: A card from a pack of 52 cards is lost. From the remaining cards of the pack, two cards
are drawn and are found to be both hearts. find the probability of the lost card being a heart.
Solution:
DR.P.GANGADHARA REDDY 27
NLP Lecture Notes Unit2
Let E1, E2, E3, and E4 be the events of losing a card of hearts, clubs, spades, and diamonds
respectively.
Then P(E1) = P(E2) = P(E3) = P(E4) = 13/52 = 1/4.

Let E be the event of drawing 2 hearts from the remaining 51 cards. Then, P(E|E1) = probability of
drawing 2 hearts, given that a card of hearts is missing
= 12C2 / 51C2 = (12 × 11)/2! × 2!/(51 × 50) = 22/425

P(E|E2) = probability of drawing 2 clubs ,given that a card of clubs is missing

= 13C2 / 51C2 = (13 × 12)/2! × 2!/(51 × 50) = 26/425

P(E|E3) = probability of drawing 2 spades ,given that a card of hearts is missing

= 13C2 / 51C2 = 26/425

P(E|E4) = probability of drawing 2 diamonds ,given that a card of diamonds is missing

= 13C2 / 51C2 = 26/425

Therefore,

P(E1|E) = probability of the lost card is being a heart, given the 2 hearts are drawn from the remaining 51
cards
= P(E1) . P(E|E1)/P(E1) . P(E|E1) + P(E2) . P(E|E2) + P(E3) . P(E|E3) + P(E4) .

P(E|E4)

= (1/4 × 22/425) / {(1/4 × 22/425) + (1/4 × 26/425) + (1/4 × 26/425) + (1/4 × 26/425)}
= 22/100 = 0.22

Hence, The required probability is 0.22.

Example 4: Suppose 15 men out of 300 men and 25 women out of 1000 are good orators. An orator
is chosen at random. Find the probability that a male person is selected. Assume that there
are equal numbers of men and women.
Solution:

Let there be 1000 men and 1000 women.

Let E1 and E2 be the events of choosing a man and a woman respectively. Then, P(E1) = 1000/2000 = 1/2
, and P(E2) = 1000/2000 = 1/2
Let E be the event of choosing an orator. Then,
DR.P.GANGADHARA REDDY 28
NLP Lecture Notes Unit2
P(E|E1) = 50/1000 = 1/20, and P(E|E2) = 25/1000 = 1/40

Probability of selecting a male person ,given that the person selected is a good orator P(E1/E) = P(E|E1) ×
P(E1)/ P(E|E1) × P(E1) + P(E|E2) × P(E2)
= (1/2 × 1/20) /{(1/2 × 1/20) + (1/2 × 1/40)}

= 2/3

Hence the required probability is 2/3.

Example 5: A man is known to speak the lies 1 out of 4 times. He throws a die and reports that it is
a six. Find the probability that is actually a six.
Solution:

In a throw of a die, let

E1 = event of getting a six,

E2 = event of not getting a six and

E = event that the man reports that it is a six. Then, P(E1) = 1/6, and P(E2) = (1 – 1/6) = 5/6
P(E|E1) = probability that the man reports that six occurs when six has actually occurred
= probability that the man speaks the truth

= 3/4

P(E|E2) = probability that the man reports that six occurs when six has not actually occurred
= probability that the man does not speak the truth

= (1 – 3/4) = 1/4

Probability of getting a six ,given that the man reports it to be six

P(E1|E) = P(E|E1) × P(E1)/P(E|E1) × P(E1) + P(E|E2) × P(E2) [by Bayes’ theorem]

= (3/4 × 1/6)/{(3/4 × 1/6) + (1/4 × 5/6)}

= (1/8 × 3) = 3/8

Hence the probability required is 3/8.

FAQs on Bayes’ Theorem What is Bayes’ theorem?


Bayes, theorem as the name suggest is a mathematical theorem which is used to find the conditionality

DR.P.GANGADHARA REDDY 29
NLP Lecture Notes Unit2
probability of an event. Conditional probability is the probability of the event which will occur in
future. It is calculated based on the previous outcomes of the events.
When to use Bayes’ theorem?

Bayes’ theorem is applicable when the conditional probability of an event is given, it is used to find the
reverse probability of the event.
How is Bayes’ theorem different from conditional probability?

Bayes’ theorem is used to define the probability of an event based on the previous conditions of the event.
Whereas, Bayes’ Theorem uses conditional probability to find the reverse probability of the event.
What is the formula for Bayes’ theorem? Bayes theorem formula is explained below, P(A|B) = [P(B|A)
P(A)] / P(B)
where,

P(A) and P(B) are the probabilities of events A and B P(A|B) is the probability of event A when event B
happens P(B|A) is the probability of event B when A happens.

Shannon Game, Entropy & Cross-Entropy:


In NLP, a "Shannon game" is a thought experiment that illustrates the concept of Shannon entropy,
where you try to guess the next word in a sequence based on the probability distribution of words, and
the "cross-entropy" is calculated by comparing the predicted probability distribution of your model
to the true distribution of the language, essentially measuring how well your model is capturing the
"uncertainty" of the next word based on the context; here's a simple example to understand the
Calculation.
Example Scenario:
• Text: "The quick brown fox jumps over the ___"

• Possible next words: "fence", "dog", "cat"

• True probability distribution:


o "fence": 0.6

o "dog": 0.2

o "cat": 0.2
Calculating Entropy (of the true distribution):
Formula: Entropy (H) = - Σ (p(x) * log2(p(x))) where p(x) is the probability of event x.

DR.P.GANGADHARA REDDY 30
NLP Lecture Notes Unit2
Calculation:
H = - (0.6 * log2(0.6) + 0.2 * log2(0.2) + 0.2 * log2(0.2))

H ≈ 1.23 bits
Model Prediction (example):
Predicted probability distribution:
o "fence": 0.4

o "dog": 0.3

o "cat": 0.3
Calculating Cross-Entropy:
Formula: Cross-Entropy (H(P, Q)) = - Σ (p(x) * log2(q(x)))

where p(x) is the true probability and q(x) is the predicted probability.

Calculation:
H(P, Q) = - (0.6 * log2(0.4) + 0.2 * log2(0.3) + 0.2 * log2(0.3))

H(P, Q) ≈ 1.35 bits


Interpretation:
• Entropy: The true entropy of 1.23 bits indicates that there is a moderate level of uncertainty
about the next word, as no single option is overpoweringly likely.

• Cross-Entropy: The calculated cross-entropy of 1.35 bits is slightly higher than the true entropy,
meaning the model's predicted probability distribution is not perfectly aligned with the true
distribution, indicating a small loss in information due to the prediction.
Key points:
• Higher entropy: A higher entropy value signifies more uncertainty in the language, meaning
there are many possible words that could come next.

• Lower cross-entropy:A lower cross-entropy value indicates that the model's predictions are
closer to the true probability distribution, suggesting a better prediction performance.

DR.P.GANGADHARA REDDY 31

You might also like