Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 2f03807

Browse files
antmarakisnorvig
authored andcommitted
NLP: Chart Parsing (aimacode#612)
* Update nlp.py * add chart parsing test * add chart parsing section
1 parent d84c3bf commit 2f03807

File tree

3 files changed

+258
-8
lines changed

3 files changed

+258
-8
lines changed

nlp.ipynb

Lines changed: 250 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
"import nlp\n",
2323
"from nlp import Page, HITS\n",
2424
"from nlp import Lexicon, Rules, Grammar, ProbLexicon, ProbRules, ProbGrammar\n",
25-
"from nlp import CYK_parse"
25+
"from nlp import CYK_parse, Chart"
2626
]
2727
},
2828
{
@@ -36,7 +36,9 @@
3636
"* Overview\n",
3737
"* Languages\n",
3838
"* HITS\n",
39-
"* Question Answering"
39+
"* Question Answering\n",
40+
"* CYK Parse\n",
41+
"* Chart Parsing"
4042
]
4143
},
4244
{
@@ -45,7 +47,11 @@
4547
"source": [
4648
"## OVERVIEW\n",
4749
"\n",
48-
"`TODO...`"
50+
"**Natural Language Processing (NLP)** is a field of AI concerned with understanding, analyzing and using natural languages. This field is considered a difficult yet intriguing field of study, since it is connected to how humans and their languages work.\n",
51+
"\n",
52+
"Applications of the field include translation, speech recognition, topic segmentation, information extraction and retrieval, and a lot more.\n",
53+
"\n",
54+
"Below we take a look at some algorithms in the field. Before we get right into it though, we will take a look at a very useful form of language, **context-free** languages. Even though they are a bit restrictive, they have been used a lot in research in natural language processing."
4955
]
5056
},
5157
{
@@ -908,6 +914,247 @@
908914
"\n",
909915
"Notice how the probability for the whole string (given by the key `('S', 0, 4)`) is 0.015. This means the most probable parsing of the sentence has a probability of 0.015."
910916
]
917+
},
918+
{
919+
"cell_type": "markdown",
920+
"metadata": {},
921+
"source": [
922+
"## CHART PARSING\n",
923+
"\n",
924+
"### Overview\n",
925+
"\n",
926+
"Let's now take a look at a more general chart parsing algorithm. Given a non-probabilistic grammar and a sentence, this algorithm builds a parse tree in a top-down manner, with the words of the sentence as the leaves. It works with a dynamic programming approach, building a chart to store parses for substrings so that it doesn't have to analyze them again (just like the CYK algorithm). Each non-terminal, starting from S, gets replaced by its right-hand side rules in the chart, until we end up with the correct parses.\n",
927+
"\n",
928+
"### Implementation\n",
929+
"\n",
930+
"A parse is in the form `[start, end, non-terminal, sub-tree, expected-transformation]`, where `sub-tree` is a tree with the corresponding `non-terminal` as its root and `expected-transformation` is a right-hand side rule of the `non-terminal`.\n",
931+
"\n",
932+
"The chart parsing is implemented in a class, `Chart`. It is initialized with a grammar and can return the list of all the parses of a sentence with the `parses` function.\n",
933+
"\n",
934+
"The chart is a list of lists. The lists correspond to the lengths of substrings (including the empty string), from start to finish. When we say 'a point in the chart', we refer to a list of a certain length.\n",
935+
"\n",
936+
"A quick rundown of the class functions:"
937+
]
938+
},
939+
{
940+
"cell_type": "markdown",
941+
"metadata": {
942+
"collapsed": true
943+
},
944+
"source": [
945+
"* `parses`: Returns a list of parses for a given sentence. If the sentence can't be parsed, it will return an empty list. Initializes the process by calling `parse` from the starting symbol.\n",
946+
"\n",
947+
"\n",
948+
"* `parse`: Parses the list of words and builds the chart.\n",
949+
"\n",
950+
"\n",
951+
"* `add_edge`: Adds another edge to the chart at a given point. Also, examines whether the edge extends or predicts another edge. If the edge itself is not expecting a transformation, it will extend other edges and it will predict edges otherwise.\n",
952+
"\n",
953+
"\n",
954+
"* `scanner`: Given a word and a point in the chart, it extends edges that were expecting a transformation that can result in the given word. For example, if the word 'the' is an 'Article' and we are examining two edges at a chart's point, with one expecting an 'Article' and the other a 'Verb', the first one will be extended while the second one will not.\n",
955+
"\n",
956+
"\n",
957+
"* `predictor`: If an edge can't extend other edges (because it is expecting a transformation itself), we will add to the chart rules/transformations that can help extend the edge. The new edges come from the right-hand side of the expected transformation's rules. For example, if an edge is expecting the transformation 'Adjective Noun', we will add to the chart an edge for each right-hand side rule of the non-terminal 'Adjective'.\n",
958+
"\n",
959+
"\n",
960+
"* `extender`: Extends edges given an edge (called `E`). If `E`'s non-terminal is the same as the expected transformation of another edge (let's call it `A`), add to the chart a new edge with the non-terminal of `A` and the transformations of `A` minus the non-terminal that matched with `E`'s non-terminal. For example, if an edge `E` has 'Article' as its non-terminal and is expecting no transformation, we need to see what edges it can extend. Let's examine the edge `N`. This expects a transformation of 'Noun Verb'. 'Noun' does not match with 'Article', so we move on. Another edge, `A`, expects a transformation of 'Article Noun' and has a non-terminal of 'NP'. We have a match! A new edge will be added with 'NP' as its non-terminal (the non-terminal of `A`) and 'Noun' as the expected transformation (the rest of the expected transformation of `A`)."
961+
]
962+
},
963+
{
964+
"cell_type": "markdown",
965+
"metadata": {},
966+
"source": [
967+
"### Example\n",
968+
"\n",
969+
"We will use the grammar `E0` to parse the sentence \"the stench is in 2 2\".\n",
970+
"\n",
971+
"First we need to build a `Chart` object:"
972+
]
973+
},
974+
{
975+
"cell_type": "code",
976+
"execution_count": 2,
977+
"metadata": {
978+
"collapsed": true
979+
},
980+
"outputs": [],
981+
"source": [
982+
"chart = Chart(nlp.E0)"
983+
]
984+
},
985+
{
986+
"cell_type": "markdown",
987+
"metadata": {},
988+
"source": [
989+
"And then we simply call the `parses` function:"
990+
]
991+
},
992+
{
993+
"cell_type": "code",
994+
"execution_count": 3,
995+
"metadata": {},
996+
"outputs": [
997+
{
998+
"name": "stdout",
999+
"output_type": "stream",
1000+
"text": [
1001+
"[[0, 6, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], []]]\n"
1002+
]
1003+
}
1004+
],
1005+
"source": [
1006+
"print(chart.parses('the stench is in 2 2'))"
1007+
]
1008+
},
1009+
{
1010+
"cell_type": "markdown",
1011+
"metadata": {},
1012+
"source": [
1013+
"You can see which edges get added by setting the optional initialization argument `trace` to true."
1014+
]
1015+
},
1016+
{
1017+
"cell_type": "code",
1018+
"execution_count": 4,
1019+
"metadata": {
1020+
"collapsed": true
1021+
},
1022+
"outputs": [
1023+
{
1024+
"name": "stdout",
1025+
"output_type": "stream",
1026+
"text": [
1027+
"Chart: added [0, 0, 'S_', [], ['S']]\n",
1028+
"Chart: added [0, 0, 'S', [], ['NP', 'VP']]\n",
1029+
"Chart: added [0, 0, 'NP', [], ['Pronoun']]\n",
1030+
"Chart: added [0, 0, 'NP', [], ['Name']]\n",
1031+
"Chart: added [0, 0, 'NP', [], ['Noun']]\n",
1032+
"Chart: added [0, 0, 'NP', [], ['Article', 'Noun']]\n",
1033+
"Chart: added [0, 0, 'NP', [], ['Digit', 'Digit']]\n",
1034+
"Chart: added [0, 0, 'NP', [], ['NP', 'PP']]\n",
1035+
"Chart: added [0, 0, 'NP', [], ['NP', 'RelClause']]\n",
1036+
"Chart: added [0, 0, 'S', [], ['S', 'Conjunction', 'S']]\n",
1037+
"Chart: added [0, 1, 'NP', [('Article', 'the')], ['Noun']]\n",
1038+
"Chart: added [0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []]\n",
1039+
"Chart: added [0, 2, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []]], ['VP']]\n",
1040+
"Chart: added [2, 2, 'VP', [], ['Verb']]\n",
1041+
"Chart: added [2, 2, 'VP', [], ['VP', 'NP']]\n",
1042+
"Chart: added [2, 2, 'VP', [], ['VP', 'Adjective']]\n",
1043+
"Chart: added [2, 2, 'VP', [], ['VP', 'PP']]\n",
1044+
"Chart: added [2, 2, 'VP', [], ['VP', 'Adverb']]\n",
1045+
"Chart: added [0, 2, 'NP', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []]], ['PP']]\n",
1046+
"Chart: added [2, 2, 'PP', [], ['Preposition', 'NP']]\n",
1047+
"Chart: added [0, 2, 'NP', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []]], ['RelClause']]\n",
1048+
"Chart: added [2, 2, 'RelClause', [], ['That', 'VP']]\n",
1049+
"Chart: added [2, 3, 'VP', [('Verb', 'is')], []]\n",
1050+
"Chart: added [0, 3, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 3, 'VP', [('Verb', 'is')], []]], []]\n",
1051+
"Chart: added [0, 3, 'S_', [[0, 3, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 3, 'VP', [('Verb', 'is')], []]], []]], []]\n",
1052+
"Chart: added [0, 3, 'S', [[0, 3, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 3, 'VP', [('Verb', 'is')], []]], []]], ['Conjunction', 'S']]\n",
1053+
"Chart: added [2, 3, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []]], ['NP']]\n",
1054+
"Chart: added [3, 3, 'NP', [], ['Pronoun']]\n",
1055+
"Chart: added [3, 3, 'NP', [], ['Name']]\n",
1056+
"Chart: added [3, 3, 'NP', [], ['Noun']]\n",
1057+
"Chart: added [3, 3, 'NP', [], ['Article', 'Noun']]\n",
1058+
"Chart: added [3, 3, 'NP', [], ['Digit', 'Digit']]\n",
1059+
"Chart: added [3, 3, 'NP', [], ['NP', 'PP']]\n",
1060+
"Chart: added [3, 3, 'NP', [], ['NP', 'RelClause']]\n",
1061+
"Chart: added [2, 3, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []]], ['Adjective']]\n",
1062+
"Chart: added [2, 3, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []]], ['PP']]\n",
1063+
"Chart: added [3, 3, 'PP', [], ['Preposition', 'NP']]\n",
1064+
"Chart: added [2, 3, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []]], ['Adverb']]\n",
1065+
"Chart: added [3, 4, 'PP', [('Preposition', 'in')], ['NP']]\n",
1066+
"Chart: added [4, 4, 'NP', [], ['Pronoun']]\n",
1067+
"Chart: added [4, 4, 'NP', [], ['Name']]\n",
1068+
"Chart: added [4, 4, 'NP', [], ['Noun']]\n",
1069+
"Chart: added [4, 4, 'NP', [], ['Article', 'Noun']]\n",
1070+
"Chart: added [4, 4, 'NP', [], ['Digit', 'Digit']]\n",
1071+
"Chart: added [4, 4, 'NP', [], ['NP', 'PP']]\n",
1072+
"Chart: added [4, 4, 'NP', [], ['NP', 'RelClause']]\n",
1073+
"Chart: added [4, 5, 'NP', [('Digit', '2')], ['Digit']]\n",
1074+
"Chart: added [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]\n",
1075+
"Chart: added [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]\n",
1076+
"Chart: added [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]\n",
1077+
"Chart: added [0, 6, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], []]\n",
1078+
"Chart: added [0, 6, 'S_', [[0, 6, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], []]], []]\n",
1079+
"Chart: added [0, 6, 'S', [[0, 6, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], []]], ['Conjunction', 'S']]\n",
1080+
"Chart: added [2, 6, 'VP', [[2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], ['NP']]\n",
1081+
"Chart: added [6, 6, 'NP', [], ['Pronoun']]\n",
1082+
"Chart: added [6, 6, 'NP', [], ['Name']]\n",
1083+
"Chart: added [6, 6, 'NP', [], ['Noun']]\n",
1084+
"Chart: added [6, 6, 'NP', [], ['Article', 'Noun']]\n",
1085+
"Chart: added [6, 6, 'NP', [], ['Digit', 'Digit']]\n",
1086+
"Chart: added [6, 6, 'NP', [], ['NP', 'PP']]\n",
1087+
"Chart: added [6, 6, 'NP', [], ['NP', 'RelClause']]\n",
1088+
"Chart: added [2, 6, 'VP', [[2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], ['Adjective']]\n",
1089+
"Chart: added [2, 6, 'VP', [[2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], ['PP']]\n",
1090+
"Chart: added [6, 6, 'PP', [], ['Preposition', 'NP']]\n",
1091+
"Chart: added [2, 6, 'VP', [[2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], ['Adverb']]\n",
1092+
"Chart: added [4, 6, 'NP', [[4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], ['PP']]\n",
1093+
"Chart: added [4, 6, 'NP', [[4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], ['RelClause']]\n",
1094+
"Chart: added [6, 6, 'RelClause', [], ['That', 'VP']]\n"
1095+
]
1096+
},
1097+
{
1098+
"data": {
1099+
"text/plain": [
1100+
"[[0,\n",
1101+
" 6,\n",
1102+
" 'S',\n",
1103+
" [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []],\n",
1104+
" [2,\n",
1105+
" 6,\n",
1106+
" 'VP',\n",
1107+
" [[2, 3, 'VP', [('Verb', 'is')], []],\n",
1108+
" [3,\n",
1109+
" 6,\n",
1110+
" 'PP',\n",
1111+
" [('Preposition', 'in'),\n",
1112+
" [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]],\n",
1113+
" []]],\n",
1114+
" []]],\n",
1115+
" []]]"
1116+
]
1117+
},
1118+
"execution_count": 4,
1119+
"metadata": {},
1120+
"output_type": "execute_result"
1121+
}
1122+
],
1123+
"source": [
1124+
"chart_trace = Chart(nlp.E0, trace=True)\n",
1125+
"chart_trace.parses('the stench is in 2 2')"
1126+
]
1127+
},
1128+
{
1129+
"cell_type": "markdown",
1130+
"metadata": {},
1131+
"source": [
1132+
"Let's try and parse a sentence that is not recognized by the grammar:"
1133+
]
1134+
},
1135+
{
1136+
"cell_type": "code",
1137+
"execution_count": 5,
1138+
"metadata": {},
1139+
"outputs": [
1140+
{
1141+
"name": "stdout",
1142+
"output_type": "stream",
1143+
"text": [
1144+
"[]\n"
1145+
]
1146+
}
1147+
],
1148+
"source": [
1149+
"print(chart.parses('the stench 2 2'))"
1150+
]
1151+
},
1152+
{
1153+
"cell_type": "markdown",
1154+
"metadata": {},
1155+
"source": [
1156+
"An empty list was returned."
1157+
]
9111158
}
9121159
],
9131160
"metadata": {

nlp.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,5 @@
11
"""Natural Language Processing; Chart Parsing and PageRanking (Chapter 22-23)"""
22

3-
# (Written for the second edition of AIMA; expect some discrepanciecs
4-
# from the third edition until this gets reviewed.)
5-
63
from collections import defaultdict
74
from utils import weighted_choice
85
import urllib.request
@@ -274,7 +271,7 @@ def __repr__(self):
274271

275272
class Chart:
276273

277-
"""Class for parsing sentences using a chart data structure. [Figure 22.7]
274+
"""Class for parsing sentences using a chart data structure.
278275
>>> chart = Chart(E0);
279276
>>> len(chart.parses('the stench is in 2 2'))
280277
1

tests/test_nlp.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
from nlp import expand_pages, relevant_pages, normalize, ConvergenceDetector, getInlinks
66
from nlp import getOutlinks, Page, determineInlinks, HITS
77
from nlp import Rules, Lexicon, Grammar, ProbRules, ProbLexicon, ProbGrammar
8-
from nlp import CYK_parse
8+
from nlp import Chart, CYK_parse
99
# Clumsy imports because we want to access certain nlp.py globals explicitly, because
1010
# they are accessed by functions within nlp.py
1111

@@ -101,6 +101,12 @@ def test_prob_generation():
101101
assert len(sentence) == 2
102102

103103

104+
def test_chart_parsing():
105+
chart = Chart(nlp.E0)
106+
parses = chart.parses('the stench is in 2 2')
107+
assert len(parses) == 1
108+
109+
104110
def test_CYK_parse():
105111
grammar = nlp.E_Prob_Chomsky
106112
words = ['the', 'robot', 'is', 'good']

0 commit comments

Comments
 (0)