|
22 | 22 | "import nlp\n",
|
23 | 23 | "from nlp import Page, HITS\n",
|
24 | 24 | "from nlp import Lexicon, Rules, Grammar, ProbLexicon, ProbRules, ProbGrammar\n",
|
25 |
| - "from nlp import CYK_parse" |
| 25 | + "from nlp import CYK_parse, Chart" |
26 | 26 | ]
|
27 | 27 | },
|
28 | 28 | {
|
|
36 | 36 | "* Overview\n",
|
37 | 37 | "* Languages\n",
|
38 | 38 | "* HITS\n",
|
39 |
| - "* Question Answering" |
| 39 | + "* Question Answering\n", |
| 40 | + "* CYK Parse\n", |
| 41 | + "* Chart Parsing" |
40 | 42 | ]
|
41 | 43 | },
|
42 | 44 | {
|
|
45 | 47 | "source": [
|
46 | 48 | "## OVERVIEW\n",
|
47 | 49 | "\n",
|
48 |
| - "`TODO...`" |
| 50 | + "**Natural Language Processing (NLP)** is a field of AI concerned with understanding, analyzing and using natural languages. This field is considered a difficult yet intriguing field of study, since it is connected to how humans and their languages work.\n", |
| 51 | + "\n", |
| 52 | + "Applications of the field include translation, speech recognition, topic segmentation, information extraction and retrieval, and a lot more.\n", |
| 53 | + "\n", |
| 54 | + "Below we take a look at some algorithms in the field. Before we get right into it though, we will take a look at a very useful form of language, **context-free** languages. Even though they are a bit restrictive, they have been used a lot in research in natural language processing." |
49 | 55 | ]
|
50 | 56 | },
|
51 | 57 | {
|
|
908 | 914 | "\n",
|
909 | 915 | "Notice how the probability for the whole string (given by the key `('S', 0, 4)`) is 0.015. This means the most probable parsing of the sentence has a probability of 0.015."
|
910 | 916 | ]
|
| 917 | + }, |
| 918 | + { |
| 919 | + "cell_type": "markdown", |
| 920 | + "metadata": {}, |
| 921 | + "source": [ |
| 922 | + "## CHART PARSING\n", |
| 923 | + "\n", |
| 924 | + "### Overview\n", |
| 925 | + "\n", |
| 926 | + "Let's now take a look at a more general chart parsing algorithm. Given a non-probabilistic grammar and a sentence, this algorithm builds a parse tree in a top-down manner, with the words of the sentence as the leaves. It works with a dynamic programming approach, building a chart to store parses for substrings so that it doesn't have to analyze them again (just like the CYK algorithm). Each non-terminal, starting from S, gets replaced by its right-hand side rules in the chart, until we end up with the correct parses.\n", |
| 927 | + "\n", |
| 928 | + "### Implementation\n", |
| 929 | + "\n", |
| 930 | + "A parse is in the form `[start, end, non-terminal, sub-tree, expected-transformation]`, where `sub-tree` is a tree with the corresponding `non-terminal` as its root and `expected-transformation` is a right-hand side rule of the `non-terminal`.\n", |
| 931 | + "\n", |
| 932 | + "The chart parsing is implemented in a class, `Chart`. It is initialized with a grammar and can return the list of all the parses of a sentence with the `parses` function.\n", |
| 933 | + "\n", |
| 934 | + "The chart is a list of lists. The lists correspond to the lengths of substrings (including the empty string), from start to finish. When we say 'a point in the chart', we refer to a list of a certain length.\n", |
| 935 | + "\n", |
| 936 | + "A quick rundown of the class functions:" |
| 937 | + ] |
| 938 | + }, |
| 939 | + { |
| 940 | + "cell_type": "markdown", |
| 941 | + "metadata": { |
| 942 | + "collapsed": true |
| 943 | + }, |
| 944 | + "source": [ |
| 945 | + "* `parses`: Returns a list of parses for a given sentence. If the sentence can't be parsed, it will return an empty list. Initializes the process by calling `parse` from the starting symbol.\n", |
| 946 | + "\n", |
| 947 | + "\n", |
| 948 | + "* `parse`: Parses the list of words and builds the chart.\n", |
| 949 | + "\n", |
| 950 | + "\n", |
| 951 | + "* `add_edge`: Adds another edge to the chart at a given point. Also, examines whether the edge extends or predicts another edge. If the edge itself is not expecting a transformation, it will extend other edges and it will predict edges otherwise.\n", |
| 952 | + "\n", |
| 953 | + "\n", |
| 954 | + "* `scanner`: Given a word and a point in the chart, it extends edges that were expecting a transformation that can result in the given word. For example, if the word 'the' is an 'Article' and we are examining two edges at a chart's point, with one expecting an 'Article' and the other a 'Verb', the first one will be extended while the second one will not.\n", |
| 955 | + "\n", |
| 956 | + "\n", |
| 957 | + "* `predictor`: If an edge can't extend other edges (because it is expecting a transformation itself), we will add to the chart rules/transformations that can help extend the edge. The new edges come from the right-hand side of the expected transformation's rules. For example, if an edge is expecting the transformation 'Adjective Noun', we will add to the chart an edge for each right-hand side rule of the non-terminal 'Adjective'.\n", |
| 958 | + "\n", |
| 959 | + "\n", |
| 960 | + "* `extender`: Extends edges given an edge (called `E`). If `E`'s non-terminal is the same as the expected transformation of another edge (let's call it `A`), add to the chart a new edge with the non-terminal of `A` and the transformations of `A` minus the non-terminal that matched with `E`'s non-terminal. For example, if an edge `E` has 'Article' as its non-terminal and is expecting no transformation, we need to see what edges it can extend. Let's examine the edge `N`. This expects a transformation of 'Noun Verb'. 'Noun' does not match with 'Article', so we move on. Another edge, `A`, expects a transformation of 'Article Noun' and has a non-terminal of 'NP'. We have a match! A new edge will be added with 'NP' as its non-terminal (the non-terminal of `A`) and 'Noun' as the expected transformation (the rest of the expected transformation of `A`)." |
| 961 | + ] |
| 962 | + }, |
| 963 | + { |
| 964 | + "cell_type": "markdown", |
| 965 | + "metadata": {}, |
| 966 | + "source": [ |
| 967 | + "### Example\n", |
| 968 | + "\n", |
| 969 | + "We will use the grammar `E0` to parse the sentence \"the stench is in 2 2\".\n", |
| 970 | + "\n", |
| 971 | + "First we need to build a `Chart` object:" |
| 972 | + ] |
| 973 | + }, |
| 974 | + { |
| 975 | + "cell_type": "code", |
| 976 | + "execution_count": 2, |
| 977 | + "metadata": { |
| 978 | + "collapsed": true |
| 979 | + }, |
| 980 | + "outputs": [], |
| 981 | + "source": [ |
| 982 | + "chart = Chart(nlp.E0)" |
| 983 | + ] |
| 984 | + }, |
| 985 | + { |
| 986 | + "cell_type": "markdown", |
| 987 | + "metadata": {}, |
| 988 | + "source": [ |
| 989 | + "And then we simply call the `parses` function:" |
| 990 | + ] |
| 991 | + }, |
| 992 | + { |
| 993 | + "cell_type": "code", |
| 994 | + "execution_count": 3, |
| 995 | + "metadata": {}, |
| 996 | + "outputs": [ |
| 997 | + { |
| 998 | + "name": "stdout", |
| 999 | + "output_type": "stream", |
| 1000 | + "text": [ |
| 1001 | + "[[0, 6, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], []]]\n" |
| 1002 | + ] |
| 1003 | + } |
| 1004 | + ], |
| 1005 | + "source": [ |
| 1006 | + "print(chart.parses('the stench is in 2 2'))" |
| 1007 | + ] |
| 1008 | + }, |
| 1009 | + { |
| 1010 | + "cell_type": "markdown", |
| 1011 | + "metadata": {}, |
| 1012 | + "source": [ |
| 1013 | + "You can see which edges get added by setting the optional initialization argument `trace` to true." |
| 1014 | + ] |
| 1015 | + }, |
| 1016 | + { |
| 1017 | + "cell_type": "code", |
| 1018 | + "execution_count": 4, |
| 1019 | + "metadata": { |
| 1020 | + "collapsed": true |
| 1021 | + }, |
| 1022 | + "outputs": [ |
| 1023 | + { |
| 1024 | + "name": "stdout", |
| 1025 | + "output_type": "stream", |
| 1026 | + "text": [ |
| 1027 | + "Chart: added [0, 0, 'S_', [], ['S']]\n", |
| 1028 | + "Chart: added [0, 0, 'S', [], ['NP', 'VP']]\n", |
| 1029 | + "Chart: added [0, 0, 'NP', [], ['Pronoun']]\n", |
| 1030 | + "Chart: added [0, 0, 'NP', [], ['Name']]\n", |
| 1031 | + "Chart: added [0, 0, 'NP', [], ['Noun']]\n", |
| 1032 | + "Chart: added [0, 0, 'NP', [], ['Article', 'Noun']]\n", |
| 1033 | + "Chart: added [0, 0, 'NP', [], ['Digit', 'Digit']]\n", |
| 1034 | + "Chart: added [0, 0, 'NP', [], ['NP', 'PP']]\n", |
| 1035 | + "Chart: added [0, 0, 'NP', [], ['NP', 'RelClause']]\n", |
| 1036 | + "Chart: added [0, 0, 'S', [], ['S', 'Conjunction', 'S']]\n", |
| 1037 | + "Chart: added [0, 1, 'NP', [('Article', 'the')], ['Noun']]\n", |
| 1038 | + "Chart: added [0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []]\n", |
| 1039 | + "Chart: added [0, 2, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []]], ['VP']]\n", |
| 1040 | + "Chart: added [2, 2, 'VP', [], ['Verb']]\n", |
| 1041 | + "Chart: added [2, 2, 'VP', [], ['VP', 'NP']]\n", |
| 1042 | + "Chart: added [2, 2, 'VP', [], ['VP', 'Adjective']]\n", |
| 1043 | + "Chart: added [2, 2, 'VP', [], ['VP', 'PP']]\n", |
| 1044 | + "Chart: added [2, 2, 'VP', [], ['VP', 'Adverb']]\n", |
| 1045 | + "Chart: added [0, 2, 'NP', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []]], ['PP']]\n", |
| 1046 | + "Chart: added [2, 2, 'PP', [], ['Preposition', 'NP']]\n", |
| 1047 | + "Chart: added [0, 2, 'NP', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []]], ['RelClause']]\n", |
| 1048 | + "Chart: added [2, 2, 'RelClause', [], ['That', 'VP']]\n", |
| 1049 | + "Chart: added [2, 3, 'VP', [('Verb', 'is')], []]\n", |
| 1050 | + "Chart: added [0, 3, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 3, 'VP', [('Verb', 'is')], []]], []]\n", |
| 1051 | + "Chart: added [0, 3, 'S_', [[0, 3, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 3, 'VP', [('Verb', 'is')], []]], []]], []]\n", |
| 1052 | + "Chart: added [0, 3, 'S', [[0, 3, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 3, 'VP', [('Verb', 'is')], []]], []]], ['Conjunction', 'S']]\n", |
| 1053 | + "Chart: added [2, 3, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []]], ['NP']]\n", |
| 1054 | + "Chart: added [3, 3, 'NP', [], ['Pronoun']]\n", |
| 1055 | + "Chart: added [3, 3, 'NP', [], ['Name']]\n", |
| 1056 | + "Chart: added [3, 3, 'NP', [], ['Noun']]\n", |
| 1057 | + "Chart: added [3, 3, 'NP', [], ['Article', 'Noun']]\n", |
| 1058 | + "Chart: added [3, 3, 'NP', [], ['Digit', 'Digit']]\n", |
| 1059 | + "Chart: added [3, 3, 'NP', [], ['NP', 'PP']]\n", |
| 1060 | + "Chart: added [3, 3, 'NP', [], ['NP', 'RelClause']]\n", |
| 1061 | + "Chart: added [2, 3, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []]], ['Adjective']]\n", |
| 1062 | + "Chart: added [2, 3, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []]], ['PP']]\n", |
| 1063 | + "Chart: added [3, 3, 'PP', [], ['Preposition', 'NP']]\n", |
| 1064 | + "Chart: added [2, 3, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []]], ['Adverb']]\n", |
| 1065 | + "Chart: added [3, 4, 'PP', [('Preposition', 'in')], ['NP']]\n", |
| 1066 | + "Chart: added [4, 4, 'NP', [], ['Pronoun']]\n", |
| 1067 | + "Chart: added [4, 4, 'NP', [], ['Name']]\n", |
| 1068 | + "Chart: added [4, 4, 'NP', [], ['Noun']]\n", |
| 1069 | + "Chart: added [4, 4, 'NP', [], ['Article', 'Noun']]\n", |
| 1070 | + "Chart: added [4, 4, 'NP', [], ['Digit', 'Digit']]\n", |
| 1071 | + "Chart: added [4, 4, 'NP', [], ['NP', 'PP']]\n", |
| 1072 | + "Chart: added [4, 4, 'NP', [], ['NP', 'RelClause']]\n", |
| 1073 | + "Chart: added [4, 5, 'NP', [('Digit', '2')], ['Digit']]\n", |
| 1074 | + "Chart: added [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]\n", |
| 1075 | + "Chart: added [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]\n", |
| 1076 | + "Chart: added [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]\n", |
| 1077 | + "Chart: added [0, 6, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], []]\n", |
| 1078 | + "Chart: added [0, 6, 'S_', [[0, 6, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], []]], []]\n", |
| 1079 | + "Chart: added [0, 6, 'S', [[0, 6, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], []]], ['Conjunction', 'S']]\n", |
| 1080 | + "Chart: added [2, 6, 'VP', [[2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], ['NP']]\n", |
| 1081 | + "Chart: added [6, 6, 'NP', [], ['Pronoun']]\n", |
| 1082 | + "Chart: added [6, 6, 'NP', [], ['Name']]\n", |
| 1083 | + "Chart: added [6, 6, 'NP', [], ['Noun']]\n", |
| 1084 | + "Chart: added [6, 6, 'NP', [], ['Article', 'Noun']]\n", |
| 1085 | + "Chart: added [6, 6, 'NP', [], ['Digit', 'Digit']]\n", |
| 1086 | + "Chart: added [6, 6, 'NP', [], ['NP', 'PP']]\n", |
| 1087 | + "Chart: added [6, 6, 'NP', [], ['NP', 'RelClause']]\n", |
| 1088 | + "Chart: added [2, 6, 'VP', [[2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], ['Adjective']]\n", |
| 1089 | + "Chart: added [2, 6, 'VP', [[2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], ['PP']]\n", |
| 1090 | + "Chart: added [6, 6, 'PP', [], ['Preposition', 'NP']]\n", |
| 1091 | + "Chart: added [2, 6, 'VP', [[2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], ['Adverb']]\n", |
| 1092 | + "Chart: added [4, 6, 'NP', [[4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], ['PP']]\n", |
| 1093 | + "Chart: added [4, 6, 'NP', [[4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], ['RelClause']]\n", |
| 1094 | + "Chart: added [6, 6, 'RelClause', [], ['That', 'VP']]\n" |
| 1095 | + ] |
| 1096 | + }, |
| 1097 | + { |
| 1098 | + "data": { |
| 1099 | + "text/plain": [ |
| 1100 | + "[[0,\n", |
| 1101 | + " 6,\n", |
| 1102 | + " 'S',\n", |
| 1103 | + " [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []],\n", |
| 1104 | + " [2,\n", |
| 1105 | + " 6,\n", |
| 1106 | + " 'VP',\n", |
| 1107 | + " [[2, 3, 'VP', [('Verb', 'is')], []],\n", |
| 1108 | + " [3,\n", |
| 1109 | + " 6,\n", |
| 1110 | + " 'PP',\n", |
| 1111 | + " [('Preposition', 'in'),\n", |
| 1112 | + " [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]],\n", |
| 1113 | + " []]],\n", |
| 1114 | + " []]],\n", |
| 1115 | + " []]]" |
| 1116 | + ] |
| 1117 | + }, |
| 1118 | + "execution_count": 4, |
| 1119 | + "metadata": {}, |
| 1120 | + "output_type": "execute_result" |
| 1121 | + } |
| 1122 | + ], |
| 1123 | + "source": [ |
| 1124 | + "chart_trace = Chart(nlp.E0, trace=True)\n", |
| 1125 | + "chart_trace.parses('the stench is in 2 2')" |
| 1126 | + ] |
| 1127 | + }, |
| 1128 | + { |
| 1129 | + "cell_type": "markdown", |
| 1130 | + "metadata": {}, |
| 1131 | + "source": [ |
| 1132 | + "Let's try and parse a sentence that is not recognized by the grammar:" |
| 1133 | + ] |
| 1134 | + }, |
| 1135 | + { |
| 1136 | + "cell_type": "code", |
| 1137 | + "execution_count": 5, |
| 1138 | + "metadata": {}, |
| 1139 | + "outputs": [ |
| 1140 | + { |
| 1141 | + "name": "stdout", |
| 1142 | + "output_type": "stream", |
| 1143 | + "text": [ |
| 1144 | + "[]\n" |
| 1145 | + ] |
| 1146 | + } |
| 1147 | + ], |
| 1148 | + "source": [ |
| 1149 | + "print(chart.parses('the stench 2 2'))" |
| 1150 | + ] |
| 1151 | + }, |
| 1152 | + { |
| 1153 | + "cell_type": "markdown", |
| 1154 | + "metadata": {}, |
| 1155 | + "source": [ |
| 1156 | + "An empty list was returned." |
| 1157 | + ] |
911 | 1158 | }
|
912 | 1159 | ],
|
913 | 1160 | "metadata": {
|
|
0 commit comments