|
21 | 21 | "source": [
|
22 | 22 | "import nlp\n",
|
23 | 23 | "from nlp import Page, HITS\n",
|
24 |
| - "from nlp import Lexicon, Rules, Grammar, ProbLexicon, ProbRules, ProbGrammar" |
| 24 | + "from nlp import Lexicon, Rules, Grammar, ProbLexicon, ProbRules, ProbGrammar\n", |
| 25 | + "from nlp import CYK_parse" |
25 | 26 | ]
|
26 | 27 | },
|
27 | 28 | {
|
|
60 | 61 | "A lot of natural and programming languages can be represented by a **Context-Free Grammar (CFG)**. A CFG is a grammar that has a single non-terminal symbol on the left-hand side. That means a non-terminal can be replaced by the right-hand side of the rule regardless of context. An example of a CFG:\n",
|
61 | 62 | "\n",
|
62 | 63 | "```\n",
|
63 |
| - "S -> aSb | e\n", |
| 64 | + "S -> aSb | ε\n", |
64 | 65 | "```\n",
|
65 | 66 | "\n",
|
66 |
| - "That means `S` can be replaced by either `aSb` or `e` (with `e` we denote the empty string). The lexicon of the language is comprised of the terminals `a` and `b`, while with `S` we denote the non-terminal symbol. In general, non-terminals are capitalized while terminals are not, and we usually name the starting non-terminal `S`. The language generated by the above grammar is the language a<sup>n</sup>b<sup>n</sup> for n greater or equal than 1." |
| 67 | + "That means `S` can be replaced by either `aSb` or `ε` (with `ε` we denote the empty string). The lexicon of the language is comprised of the terminals `a` and `b`, while with `S` we denote the non-terminal symbol. In general, non-terminals are capitalized while terminals are not, and we usually name the starting non-terminal `S`. The language generated by the above grammar is the language a<sup>n</sup>b<sup>n</sup> for n greater or equal than 1." |
67 | 68 | ]
|
68 | 69 | },
|
69 | 70 | {
|
|
72 | 73 | "source": [
|
73 | 74 | "### Probabilistic Context-Free Grammar\n",
|
74 | 75 | "\n",
|
75 |
| - "While a simple CFG can be very useful, we might want to know the chance of each rule occuring. Above, we do not know if `S` is more likely to be replaced by `aSb` or `e`. **Probabilistic Context-Free Grammars (PCFG)** are built to fill exactly that need. Each rule has a probability, given in brackets, and the probabilities of a rule sum up to 1:\n", |
| 76 | + "While a simple CFG can be very useful, we might want to know the chance of each rule occuring. Above, we do not know if `S` is more likely to be replaced by `aSb` or `ε`. **Probabilistic Context-Free Grammars (PCFG)** are built to fill exactly that need. Each rule has a probability, given in brackets, and the probabilities of a rule sum up to 1:\n", |
76 | 77 | "\n",
|
77 | 78 | "```\n",
|
78 |
| - "S -> aSb [0.7] | e [0.3]\n", |
| 79 | + "S -> aSb [0.7] | ε [0.3]\n", |
79 | 80 | "```\n",
|
80 | 81 | "\n",
|
81 |
| - "Now we know it is more likely for `S` to be replaced by `aSb` than by `e`." |
| 82 | + "Now we know it is more likely for `S` to be replaced by `aSb` than by `e`.\n", |
| 83 | + "\n", |
| 84 | + "An issue with *PCFGs* is how we will assign the various probabilities to the rules. We could use our knowledge as humans to assign the probabilities, but that is a laborious and prone to error task. Instead, we can *learn* the probabilities from data. Data is categorized as labeled (with correctly parsed sentences, usually called a **treebank**) or unlabeled (given only lexical and syntactic category names).\n", |
| 85 | + "\n", |
| 86 | + "With labeled data, we can simply count the occurences. For the above grammar, if we have 100 `S` rules and 30 of them are of the form `S -> ε`, we assign a probability of 0.3 to the transformation.\n", |
| 87 | + "\n", |
| 88 | + "With unlabeled data we have to learn both the grammar rules and the probability of each rule. We can go with many approaches, one of them the **inside-outside** algorithm. It uses a dynamic programming approach, that first finds the probability of a substring being generated by each rule, and then estimates the probability of each rule." |
82 | 89 | ]
|
83 | 90 | },
|
84 | 91 | {
|
|
755 | 762 | "\n",
|
756 | 763 | "Finally, the different results are weighted by the generality of the queries. The result from the general boolean query [George Washington OR second in command] weighs less that the more specific query [George Washington's second in command was \\*]. As an answer we return the most highly-ranked n-gram."
|
757 | 764 | ]
|
| 765 | + }, |
| 766 | + { |
| 767 | + "cell_type": "markdown", |
| 768 | + "metadata": {}, |
| 769 | + "source": [ |
| 770 | + "## CYK PARSE\n", |
| 771 | + "\n", |
| 772 | + "### Overview\n", |
| 773 | + "\n", |
| 774 | + "Syntactic analysis (or **parsing**) of a sentence is the process of uncovering the phrase structure of the sentence according to the rules of a grammar. There are two main approaches to parsing. *Top-down*, start with the starting symbol and build a parse tree with the given words as its leaves, and *bottom-up*, where we start from the given words and build a tree that has the starting symbol as its root. Both approaches involve \"guessing\" ahead, so it is very possible it will take long to parse a sentence (wrong guess mean a lot of backtracking). Thankfully, a lot of effort is spent in analyzing already analyzed substrings, so we can follow a dynamic programming approach to store and reuse these parses instead of recomputing them. The *CYK Parsing Algorithm* (named after its inventors, Cocke, Younger and Kasami) utilizes this technique to parse sentences of a grammar in *Chomsky Normal Form*.\n", |
| 775 | + "\n", |
| 776 | + "The CYK algorithm returns an *M x N x N* array (named *P*), where *N* is the number of words in the sentence and *M* the number of non-terminal symbols in the grammar. Each element in this array shows the probability of a substring being transformed from a particular non-terminal. To find the most probable parse of the sentence, a search in the resulting array is required. Search heuristic algorithms work well in this space, and we can derive the heuristics from the properties of the grammar.\n", |
| 777 | + "\n", |
| 778 | + "The algorithm in short works like this: There is an external loop that determines the length of the substring. Then the algorithm loops through the words in the sentence. For each word, it again loops through all the words to its right up to the first-loop length. The substring it will work on in this iteration is the words from the second-loop word with first-loop length. Finally, it loops through all the rules in the grammar and updates the substring's probability for each right-hand side non-terminal." |
| 779 | + ] |
| 780 | + }, |
| 781 | + { |
| 782 | + "cell_type": "markdown", |
| 783 | + "metadata": {}, |
| 784 | + "source": [ |
| 785 | + "### Implementation\n", |
| 786 | + "\n", |
| 787 | + "The implementation takes as input a list of words and a probabilistic grammar (from the `ProbGrammar` class detailed above) in CNF and returns the table/dictionary *P*. An item's key in *P* is a tuple in the form `(Non-terminal, start of substring, length of substring)`, and the value is a probability. For example, for the sentence \"the monkey is dancing\" and the substring \"the monkey\" an item can be `('NP', 0, 2): 0.5`, which means the first two words (the substring from index 0 and length 2) have a 0.5 probablity of coming from the `NP` terminal.\n", |
| 788 | + "\n", |
| 789 | + "Before we continue, you can take a look at the source code by running the cell below:" |
| 790 | + ] |
| 791 | + }, |
| 792 | + { |
| 793 | + "cell_type": "code", |
| 794 | + "execution_count": 2, |
| 795 | + "metadata": { |
| 796 | + "collapsed": true |
| 797 | + }, |
| 798 | + "outputs": [], |
| 799 | + "source": [ |
| 800 | + "%psource CYK_parse" |
| 801 | + ] |
| 802 | + }, |
| 803 | + { |
| 804 | + "cell_type": "markdown", |
| 805 | + "metadata": {}, |
| 806 | + "source": [ |
| 807 | + "When updating the probability of a substring, we pick the max of its current one and the probability of the substring broken into two parts: one from the second-loop word with third-loop length, and the other from the first part's end to the remainer of the first-loop length." |
| 808 | + ] |
| 809 | + }, |
| 810 | + { |
| 811 | + "cell_type": "markdown", |
| 812 | + "metadata": {}, |
| 813 | + "source": [ |
| 814 | + "### Example\n", |
| 815 | + "\n", |
| 816 | + "Let's build a probabilistic grammar in CNF:" |
| 817 | + ] |
| 818 | + }, |
| 819 | + { |
| 820 | + "cell_type": "code", |
| 821 | + "execution_count": 3, |
| 822 | + "metadata": { |
| 823 | + "collapsed": true |
| 824 | + }, |
| 825 | + "outputs": [], |
| 826 | + "source": [ |
| 827 | + "E_Prob_Chomsky = ProbGrammar('E_Prob_Chomsky', # A Probabilistic Grammar in CNF\n", |
| 828 | + " ProbRules(\n", |
| 829 | + " S='NP VP [1]',\n", |
| 830 | + " NP='Article Noun [0.6] | Adjective Noun [0.4]',\n", |
| 831 | + " VP='Verb NP [0.5] | Verb Adjective [0.5]',\n", |
| 832 | + " ),\n", |
| 833 | + " ProbLexicon(\n", |
| 834 | + " Article='the [0.5] | a [0.25] | an [0.25]',\n", |
| 835 | + " Noun='robot [0.4] | sheep [0.4] | fence [0.2]',\n", |
| 836 | + " Adjective='good [0.5] | new [0.2] | sad [0.3]',\n", |
| 837 | + " Verb='is [0.5] | say [0.3] | are [0.2]'\n", |
| 838 | + " ))" |
| 839 | + ] |
| 840 | + }, |
| 841 | + { |
| 842 | + "cell_type": "markdown", |
| 843 | + "metadata": {}, |
| 844 | + "source": [ |
| 845 | + "Now let's see the probabilities table for the sentence \"the robot is good\":" |
| 846 | + ] |
| 847 | + }, |
| 848 | + { |
| 849 | + "cell_type": "code", |
| 850 | + "execution_count": 4, |
| 851 | + "metadata": {}, |
| 852 | + "outputs": [ |
| 853 | + { |
| 854 | + "name": "stdout", |
| 855 | + "output_type": "stream", |
| 856 | + "text": [ |
| 857 | + "defaultdict(<class 'float'>, {('Noun', 3, 1): 0.0, ('VP', 0, 3): 0.0, ('Article', 1, 1): 0.0, ('Adjective', 2, 1): 0.0, ('NP', 2, 2): 0.0, ('Adjective', 1, 3): 0.0, ('S', 0, 4): 0.015, ('NP', 1, 3): 0.0, ('VP', 1, 3): 0.0, ('VP', 3, 1): 0.0, ('Verb', 1, 1): 0.0, ('Adjective', 2, 2): 0.0, ('NP', 1, 1): 0.0, ('NP', 2, 1): 0.0, ('NP', 1, 2): 0.0, ('Adjective', 0, 3): 0.0, ('Noun', 2, 1): 0.0, ('Verb', 2, 1): 0.5, ('S', 2, 2): 0.0, ('Adjective', 0, 2): 0.0, ('Noun', 2, 2): 0.0, ('Adjective', 0, 1): 0.0, ('Adjective', 3, 1): 0.5, ('Article', 0, 3): 0.0, ('Article', 0, 1): 0.5, ('VP', 0, 2): 0.0, ('Article', 0, 2): 0.0, ('Noun', 1, 1): 0.4, ('VP', 1, 2): 0.0, ('VP', 0, 4): 0.0, ('Article', 1, 2): 0.0, ('S', 1, 3): 0.0, ('NP', 0, 1): 0.0, ('Verb', 0, 3): 0.0, ('Noun', 1, 3): 0.0, ('VP', 2, 2): 0.125, ('S', 1, 2): 0.0, ('NP', 0, 2): 0.12, ('Verb', 0, 2): 0.0, ('Noun', 1, 2): 0.0, ('VP', 2, 1): 0.0, ('NP', 0, 3): 0.0, ('Verb', 0, 1): 0.0, ('S', 0, 2): 0.0, ('VP', 1, 1): 0.0, ('NP', 0, 4): 0.0, ('Article', 2, 1): 0.0, ('NP', 3, 1): 0.0, ('Adjective', 1, 1): 0.0, ('S', 0, 3): 0.0, ('Adjective', 1, 2): 0.0, ('Verb', 1, 2): 0.0})\n" |
| 858 | + ] |
| 859 | + } |
| 860 | + ], |
| 861 | + "source": [ |
| 862 | + "words = ['the', 'robot', 'is', 'good']\n", |
| 863 | + "grammar = E_Prob_Chomsky\n", |
| 864 | + "\n", |
| 865 | + "P = CYK_parse(words, grammar)\n", |
| 866 | + "print(P)" |
| 867 | + ] |
| 868 | + }, |
| 869 | + { |
| 870 | + "cell_type": "markdown", |
| 871 | + "metadata": {}, |
| 872 | + "source": [ |
| 873 | + "A `defaultdict` object is returned (`defaultdict` is basically a dictionary but with a default value/type). Keys are tuples in the form mentioned above and the values are the corresponding probabilities. Most of the items/parses have a probability of 0. Let's filter those out to take a better look at the parses that matter." |
| 874 | + ] |
| 875 | + }, |
| 876 | + { |
| 877 | + "cell_type": "code", |
| 878 | + "execution_count": 7, |
| 879 | + "metadata": {}, |
| 880 | + "outputs": [ |
| 881 | + { |
| 882 | + "name": "stdout", |
| 883 | + "output_type": "stream", |
| 884 | + "text": [ |
| 885 | + "{('NP', 0, 2): 0.12, ('Adjective', 3, 1): 0.5, ('S', 0, 4): 0.015, ('Verb', 2, 1): 0.5, ('Article', 0, 1): 0.5, ('VP', 2, 2): 0.125, ('Noun', 1, 1): 0.4}\n" |
| 886 | + ] |
| 887 | + } |
| 888 | + ], |
| 889 | + "source": [ |
| 890 | + "parses = {k: p for k, p in P.items() if p >0}\n", |
| 891 | + "\n", |
| 892 | + "print(parses)" |
| 893 | + ] |
| 894 | + }, |
| 895 | + { |
| 896 | + "cell_type": "markdown", |
| 897 | + "metadata": {}, |
| 898 | + "source": [ |
| 899 | + "The item `('Article', 0, 1): 0.5` means that the first item came from the `Article` non-terminal with a chance of 0.5. A more complicated item, one with two words, is `('NP', 0, 2): 0.12` which covers the first two words. The probability of the substring \"the robot\" coming from the `NP` non-terminal is 0.12. Let's try and follow the transformations from `NP` to the given words (top-down) to make sure this is indeed the case:\n", |
| 900 | + "\n", |
| 901 | + "1. The probability of `NP` transforming to `Article Noun` is 0.6.\n", |
| 902 | + "\n", |
| 903 | + "2. The probability of `Article` transforming to \"the\" is 0.5 (total probability = 0.6*0.5 = 0.3).\n", |
| 904 | + "\n", |
| 905 | + "3. The probability of `Noun` transforming to \"robot\" is 0.4 (total = 0.3*0.4 = 0.12).\n", |
| 906 | + "\n", |
| 907 | + "Thus, the total probability of the transformation is 0.12.\n", |
| 908 | + "\n", |
| 909 | + "Notice how the probability for the whole string (given by the key `('S', 0, 4)`) is 0.015. This means the most probable parsing of the sentence has a probability of 0.015." |
| 910 | + ] |
758 | 911 | }
|
759 | 912 | ],
|
760 | 913 | "metadata": {
|
|
0 commit comments