Thanks to visit codestin.com
Credit goes to github.com

Skip to content

NLP: Chart Parsing #612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 11, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
253 changes: 250 additions & 3 deletions nlp.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
"import nlp\n",
"from nlp import Page, HITS\n",
"from nlp import Lexicon, Rules, Grammar, ProbLexicon, ProbRules, ProbGrammar\n",
"from nlp import CYK_parse"
"from nlp import CYK_parse, Chart"
]
},
{
Expand All @@ -36,7 +36,9 @@
"* Overview\n",
"* Languages\n",
"* HITS\n",
"* Question Answering"
"* Question Answering\n",
"* CYK Parse\n",
"* Chart Parsing"
]
},
{
Expand All @@ -45,7 +47,11 @@
"source": [
"## OVERVIEW\n",
"\n",
"`TODO...`"
"**Natural Language Processing (NLP)** is a field of AI concerned with understanding, analyzing and using natural languages. This field is considered a difficult yet intriguing field of study, since it is connected to how humans and their languages work.\n",
"\n",
"Applications of the field include translation, speech recognition, topic segmentation, information extraction and retrieval, and a lot more.\n",
"\n",
"Below we take a look at some algorithms in the field. Before we get right into it though, we will take a look at a very useful form of language, **context-free** languages. Even though they are a bit restrictive, they have been used a lot in research in natural language processing."
]
},
{
Expand Down Expand Up @@ -908,6 +914,247 @@
"\n",
"Notice how the probability for the whole string (given by the key `('S', 0, 4)`) is 0.015. This means the most probable parsing of the sentence has a probability of 0.015."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## CHART PARSING\n",
"\n",
"### Overview\n",
"\n",
"Let's now take a look at a more general chart parsing algorithm. Given a non-probabilistic grammar and a sentence, this algorithm builds a parse tree in a top-down manner, with the words of the sentence as the leaves. It works with a dynamic programming approach, building a chart to store parses for substrings so that it doesn't have to analyze them again (just like the CYK algorithm). Each non-terminal, starting from S, gets replaced by its right-hand side rules in the chart, until we end up with the correct parses.\n",
"\n",
"### Implementation\n",
"\n",
"A parse is in the form `[start, end, non-terminal, sub-tree, expected-transformation]`, where `sub-tree` is a tree with the corresponding `non-terminal` as its root and `expected-transformation` is a right-hand side rule of the `non-terminal`.\n",
"\n",
"The chart parsing is implemented in a class, `Chart`. It is initialized with a grammar and can return the list of all the parses of a sentence with the `parses` function.\n",
"\n",
"The chart is a list of lists. The lists correspond to the lengths of substrings (including the empty string), from start to finish. When we say 'a point in the chart', we refer to a list of a certain length.\n",
"\n",
"A quick rundown of the class functions:"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"* `parses`: Returns a list of parses for a given sentence. If the sentence can't be parsed, it will return an empty list. Initializes the process by calling `parse` from the starting symbol.\n",
"\n",
"\n",
"* `parse`: Parses the list of words and builds the chart.\n",
"\n",
"\n",
"* `add_edge`: Adds another edge to the chart at a given point. Also, examines whether the edge extends or predicts another edge. If the edge itself is not expecting a transformation, it will extend other edges and it will predict edges otherwise.\n",
"\n",
"\n",
"* `scanner`: Given a word and a point in the chart, it extends edges that were expecting a transformation that can result in the given word. For example, if the word 'the' is an 'Article' and we are examining two edges at a chart's point, with one expecting an 'Article' and the other a 'Verb', the first one will be extended while the second one will not.\n",
"\n",
"\n",
"* `predictor`: If an edge can't extend other edges (because it is expecting a transformation itself), we will add to the chart rules/transformations that can help extend the edge. The new edges come from the right-hand side of the expected transformation's rules. For example, if an edge is expecting the transformation 'Adjective Noun', we will add to the chart an edge for each right-hand side rule of the non-terminal 'Adjective'.\n",
"\n",
"\n",
"* `extender`: Extends edges given an edge (called `E`). If `E`'s non-terminal is the same as the expected transformation of another edge (let's call it `A`), add to the chart a new edge with the non-terminal of `A` and the transformations of `A` minus the non-terminal that matched with `E`'s non-terminal. For example, if an edge `E` has 'Article' as its non-terminal and is expecting no transformation, we need to see what edges it can extend. Let's examine the edge `N`. This expects a transformation of 'Noun Verb'. 'Noun' does not match with 'Article', so we move on. Another edge, `A`, expects a transformation of 'Article Noun' and has a non-terminal of 'NP'. We have a match! A new edge will be added with 'NP' as its non-terminal (the non-terminal of `A`) and 'Noun' as the expected transformation (the rest of the expected transformation of `A`)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example\n",
"\n",
"We will use the grammar `E0` to parse the sentence \"the stench is in 2 2\".\n",
"\n",
"First we need to build a `Chart` object:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"chart = Chart(nlp.E0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And then we simply call the `parses` function:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0, 6, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], []]]\n"
]
}
],
"source": [
"print(chart.parses('the stench is in 2 2'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can see which edges get added by setting the optional initialization argument `trace` to true."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Chart: added [0, 0, 'S_', [], ['S']]\n",
"Chart: added [0, 0, 'S', [], ['NP', 'VP']]\n",
"Chart: added [0, 0, 'NP', [], ['Pronoun']]\n",
"Chart: added [0, 0, 'NP', [], ['Name']]\n",
"Chart: added [0, 0, 'NP', [], ['Noun']]\n",
"Chart: added [0, 0, 'NP', [], ['Article', 'Noun']]\n",
"Chart: added [0, 0, 'NP', [], ['Digit', 'Digit']]\n",
"Chart: added [0, 0, 'NP', [], ['NP', 'PP']]\n",
"Chart: added [0, 0, 'NP', [], ['NP', 'RelClause']]\n",
"Chart: added [0, 0, 'S', [], ['S', 'Conjunction', 'S']]\n",
"Chart: added [0, 1, 'NP', [('Article', 'the')], ['Noun']]\n",
"Chart: added [0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []]\n",
"Chart: added [0, 2, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []]], ['VP']]\n",
"Chart: added [2, 2, 'VP', [], ['Verb']]\n",
"Chart: added [2, 2, 'VP', [], ['VP', 'NP']]\n",
"Chart: added [2, 2, 'VP', [], ['VP', 'Adjective']]\n",
"Chart: added [2, 2, 'VP', [], ['VP', 'PP']]\n",
"Chart: added [2, 2, 'VP', [], ['VP', 'Adverb']]\n",
"Chart: added [0, 2, 'NP', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []]], ['PP']]\n",
"Chart: added [2, 2, 'PP', [], ['Preposition', 'NP']]\n",
"Chart: added [0, 2, 'NP', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []]], ['RelClause']]\n",
"Chart: added [2, 2, 'RelClause', [], ['That', 'VP']]\n",
"Chart: added [2, 3, 'VP', [('Verb', 'is')], []]\n",
"Chart: added [0, 3, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 3, 'VP', [('Verb', 'is')], []]], []]\n",
"Chart: added [0, 3, 'S_', [[0, 3, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 3, 'VP', [('Verb', 'is')], []]], []]], []]\n",
"Chart: added [0, 3, 'S', [[0, 3, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 3, 'VP', [('Verb', 'is')], []]], []]], ['Conjunction', 'S']]\n",
"Chart: added [2, 3, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []]], ['NP']]\n",
"Chart: added [3, 3, 'NP', [], ['Pronoun']]\n",
"Chart: added [3, 3, 'NP', [], ['Name']]\n",
"Chart: added [3, 3, 'NP', [], ['Noun']]\n",
"Chart: added [3, 3, 'NP', [], ['Article', 'Noun']]\n",
"Chart: added [3, 3, 'NP', [], ['Digit', 'Digit']]\n",
"Chart: added [3, 3, 'NP', [], ['NP', 'PP']]\n",
"Chart: added [3, 3, 'NP', [], ['NP', 'RelClause']]\n",
"Chart: added [2, 3, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []]], ['Adjective']]\n",
"Chart: added [2, 3, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []]], ['PP']]\n",
"Chart: added [3, 3, 'PP', [], ['Preposition', 'NP']]\n",
"Chart: added [2, 3, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []]], ['Adverb']]\n",
"Chart: added [3, 4, 'PP', [('Preposition', 'in')], ['NP']]\n",
"Chart: added [4, 4, 'NP', [], ['Pronoun']]\n",
"Chart: added [4, 4, 'NP', [], ['Name']]\n",
"Chart: added [4, 4, 'NP', [], ['Noun']]\n",
"Chart: added [4, 4, 'NP', [], ['Article', 'Noun']]\n",
"Chart: added [4, 4, 'NP', [], ['Digit', 'Digit']]\n",
"Chart: added [4, 4, 'NP', [], ['NP', 'PP']]\n",
"Chart: added [4, 4, 'NP', [], ['NP', 'RelClause']]\n",
"Chart: added [4, 5, 'NP', [('Digit', '2')], ['Digit']]\n",
"Chart: added [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]\n",
"Chart: added [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]\n",
"Chart: added [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]\n",
"Chart: added [0, 6, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], []]\n",
"Chart: added [0, 6, 'S_', [[0, 6, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], []]], []]\n",
"Chart: added [0, 6, 'S', [[0, 6, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], []]], ['Conjunction', 'S']]\n",
"Chart: added [2, 6, 'VP', [[2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], ['NP']]\n",
"Chart: added [6, 6, 'NP', [], ['Pronoun']]\n",
"Chart: added [6, 6, 'NP', [], ['Name']]\n",
"Chart: added [6, 6, 'NP', [], ['Noun']]\n",
"Chart: added [6, 6, 'NP', [], ['Article', 'Noun']]\n",
"Chart: added [6, 6, 'NP', [], ['Digit', 'Digit']]\n",
"Chart: added [6, 6, 'NP', [], ['NP', 'PP']]\n",
"Chart: added [6, 6, 'NP', [], ['NP', 'RelClause']]\n",
"Chart: added [2, 6, 'VP', [[2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], ['Adjective']]\n",
"Chart: added [2, 6, 'VP', [[2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], ['PP']]\n",
"Chart: added [6, 6, 'PP', [], ['Preposition', 'NP']]\n",
"Chart: added [2, 6, 'VP', [[2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], ['Adverb']]\n",
"Chart: added [4, 6, 'NP', [[4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], ['PP']]\n",
"Chart: added [4, 6, 'NP', [[4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], ['RelClause']]\n",
"Chart: added [6, 6, 'RelClause', [], ['That', 'VP']]\n"
]
},
{
"data": {
"text/plain": [
"[[0,\n",
" 6,\n",
" 'S',\n",
" [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []],\n",
" [2,\n",
" 6,\n",
" 'VP',\n",
" [[2, 3, 'VP', [('Verb', 'is')], []],\n",
" [3,\n",
" 6,\n",
" 'PP',\n",
" [('Preposition', 'in'),\n",
" [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]],\n",
" []]],\n",
" []]],\n",
" []]]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chart_trace = Chart(nlp.E0, trace=True)\n",
"chart_trace.parses('the stench is in 2 2')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try and parse a sentence that is not recognized by the grammar:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[]\n"
]
}
],
"source": [
"print(chart.parses('the stench 2 2'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An empty list was returned."
]
}
],
"metadata": {
Expand Down
5 changes: 1 addition & 4 deletions nlp.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
"""Natural Language Processing; Chart Parsing and PageRanking (Chapter 22-23)"""

# (Written for the second edition of AIMA; expect some discrepanciecs
# from the third edition until this gets reviewed.)

from collections import defaultdict
from utils import weighted_choice
import urllib.request
Expand Down Expand Up @@ -274,7 +271,7 @@ def __repr__(self):

class Chart:

"""Class for parsing sentences using a chart data structure. [Figure 22.7]
"""Class for parsing sentences using a chart data structure.
>>> chart = Chart(E0);
>>> len(chart.parses('the stench is in 2 2'))
1
Expand Down
8 changes: 7 additions & 1 deletion tests/test_nlp.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from nlp import expand_pages, relevant_pages, normalize, ConvergenceDetector, getInlinks
from nlp import getOutlinks, Page, determineInlinks, HITS
from nlp import Rules, Lexicon, Grammar, ProbRules, ProbLexicon, ProbGrammar
from nlp import CYK_parse
from nlp import Chart, CYK_parse
# Clumsy imports because we want to access certain nlp.py globals explicitly, because
# they are accessed by functions within nlp.py

Expand Down Expand Up @@ -101,6 +101,12 @@ def test_prob_generation():
assert len(sentence) == 2


def test_chart_parsing():
chart = Chart(nlp.E0)
parses = chart.parses('the stench is in 2 2')
assert len(parses) == 1


def test_CYK_parse():
grammar = nlp.E_Prob_Chomsky
words = ['the', 'robot', 'is', 'good']
Expand Down