Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 35ef22c

Browse files
antmarakisnorvig
authored andcommitted
Updated text.py Notebook (#352)
* Update text.ipynb * Update text.ipynb
1 parent d941781 commit 35ef22c

File tree

1 file changed

+205
-8
lines changed

1 file changed

+205
-8
lines changed

text.ipynb

Lines changed: 205 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,221 @@
11
{
22
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {
6+
"collapsed": false,
7+
"deletable": true,
8+
"editable": true
9+
},
10+
"source": [
11+
"# Text\n",
12+
"\n",
13+
"This notebook serves as supporting material for topics covered in **Chapter 22 - Natural Language Processing** from the book *Artificial Intelligence: A Modern Approach*. This notebook uses implementations from [text.py](https://github.com/aimacode/aima-python/blob/master/text.py)."
14+
]
15+
},
16+
{
17+
"cell_type": "markdown",
18+
"metadata": {
19+
"deletable": true,
20+
"editable": true
21+
},
22+
"source": [
23+
"## Contents\n",
24+
"\n",
25+
"* Text Models\n",
26+
"* Viterbi Text Segmentation\n",
27+
" * Overview\n",
28+
" * Implementation\n",
29+
" * Example"
30+
]
31+
},
32+
{
33+
"cell_type": "markdown",
34+
"metadata": {
35+
"deletable": true,
36+
"editable": true
37+
},
38+
"source": [
39+
"## Text Models\n",
40+
"\n",
41+
"Before we start performing text processing algorithms, we will need to build some word models. Those models serve as a look-up table for word probabilities. In the text module we have implemented two such models, which inherit from the `CountingProbDist` from `learning.py`. `UnigramTextModel` and `NgramTextModel`. We supply them with a text file and they show the frequency of the different words.\n",
42+
"\n",
43+
"The main difference between the two models is that the first returns the probability of one single word (eg. the probability of the word 'the' appearing), while the second one can show us the probability of a *sequence* of words (eg. the probability of the sequence 'of the' appearing).\n",
44+
"\n",
45+
"Also, both functions can generate random words and sequences respectively, random according to the model.\n",
46+
"\n",
47+
"Below we build the two models. The text file we will use to build them is the *Flatland*, by Edwin A. Abbott. We will load it from [here](https://github.com/aimacode/aima-data/blob/a21fc108f52ad551344e947b0eb97df82f8d2b2b/EN-text/flatland.txt)."
48+
]
49+
},
350
{
451
"cell_type": "code",
5-
"execution_count": null,
52+
"execution_count": 4,
653
"metadata": {
7-
"collapsed": false
54+
"collapsed": false,
55+
"deletable": true,
56+
"editable": true
57+
},
58+
"outputs": [
59+
{
60+
"name": "stdout",
61+
"output_type": "stream",
62+
"text": [
63+
"[(2081, 'the'), (1479, 'of'), (1021, 'and'), (1008, 'to'), (850, 'a')]\n",
64+
"[(368, ('of', 'the')), (152, ('to', 'the')), (152, ('in', 'the')), (86, ('of', 'a')), (80, ('it', 'is'))]\n"
65+
]
66+
}
67+
],
68+
"source": [
69+
"from text import UnigramTextModel, NgramTextModel, words\n",
70+
"from utils import DataFile\n",
71+
"\n",
72+
"flatland = DataFile(\"EN-text/flatland.txt\").read()\n",
73+
"wordseq = words(flatland)\n",
74+
"\n",
75+
"P1 = UnigramTextModel(wordseq)\n",
76+
"P2 = NgramTextModel(2, wordseq)\n",
77+
"\n",
78+
"print(P1.top(5))\n",
79+
"print(P2.top(5))"
80+
]
81+
},
82+
{
83+
"cell_type": "markdown",
84+
"metadata": {
85+
"deletable": true,
86+
"editable": true
87+
},
88+
"source": [
89+
"We see that the most used word in *Flatland* is 'the', with 2081 occurences, while the most used sequence is 'of the' with 368 occurences."
90+
]
91+
},
92+
{
93+
"cell_type": "markdown",
94+
"metadata": {
95+
"deletable": true,
96+
"editable": true
97+
},
98+
"source": [
99+
"## Viterbi Text Segmentation\n",
100+
"\n",
101+
"### Overview\n",
102+
"\n",
103+
"We are given a string containing words of a sentence, but all the spaces are gone! It is very hard to read and we would like to separate the words in the string. We can accomplish this by employing the `Viterbi Segmentation` algorithm. It takes as input the string to segment and a text model, and it returns a list of the separate words.\n",
104+
"\n",
105+
"The algorithm operates in a dynamic programming approach. It starts from the beginning of the string and iteratively builds the best solution using previous solutions. It accomplishes that by segmentating the string into \"windows\", each window representing a word (real or gibberish). It then calculates the probability of the sequence up that window/word occuring and updates its solution. When it is done, it traces back from the final word and finds the complete sequence of words."
106+
]
107+
},
108+
{
109+
"cell_type": "markdown",
110+
"metadata": {
111+
"deletable": true,
112+
"editable": true
8113
},
9-
"outputs": [],
10114
"source": [
11-
"import text"
115+
"### Implementation"
12116
]
13117
},
14118
{
15119
"cell_type": "code",
16-
"execution_count": null,
120+
"execution_count": 1,
17121
"metadata": {
18-
"collapsed": true
122+
"collapsed": true,
123+
"deletable": true,
124+
"editable": true
19125
},
20126
"outputs": [],
21-
"source": []
127+
"source": [
128+
"def viterbi_segment(text, P):\n",
129+
" \"\"\"Find the best segmentation of the string of characters, given the\n",
130+
" UnigramTextModel P.\"\"\"\n",
131+
" # best[i] = best probability for text[0:i]\n",
132+
" # words[i] = best word ending at position i\n",
133+
" n = len(text)\n",
134+
" words = [''] + list(text)\n",
135+
" best = [1.0] + [0.0] * n\n",
136+
" # Fill in the vectors best words via dynamic programming\n",
137+
" for i in range(n+1):\n",
138+
" for j in range(0, i):\n",
139+
" w = text[j:i]\n",
140+
" newbest = P[w] * best[i - len(w)]\n",
141+
" if newbest >= best[i]:\n",
142+
" best[i] = newbest\n",
143+
" words[i] = w\n",
144+
" # Now recover the sequence of best words\n",
145+
" sequence = []\n",
146+
" i = len(words) - 1\n",
147+
" while i > 0:\n",
148+
" sequence[0:0] = [words[i]]\n",
149+
" i = i - len(words[i])\n",
150+
" # Return sequence of best words and overall probability\n",
151+
" return sequence, best[-1]"
152+
]
153+
},
154+
{
155+
"cell_type": "markdown",
156+
"metadata": {
157+
"deletable": true,
158+
"editable": true
159+
},
160+
"source": [
161+
"The function takes as input a string and a text model, and returns the most probable sequence of words, together with the probability of that sequence.\n",
162+
"\n",
163+
"The \"window\" is `w` and it includes the characters from *j* to *i*. We use it to \"build\" the following sequence: from the start to *j* and then `w`. We have previously calculated the probability from the start to *j*, so now we multiply that probability by `P[w]` to get the probability of the whole sequence. If that probability is greater than the probability we have calculated so far for the sequence from the start to *i* (`best[i]`), we update it."
164+
]
165+
},
166+
{
167+
"cell_type": "markdown",
168+
"metadata": {
169+
"deletable": true,
170+
"editable": true
171+
},
172+
"source": [
173+
"### Example\n",
174+
"\n",
175+
"The model the algorithm uses is the `UnigramTextModel`. First we will build the model using the *Flatland* text and then we will try and separate a space-devoid sentence."
176+
]
177+
},
178+
{
179+
"cell_type": "code",
180+
"execution_count": 6,
181+
"metadata": {
182+
"collapsed": false,
183+
"deletable": true,
184+
"editable": true
185+
},
186+
"outputs": [
187+
{
188+
"name": "stdout",
189+
"output_type": "stream",
190+
"text": [
191+
"Sequence of words is: ['it', 'is', 'easy', 'to', 'read', 'words', 'without', 'spaces']\n",
192+
"Probability of sequence is: 2.273672843573388e-24\n"
193+
]
194+
}
195+
],
196+
"source": [
197+
"from text import UnigramTextModel, words, viterbi_segment\n",
198+
"from utils import DataFile\n",
199+
"\n",
200+
"flatland = DataFile(\"EN-text/flatland.txt\").read()\n",
201+
"wordseq = words(flatland)\n",
202+
"P = UnigramTextModel(wordseq)\n",
203+
"text = \"itiseasytoreadwordswithoutspaces\"\n",
204+
"\n",
205+
"s, p = viterbi_segment(text,P)\n",
206+
"print(\"Sequence of words is:\",s)\n",
207+
"print(\"Probability of sequence is:\",p)"
208+
]
209+
},
210+
{
211+
"cell_type": "markdown",
212+
"metadata": {
213+
"deletable": true,
214+
"editable": true
215+
},
216+
"source": [
217+
"The algorithm correctly retrieved the words from the string. It also gave us the probability of this sequence, which is small, but still the most probable segmentation of the string."
218+
]
22219
}
23220
],
24221
"metadata": {
@@ -37,7 +234,7 @@
37234
"name": "python",
38235
"nbconvert_exporter": "python",
39236
"pygments_lexer": "ipython3",
40-
"version": "3.5.1"
237+
"version": "3.5.2"
41238
}
42239
},
43240
"nbformat": 4,

0 commit comments

Comments
 (0)