Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Updated text.py Notebook #352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 18, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
213 changes: 205 additions & 8 deletions text.ipynb
Original file line number Diff line number Diff line change
@@ -1,24 +1,221 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"source": [
"# Text\n",
"\n",
"This notebook serves as supporting material for topics covered in **Chapter 22 - Natural Language Processing** from the book *Artificial Intelligence: A Modern Approach*. This notebook uses implementations from [text.py](https://github.com/aimacode/aima-python/blob/master/text.py)."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Contents\n",
"\n",
"* Text Models\n",
"* Viterbi Text Segmentation\n",
" * Overview\n",
" * Implementation\n",
" * Example"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Text Models\n",
"\n",
"Before we start performing text processing algorithms, we will need to build some word models. Those models serve as a look-up table for word probabilities. In the text module we have implemented two such models, which inherit from the `CountingProbDist` from `learning.py`. `UnigramTextModel` and `NgramTextModel`. We supply them with a text file and they show the frequency of the different words.\n",
"\n",
"The main difference between the two models is that the first returns the probability of one single word (eg. the probability of the word 'the' appearing), while the second one can show us the probability of a *sequence* of words (eg. the probability of the sequence 'of the' appearing).\n",
"\n",
"Also, both functions can generate random words and sequences respectively, random according to the model.\n",
"\n",
"Below we build the two models. The text file we will use to build them is the *Flatland*, by Edwin A. Abbott. We will load it from [here](https://github.com/aimacode/aima-data/blob/a21fc108f52ad551344e947b0eb97df82f8d2b2b/EN-text/flatland.txt)."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 4,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(2081, 'the'), (1479, 'of'), (1021, 'and'), (1008, 'to'), (850, 'a')]\n",
"[(368, ('of', 'the')), (152, ('to', 'the')), (152, ('in', 'the')), (86, ('of', 'a')), (80, ('it', 'is'))]\n"
]
}
],
"source": [
"from text import UnigramTextModel, NgramTextModel, words\n",
"from utils import DataFile\n",
"\n",
"flatland = DataFile(\"EN-text/flatland.txt\").read()\n",
"wordseq = words(flatland)\n",
"\n",
"P1 = UnigramTextModel(wordseq)\n",
"P2 = NgramTextModel(2, wordseq)\n",
"\n",
"print(P1.top(5))\n",
"print(P2.top(5))"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We see that the most used word in *Flatland* is 'the', with 2081 occurences, while the most used sequence is 'of the' with 368 occurences."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Viterbi Text Segmentation\n",
"\n",
"### Overview\n",
"\n",
"We are given a string containing words of a sentence, but all the spaces are gone! It is very hard to read and we would like to separate the words in the string. We can accomplish this by employing the `Viterbi Segmentation` algorithm. It takes as input the string to segment and a text model, and it returns a list of the separate words.\n",
"\n",
"The algorithm operates in a dynamic programming approach. It starts from the beginning of the string and iteratively builds the best solution using previous solutions. It accomplishes that by segmentating the string into \"windows\", each window representing a word (real or gibberish). It then calculates the probability of the sequence up that window/word occuring and updates its solution. When it is done, it traces back from the final word and finds the complete sequence of words."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"import text"
"### Implementation"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": []
"source": [
"def viterbi_segment(text, P):\n",
" \"\"\"Find the best segmentation of the string of characters, given the\n",
" UnigramTextModel P.\"\"\"\n",
" # best[i] = best probability for text[0:i]\n",
" # words[i] = best word ending at position i\n",
" n = len(text)\n",
" words = [''] + list(text)\n",
" best = [1.0] + [0.0] * n\n",
" # Fill in the vectors best words via dynamic programming\n",
" for i in range(n+1):\n",
" for j in range(0, i):\n",
" w = text[j:i]\n",
" newbest = P[w] * best[i - len(w)]\n",
" if newbest >= best[i]:\n",
" best[i] = newbest\n",
" words[i] = w\n",
" # Now recover the sequence of best words\n",
" sequence = []\n",
" i = len(words) - 1\n",
" while i > 0:\n",
" sequence[0:0] = [words[i]]\n",
" i = i - len(words[i])\n",
" # Return sequence of best words and overall probability\n",
" return sequence, best[-1]"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The function takes as input a string and a text model, and returns the most probable sequence of words, together with the probability of that sequence.\n",
"\n",
"The \"window\" is `w` and it includes the characters from *j* to *i*. We use it to \"build\" the following sequence: from the start to *j* and then `w`. We have previously calculated the probability from the start to *j*, so now we multiply that probability by `P[w]` to get the probability of the whole sequence. If that probability is greater than the probability we have calculated so far for the sequence from the start to *i* (`best[i]`), we update it."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Example\n",
"\n",
"The model the algorithm uses is the `UnigramTextModel`. First we will build the model using the *Flatland* text and then we will try and separate a space-devoid sentence."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sequence of words is: ['it', 'is', 'easy', 'to', 'read', 'words', 'without', 'spaces']\n",
"Probability of sequence is: 2.273672843573388e-24\n"
]
}
],
"source": [
"from text import UnigramTextModel, words, viterbi_segment\n",
"from utils import DataFile\n",
"\n",
"flatland = DataFile(\"EN-text/flatland.txt\").read()\n",
"wordseq = words(flatland)\n",
"P = UnigramTextModel(wordseq)\n",
"text = \"itiseasytoreadwordswithoutspaces\"\n",
"\n",
"s, p = viterbi_segment(text,P)\n",
"print(\"Sequence of words is:\",s)\n",
"print(\"Probability of sequence is:\",p)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The algorithm correctly retrieved the words from the string. It also gave us the probability of this sequence, which is small, but still the most probable segmentation of the string."
]
}
],
"metadata": {
Expand All @@ -37,7 +234,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
"version": "3.5.2"
}
},
"nbformat": 4,
Expand Down