Updated text.py Notebook (#352)

antmarakis · norvig · commit 35ef22ce0a8f · 2017-03-18T01:11:00.000-07:00
* Update text.ipynb

* Update text.ipynb
diff --git a/text.ipynb b/text.ipynb
@@ -1,24 +1,221 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false,
+    "deletable": true,
+    "editable": true
+   },
+   "source": [
+    "# Text\n",
+    "\n",
+    "This notebook serves as supporting material for topics covered in **Chapter 22 - Natural Language Processing** from the book *Artificial Intelligence: A Modern Approach*. This notebook uses implementations from [text.py](https://github.com/aimacode/aima-python/blob/master/text.py)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": true,
+    "editable": true
+   },
+   "source": [
+    "## Contents\n",
+    "\n",
+    "* Text Models\n",
+    "* Viterbi Text Segmentation\n",
+    "    * Overview\n",
+    "    * Implementation\n",
+    "    * Example"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": true,
+    "editable": true
+   },
+   "source": [
+    "## Text Models\n",
+    "\n",
+    "Before we start performing text processing algorithms, we will need to build some word models. Those models serve as a look-up table for word probabilities. In the text module we have implemented two such models, which inherit from the `CountingProbDist` from `learning.py`. `UnigramTextModel` and `NgramTextModel`. We supply them with a text file and they show the frequency of the different words.\n",
+    "\n",
+    "The main difference between the two models is that the first returns the probability of one single word (eg. the probability of the word 'the' appearing), while the second one can show us the probability of a *sequence* of words (eg. the probability of the sequence 'of the' appearing).\n",
+    "\n",
+    "Also, both functions can generate random words and sequences respectively, random according to the model.\n",
+    "\n",
+    "Below we build the two models. The text file we will use to build them is the *Flatland*, by Edwin A. Abbott. We will load it from [here](https://github.com/aimacode/aima-data/blob/a21fc108f52ad551344e947b0eb97df82f8d2b2b/EN-text/flatland.txt)."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "deletable": true,
+    "editable": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[(2081, 'the'), (1479, 'of'), (1021, 'and'), (1008, 'to'), (850, 'a')]\n",
+      "[(368, ('of', 'the')), (152, ('to', 'the')), (152, ('in', 'the')), (86, ('of', 'a')), (80, ('it', 'is'))]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from text import UnigramTextModel, NgramTextModel, words\n",
+    "from utils import DataFile\n",
+    "\n",
+    "flatland = DataFile(\"EN-text/flatland.txt\").read()\n",
+    "wordseq = words(flatland)\n",
+    "\n",
+    "P1 = UnigramTextModel(wordseq)\n",
+    "P2 = NgramTextModel(2, wordseq)\n",
+    "\n",
+    "print(P1.top(5))\n",
+    "print(P2.top(5))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": true,
+    "editable": true
+   },
+   "source": [
+    "We see that the most used word in *Flatland* is 'the', with 2081 occurences, while the most used sequence is 'of the' with 368 occurences."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": true,
+    "editable": true
+   },
+   "source": [
+    "## Viterbi Text Segmentation\n",
+    "\n",
+    "### Overview\n",
+    "\n",
+    "We are given a string containing words of a sentence, but all the spaces are gone! It is very hard to read and we would like to separate the words in the string. We can accomplish this by employing the `Viterbi Segmentation` algorithm. It takes as input the string to segment and a text model, and it returns a list of the separate words.\n",
+    "\n",
+    "The algorithm operates in a dynamic programming approach. It starts from the beginning of the string and iteratively builds the best solution using previous solutions. It accomplishes that by segmentating the string into \"windows\", each window representing a word (real or gibberish). It then calculates the probability of the sequence up that window/word occuring and updates its solution. When it is done, it traces back from the final word and finds the complete sequence of words."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": true,
+    "editable": true
    },
-   "outputs": [],
    "source": [
-    "import text"
+    "### Implementation"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "metadata": {
-    "collapsed": true
+    "collapsed": true,
+    "deletable": true,
+    "editable": true
    },
    "outputs": [],
-   "source": []
+   "source": [
+    "def viterbi_segment(text, P):\n",
+    "    \"\"\"Find the best segmentation of the string of characters, given the\n",
+    "    UnigramTextModel P.\"\"\"\n",
+    "    # best[i] = best probability for text[0:i]\n",
+    "    # words[i] = best word ending at position i\n",
+    "    n = len(text)\n",
+    "    words = [''] + list(text)\n",
+    "    best = [1.0] + [0.0] * n\n",
+    "    # Fill in the vectors best words via dynamic programming\n",
+    "    for i in range(n+1):\n",
+    "        for j in range(0, i):\n",
+    "            w = text[j:i]\n",
+    "            newbest = P[w] * best[i - len(w)]\n",
+    "            if newbest >= best[i]:\n",
+    "                best[i] = newbest\n",
+    "                words[i] = w\n",
+    "    # Now recover the sequence of best words\n",
+    "    sequence = []\n",
+    "    i = len(words) - 1\n",
+    "    while i > 0:\n",
+    "        sequence[0:0] = [words[i]]\n",
+    "        i = i - len(words[i])\n",
+    "    # Return sequence of best words and overall probability\n",
+    "    return sequence, best[-1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": true,
+    "editable": true
+   },
+   "source": [
+    "The function takes as input a string and a text model, and returns the most probable sequence of words, together with the probability of that sequence.\n",
+    "\n",
+    "The \"window\" is `w` and it includes the characters from *j* to *i*. We use it to \"build\" the following sequence: from the start to *j* and then `w`. We have previously calculated the probability from the start to *j*, so now we multiply that probability by `P[w]` to get the probability of the whole sequence. If that probability is greater than the probability we have calculated so far for the sequence from the start to *i* (`best[i]`), we update it."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": true,
+    "editable": true
+   },
+   "source": [
+    "### Example\n",
+    "\n",
+    "The model the algorithm uses is the `UnigramTextModel`. First we will build the model using the *Flatland* text and then we will try and separate a space-devoid sentence."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "collapsed": false,
+    "deletable": true,
+    "editable": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Sequence of words is: ['it', 'is', 'easy', 'to', 'read', 'words', 'without', 'spaces']\n",
+      "Probability of sequence is: 2.273672843573388e-24\n"
+     ]
+    }
+   ],
+   "source": [
+    "from text import UnigramTextModel, words, viterbi_segment\n",
+    "from utils import DataFile\n",
+    "\n",
+    "flatland = DataFile(\"EN-text/flatland.txt\").read()\n",
+    "wordseq = words(flatland)\n",
+    "P = UnigramTextModel(wordseq)\n",
+    "text = \"itiseasytoreadwordswithoutspaces\"\n",
+    "\n",
+    "s, p = viterbi_segment(text,P)\n",
+    "print(\"Sequence of words is:\",s)\n",
+    "print(\"Probability of sequence is:\",p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": true,
+    "editable": true
+   },
+   "source": [
+    "The algorithm correctly retrieved the words from the string. It also gave us the probability of this sequence, which is small, but still the most probable segmentation of the string."
+   ]
   }
  ],
  "metadata": {
@@ -37,7 +234,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.5.1"
+   "version": "3.5.2"
   }
  },
  "nbformat": 4,