Added PermutationDecoder to notebook (aimacode#507)

Chipe1 · norvig · commit ff8fc03843a1 · 2017-05-23T22:16:16.000-07:00
diff --git a/text.ipynb b/text.ipynb
@@ -364,6 +364,64 @@
     "decoded_message = decoder.decode(ciphertext)\n",
     "print('The decoded message is', '\"' + decoded_message + '\"')"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Permutation Decoder\n",
+    "Now let us try to decode messages encrypted by a general monoalphabetic substitution cipher. The letters in the alphabet can be replaced by any permutation of letters. For example if the alpahbet consisted of `{A B C}` then it can be replaced by `{A C B}`, `{B A C}`, `{B C A}`, `{C A B}`, `{C B A}` or even `{A B C}` itself. Suppose we choose the permutation `{C B A}`, then the plain text `\"CAB BA AAC\"` would become `\"ACB BC CCA\"`. We can see that Caesar cipher is also a form of permutation cipher where the permutation is a cyclic permutation. Unlike the Caesar cipher, it is infeasible to try all possible permutations. The number of possible permutations in Latin alphabet is `26!` which is of the order $10^{26}$. We use graph search algorithms to search for a 'good' permutation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "%psource PermutationDecoder"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Each state/node in the graph is represented as a letter-to-letter map. If there no mapping for a letter it means the letter is unchanged in the permutation. These maps are stored as dictionaries. Each dictionary is a 'potential' permutation.  We use the word 'potential' because every dictionary doesn't necessarily represent a valid permutation since a permutation cannot have repeating elements. For example the dictionary `{'A': 'B', 'C': 'X'}` is invalid because `'A'` is replaced by `'B'`, but so is `'B'` because the dictionary doesn't have a mapping for `'B'`. Two dictionaries can also represent the same permutation e.g. `{'A': 'C', 'C': 'A'}` and `{'A': 'C', 'B': 'B', 'C': 'A'}` represent the same permutation where `'A'` and `'C'` are interchanged and all other letters remain unaltered. To ensure we get a valid permutation a goal state must map all letters in the alphabet. We also prevent repetions in the permutation by allowing only those actions which go to new state/node in which the newly added letter to the dictionary maps to previously unmapped letter. These two rules togeter ensure that the dictionary of a goal state will represent a valid permutation.\n",
+    "The score of a state is determined using word scores, unigram scores, and bigram scores. Experiment with different weightages for word, unigram and bigram scores and see how they affect the decoding."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\"ahed world\" decodes to \"shed could\"\n",
+      "\"ahed woxld\" decodes to \"shew atiow\"\n"
+     ]
+    }
+   ],
+   "source": [
+    "ciphertexts = ['ahed world', 'ahed woxld']\n",
+    "\n",
+    "pd = PermutationDecoder(canonicalize(flatland))\n",
+    "for ctext in ciphertexts:\n",
+    "    print('\"{}\" decodes to \"{}\"'.format(ctext, pd.decode(ctext)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As evident from the above example, permutation decoding using best first search is sensitive to initial text. This is because not only the final dictionary, with substitutions for all letters, must have good score but so must the intermediate dictionaries. You could think of it as performing a local search by finding substitutons for each letter one by one. We could get very different results by changing even a single letter because that letter could be a deciding factor for selecting substitution in early stages which snowballs and affects the later stages. To make the search better we can use different definition of score in different stages and optimize on which letter to substitute first."
+   ]
   }
  ],
  "metadata": {