diff --git a/nlp_apps.ipynb b/nlp_apps.ipynb index 94a91bb36..2c9a1ddda 100644 --- a/nlp_apps.ipynb +++ b/nlp_apps.ipynb @@ -16,7 +16,8 @@ "## CONTENTS\n", "\n", "* Language Recognition\n", - "* Author Recognition" + "* Author Recognition\n", + "* The Federalist Papers" ] }, { @@ -371,6 +372,410 @@ "\n", "You can try more sentences on your own. Unfortunately though, since the datasets are pretty small, chances are the guesses will not always be correct." ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## THE FEDERALIST PAPERS\n", + "\n", + "Let's now take a look at a harder problem, classifying the authors of the [Federalist Papers](https://en.wikipedia.org/wiki/The_Federalist_Papers). The *Federalist Papers* are a series of papers written by Alexander Hamilton, James Madison and John Jay towards establishing the United States Constitution.\n", + "\n", + "What is interesting about these papers is that they were all written under a pseudonym, \"Publius\", to keep the identity of the authors a secret. Only after Hamilton's death, when a list was found written by him detailing the authorship of the papers, did the rest of the world learn what papers each of the authors wrote. After the list was published, Madison chimed in to make a couple of corrections: Hamilton, Madison said, hastily wrote down the list and assigned some papers to the wrong author!\n", + "\n", + "Here we will try and find out who really wrote these mysterious papers.\n", + "\n", + "To solve this we will learn from the undisputed papers to predict the disputed ones. First, let's read the texts from the file:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "from utils import open_data\n", + "from text import *\n", + "\n", + "federalist = open_data(\"EN-text/federalist.txt\").read()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's see how the text looks. We will print the first 500 characters:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'The Project Gutenberg EBook of The Federalist Papers, by \\nAlexander Hamilton and John Jay and James Madison\\n\\nThis eBook is for the use of anyone anywhere at no cost and with\\nalmost no restrictions whatsoever. You may copy it, give it away or\\nre-use it under the terms of the Project Gutenberg License included\\nwith this eBook or online at www.gutenberg.net\\n\\n\\nTitle: The Federalist Papers\\n\\nAuthor: Alexander Hamilton\\n John Jay\\n James Madison\\n\\nPosting Date: December 12, 2011 [EBook #18]'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "federalist[:500]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It seems that the text file opens with a license agreement, hardly useful in our case. In fact, the license spans 113 words, while there is also a licensing agreement at the end of the file, which spans 3098 words. We need to remove them. To do so, we will first convert the text into words, to make our lives easier." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "wordseq = words(federalist)\n", + "wordseq = wordseq[114:-3098]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's now take a look at the first 100 words:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'federalist no 1 general introduction for the independent journal hamilton to the people of the state of new york after an unequivocal experience of the inefficacy of the subsisting federal government you are called upon to deliberate on a new constitution for the united states of america the subject speaks its own importance comprehending in its consequences nothing less than the existence of the union the safety and welfare of the parts of which it is composed the fate of an empire in many respects the most interesting in the world it has been frequently remarked that it seems to'" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "' '.join(wordseq[:100])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Much better.\n", + "\n", + "As with any Natural Language Processing problem, it is prudent to do some text pre-processing and clean our data before we start building our model. Remember that all the papers are signed as 'Publius', so we can safely remove that word, since it doesn't give us any information as to the real author.\n", + "\n", + "NOTE: Since we are only removing a single word from each paper, this step can be skipped. We add it here to show that processing the data in our hands is something we should always be considering. Oftentimes pre-processing the data in just the right way is the difference between a robust model and a flimsy one." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "wordseq = [w for w in wordseq if w != 'publius']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we have to separate the text from a block of words into papers and assign them to their authors. We can see that each paper starts with the word 'federalist', so we will split the text on that word.\n", + "\n", + "The disputed papers are the papers from 49 to 58, from 18 to 20 and paper 64. We want to leave these papers unassigned. Also, note that there are two versions of paper 70; both from Hamilton.\n", + "\n", + "Finally, to keep the implementation intuitive, we add a `None` object at the start of the `papers` list to make the list index match up with the paper numbering (for example, `papers[5]` now corresponds to paper no. 5 instead of the paper no.6 in the 0-indexed Python)." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(4, 16, 52)" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import re\n", + "\n", + "papers = re.split(r'federalist\\s', ' '.join(wordseq))\n", + "papers = [p for p in papers if p not in ['', ' ']]\n", + "papers = [None] + papers\n", + "\n", + "disputed = list(range(49, 58+1)) + [18, 19, 20, 64]\n", + "jay, madison, hamilton = [], [], []\n", + "for i, p in enumerate(papers):\n", + " if i in disputed or i == 0:\n", + " continue\n", + " \n", + " if 'jay' in p:\n", + " jay.append(p)\n", + " elif 'madison' in p:\n", + " madison.append(p)\n", + " else:\n", + " hamilton.append(p)\n", + "\n", + "len(jay), len(madison), len(hamilton)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As we can see, from the undisputed papers Jay wrote 4, Madison 17 and Hamilton 51 (+1 duplicate). Let's now build our word models. The Unigram Word Model again will come in handy." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "hamilton = ''.join(hamilton)\n", + "hamilton_words = words(hamilton)\n", + "P_hamilton = UnigramWordModel(hamilton_words, default=1)\n", + "\n", + "madison = ''.join(madison)\n", + "madison_words = words(madison)\n", + "P_madison = UnigramWordModel(madison_words, default=1)\n", + "\n", + "jay = ''.join(jay)\n", + "jay_words = words(jay)\n", + "P_jay = UnigramWordModel(jay_words, default=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now it is time to build our new Naive Bayes Learner. It is very similar to the one found in `learning.py`, but with an important difference: it doesn't classify an example, but instead returns the probability of the example belonging to each class. This will allow us to not only see to whom a paper belongs to, but also the probability of authorship as well." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "import random\n", + "from utils import product\n", + "\n", + "\n", + "def NaiveBayesLearner(dist):\n", + " \"\"\"A simple naive bayes classifier that takes as input a dictionary of\n", + " Counter distributions and can then be used to find the probability\n", + " of a given item belonging to each class.\n", + " The input dictionary is in the following form:\n", + " ClassName: Counter\"\"\"\n", + " attr_dist = {c_name: count_prob for c_name, count_prob in dist.items()}\n", + "\n", + " def predict(example):\n", + " \"\"\"Predict the probabilities for each class.\"\"\"\n", + " def class_prob(target, e):\n", + " attr = attr_dist[target]\n", + " return product([attr[a] for a in e])\n", + "\n", + " pred = {t: class_prob(t, example) for t in dist.keys()}\n", + "\n", + " total = sum(pred.values())\n", + " if total == 0:\n", + " # Since there are a lot of multiplications of very small numbers,\n", + " # we end up with values equal to 0. To combat that, we keep\n", + " # dividing the example until the sum of the values is not 0.\n", + " random_words_count = max([int(3*len(example)/4), 100])\n", + " pred = predict(random.sample(example, random_words_count))\n", + " else:\n", + " for k, v in pred.items():\n", + " pred[k] = v / total\n", + "\n", + " return pred\n", + "\n", + " return predict" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next we will build our Learner. Note that even though Hamilton wrote the most papers, that doesn't make it more probable that he wrote the rest, so all the class probabilities will be equal. We can change them if we have some external knowledge, which for this tutorial we do not have." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "dist = {('Madison', 1): P_madison, ('Hamilton', 1): P_hamilton, ('Jay', 1): P_jay}\n", + "nBS = NaiveBayesLearner(dist)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As usual, the `recognize` function will take as input a string and after removing capitalization and splitting it into words, will feed it into the Naive Bayes Classifier. Since though the classifier is probabilistic (it randomly picks words from the example to evaluate) it is better if we run the experiment a lot of times and averaged the results." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "def avg_preds(preds):\n", + " d = {}\n", + " for k in preds[0].keys():\n", + " d[k] = 0\n", + " for p in preds:\n", + " d[k] += p[k]\n", + " \n", + " return {k: d[k] / len(preds)\n", + " for k in preds[0].keys()}\n", + "\n", + "\n", + "def recognize(sentence, nBS):\n", + " sentence = sentence.lower()\n", + " sentence_words = words(sentence)\n", + " \n", + " return avg_preds([nBS(sentence_words) for i in range(25)])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can start predicting the disputed papers:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Paper No. 49\n", + "Hamilton: 0.18218476722264856\n", + "Madison : 0.8178151126501306\n", + "Jay : 1.2012722099721584e-07\n", + "----------------------\n", + "Paper No. 50\n", + "Hamilton: 0.006340777113564324\n", + "Madison : 0.9935600714606485\n", + "Jay : 9.915142578703363e-05\n", + "----------------------\n", + "Paper No. 51\n", + "Hamilton: 0.10807398451170964\n", + "Madison : 0.8919260093780947\n", + "Jay : 6.11019566801153e-09\n", + "----------------------\n", + "Paper No. 52\n", + "Hamilton: 0.015755507847563528\n", + "Madison : 0.9842245750173423\n", + "Jay : 1.9917135094100632e-05\n", + "----------------------\n", + "Paper No. 53\n", + "Hamilton: 0.16148149622286845\n", + "Madison : 0.8385181396174793\n", + "Jay : 3.641596521788814e-07\n", + "----------------------\n", + "Paper No. 54\n", + "Hamilton: 0.1202445807489968\n", + "Madison : 0.8797554191935693\n", + "Jay : 5.743394071176045e-11\n", + "----------------------\n", + "Paper No. 55\n", + "Hamilton: 0.10014174623125195\n", + "Madison : 0.8998582478040609\n", + "Jay : 5.964687179083329e-09\n", + "----------------------\n", + "Paper No. 56\n", + "Hamilton: 0.15930217913525455\n", + "Madison : 0.8406948696158869\n", + "Jay : 2.9512488585096405e-06\n", + "----------------------\n", + "Paper No. 57\n", + "Hamilton: 0.3106575736716812\n", + "Madison : 0.6893423580295986\n", + "Jay : 6.829872019646261e-08\n", + "----------------------\n", + "Paper No. 58\n", + "Hamilton: 0.08144023779669217\n", + "Madison : 0.9185597621646735\n", + "Jay : 3.8634360540381284e-11\n", + "----------------------\n", + "Paper No. 18\n", + "Hamilton: 7.762932414823314e-06\n", + "Madison : 0.5114716240007965\n", + "Jay : 0.4885206130667886\n", + "----------------------\n", + "Paper No. 19\n", + "Hamilton: 0.011570316420346522\n", + "Madison : 0.5281730401297515\n", + "Jay : 0.4602566434499019\n", + "----------------------\n", + "Paper No. 20\n", + "Hamilton: 0.14651509965391551\n", + "Madison : 0.5342142523806944\n", + "Jay : 0.31927064796538995\n", + "----------------------\n", + "Paper No. 64\n", + "Hamilton: 0.5756065218890194\n", + "Madison : 0.3648418106830272\n", + "Jay : 0.059551667427953384\n", + "----------------------\n" + ] + } + ], + "source": [ + "for d in disputed:\n", + " print(\"Paper No. {}\".format(d))\n", + " probs = recognize(papers[d], nBS)\n", + " h = probs[('Hamilton', 1)]\n", + " m = probs[('Madison', 1)]\n", + " j = probs[('Jay', 1)]\n", + " print(\"Hamilton: {}\".format(h))\n", + " print(\"Madison : {}\".format(m))\n", + " print(\"Jay : {}\".format(j))\n", + " print(\"----------------------\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "NOTE: Since the algorithm has an element of random, it will show different results on each run. Generally, the more the experiments, the stabler the results.\n", + "\n", + "This is a simple approach to the problem and thankfully researchers are fairly certain that papers 49-58 were all written by Madison, while 18-20 were written in collaboration between Hamilton and Madison, with Madison being credited for most of the work. Our classifier is not that far off. It should correctly classify all (or most of) the papers by Madison, even though on some occasions the classifier is not that sure. For the collaboration papers between Hamilton and Madison the classifier shows some peculiar results: most of the time it correctly implies that Madison did a lot of the work but instead of Hamilton helping him, it usually shows Jay. This might be because the collaboration between Madison and Hamilton produced some results uncharacteristic to either of them. Without further investigation it is hard to pinpoint the issue.\n", + "\n", + "Unfortunately, it misses paper 64. Consensus is that the paper was written by John Jay, while our classifier believes it was written by Hamilton. The classifier went wrong there because it did not have much information on Jay's writing; only 4 papers. This is one of the problems with using unbalanced datasets such as this one, where information on some classes is sparser than information on the rest. To avoid this, we can add more writings for Jay and Madison to end up with an equal amount of data for each author." + ] } ], "metadata": {