Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Learning Notebook: Simple Naive Bayes #628

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 24, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 144 additions & 19 deletions learning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"metadata": {
"collapsed": true
},
Expand Down Expand Up @@ -817,13 +817,13 @@
},
{
"cell_type": "code",
"execution_count": 23,
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%psource PluralityLearner"
"psource(PluralityLearner)"
]
},
{
Expand Down Expand Up @@ -909,13 +909,13 @@
},
{
"cell_type": "code",
"execution_count": 25,
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%psource NearestNeighborLearner"
"psource(NearestNeighborLearner)"
]
},
{
Expand Down Expand Up @@ -991,19 +991,39 @@
"\n",
"Information Gain is difference between entropy of the parent and weighted sum of entropy of children. The feature used for splitting is the one which provides the most information gain.\n",
"\n",
"#### Pseudocode\n",
"\n",
"You can view the pseudocode by running the cell below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pseudocode(\"Decision Tree Learning\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Implementation\n",
"The nodes of the tree constructed by our learning algorithm are stored using either `DecisionFork` or `DecisionLeaf` based on whether they are a parent node or a leaf node respectively."
]
},
{
"cell_type": "code",
"execution_count": 27,
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%psource DecisionFork"
"psource(DecisionFork)"
]
},
{
Expand All @@ -1015,13 +1035,13 @@
},
{
"cell_type": "code",
"execution_count": 28,
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%psource DecisionLeaf"
"psource(DecisionLeaf)"
]
},
{
Expand All @@ -1033,13 +1053,13 @@
},
{
"cell_type": "code",
"execution_count": 29,
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%psource DecisionTreeLearner"
"psource(DecisionTreeLearner)"
]
},
{
Expand Down Expand Up @@ -1142,7 +1162,7 @@
"source": [
"### Implementation\n",
"\n",
"The implementation of the Naive Bayes Classifier is split in two; Discrete and Continuous. The user can choose between them with the argument `continuous`."
"The implementation of the Naive Bayes Classifier is split in two; *Learning* and *Simple*. The *learning* classifier takes as input a dataset and learns the needed distributions from that. It is itself split into two, for discrete and continuous features. The *simple* classifier takes as input not a dataset, but already calculated distributions (a dictionary of `CountingProbDist` objects)."
]
},
{
Expand Down Expand Up @@ -1237,13 +1257,13 @@
},
{
"cell_type": "code",
"execution_count": 32,
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%psource NaiveBayesDiscrete"
"psource(NaiveBayesDiscrete)"
]
},
{
Expand Down Expand Up @@ -1327,13 +1347,42 @@
},
{
"cell_type": "code",
"execution_count": 35,
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%psource NaiveBayesContinuous"
"psource(NaiveBayesContinuous)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Simple\n",
"\n",
"The simple classifier (chosen with the argument `simple`) does not learn from a dataset, instead it takes as input a dictionary of already calculated `CountingProbDist` objects and returns a predictor function. The dictionary is in the following form: `(Class Name, Class Probability): CountingProbDist Object`.\n",
"\n",
"Each class has its own probability distribution. The classifier given a list of features calculates the probability of the input for each class and returns the max. The only pre-processing work is to create dictionaries for the distribution of classes (named `targets`) and attributes/features.\n",
"\n",
"The complete code for the simple classifier:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"psource(NaiveBayesSimple)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This classifier is useful when you already have calculated the distributions and you need to predict future items."
]
},
{
Expand Down Expand Up @@ -1385,7 +1434,83 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how the Discrete Classifier misclassified the second item, while the Continuous one had no problem."
"Notice how the Discrete Classifier misclassified the second item, while the Continuous one had no problem.\n",
"\n",
"Let's now take a look at the simple classifier. First we will come up with a sample problem to solve. Say we are given three bags. Each bag contains three letters ('a', 'b' and 'c') of different quantities. We are given a string of letters and we are tasked with finding from which bag the string of letters came.\n",
"\n",
"Since we know the probability distribution of the letters for each bag, we can use the naive bayes classifier to make our prediction."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"bag1 = 'a'*50 + 'b'*30 + 'c'*15\n",
"dist1 = CountingProbDist(bag1)\n",
"bag2 = 'a'*30 + 'b'*45 + 'c'*20\n",
"dist2 = CountingProbDist(bag2)\n",
"bag3 = 'a'*20 + 'b'*20 + 'c'*35\n",
"dist3 = CountingProbDist(bag3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have the `CountingProbDist` objects for each bag/class, we will create the dictionary. We assume that it is equally probable that we will pick from any bag."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"dist = {('First', 0.5): dist1, ('Second', 0.3): dist2, ('Third', 0.2): dist3}\n",
"nBS = NaiveBayesLearner(dist, simple=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can start making predictions:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"First\n",
"Second\n",
"Third\n"
]
}
],
"source": [
"print(nBS('aab')) # We can handle strings\n",
"print(nBS(['b', 'b'])) # And lists!\n",
"print(nBS('ccbcc'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The results make intuitive sence. The first bag has a high amount of 'a's, the second has a high amount of 'b's and the third has a high amount of 'c's. The classifier seems to confirm this intuition.\n",
"\n",
"Note that the simple classifier doesn't distinguish between discrete and continuous values. It just takes whatever it is given. Also, the `simple` option on the `NaiveBayesLearner` overrides the `continuous` argument. `NaiveBayesLearner(d, simple=True, continuous=False)` just creates a simple classifier."
]
},
{
Expand Down Expand Up @@ -1423,13 +1548,13 @@
},
{
"cell_type": "code",
"execution_count": 37,
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%psource PerceptronLearner"
"psource(PerceptronLearner)"
]
},
{
Expand Down