|
110 | 110 | },
|
111 | 111 | {
|
112 | 112 | "cell_type": "code",
|
113 |
| - "execution_count": 2, |
| 113 | + "execution_count": null, |
114 | 114 | "metadata": {
|
115 | 115 | "collapsed": true
|
116 | 116 | },
|
|
817 | 817 | },
|
818 | 818 | {
|
819 | 819 | "cell_type": "code",
|
820 |
| - "execution_count": 23, |
| 820 | + "execution_count": null, |
821 | 821 | "metadata": {
|
822 | 822 | "collapsed": true
|
823 | 823 | },
|
824 | 824 | "outputs": [],
|
825 | 825 | "source": [
|
826 |
| - "%psource PluralityLearner" |
| 826 | + "psource(PluralityLearner)" |
827 | 827 | ]
|
828 | 828 | },
|
829 | 829 | {
|
|
909 | 909 | },
|
910 | 910 | {
|
911 | 911 | "cell_type": "code",
|
912 |
| - "execution_count": 25, |
| 912 | + "execution_count": null, |
913 | 913 | "metadata": {
|
914 | 914 | "collapsed": true
|
915 | 915 | },
|
916 | 916 | "outputs": [],
|
917 | 917 | "source": [
|
918 |
| - "%psource NearestNeighborLearner" |
| 918 | + "psource(NearestNeighborLearner)" |
919 | 919 | ]
|
920 | 920 | },
|
921 | 921 | {
|
|
991 | 991 | "\n",
|
992 | 992 | "Information Gain is difference between entropy of the parent and weighted sum of entropy of children. The feature used for splitting is the one which provides the most information gain.\n",
|
993 | 993 | "\n",
|
| 994 | + "#### Pseudocode\n", |
| 995 | + "\n", |
| 996 | + "You can view the pseudocode by running the cell below:" |
| 997 | + ] |
| 998 | + }, |
| 999 | + { |
| 1000 | + "cell_type": "code", |
| 1001 | + "execution_count": null, |
| 1002 | + "metadata": { |
| 1003 | + "collapsed": true |
| 1004 | + }, |
| 1005 | + "outputs": [], |
| 1006 | + "source": [ |
| 1007 | + "pseudocode(\"Decision Tree Learning\")" |
| 1008 | + ] |
| 1009 | + }, |
| 1010 | + { |
| 1011 | + "cell_type": "markdown", |
| 1012 | + "metadata": {}, |
| 1013 | + "source": [ |
994 | 1014 | "### Implementation\n",
|
995 | 1015 | "The nodes of the tree constructed by our learning algorithm are stored using either `DecisionFork` or `DecisionLeaf` based on whether they are a parent node or a leaf node respectively."
|
996 | 1016 | ]
|
997 | 1017 | },
|
998 | 1018 | {
|
999 | 1019 | "cell_type": "code",
|
1000 |
| - "execution_count": 27, |
| 1020 | + "execution_count": null, |
1001 | 1021 | "metadata": {
|
1002 | 1022 | "collapsed": true
|
1003 | 1023 | },
|
1004 | 1024 | "outputs": [],
|
1005 | 1025 | "source": [
|
1006 |
| - "%psource DecisionFork" |
| 1026 | + "psource(DecisionFork)" |
1007 | 1027 | ]
|
1008 | 1028 | },
|
1009 | 1029 | {
|
|
1015 | 1035 | },
|
1016 | 1036 | {
|
1017 | 1037 | "cell_type": "code",
|
1018 |
| - "execution_count": 28, |
| 1038 | + "execution_count": null, |
1019 | 1039 | "metadata": {
|
1020 | 1040 | "collapsed": true
|
1021 | 1041 | },
|
1022 | 1042 | "outputs": [],
|
1023 | 1043 | "source": [
|
1024 |
| - "%psource DecisionLeaf" |
| 1044 | + "psource(DecisionLeaf)" |
1025 | 1045 | ]
|
1026 | 1046 | },
|
1027 | 1047 | {
|
|
1033 | 1053 | },
|
1034 | 1054 | {
|
1035 | 1055 | "cell_type": "code",
|
1036 |
| - "execution_count": 29, |
| 1056 | + "execution_count": null, |
1037 | 1057 | "metadata": {
|
1038 | 1058 | "collapsed": true
|
1039 | 1059 | },
|
1040 | 1060 | "outputs": [],
|
1041 | 1061 | "source": [
|
1042 |
| - "%psource DecisionTreeLearner" |
| 1062 | + "psource(DecisionTreeLearner)" |
1043 | 1063 | ]
|
1044 | 1064 | },
|
1045 | 1065 | {
|
|
1142 | 1162 | "source": [
|
1143 | 1163 | "### Implementation\n",
|
1144 | 1164 | "\n",
|
1145 |
| - "The implementation of the Naive Bayes Classifier is split in two; Discrete and Continuous. The user can choose between them with the argument `continuous`." |
| 1165 | + "The implementation of the Naive Bayes Classifier is split in two; *Learning* and *Simple*. The *learning* classifier takes as input a dataset and learns the needed distributions from that. It is itself split into two, for discrete and continuous features. The *simple* classifier takes as input not a dataset, but already calculated distributions (a dictionary of `CountingProbDist` objects)." |
1146 | 1166 | ]
|
1147 | 1167 | },
|
1148 | 1168 | {
|
|
1237 | 1257 | },
|
1238 | 1258 | {
|
1239 | 1259 | "cell_type": "code",
|
1240 |
| - "execution_count": 32, |
| 1260 | + "execution_count": null, |
1241 | 1261 | "metadata": {
|
1242 | 1262 | "collapsed": true
|
1243 | 1263 | },
|
1244 | 1264 | "outputs": [],
|
1245 | 1265 | "source": [
|
1246 |
| - "%psource NaiveBayesDiscrete" |
| 1266 | + "psource(NaiveBayesDiscrete)" |
1247 | 1267 | ]
|
1248 | 1268 | },
|
1249 | 1269 | {
|
|
1327 | 1347 | },
|
1328 | 1348 | {
|
1329 | 1349 | "cell_type": "code",
|
1330 |
| - "execution_count": 35, |
| 1350 | + "execution_count": null, |
1331 | 1351 | "metadata": {
|
1332 | 1352 | "collapsed": true
|
1333 | 1353 | },
|
1334 | 1354 | "outputs": [],
|
1335 | 1355 | "source": [
|
1336 |
| - "%psource NaiveBayesContinuous" |
| 1356 | + "psource(NaiveBayesContinuous)" |
| 1357 | + ] |
| 1358 | + }, |
| 1359 | + { |
| 1360 | + "cell_type": "markdown", |
| 1361 | + "metadata": {}, |
| 1362 | + "source": [ |
| 1363 | + "#### Simple\n", |
| 1364 | + "\n", |
| 1365 | + "The simple classifier (chosen with the argument `simple`) does not learn from a dataset, instead it takes as input a dictionary of already calculated `CountingProbDist` objects and returns a predictor function. The dictionary is in the following form: `(Class Name, Class Probability): CountingProbDist Object`.\n", |
| 1366 | + "\n", |
| 1367 | + "Each class has its own probability distribution. The classifier given a list of features calculates the probability of the input for each class and returns the max. The only pre-processing work is to create dictionaries for the distribution of classes (named `targets`) and attributes/features.\n", |
| 1368 | + "\n", |
| 1369 | + "The complete code for the simple classifier:" |
| 1370 | + ] |
| 1371 | + }, |
| 1372 | + { |
| 1373 | + "cell_type": "code", |
| 1374 | + "execution_count": null, |
| 1375 | + "metadata": {}, |
| 1376 | + "outputs": [], |
| 1377 | + "source": [ |
| 1378 | + "psource(NaiveBayesSimple)" |
| 1379 | + ] |
| 1380 | + }, |
| 1381 | + { |
| 1382 | + "cell_type": "markdown", |
| 1383 | + "metadata": {}, |
| 1384 | + "source": [ |
| 1385 | + "This classifier is useful when you already have calculated the distributions and you need to predict future items." |
1337 | 1386 | ]
|
1338 | 1387 | },
|
1339 | 1388 | {
|
|
1385 | 1434 | "cell_type": "markdown",
|
1386 | 1435 | "metadata": {},
|
1387 | 1436 | "source": [
|
1388 |
| - "Notice how the Discrete Classifier misclassified the second item, while the Continuous one had no problem." |
| 1437 | + "Notice how the Discrete Classifier misclassified the second item, while the Continuous one had no problem.\n", |
| 1438 | + "\n", |
| 1439 | + "Let's now take a look at the simple classifier. First we will come up with a sample problem to solve. Say we are given three bags. Each bag contains three letters ('a', 'b' and 'c') of different quantities. We are given a string of letters and we are tasked with finding from which bag the string of letters came.\n", |
| 1440 | + "\n", |
| 1441 | + "Since we know the probability distribution of the letters for each bag, we can use the naive bayes classifier to make our prediction." |
| 1442 | + ] |
| 1443 | + }, |
| 1444 | + { |
| 1445 | + "cell_type": "code", |
| 1446 | + "execution_count": 2, |
| 1447 | + "metadata": { |
| 1448 | + "collapsed": true |
| 1449 | + }, |
| 1450 | + "outputs": [], |
| 1451 | + "source": [ |
| 1452 | + "bag1 = 'a'*50 + 'b'*30 + 'c'*15\n", |
| 1453 | + "dist1 = CountingProbDist(bag1)\n", |
| 1454 | + "bag2 = 'a'*30 + 'b'*45 + 'c'*20\n", |
| 1455 | + "dist2 = CountingProbDist(bag2)\n", |
| 1456 | + "bag3 = 'a'*20 + 'b'*20 + 'c'*35\n", |
| 1457 | + "dist3 = CountingProbDist(bag3)" |
| 1458 | + ] |
| 1459 | + }, |
| 1460 | + { |
| 1461 | + "cell_type": "markdown", |
| 1462 | + "metadata": {}, |
| 1463 | + "source": [ |
| 1464 | + "Now that we have the `CountingProbDist` objects for each bag/class, we will create the dictionary. We assume that it is equally probable that we will pick from any bag." |
| 1465 | + ] |
| 1466 | + }, |
| 1467 | + { |
| 1468 | + "cell_type": "code", |
| 1469 | + "execution_count": 3, |
| 1470 | + "metadata": { |
| 1471 | + "collapsed": true |
| 1472 | + }, |
| 1473 | + "outputs": [], |
| 1474 | + "source": [ |
| 1475 | + "dist = {('First', 0.5): dist1, ('Second', 0.3): dist2, ('Third', 0.2): dist3}\n", |
| 1476 | + "nBS = NaiveBayesLearner(dist, simple=True)" |
| 1477 | + ] |
| 1478 | + }, |
| 1479 | + { |
| 1480 | + "cell_type": "markdown", |
| 1481 | + "metadata": {}, |
| 1482 | + "source": [ |
| 1483 | + "Now we can start making predictions:" |
| 1484 | + ] |
| 1485 | + }, |
| 1486 | + { |
| 1487 | + "cell_type": "code", |
| 1488 | + "execution_count": 4, |
| 1489 | + "metadata": {}, |
| 1490 | + "outputs": [ |
| 1491 | + { |
| 1492 | + "name": "stdout", |
| 1493 | + "output_type": "stream", |
| 1494 | + "text": [ |
| 1495 | + "First\n", |
| 1496 | + "Second\n", |
| 1497 | + "Third\n" |
| 1498 | + ] |
| 1499 | + } |
| 1500 | + ], |
| 1501 | + "source": [ |
| 1502 | + "print(nBS('aab')) # We can handle strings\n", |
| 1503 | + "print(nBS(['b', 'b'])) # And lists!\n", |
| 1504 | + "print(nBS('ccbcc'))" |
| 1505 | + ] |
| 1506 | + }, |
| 1507 | + { |
| 1508 | + "cell_type": "markdown", |
| 1509 | + "metadata": {}, |
| 1510 | + "source": [ |
| 1511 | + "The results make intuitive sence. The first bag has a high amount of 'a's, the second has a high amount of 'b's and the third has a high amount of 'c's. The classifier seems to confirm this intuition.\n", |
| 1512 | + "\n", |
| 1513 | + "Note that the simple classifier doesn't distinguish between discrete and continuous values. It just takes whatever it is given. Also, the `simple` option on the `NaiveBayesLearner` overrides the `continuous` argument. `NaiveBayesLearner(d, simple=True, continuous=False)` just creates a simple classifier." |
1389 | 1514 | ]
|
1390 | 1515 | },
|
1391 | 1516 | {
|
|
1423 | 1548 | },
|
1424 | 1549 | {
|
1425 | 1550 | "cell_type": "code",
|
1426 |
| - "execution_count": 37, |
| 1551 | + "execution_count": null, |
1427 | 1552 | "metadata": {
|
1428 | 1553 | "collapsed": true
|
1429 | 1554 | },
|
1430 | 1555 | "outputs": [],
|
1431 | 1556 | "source": [
|
1432 |
| - "%psource PerceptronLearner" |
| 1557 | + "psource(PerceptronLearner)" |
1433 | 1558 | ]
|
1434 | 1559 | },
|
1435 | 1560 | {
|
|
0 commit comments