diff --git a/learning.ipynb b/learning.ipynb index 9f2d91add..78ff4f0e3 100644 --- a/learning.ipynb +++ b/learning.ipynb @@ -16,7 +16,9 @@ "cell_type": "code", "execution_count": 1, "metadata": { - "collapsed": true + "collapsed": true, + "deletable": true, + "editable": true }, "outputs": [], "source": [ @@ -32,26 +34,51 @@ "source": [ "## Contents\n", "\n", - "* Datasets\n", "* Machine Learning Overview\n", - "* Plurality Learner Classifier\n", - " * Overview\n", - " * Implementation\n", - " * Example\n", - "* k-Nearest Neighbours Classifier\n", - " * Overview\n", - " * Implementation\n", - " * Example\n", - "* Perceptron Classifier\n", - " * Overview\n", - " * Implementation\n", - " * Example\n", - "* MNIST Handwritten Digits Classification\n", + "* Datasets\n", + "* Plurality Learner\n", + "* k-Nearest Neighbours\n", + "* Perceptron\n", + "* MNIST Handwritten Digits\n", " * Loading and Visualising\n", " * Testing\n", " * kNN Classifier" ] }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "## Machine Learning Overview\n", + "\n", + "In this notebook, we learn about agents that can improve their behavior through diligent study of their own experiences.\n", + "\n", + "An agent is **learning** if it improves its performance on future tasks after making observations about the world.\n", + "\n", + "There are three types of feedback that determine the three main types of learning:\n", + "\n", + "* **Supervised Learning**:\n", + "\n", + "In Supervised Learning the agent observes some example input-output pairs and learns a function that maps from input to output.\n", + "\n", + "**Example**: Let's think of an agent to classify images containing cats or dogs. If we provide an image containing a cat or a dog, this agent should output a string \"cat\" or \"dog\" for that particular image. To teach this agent, we will give a lot of input-output pairs like {cat image-\"cat\"}, {dog image-\"dog\"} to the agent. The agent then learns a function that maps from an input image to one of those strings.\n", + "\n", + "* **Unsupervised Learning**:\n", + "\n", + "In Unsupervised Learning the agent learns patterns in the input even though no explicit feedback is supplied. The most common type is **clustering**: detecting potential useful clusters of input examples.\n", + "\n", + "**Example**: A taxi agent would develop a concept of *good traffic days* and *bad traffic days* without ever being given labeled examples.\n", + "\n", + "* **Reinforcement Learning**:\n", + "\n", + "In Reinforcement Learning the agent learns from a series of reinforcements—rewards or punishments.\n", + "\n", + "**Example**: Let's talk about an agent to play the popular Atari game—[Pong](http://www.ponggame.org). We will reward a point for every correct move and deduct a point for every wrong move from the agent. Eventually, the agent will figure out its actions prior to reinforcement were most responsible for it." + ] + }, { "cell_type": "markdown", "metadata": { @@ -63,9 +90,9 @@ "\n", "For the following tutorials we will use a range of datasets, to better showcase the strengths and weaknesses of the algorithms. The datasests are the following:\n", "\n", - "* [Fisher's Iris](https://github.com/aimacode/aima-data/blob/a21fc108f52ad551344e947b0eb97df82f8d2b2b/iris.csv). Each item represents a flower, with four measurements: the length and the width of the sepals and petals. Each item/flower is categorized into one of three species: Setosa, Versicolor and Virginica.\n", + "* [Fisher's Iris](https://github.com/aimacode/aima-data/blob/a21fc108f52ad551344e947b0eb97df82f8d2b2b/iris.csv): Each item represents a flower, with four measurements: the length and the width of the sepals and petals. Each item/flower is categorized into one of three species: Setosa, Versicolor and Virginica.\n", "\n", - "* [Zoo](https://github.com/aimacode/aima-data/blob/a21fc108f52ad551344e947b0eb97df82f8d2b2b/zoo.csv). The dataset holds different animals and their classification as \"mammal\", \"fish\", etc. The new animal we want to classify has the following measurements: 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1 (don't concern yourself with what the measurements mean)." + "* [Zoo](https://github.com/aimacode/aima-data/blob/a21fc108f52ad551344e947b0eb97df82f8d2b2b/zoo.csv): The dataset holds different animals and their classification as \"mammal\", \"fish\", etc. The new animal we want to classify has the following measurements: 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 1, 0, 1 (don't concern yourself with what the measurements mean)." ] }, { @@ -75,31 +102,480 @@ "editable": true }, "source": [ - "## Machine Learning Overview\n", + "To make using the datasets easier, we have written a class, `DataSet`, in `learning.py`. The tutorials found here make use of this class.\n", "\n", - "In this notebook, we learn about agents that can improve their behavior through diligent study of their own experiences.\n", + "Let's have a look at how it works before we get started with the algorithms." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "### Intro\n", "\n", - "An agent is **learning** if it improves its performance on future tasks after making observations about the world.\n", + "A lot of the datasets we will work with are .csv files (although other formats are supported too). We have a collection of sample datasets ready to use [on aima-data](https://github.com/aimacode/aima-data/tree/a21fc108f52ad551344e947b0eb97df82f8d2b2b). Two examples are the datasets mentioned above (*iris.csv* and *zoo.csv*). You can find plenty datasets online, and a good repository of such datasets is [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.html).\n", "\n", - "There are three types of feedback that determine the three main types of learning:\n", + "In such files, each line corresponds to one item/measurement. Each individual value in a line represents a *feature* and usually there is a value denoting the *class* of the item.\n", "\n", - "* **Supervised Learning**:\n", + "You can find the code for the dataset here:" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "collapsed": true, + "deletable": true, + "editable": true + }, + "outputs": [], + "source": [ + "%psource DataSet" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "### Class Attributes\n", "\n", - "In Supervised Learning the agent observes some example input-output pairs and learns a function that maps from input to output.\n", + "* **examples**: Holds the items of the dataset. Each item is a list of values.\n", "\n", - "**Example**: Let's think of an agent to classify images containing cats or dogs. If we provide an image containing a cat or a dog, this agent should output a string \"cat\" or \"dog\" for that particular image. To teach this agent, we will give a lot of input-output pairs like {cat image-\"cat\"}, {dog image-\"dog\"} to the agent. The agent then learns a function that maps from an input image to one of those strings.\n", + "* **attrs**: The indexes of the features (by default in the range of [0,f), where *f* is the number of features. For example, `item[i]` returns the feature at index *i* of *item*.\n", "\n", - "* **Unsupervised Learning**:\n", + "* **attrnames**: An optional list with attribute names. For example, `item[s]`, where *s* is a feature name, returns the feature of name *s* in *item*.\n", "\n", - "In Unsupervised Learning the agent learns patterns in the input even though no explicit feedback is supplied. The most common type is **clustering**: detecting potential useful clusters of input examples.\n", + "* **target**: The attribute a learning algorithm will try to predict. By default the last attribute.\n", "\n", - "**Example**: A taxi agent would develop a concept of *good traffic days* and *bad traffic days* without ever being given labeled examples.\n", + "* **inputs**: This is the list of attributes without the target.\n", "\n", - "* **Reinforcement Learning**:\n", + "* **values**: A list of lists which holds the set of possible values for the corresponding attribute/feature. If initially `None`, it gets computed (by the function `setproblem`) from the examples.\n", "\n", - "In Reinforcement Learning the agent learns from a series of reinforcements—rewards or punishments.\n", + "* **distance**: The distance function used in the learner to calculate the distance between two items. By default `mean_boolean_error`.\n", "\n", - "**Example**: Let's talk about an agent to play the popular Atari game—[Pong](http://www.ponggame.org). We will reward a point for every correct move and deduct a point for every wrong move from the agent. Eventually, the agent will figure out its actions prior to reinforcement were most responsible for it." + "* **name**: Name of the dataset.\n", + "\n", + "* **source**: The source of the dataset (url or other). Not used in the code.\n", + "\n", + "* **exclude**: A list of indexes to exclude from `inputs`. The list can include either attribute indexes (attrs) or names (attrnames)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "### Class Helper Functions\n", + "\n", + "These functions help modify a `DataSet` object to your needs.\n", + "\n", + "* **sanitize**: Takes as input an example and returns it with non-input (target) attributes replaced by `None`. Useful for testing. Keep in mind that the example given is not itself sanitized, but instead a sanitized copy is returned.\n", + "\n", + "* **classes_to_numbers**: Maps the class names of a dataset to numbers. If the class names are not given, they are computed from the dataset values. Useful for classifiers that return a numerical value instead of a string.\n", + "\n", + "* **remove_examples**: Removes examples containing a given value. Useful for removing examples with missing values, or for removing classes (needed for binary classifiers)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "### Importing a Dataset\n", + "\n", + "#### Importing from aima-data\n", + "\n", + "Datasets uploaded on aima-data can be imported with the following line:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "collapsed": false, + "deletable": true, + "editable": true + }, + "outputs": [], + "source": [ + "iris = DataSet(name=\"iris\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "To check that we imported the correct dataset, we can do the following:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "collapsed": false, + "deletable": true, + "editable": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[5.1, 3.5, 1.4, 0.2, 'setosa']\n", + "[0, 1, 2, 3]\n" + ] + } + ], + "source": [ + "print(iris.examples[0])\n", + "print(iris.inputs)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "Which correctly prints the first line in the csv file and the list of attribute indexes." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "When importing a dataset, we can specify to exclude an attribute (for example, at index 1) by setting the parameter `exclude` to the attribute index or name." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "collapsed": false, + "deletable": true, + "editable": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[0, 2, 3]\n" + ] + } + ], + "source": [ + "iris2 = DataSet(name=\"iris\",exclude=[1])\n", + "print(iris2.inputs)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "### Attributes\n", + "\n", + "Here we showcase the attributes.\n", + "\n", + "First we will print the first three items/examples in the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "collapsed": false, + "deletable": true, + "editable": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[5.1, 3.5, 1.4, 0.2, 'setosa'], [4.9, 3.0, 1.4, 0.2, 'setosa'], [4.7, 3.2, 1.3, 0.2, 'setosa']]\n" + ] + } + ], + "source": [ + "print(iris.examples[:3])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "Then we will print `attrs`, `attrnames`, `target`, `input`. Notice how `attrs` holds values in [0,4], but since the fourth attribute is the target, `inputs` holds values in [0,3]." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "collapsed": false, + "deletable": true, + "editable": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "attrs: [0, 1, 2, 3, 4]\n", + "attrnames (by default same as attrs): [0, 1, 2, 3, 4]\n", + "target: 4\n", + "inputs: [0, 1, 2, 3]\n" + ] + } + ], + "source": [ + "print(\"attrs:\", iris.attrs)\n", + "print(\"attrnames (by default same as attrs):\", iris.attrnames)\n", + "print(\"target:\", iris.target)\n", + "print(\"inputs:\", iris.inputs)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "Now we will print all the possible values for the first feature/attribute." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "collapsed": false, + "deletable": true, + "editable": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[4.7, 5.5, 6.3, 5.0, 4.9, 5.1, 4.6, 5.4, 4.4, 4.8, 5.8, 7.0, 7.1, 4.5, 5.9, 5.6, 6.9, 6.6, 6.5, 6.4, 6.0, 6.1, 7.6, 7.4, 7.9, 4.3, 5.7, 5.3, 5.2, 6.7, 6.2, 6.8, 7.3, 7.2, 7.7]\n" + ] + } + ], + "source": [ + "print(iris.values[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "Finally we will print the dataset's name and source. Keep in mind that we have not set a source for the dataset, so in this case it is empty." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "collapsed": false, + "deletable": true, + "editable": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "name: iris\n", + "source: \n" + ] + } + ], + "source": [ + "print(\"name:\", iris.name)\n", + "print(\"source:\", iris.source)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "A useful combination of the above is `dataset.values[dataset.target]` which returns the possible values of the target. For classification problems, this will return all the possible classes. Let's try it:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "collapsed": false, + "deletable": true, + "editable": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['setosa', 'virginica', 'versicolor']\n" + ] + } + ], + "source": [ + "print(iris.values[iris.target])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "### Helper Functions" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "We will now take a look at the auxiliary functions found in the class.\n", + "\n", + "First we will take a look at the `sanitize` function, which sets the non-input values of the given example to `None`.\n", + "\n", + "In this case we want to hide the class of the first example, so we will sanitize it.\n", + "\n", + "Note that the function doesn't actually change the given example; it returns a sanitized *copy* of it." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "collapsed": false, + "deletable": true, + "editable": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Sanitized: [5.1, 3.5, 1.4, 0.2, None]\n", + "Original: [5.1, 3.5, 1.4, 0.2, 'setosa']\n" + ] + } + ], + "source": [ + "print(\"Sanitized:\",iris.sanitize(iris.examples[0]))\n", + "print(\"Original:\",iris.examples[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "Currently the `iris` dataset has three classes, setosa, virginica and versicolor. We want though to convert it to a binary class dataset (a dataset with two classes). The class we want to remove is \"virginica\". To accomplish that we will utilize the helper function `remove_examples`." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "collapsed": false, + "deletable": true, + "editable": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['setosa', 'versicolor']\n" + ] + } + ], + "source": [ + "iris.remove_examples(\"virginica\")\n", + "print(iris.values[iris.target])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "Finally we take a look at `classes_to_numbers`. For a lot of the classifiers in the module (like the Neural Network), classes should have numerical values. With this function we map string class names to numbers." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "collapsed": false, + "deletable": true, + "editable": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Class of first example: setosa\n", + "Class of first example: 0\n" + ] + } + ], + "source": [ + "print(\"Class of first example:\",iris.examples[0][iris.target])\n", + "iris.classes_to_numbers()\n", + "print(\"Class of first example:\",iris.examples[0][iris.target])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": true, + "editable": true + }, + "source": [ + "As you can see \"setosa\" was mapped to 0." ] }, {