diff --git a/learning.ipynb b/learning.ipynb
index e83fb5b57..88e70be98 100644
--- a/learning.ipynb
+++ b/learning.ipynb
@@ -36,7 +36,6 @@
     "* Decision Tree Learner\n",
     "* Naive Bayes Learner\n",
     "* Perceptron\n",
-    "* Neural Network\n",
     "* Learner Evaluation\n",
     "* MNIST Handwritten Digits\n",
     "    * Loading and Visualising\n",
@@ -969,20 +968,29 @@
    "metadata": {},
    "source": [
     "## DECISION TREE LEARNER\n",
+    "\n",
     "### Overview\n",
+    "\n",
     "#### Decision Trees\n",
-    "A decision tree is a flowchart that uses a tree of decisions and their possible consequences for classification. At each non-leaf node of the tree an attribute of the input is tested, based on which the corresponding branch leading to a child-node is selected. At the leaf node the input is classified based on the class label of this leaf node. The paths from root to leaf represent classification rules based on which leaf nodes are assigned class labels.\n",
+    "A decision tree is a flowchart that uses a tree of decisions and their possible consequences for classification. At each non-leaf node of the tree an attribute of the input is tested, based on which corresponding branch leading to a child-node is selected. At the leaf node the input is classified based on the class label of this leaf node. The paths from root to leaves represent classification rules based on which leaf nodes are assigned class labels.\n",
     "![perceptron](images/decisiontree_fruit.jpg)\n",
     "#### Decision Tree Learning\n",
     "Decision tree learning is the construction of a decision tree from class-labeled training data. The data is expected to be a tuple in which each record of the tuple is an attribute used for classification. The decision tree is built top-down, by choosing a variable at each step that best splits the set of items. There are different metrics for measuring the \"best split\". These generally measure the homogeneity of the target variable within the subsets.\n",
+    "\n",
     "#### Gini Impurity\n",
     "Gini impurity of a set is the probability of a randomly chosen element to be incorrectly labeled if it was randomly labeled according to the distribution of labels in the set.\n",
+    "\n",
     "$$I_G(p) = \\sum{p_i(1 - p_i)} = 1 - \\sum{p_i^2}$$\n",
+    "\n",
     "We select split which minimizes the Gini impurity in childre nodes.\n",
+    "\n",
     "#### Information Gain\n",
     "Information gain is based on the concept of entropy from information theory. Entropy is defined as:\n",
+    "\n",
     "$$H(p) = -\\sum{p_i \\log_2{p_i}}$$\n",
+    "\n",
     "Information Gain is difference between entropy of the parent and weighted sum of entropy of children. The feature used for splitting is the one which provides the most information gain.\n",
+    "\n",
     "### Implementation\n",
     "The nodes of the tree constructed by our learning algorithm are stored using either `DecisionFork` or `DecisionLeaf` based on whether they are a parent node or a leaf node respectively."
    ]
@@ -1041,9 +1049,9 @@
     "The implementation of `DecisionTreeLearner` provided in [learning.py](https://github.com/aimacode/aima-python/blob/master/learning.py) uses information gain as the metric for selecting which attribute to test for splitting. The function builds the tree top-down in a recursive manner. Based on the input it makes one of the four choices:\n",
     "<ol>\n",
     "<li>If the input at the current step has no training data we return the mode of classes of input data recieved in the parent step (previous level of recursion).</li>\n",
-    "<li>If all values in training data belongs to the same class it returns a `DecisionLeaf` whose class label is the class which all the data belongs to.</li>\n",
-    "<li>If the data has no attributes that can be tested we return prurality value of class of training data.</li>\n",
-    "<li>We choose the attribute which gives highest amount of entropy gain and return a `DecisionFork` which splits based of this attribute. Each branch recursively calls `decision_tree_learning` to constructs the sub-tree.</li>\n",
+    "<li>If all values in training data belong to the same class it returns a `DecisionLeaf` whose class label is the class which all the data belongs to.</li>\n",
+    "<li>If the data has no attributes that can be tested we return the class with highest plurality value in the training data.</li>\n",
+    "<li>We choose the attribute which gives the highest amount of entropy gain and return a `DecisionFork` which splits based on this attribute. Each branch recursively calls `decision_tree_learning` to construct the sub-tree.</li>\n",
     "</ol>"
    ]
   },
@@ -1470,133 +1478,6 @@
     "The correct output is 0, which means the item belongs in the first class, \"setosa\". Note that the Perceptron algorithm is not perfect and may produce false classifications."
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## NEURAL NETWORK\n",
-    "\n",
-    "### Overview\n",
-    "\n",
-    "Although the Perceptron may seem like a good way to make classifications, it is a linear classifier (which, roughly, means it can only draw straight lines to divide spaces) and therefore it can be stumped by more complex problems. We can extend Perceptron to solve this issue, by employing multiple layers of its functionality. The construct we are left with is called a Neural Network, or a Multi-Layer Perceptron, and it is a non-linear classifier. It achieves that by combining the results of linear functions on each layer of the network.\n",
-    "\n",
-    "Similar to the Perceptron, this network also has an input and output layer. However it can also have a number of hidden layers. These hidden layers are responsible for the non-linearity of the network. The layers are comprised of nodes. Each node in a layer (excluding the input one), holds some values, called *weights*, and takes as input the output values of the previous layer. The node then calculates the dot product of its inputs and its weights and then activates it with an *activation function* (sometimes a sigmoid). Its output is fed to the nodes of the next layer. Note that sometimes the output layer does not use an activation function, or uses a different one from the rest of the network. The process of passing the outputs down the layer is called *feed-forward*.\n",
-    "\n",
-    "After the input values are fed-forward into the network, the resulting output can be used for classification. The problem at hand now is how to train the network (ie. adjust the weights in the nodes). To accomplish that we utilize the *Backpropagation* algorithm. In short, it does the opposite of what we were doing up to now. Instead of feeding the input forward, it will feed the error backwards. So, after we make a classification, we check whether it is correct or not, and how far off we were. We then take this error and propagate it backwards in the network, adjusting the weights of the nodes accordingly. We will run the algorithm on the given input/dataset for a fixed amount of time, or until we are satisfied with the results. The number of times we will iterate over the dataset is called *epochs*. In a later section we take a detailed look at how this algorithm works.\n",
-    "\n",
-    "NOTE: Sometimes we add to the input of each layer another node, called *bias*. This is a constant value that will be fed to the next layer, usually set to 1. The bias generally helps us \"shift\" the computed function to the left or right."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "![neural_net](images/neural_net.png)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Implementation\n",
-    "\n",
-    "The `NeuralNetLearner` function takes as input a dataset to train upon, the learning rate (in (0, 1]), the number of epochs and finally the size of the hidden layers. This last argument is a list, with each element corresponding to one hidden layer.\n",
-    "\n",
-    "After that we will create our neural network in the `network` function. This function will make the necessary connections between the input layer, hidden layer and output layer. With the network ready, we will use the `BackPropagationLearner` to train the weights of our network for the examples provided in the dataset.\n",
-    "\n",
-    "The NeuralNetLearner returns the `predict` function, which can receive an example and feed-forward it into our network to generate a prediction."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 39,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "%psource NeuralNetLearner"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Backpropagation\n",
-    "\n",
-    "In both the Perceptron and the Neural Network, we are using the Backpropagation algorithm to train our weights. Basically it achieves that by propagating the errors from our last layer into our first layer, this is why it is called Backpropagation. In order to use Backpropagation, we need a cost function. This function is responsible for indicating how good our neural network is for a given example. One common cost function is the *Mean Squared Error* (MSE). This cost function has the following format:\n",
-    "\n",
-    "$$MSE=\\frac{1}{2} \\sum_{i=1}^{n}(y - \\hat{y})^{2}$$\n",
-    "\n",
-    "Where `n` is the number of training examples, $\\hat{y}$ is our prediction and $y$ is the correct prediction for the example.\n",
-    "\n",
-    "The algorithm combines the concept of partial derivatives and the chain rule to generate the gradient for each weight in the network based on the cost function.\n",
-    "\n",
-    "For example, if we are using a Neural Network with three layers, the sigmoid function as our activation function and the MSE cost function, we want to find the gradient for the a given weight $w_{j}$, we can compute it like this:\n",
-    "\n",
-    "$$\\frac{\\partial MSE(\\hat{y}, y)}{\\partial w_{j}} = \\frac{\\partial MSE(\\hat{y}, y)}{\\partial \\hat{y}}\\times\\frac{\\partial\\hat{y}(in_{j})}{\\partial in_{j}}\\times\\frac{\\partial in_{j}}{\\partial w_{j}}$$\n",
-    "\n",
-    "Solving this equation, we have:\n",
-    "\n",
-    "$$\\frac{\\partial MSE(\\hat{y}, y)}{\\partial w_{j}} = (\\hat{y} - y)\\times{\\hat{y}}'(in_{j})\\times a_{j}$$\n",
-    "\n",
-    "Remember that $\\hat{y}$ is the activation function applied to a neuron in our hidden layer, therefore $$\\hat{y} = sigmoid(\\sum_{i=1}^{num\\_neurons}w_{i}\\times a_{i})$$\n",
-    "\n",
-    "Also $a$ is the input generated by feeding the input layer variables into the hidden layer.\n",
-    "\n",
-    "We can use the same technique for the weights in the input layer as well. After we have the gradients for both weights, we use gradient descent to update the weights of the network."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Implementation\n",
-    "\n",
-    "First, we feed-forward the examples in our neural network. After that, we calculate the gradient for each layer weights. Once that is complete, we update all the weights using gradient descent. After running these for a given number of epochs, the function returns the trained Neural Network."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 40,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "%psource BackPropagationLearner"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 41,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "0\n"
-     ]
-    }
-   ],
-   "source": [
-    "iris = DataSet(name=\"iris\")\n",
-    "iris.classes_to_numbers()\n",
-    "\n",
-    "nNL = NeuralNetLearner(iris)\n",
-    "print(nNL([5, 3, 1, 0.1]))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The output should be 0, which means the item should get classified in the first class, \"setosa\". Note that since the algorithm is non-deterministic (because of the random initial weights) the classification might be wrong. Usually though it should be correct.\n",
-    "\n",
-    "To increase accuracy, you can (most of the time) add more layers and nodes. Unfortunately the more layers and nodes you have, the greater the computation cost."
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -2238,7 +2119,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.1"
+   "version": "3.5.3"
   }
  },
  "nbformat": 4,
diff --git a/neural_nets.ipynb b/neural_nets.ipynb
new file mode 100644
index 000000000..e7e085107
--- /dev/null
+++ b/neural_nets.ipynb
@@ -0,0 +1,174 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# NEURAL NETWORKS\n",
+    "\n",
+    "This notebook covers the neural network algorithms from chapter 18 of the book *Artificial Intelligence: A Modern Approach*, by Stuart Russel and Peter Norvig. The code in the notebook can be found in [learning.py](https://github.com/aimacode/aima-python/blob/master/learning.py).\n",
+    "\n",
+    "Execute the below cell to get started:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from learning import *"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## NEURAL NETWORK ALGORITHM\n",
+    "\n",
+    "### Overview\n",
+    "\n",
+    "Although the Perceptron may seem like a good way to make classifications, it is a linear classifier (which, roughly, means it can only draw straight lines to divide spaces) and therefore it can be stumped by more complex problems. We can extend Perceptron to solve this issue, by employing multiple layers of its functionality. The construct we are left with is called a Neural Network, or a Multi-Layer Perceptron, and it is a non-linear classifier. It achieves that by combining the results of linear functions on each layer of the network.\n",
+    "\n",
+    "Similar to the Perceptron, this network also has an input and output layer. However it can also have a number of hidden layers. These hidden layers are responsible for the non-linearity of the network. The layers are comprised of nodes. Each node in a layer (excluding the input one), holds some values, called *weights*, and takes as input the output values of the previous layer. The node then calculates the dot product of its inputs and its weights and then activates it with an *activation function* (sometimes a sigmoid). Its output is fed to the nodes of the next layer. Note that sometimes the output layer does not use an activation function, or uses a different one from the rest of the network. The process of passing the outputs down the layer is called *feed-forward*.\n",
+    "\n",
+    "After the input values are fed-forward into the network, the resulting output can be used for classification. The problem at hand now is how to train the network (ie. adjust the weights in the nodes). To accomplish that we utilize the *Backpropagation* algorithm. In short, it does the opposite of what we were doing up to now. Instead of feeding the input forward, it will feed the error backwards. So, after we make a classification, we check whether it is correct or not, and how far off we were. We then take this error and propagate it backwards in the network, adjusting the weights of the nodes accordingly. We will run the algorithm on the given input/dataset for a fixed amount of time, or until we are satisfied with the results. The number of times we will iterate over the dataset is called *epochs*. In a later section we take a detailed look at how this algorithm works.\n",
+    "\n",
+    "NOTE: Sometimes we add to the input of each layer another node, called *bias*. This is a constant value that will be fed to the next layer, usually set to 1. The bias generally helps us \"shift\" the computed function to the left or right."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![neural_net](images/neural_net.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Implementation\n",
+    "\n",
+    "The `NeuralNetLearner` function takes as input a dataset to train upon, the learning rate (in (0, 1]), the number of epochs and finally the size of the hidden layers. This last argument is a list, with each element corresponding to one hidden layer.\n",
+    "\n",
+    "After that we will create our neural network in the `network` function. This function will make the necessary connections between the input layer, hidden layer and output layer. With the network ready, we will use the `BackPropagationLearner` to train the weights of our network for the examples provided in the dataset.\n",
+    "\n",
+    "The NeuralNetLearner returns the `predict` function, which can receive an example and feed-forward it into our network to generate a prediction."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "%psource NeuralNetLearner"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## BACKPROPAGATION\n",
+    "\n",
+    "### Overview\n",
+    "\n",
+    "In both the Perceptron and the Neural Network, we are using the Backpropagation algorithm to train our weights. Basically it achieves that by propagating the errors from our last layer into our first layer, this is why it is called Backpropagation. In order to use Backpropagation, we need a cost function. This function is responsible for indicating how good our neural network is for a given example. One common cost function is the *Mean Squared Error* (MSE). This cost function has the following format:\n",
+    "\n",
+    "$$MSE=\\frac{1}{2} \\sum_{i=1}^{n}(y - \\hat{y})^{2}$$\n",
+    "\n",
+    "Where `n` is the number of training examples, $\\hat{y}$ is our prediction and $y$ is the correct prediction for the example.\n",
+    "\n",
+    "The algorithm combines the concept of partial derivatives and the chain rule to generate the gradient for each weight in the network based on the cost function.\n",
+    "\n",
+    "For example, if we are using a Neural Network with three layers, the sigmoid function as our activation function and the MSE cost function, we want to find the gradient for the a given weight $w_{j}$, we can compute it like this:\n",
+    "\n",
+    "$$\\frac{\\partial MSE(\\hat{y}, y)}{\\partial w_{j}} = \\frac{\\partial MSE(\\hat{y}, y)}{\\partial \\hat{y}}\\times\\frac{\\partial\\hat{y}(in_{j})}{\\partial in_{j}}\\times\\frac{\\partial in_{j}}{\\partial w_{j}}$$\n",
+    "\n",
+    "Solving this equation, we have:\n",
+    "\n",
+    "$$\\frac{\\partial MSE(\\hat{y}, y)}{\\partial w_{j}} = (\\hat{y} - y)\\times{\\hat{y}}'(in_{j})\\times a_{j}$$\n",
+    "\n",
+    "Remember that $\\hat{y}$ is the activation function applied to a neuron in our hidden layer, therefore $$\\hat{y} = sigmoid(\\sum_{i=1}^{num\\_neurons}w_{i}\\times a_{i})$$\n",
+    "\n",
+    "Also $a$ is the input generated by feeding the input layer variables into the hidden layer.\n",
+    "\n",
+    "We can use the same technique for the weights in the input layer as well. After we have the gradients for both weights, we use gradient descent to update the weights of the network."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Implementation\n",
+    "\n",
+    "First, we feed-forward the examples in our neural network. After that, we calculate the gradient for each layer weights. Once that is complete, we update all the weights using gradient descent. After running these for a given number of epochs, the function returns the trained Neural Network."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%psource BackPropagationLearner"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0\n"
+     ]
+    }
+   ],
+   "source": [
+    "iris = DataSet(name=\"iris\")\n",
+    "iris.classes_to_numbers()\n",
+    "\n",
+    "nNL = NeuralNetLearner(iris)\n",
+    "print(nNL([5, 3, 1, 0.1]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The output should be 0, which means the item should get classified in the first class, \"setosa\". Note that since the algorithm is non-deterministic (because of the random initial weights) the classification might be wrong. Usually though it should be correct.\n",
+    "\n",
+    "To increase accuracy, you can (most of the time) add more layers and nodes. Unfortunately the more layers and nodes you have, the greater the computation cost."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}