Data Mining
Lecture Notes for Chapter 4
Artificial Neural Networks
Introduction to Data Mining , 2nd Edition
by
Tan, Steinbach, Karpatne, Kumar
10/12/2020 Introduction to Data Mining, 2nd Edition 1
Artificial Neural Networks (ANN)
Basic Idea: A complex non-linear function can be
learned as a composition of simple processing units
ANN is a collection of simple processing units
(nodes) that are connected by directed links (edges)
– Every node receives signals from incoming edges,
performs computations, and transmits signals to
outgoing edges
– Analogous to human brain where nodes are neurons
and signals are electrical impulses
– Weight of an edge determines the strength of
connection between the nodes
– Simplest ANN: Perceptron (single neuron)
10/12/2020 Introduction to Data Mining, 2nd Edition 2
Basic Architecture of Perceptron
Activation Function
Learns linear decision boundaries
Similar to logistic regression (activation function is sign
instead of sigmoid)
10/12/2020 Introduction to Data Mining, 2nd Edition 3
Perceptron Example
X1 X2 X3 Y
1 0 0 -1
1 0 1 1
1 1 0 1
1 1 1 1
0 0 1 -1
0 1 0 -1
0 1 1 1
0 0 0 -1
Output Y is 1 if at least two of the three inputs are equal to 1.
10/12/2020 Introduction to Data Mining, 2nd Edition 4
Perceptron Example
X1 X2 X3 Y
1 0 0 -1
1 0 1 1
1 1 0 1
1 1 1 1
0 0 1 -1
0 1 0 -1
0 1 1 1
0 0 0 -1
Y sign ( 0 . 3 X 1 0 . 3 X 2 0 . 3 X 3 0 . 4 )
1 if x 0
where sign ( x )
1 if x 0
10/12/2020 Introduction to Data Mining, 2nd Edition 5
Perceptron Learning Rule
Initialize the weights (w0, w1, …, wd)
Repeat
– For each training example (xi, yi)
Compute
Update the weights:
Until stopping condition is met
k: iteration number; : learning rate
10/12/2020 Introduction to Data Mining, 2nd Edition 6
Perceptron Learning Rule
Weight update formula:
Intuition:
– Update weight based on error: e =
– If y = , e=0: no update needed
– If y > , e=2: weight must be increased so
that will increase
– If y < , e=-2: weight must be decreased so
that will decrease
10/12/2020 Introduction to Data Mining, 2nd Edition 7
Example of Perceptron Learning
0.1
X 1 X2 X3 Y w0 w1 w2 w3 Epoch w0 w1 w2 w3
1 0 0 -1 0 0 0 0 0 0 0 0 0 0
1 0 1 1 1 -0.2 -0.2 0 0 1 -0.2 0 0.2 0.2
2 0 0 0 0.2 2 -0.2 0 0.4 0.2
1 1 0 1
3 0 0 0 0.2
1 1 1 1 3 -0.4 0 0.4 0.2
4 0 0 0 0.2
0 0 1 -1 5 -0.2 0 0 0 4 -0.4 0.2 0.4 0.4
0 1 0 -1 6 -0.2 0 0 0 5 -0.6 0.2 0.4 0.2
0 1 1 1 7 0 0 0.2 0.2 6 -0.6 0.4 0.4 0.2
0 0 0 -1 8 -0.2 0 0.2 0.2
Weight updates over
Weight updates over first epoch all epochs
10/12/2020 Introduction to Data Mining, 2nd Edition 8
Perceptron Learning
Since y is a linear
combination of input
variables, decision
boundary is linear
10/12/2020 Introduction to Data Mining, 2nd Edition 9
Perceptron Learning
Since y is a linear
combination of input
variables, decision
boundary is linear
For nonlinearly separable problems, perceptron
learning algorithm will fail because no linear
hyperplane can separate the data perfectly
10/12/2020 Introduction to Data Mining, 2nd Edition 10
Nonlinearly Separable Data
XOR Data
y x1 x2
x1 x2 y
0 0 -1
1 0 1
0 1 1
1 1 -1
10/12/2020 Introduction to Data Mining, 2nd Edition 11
Multi-layer Neural Network
More than one hidden layer of
computing nodes
Every node in a hidden layer
operates on activations from
preceding layer and transmits
activations forward to nodes of
next layer
Also referred to as
“feedforward neural networks”
10/12/2020 Introduction to Data Mining, 2nd Edition 12
Multi-layer Neural Network
Multi-layer neural networks with at least one
hidden layer can solve any type of classification
task involving nonlinear decision surfaces
XOR Data
10/12/2020 Introduction to Data Mining, 2nd Edition 13
Why Multiple Hidden Layers?
Activations at hidden layers can be viewed as features
extracted as functions of inputs
Every hidden layer represents a level of abstraction
– Complex features are compositions of simpler features
Number of layers is known as depth of ANN
– Deeper networks express complex hierarchy of features
10/12/2020 Introduction to Data Mining, 2nd Edition 14
Multi-Layer Network Architecture
�
�
Activation value Activation
at node i at layer l Function Linear Predictor
10/12/2020 Introduction to Data Mining, 2nd Edition 15
Activation Functions
10/12/2020 Introduction to Data Mining, 2nd Edition 16
Learning Multi-layer Neural Network
Can we apply perceptron learning rule to each
node, including hidden nodes?
– Perceptron learning rule computes error term
e = y - and updates weights accordingly
Problem: how to determine the true value of y for
hidden nodes?
– Approximate error in hidden nodes by error in
the output nodes
Problem:
– Not clear how adjustment in the hidden nodes affect overall
error
– No guarantee of convergence to optimal solution
10/12/2020 Introduction to Data Mining, 2nd Edition 17
Gradient Descent
Loss Function to measure errors across all training points
Squared Loss:
Gradient descent: Update parameters in the direction of
“maximum descent” in the loss function across all points
: learning rate
Stochastic gradient descent (SGD): update the weight for every
instance (minibatch SGD: update over min-batches of instances)
10/12/2020 Introduction to Data Mining, 2nd Edition 18
Computing Gradients
=
Using chain rule of differentiation (on a single instance):
For sigmoid activation function:
How can we compute for every layer?
10/12/2020 Introduction to Data Mining, 2nd Edition 19
Backpropagation Algorithm
At output layer L:
At a hidden layer (using chain rule):
– Gradients at layer l can be computed using gradients at layer l + 1
– Start from layer L and “backpropagate” gradients to all previous
layers
Use gradient descent to update weights at every epoch
For next epoch, use updated weights to compute loss fn. and its gradient
Iterate until convergence (loss does not change)
10/12/2020 Introduction to Data Mining, 2nd Edition 20
Design Issues in ANN
Number of nodes in input layer
– One input node per binary/continuous attribute
– k or log2 k nodes for each categorical attribute with k
values
Number of nodes in output layer
– One output for binary class problem
– k or log2 k nodes for k-class problem
Number of hidden layers and nodes per layer
Initial weights and biases
Learning rate, max. number of epochs, mini-batch size for
mini-batch SGD, …
10/12/2020 Introduction to Data Mining, 2nd Edition 21
Characteristics of ANN
Multilayer ANN are universal approximators but could
suffer from overfitting if the network is too large
Gradient descent may converge to local minimum
Model building can be very time consuming, but testing
can be very fast
Can handle redundant and irrelevant attributes because
weights are automatically learnt for all attributes
Sensitive to noise in training data
Difficult to handle missing attributes
10/12/2020 Introduction to Data Mining, 2nd Edition 22
Deep Learning Trends
Training deep neural networks (more than 5-10 layers)
could only be possible in recent times with:
– Faster computing resources (GPU)
– Larger labeled training sets
– Algorithmic Improvements in Deep Learning
Recent Trends:
– Specialized ANN Architectures:
Convolutional Neural Networks (for image data)
Recurrent Neural Networks (for sequence data)
Residual Networks (with skip connections)
– Unsupervised Models: Autoencoders
– Generative Models: Generative Adversarial Networks
10/12/2020 Introduction to Data Mining, 2nd Edition 23
Vanishing Gradient Problem
Sigmoid activation function easily saturates (show zero gradient
with z) when z is too large or too small
Lead to small (or zero) gradients of squared loss with weights,
especially at hidden layers, leading to slow (or no) learning
10/12/2020 Introduction to Data Mining, 2nd Edition 24
Handling Vanishing Gradient Problem
Use of Cross-entropy loss function
Use of Rectified Linear Unit (ReLU) Activations:
10/12/2020 Introduction to Data Mining, 2nd Edition 25