Course : Artificial Intelligence (COMP6065)
Non-official Slides
Learning from Example II
Session 19
Revised by Williem, S. Kom., Ph.D.
1
Learning Outcomes
At the end of this session, students will be able to:
• LO 5 : Apply various techniques to an agent when acting under
certainty
• LO 6 : Apply how to process natural language and other
perceptual signs in order that an agent can interact
intelligently with the world
2
Outline
1. The Theory of Learning
2. Regression and Classification with Linear Models
3. Artificial Neural Networks
4. Practical Machine Learning
5. Summary
3
The Theory of Learning
• How can we be sure that our learning algorithm has produced a
hypothesis that will predict the correct value for previously
unseen input ?
• In formal term, how do we know that the hypothesis h is close to
the target function f if we don’t know what f is?
• How many examples do we need to get a good h?
• What hypothesis space should we use?
• If hypothesis space is very complex, can we even find the best h?
4
The Theory of Learning
• How many examples are needed for learning?
– Addressed by computational learning theory
• Any hypothesis that is seriously wrong will almost certainly be
“found out” with high probability after a small number of
examples, because it will make an incorrect prediction.
• Thus, any hypothesis that is consistent with a sufficiently large
set of training examples is unlikely to be seriously wrong: that is,
it must be probably approximately correct
– PAC learning algorithm
5
The Theory of Learning
• X = set of all possible examples
• D = probability distribution from which examples are drawn,
assumed same for training and test sets.
• H = set of possible hypotheses
• m = number of training examples
error (h) P(h( x) f ( x) | x drawn from D)
Hypothesis space:
H bad f
H is approximately
correct if error(h)
6
The Theory of Learning
• PAC learning example: learning decision list
– A decision list consists of a series of tests
– Decision lists resemble decision trees, but structure is
simpler, branch only in one direction
N N
Patrons(x,Some) Patrons(x,Full) Fri/Sat(x)
no
Y Y
yes yes
We can measure the number of examples needed for PAC learning!
7
Regression and Classification
with Linear Models
• Learning a linear model: Fitting a straight line
– Has been used for hundred of years
• Cases:
– Univariate linear regression
– Linear classifiers with a threshold
8
Regression and Classification
with Linear Models
• Univariate linear regression
– It want to estimate a linear model that fits the examples
hw x w1 x w0
• w is the weights coefficient
• h is the estimated output
– What we have to do is minimizing the empirical loss
9
Regression and Classification
with Linear Models
• Univariate linear regression
– How?
• Partial derivatives are zero
• Gradient descent
10
Regression and Classification
with Linear Models
• Univariate linear regression
– Gradient descent
• α is the learning rate (step size)
• Convergence the derivatives are below threshold
11
Regression and Classification
with Linear Models
• Univariate linear regression
– Gradient descent
• The derivatives and the final update function
12
Study Case Student
1
Test 1
50
Test 2
32
2 51 33
• Test 2 Grade = w0 +w1*(Test 1 Grade) 3
4
52
53
34
35
5 54 36
• From Data: 6 55 37
7 56 39
– Estimate w0 8
9
57
58
40
41
10 59 42
– Estimate w1 11 60 43
12 61 44
13 62 46
14 63 47
15 64 48
16 65 49
17 66 50
18 67 51
19 68 53
20 69 54
21 70 55
22 71 56
23 72 57
13
Regression and Classification
with Linear Models
• Linear classifiers with a threshold
– We can use the linear function to do classification
– A linear decision boundary is a line that separates two
classes using a straight line
Plot of two seismic data parameters
• X1 body wave magnitude
• X2 surface wave magnitude
• White circles: earthquakes
• Black circles: Nuclear explosions
14
Artificial Neural Network
15
Artificial Neural Network
• An artificial network that imitates the neurons in our brain
– There is hypothesis that mental activity consists of
electrochemical activity in network of brain cells
• Each node in ANN fires when a linear combination exceed
some threshold (linear classifier)
16
Artificial Neural Network
• Neural networks are composed of nodes or units connected
by directed links
• A link from unit i and j serves to propagate activation ai
• Each link has numeric weight wi,j associated with it
• Then, we apply activation function g to derive the output
(activation) a
17
Artificial Neural Network
• The activation function g is typically
– A hard threshold (perceptron) or
– A soft threshold (sigmoid perceptron)
• There are two fundamentally distinct ways to connect each
node
– Feed-forward network: has a connection in one direction
• Arranged in layers (and hidden units if more than 1
layer)
– Recurrent network: feeds its output back its input
18
Artificial Neural Network
• Neural Network for Quake Enemy Dead
Sound Low Health
• Four input perceptron
– One input for each condition
• Four perceptron hidden layer
– Fully connected
• Five output perceptron
– One output for each action
– Choose action with highest output
– Or, probabilistic action selection Attack Wander Spawn
Retreat Chase
19
Artificial Neural Network
• Feed-forward neural networks
– Single-layer (Perceptron network)
• Works in linear problem only
– Multi-layer network
• Works in non-linear problem too
20
Artificial Neural Network
• See the cases of and, or, xor for single-layer perceptron
– How can we solve the xor?
21
Artificial Neural Network
• Multi-layer network
– Contains hidden units (3,4)
– Activation function is complex
• How can we optimize?
– Using back propagation for gradient descent!
• We propagate the derivatives value
22
Artificial Neural Network
• The back-propagation process can be summarized as follows
– Compute the derivative values for the output units, using
the observed error
– Starting with output layer, repeat the following for each
layer until the earliest hidden layer is reached
• Propagate the derivative values back to the previous
layer
• Update the weights between the two layer
23
Practical Machine Learning
• Handwritten digit recognition
– We demonstrate an application of a multilayer feed
forward network for printed character recognition.
– For simplicity, we can limit our task to the recognition of
digits from 0 to 9. Each digit is represented by a 5 x 9
bitmap.
24
Practical Machine Learning
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
26 27 28 29 30
31 32 33 34 35
36 37 38 39 40
41 42 43 44 45
25
Practical Machine Learning
• The number of neurons in the input layer is decided by the
number of pixels in the bit map. The bit map in our example
consists of 45 pixels, and thus we need 45 input neurons.
• The output layer has 10 neurons – one neuron for each digit
to be recognized.
26
Practical Machine Learning
0 1 1 0
1 2 2 1
1 3
3 0
1
1 4
4 0
0 5 2
5 0
3
6 0
1 41 4
7 0
1 42 5
1 43
8 0
1 44 9 0
1 45 10 0
27
Summary
• Computational learning theory analyzes the sample
complexity and computational complexity of inductive learning
• Linear regression is a widely used model, can be solved by
gradient descent search
• A linear classifier with a hard threshold can be trained to fit
data that are linearly separable
• Neural networks represent complex nonlinear functions with
a network of linear-threshold units
28
References
• Stuart Russell, Peter Norvig. 2010. Artificial Intelligence : A
Modern Approach. Pearson Education. New Jersey.
ISBN:9780132071482
• http://aima.cs.berkeley.edu
29