03 NEURAL NETWORKS I
Spring 2020 CS791/CS159 Machine Learning
Credits
1. B1: Machine learning: an algorithmic perspective. 2nd Edition, Marsland, Stephen. CRC press,
2015
2. B2: Principles of Soft Computing. 3rd Edition. S. N. Sivanandam, S. N. Deepa. Wiley,
2018.
3. www.d.umn.edu/~alam0026/NeuralNetwork.ppt
4. www.ohio.edu/people/starzykj/network/Class/ee690/.../NeuralNets%20overview.ppt
5. https://www.staff.ncl.ac.uk/peter.andras/annintro.ppt
6. https://tmohammed.files.wordpress.com/2012/03/w1-01-introtonn.ppt
7. http://aass.oru.se/~lilien/ml/seminars/2007_02_01b-Janecek-Perceptron.pdf
8. http://www.cems.uvm.edu/~rsnapp/teaching/cs295ml/notes/perceptron.pdf
9. http://www.atmos.washington.edu/~dennis/MatrixCalculus.pdf
10. https://en.wikipedia.org/wiki/Matrix_calculus
11. https://data-flair.training/blogs/learning-rules-in-neural-network/
B2:
Principles of Soft Computing.
3rd Edition
S. N. Sivanandam, S. N. Deepa.
Wiley, 2018.
Assignment
Read:
B1: Chapter 3.
B2: Chapter 2, 3.
Problems:
B1: 3.1, 3.2, 3.3
B2: Chapter 2, 3: Solved Problems
Neural Networks
Inspired by how human brain does analysis
Neuron – processing unit of human brain
Neuron collects signals from others through a host of fine structures called
dendrites.
Neuron sends out spikes of electrical activity through a long, thin stand
known as an axon, which splits into thousands of branches.
At the end of each branch, a structure called a synapse converts the
activity from the axon into electrical effects that inhibit or excite activity in
the connected neurons.
Estimated 1011 neurons are present in a human brain.
Each neuron is connected to thousands of other neurons.
14
About 10 synapses exist in a human brain.
Input signals collected through dendrites affect the electrical potential
inside the neuron body – called membrane potential.
Spiking of neuron happens when this membrane potential crosses a certain
threshold value.
After firing, the neuron must wait for some time to recover its energy (the
refractory period) before it can fire again.
Each neuron can be seen as a separate processor doing simple task:
whether to fire or not to fire.
Brain is a massively parallel supercomputer with 1011 processing elements
and dense interconnection.
Learning in brain happens on the principal concept of plasticity:
Modifying the strength of synaptic connections between neurons, and creating
new connections.
McCulloch and Pitts Neuron Model
Set of weighted inputs 𝒙𝒊 , 𝒘𝒊 that correspond to the synapses
Adder that sums the input signals (equivalent to the membrane of the cell
that collects electrical charge)
Activation function (initially a threshold function) that decides whether
the neuron fires (‘spikes’) for the current inputs
Analogy
𝑥𝑖 = 1 if the connected input neuron fired, =0 if it did not, an
intermediate value (e.g., =.5) can be taken as something in between.
𝑤𝑖 denotes the strength of synaptic connection
Input signal is proportional to strength of synaptic weight, so we do
𝑚
ℎ= 𝑤𝑖 𝑥𝑖
𝑖=1
Analogy
𝜃 is the threshold (“membrane threshold”)
A simple model, which has limitations
Incapable of emulating all the behaviors of real biological neurons
A network of such neurons (Neural Network) can model whatever a computer
can do
Neurons will be updated sequentially (based on a clock)
Weights can be positive (excitatory connections) or negative (inhibitory
connections)
Inputs can also be negative or positive
How does Neuron learn?
Inputs cannot change
Only weights and threshold function can change
Learning through neural network:
How to change weights and threshold functions of the neurons so that the
neural network gives correct output
The Perceptron
McCulloch and Pitts
Neuron
Weighted Connections
The Perceptron
Adder not explicitly shown
There can be m inputs and n outputs
𝑚 ≠ 𝑛 or 𝑚 = 𝑛
𝑤𝑖𝑗 represents weight given to signal value from 𝑖 𝑡ℎ input to 𝑗𝑡ℎ neuron
1 ≤ 𝑖 ≤ 𝑚 and 1 ≤ 𝑗 ≤ 𝑛
Learning Rules in Neural Networks
Perceptron learning rule
Hebbian Learning Rule
Delta learning rule or Widrow-Hoff rule
Perceptron Learning Rule
Supervised Learning Approach
The modification in sympatric weight of a node is equal to the
multiplication of error and the input.
𝑤𝑖𝑗 ← 𝑤𝑖𝑗 + 𝜂 𝑡𝑗 − 𝑦𝑗 ∙ 𝑥𝑖
or, 𝑤𝑖𝑗 ← 𝑤𝑖𝑗 − 𝜂 𝑦𝑗 − 𝑡𝑗 ∙ 𝑥𝑖
where
𝑦𝑗 : actual output at 𝑗𝑡ℎ neuron
𝑡𝑗 : target output corresponding to 𝑗𝑡ℎ neuron
𝜂: learning rate
Input 𝑥𝑖 , target 𝑡𝑗 and adder output 𝑦𝑗 are beyond our control
𝑤𝑖𝑗 and 𝜂 is what we can change
High value of 𝜂: learning will be too fast (dramatic) and the system will
never stabilize
Low value of 𝜂: learning will be too slow – system will have to see the input
too many times before it learns, but it will be resistant to noise.
Ideally 0.1 < 𝜂 < 0.4
Bias Input
What if all inputs are zero and we want one or more neurons to fire?
Solution:
Introduce a non-zero (say -1) “bias” input indexed at 0
𝑡ℎ
Introduce weights 𝑤0𝑖 : weight of bias input to 𝑖 neuron.
Perceptron Algorithm
Algorithmic complexity?
Ο(𝑇𝑚𝑛𝑘)
T: #iterations
m: #inputs
n: #outputs
k: #samples
Simulating OR output
Bias
Inputs
Take 𝑤𝑜 = −0.05, 𝑤1 = −0.02, 𝑤2 = 0.02, 𝜂 = 0.25
Let us iterate
1. B1: Machine learning: an algorithmic perspective. 2nd Edition,
Marsland, Stephen. CRC press, 2015
2. B2: Principles of Soft Computing. 3rd Edition. S. N. Sivanandam, S. N.
Deepa. Wiley, 2018.
B1 vs. B2:
B1: 𝑤𝑛𝑒𝑤 ← 𝑤𝑜𝑙𝑑 − 𝜂 𝑦 − 𝑡 𝑥
Assuming binary data using Perceptron Rule
B2: 𝑤𝑛𝑒𝑤 ← 𝑤𝑜𝑙𝑑 + 𝛼𝑡𝑥
Assuming bipolar data using Perceptron Rule
Hebb’s Rule
Donald Hebb in 1949
Changes in the strength of synaptic connections are proportional to the
correlation in the firing of the two connecting neurons.
If two neighbor neurons activated and deactivated at the same time. Then the
weight connecting these neurons should increase.
For neurons operating in the opposite phase, the weight between them should
decrease.
If there is no signal correlation, the weight should not change/ the connection
should die away
Δ𝑤𝑖𝑗 ← 𝑦𝑗 × 𝑥𝑖 , ; 𝑥𝑖 , 𝑦𝑗 ∈ {−1,1}
Generally, activation function is the linear identity function, 𝑡𝑗 = 𝑓(𝑦𝑗 ) = 𝑦𝑗
At the start, values of all weights are set to zero
Unsupervised learning rule
Target values are not used
Delta learning rule
Similar to perceptron rule, but
Based on minimization or LMS (Least Mean Square) error using Gradient
Descent Technique
Works for differentiable activation functions (e.g., linear) vs. the step
function in perceptron rule
Perceptron rule is guaranteed to converge if the data is linearly separable,
but the gradient-descent approach continues forever, converging only
asymptotically to the solution (tries to minimize error in case of inseparable
data).
We will stick to perceptron rule for now on, will discuss Gradient
Descent later
𝑤 is an (m: #dimensions of input, n: #neurons or #dimensions of output)
matrix
𝑦 and 𝑡 each is an (1, n: #neurons) matrix
x is an (1, m: #dimensions of input) matrix
𝑤 ← 𝑤 − 𝜂𝑥 𝑇 𝑦 − 𝑡
1 1
𝑇
𝑤𝑖𝑗 ← 𝑤𝑖𝑗 − 𝜂 𝑥𝑖𝑘 𝑦𝑘𝑗 − 𝑡𝑘𝑗 = 𝑤𝑖𝑗 − 𝜂 𝑥𝑘𝑖 (𝑦𝑘𝑗 − 𝑡𝑘𝑗 )
𝑘=1 𝑘=1
Value of 𝑖 𝑡ℎ (dimension of) input
Difference (predicted – target) value of 𝑗𝑡ℎ neuron
Batch Mode Learning
Let input dataset have 𝑠 samples, each sample have 𝑚 inputs (i.e., 𝑚
dimensions) and let there be 𝑛 neurons
Input dataset 𝑥 is an (s, m: #dimensions of input) matrix
𝑦 and 𝑡 each is an (s, n: #neurons) matrix
𝑤 is an (m: #dimensions of input, n: #neurons) matrix
Algorithm
For 𝑃 iterations do:
𝑦 for all 𝑠 input samples
Predict
Update 𝑤 for combined effect of all 𝑠 input sample, i.e.,
𝑠 𝑠
𝑇
𝑤𝑖𝑗 ← 𝑤𝑖𝑗 − 𝜂 𝑥𝑘𝑖 𝑦𝑘𝑗 − 𝑡𝑘𝑗 = 𝑤𝑖𝑗 − 𝜂 𝑥𝑖𝑘 𝑦𝑘𝑗 − 𝑡𝑘𝑗
𝑘=1 𝑘=1
𝑤 ← 𝑤 − 𝜂𝑥 𝑇 𝑦 − 𝑡
Batch mode seems to be often better (than updating at every sample)
Bias
Inputs
Take 𝑤𝑜 = −0.05, 𝑤1 = −0.02, 𝑤2 = 0.02, 𝜂 = 0.25
Let us iterate in batch mode.
Decision Boundary for OR function
Perceptron tries to find a straight line (in 2D, a
plane in 3D, and a hyperplane in higher
dimensions) – called decision boundary.
What is the decision boundary and how is it a line? (for 2D case)
Activation Value (say)
𝑚
𝑤𝑖𝑗 𝑥𝑖 = 𝑥 ⋅ 𝑤𝑗
𝑖=0
Where 𝑤𝑗 is the column vector corresponding to 𝑗𝑡ℎ neuron.
𝑗𝑡ℎ neuron fires if 𝑥 ⋅ 𝑤𝑗 > 0 and does not fire otherwise
So, 𝑗𝑡ℎ neuron acts as a two-class classifier:
Class I: 𝑥 ⋅ 𝑤𝑗 > 0
Class II: 𝑥 ⋅ 𝑤𝑗 ≤ 0
𝑥 ⋅ 𝑤𝑗 = 0 can be considered as the decision boundary for 𝑗𝑡ℎ neuron
For 2-D OR case with 1 neuron, this becomes:
𝑥0 𝑤0 + 𝑥1 𝑤1 + 𝑥2 𝑤2 = 0
−𝑤0 + 𝑥1 𝑤1 + 𝑥2 𝑤2 = 0 (−𝑤0 corresponds to bias)
The above is the equation for a straight line.
−𝑤0 + 𝑥1 𝑤1 + 𝑥2 𝑤2 = 0
𝑤0
𝑤12 + 𝑤22
Another Perspective
Let 𝑥 1 = {−1, 𝑥11 , 𝑥21 } and 𝑥 2 = {−1, 𝑥12 , 𝑥22 } be two points on the
decision boundary.
𝑥 1 ⋅ 𝑤𝑗 = 0 and 𝑥 2 ⋅ 𝑤𝑗 = 0 , i.e.,
𝑥 1 − 𝑥 2 ⋅ 𝑤𝑗 = 0
That is, vector 𝑤𝑗 is perpendicular to the line 𝑥 1 − 𝑥 2 , and this holds
for any two points 𝑥 1 and 𝑥 2 on the decision boundary.
Hence, decision boundary is a line and 𝑤𝑗 is a vector perpendicular to it.
Decision boundary is a line in 2D case, plane in 3D case and hyperplane
in higher dimensions.
𝑤𝑗
Convergence Theorem
If the data is linearly separable, the fixed-increment perceptron
algorithm terminates after a finite number of weight updates.
Proof taken from the slides by Prof. Robert Snapp, Department of
Computer Science, University of Vermont, Vermont, USA as part of his
course CS 295: Machine Learning
Proof of Convergence Theorem
Consider a single neuron.
Let 𝑤 represent the weight vector.
Let 𝑥𝑖 represent the 𝑖 𝑡ℎ sample vector
Let 𝑡𝑖 represent the target label of the 𝑖 𝑡ℎ sample.
𝑡𝑖 ∈ {0,1}
Let the activation function be:
1 if 𝑥𝑖 𝑤 > 0
𝑦𝑖 =
0 if 𝑥𝑖 𝑤 ≤ 0
Update rule
𝑤 = 𝑤 − 𝜂𝑥𝑖𝑇 (𝑦𝑖 − 𝑡𝑖 )
Let
−1 if 𝑡𝑖 = 0
𝑙𝑖 =
+1 if 𝑡𝑖 = 1
Then, update rule becomes
𝑤 = 𝑤 − 𝜂𝑥𝑖𝑇 (𝑦𝑖 − 𝑡𝑖 )
⇒ 𝑤 += 𝜂𝑥𝑖𝑇 𝑙𝑖
Let 𝑤 ∗ represent the solution that separates the given data.
Let 𝑥𝑖 = 𝑥𝑖 𝑙𝑖
Then,
𝑥𝑖 𝑤 ∗ > 0, ∀𝑖
And, weight update becomes
𝑤 += 𝜂 𝑥𝑖𝑇
Let 𝑥𝑖 = 𝑥𝑖 𝑙𝑖
Let 𝑤 𝑘 represent the weight vector after the 𝑘𝑡ℎ update.
Let𝑥 𝑘 represent the input sample that triggered the 𝑘𝑡ℎ update.
Thus,
𝑤 1 = 𝑤 0 + 𝜂 𝑥 𝑇 (1)
𝑤 2 = 𝑤 1 + 𝜂 𝑥 𝑇 (2)
⋮
𝑤 𝑘 = 𝑤 𝑘 − 1 + 𝜂 𝑥 𝑇 (𝑘)
We shall prove
𝐴𝑘 2 ≤ 𝑤 𝑘 − 𝑤 0 2
≤ 𝐵𝑘
for constants A and B
Thus, the network must converge after no more than 𝑘max = 𝐵/𝐴
updates
Cauchy-Schwartz Inequality
Let 𝑎, 𝑏 ∈ ℝ𝑛
𝑎 2 𝑏 2 ≥ 𝑎𝑇 𝑏 2
𝑤 1 = 𝑤 0 + 𝜂 𝑥 𝑇 (1)
𝑤 2 = 𝑤 1 + 𝜂 𝑥 𝑇 (2)
⋮
𝑤 𝑘 = 𝑤 𝑘 − 1 + 𝜂 𝑥 𝑇 (𝑘)
Adding the above 𝑘 equations yields
𝑤 𝑘 = 𝑤 0 + 𝜂(𝑥 𝑇 1 + 𝑥 𝑇 2 + ⋯ + 𝑥 𝑇 𝑘 )
𝑤 𝑘 − 𝑤 0 = 𝜂(𝑥 𝑇 1 + 𝑥 𝑇 2 + ⋯ + 𝑥 𝑇 𝑘 )
∗𝑇
Multiplying both sides with the solution 𝑤
∗𝑇 ∗𝑇 𝑇
𝑤 𝑤 𝑘 − 𝑤 0 = 𝜂𝑤 (𝑥 1 + 𝑥 𝑇 2 + ⋯ + 𝑥 𝑇 𝑘 )
Let
𝑎 = min 𝑤 ∗ 𝑇 𝑥 𝑇 > 0
𝑥
Thus,
𝑤 ∗𝑇 𝑤 𝑘 − 𝑤 0 ≥ 𝜂𝑎𝑘 > 0
∗𝑇
𝑤 𝑤 𝑘 −𝑤 0 ≥ 𝜂𝑎𝑘 > 0
Squaring both sides, with the Cauchy-Schwartz inequality, yields
∗𝑇 2 2 ∗𝑇 2 2
𝑤 𝑤 𝑘 −𝑤 0 ≥ 𝑤 𝑤 𝑘 −𝑤 0 ≥ 𝜂𝑎𝑘
Thus,
2
2
𝜂𝑎
𝑤 𝑘 −𝑤 0 ≥ 𝑘2
𝑤 ∗𝑇
This gives the lower bound.
Proof: Upper Bound
𝑤 1 = 𝑤 0 + 𝜂 𝑥 𝑇 (1)
𝑤 2 = 𝑤 1 + 𝜂 𝑥 𝑇 (2)
⋮
𝑤 𝑘 = 𝑤 𝑘 − 1 + 𝜂 𝑥 𝑇 (𝑘)
Subtracting 𝑤 0 from both sides yields
𝑤 1 − 𝑤 0 = 𝜂 𝑥 𝑇 (1)
𝑤 2 − 𝑤 0 = 𝑤 1 − 𝑤 0 + 𝜂 𝑥 𝑇 (2)
⋮
𝑤 𝑘 − 𝑤 0 = 𝑤 𝑘 − 1 − 𝑤 0 + 𝜂 𝑥 𝑇 (𝑘)
𝑤 1 − 𝑤 0 = 𝜂 𝑥 𝑇 (1)
𝑤 2 − 𝑤 0 = 𝑤 1 − 𝑤 0 + 𝜂 𝑥 𝑇 (2)
⋮
𝑤 𝑘 − 𝑤 0 = 𝑤 𝑘 − 1 − 𝑤 0 + 𝜂 𝑥 𝑇 (𝑘)
Squaring both sides yields
2
𝑤 1 −𝑤 0 = 𝜂2 𝑥 𝑇 1 2
2 2 𝑇 𝑇
𝑤 2 −𝑤 0 = 𝑤 1 − 𝑤 0 + 2𝜂 𝑤 1 − 𝑤 0 𝑥 (2)
+𝜂2 𝑥 𝑇 2 2
⋮
2 2 𝑇 𝑇
𝑤 𝑘 −𝑤 0 = 𝑤 𝑘 − 1 − 𝑤 0 + 2𝜂 𝑤 𝑘 − 1 − 𝑤 0 𝑥 (𝑘)
+𝜂2 𝑥 𝑇 𝑘 2
2
𝑤 1 −𝑤 0 = 𝜂2 𝑥 𝑇 1 2
𝑤 2 −𝑤 0 2
2 𝑇 𝑇
= 𝑤 1 −𝑤 0 + 2𝜂 𝑤 1 − 𝑤 0 𝑥 2 + 𝜂2 𝑥 𝑇 2 2
⋮
𝑤 𝑘 −𝑤 0 2
2 𝑇 𝑇
= 𝑤 𝑘−1 −𝑤 0 + 2𝜂 𝑤 𝑘 − 1 − 𝑤 0 𝑥 𝑘 + 𝜂2 𝑥 𝑇 𝑘 2
Since, 𝑥 𝑇 1 triggers an update, it must have been misclassified by weight
vector 𝑤 0 , i.e., 𝑤 0 𝑇 𝑥 𝑇 1 < 0
Similarly,
𝑤 𝑗 − 1 𝑇 𝑥 𝑇 𝑗 < 0, for 𝑗 = 1,2, … 𝑘
2
𝑤 1 −𝑤 0 = 𝜂2 𝑥 𝑇 1 2
2 2
𝑤 2 −𝑤 0 ≤ 𝑤 1 −𝑤 0 − 2𝜂𝑤 0 𝑇 𝑥 𝑇 (2) + 𝜂2 𝑥 𝑇 2 2
⋮
2 2
𝑤 𝑘 −𝑤 0 ≤ 𝑤 𝑘−1 −𝑤 0 − 2𝜂𝑤 0 𝑇 𝑥 𝑇 (𝑘) + 𝜂2 𝑥 𝑇 𝑘 2
Summing the 𝑘 inequalities yields,
𝑤 𝑘 −𝑤 0 2
≤ 𝜂2 𝑥 𝑇 1 2 + 𝑥 𝑇 2 2 + ⋯ + 𝑥 𝑇 𝑘 2
− 2𝜂𝑤 0 𝑇 𝑥 𝑇 2 + ⋯ + 𝑥 𝑇 (𝑘)
𝑤 𝑘 −𝑤 0 2
≤ 𝜂2 𝑥 𝑇 1 2 + 𝑥 𝑇 2 2 + ⋯ + 𝑥 𝑇 𝑘 2
− 2𝜂𝑤 0 𝑇 𝑥 𝑇 2 + ⋯ + 𝑥 𝑇 (𝑘)
Define
𝑇 2
𝑀 = max
𝑇
𝑥
𝑥
𝑇 𝑇
𝜇 = 2 min
𝑇
𝑤 0 𝑥 < 0 (misclassfications)
𝑥
The top equation becomes
2
𝑤 𝑘 −𝑤 0 ≤ 𝜂2 𝑀 − 𝜂𝜇 𝑘
Hence, we have shown
𝐴𝑘 2 ≤ 𝑤 𝑘 − 𝑤 0 2 ≤ 𝐵𝑘
2
𝜂𝑎 2
𝐴= 𝑇 and 𝐵 = 𝜂 𝑀 − 𝜂𝜇
𝑤∗
Thus,
𝜂𝑀 − 𝜇 ∗ 2
𝑘𝑚𝑎𝑥 = 2
𝑤
𝜂𝑎
LINEAR SEPARABILITY
A straight line decision boundary may not always exist
Linearly separable cases – when a straight (linear) decision boundary
is possible
Multiple Neurons May Help!
XOR Function – Linearly Inseparable
XOR – separable in 3D
Added Dimension
It is always possible to separate out two classes with a linear function,
provided that you project the data into the correct set of dimensions.
Kernel classifiers – basis of Support Vector Machines
Data Normalization/Standardization
Scaling input data to lie between (-1,+1)
Additionally with zero mean and unit variance – little better as it does not
allow outliers to dominate as much
𝑥 = (𝑥 − 𝜇)/𝜎
Partitioning data based on range to integral values
Choosing a subset of features can improve accuracy
LINEAR REGRESSION
Classification: find a line that separates out the classes
Regression: fit a line to data
Classification as instance of Regression
1. Fit a line to target data
2. Do regression for each class separately, i.e., fit line for data points of
each classes separately
In regression, we are computing lines (in 2D) that can predict target
values closely, i.e., 𝑦 = 𝛽1 𝑥 + 𝛽0
General form:
𝑀
𝑦= 𝛽𝑖 𝑥𝑖
𝑖=0
where: 𝑀 is the #of dimension of an input vector
𝛽 = (𝛽0 , 𝛽1 … , 𝛽𝑀 ) defines a line in 2-D, plane in 3-D and hyperplane
in higher dimensions.
Linear regression in two and three dimensions
How do we define the line/plane/hyperplane that best fits the data?
Minimize the distance between the line and the data points.
Least-squares Optimization
Where,
N: #data points
M: #dimension of input vector
N: #data points
M: #dimension of input vector
In matrix form, the above can be written as
𝑡 − 𝑋𝛽 𝑇 (𝑡 − 𝑋𝛽)
Where,
𝑡 is an (𝑁 × 1) vector containing target values
𝑋 is an (𝑁 × 𝑀) matrix denoting input values (including bias)
𝑋𝑖𝑗 : denotes value of 𝑗𝑡ℎ dimension of 𝑖 𝑡ℎ input vector
𝛽 is an (𝑀 × 1) vector defining the hyperplane.
To minimize least-squares error:
𝑑( 𝑡 − 𝑋𝛽 𝑇 𝑡 − 𝑋𝛽 )
=0
𝑑𝛽
𝑑( 𝑡 𝑇 − 𝑋𝛽 𝑇 𝑡 − 𝑋𝛽 )
=0
𝑑𝛽
−𝑑(𝑡 𝑇 t) 𝑑 𝑡 𝑇 𝑋𝛽 𝑑 𝑋𝛽 𝑇 𝑡 𝑑 𝛽 𝑇 𝑋 𝑇 𝑋𝛽
− − + =0
𝑑𝛽 𝑑𝛽 𝑑𝛽 𝑑𝛽
0 − 𝑡 𝑇 𝑋 − 𝑡 𝑇 𝑋 + 𝛽 𝑇 𝑋 𝑇 𝑋 + 𝑋 𝑇 𝑋 T = −2𝑡 𝑇 𝑋 + 2𝛽 𝑇 𝑋 𝑇 𝑋 = 0
𝛽𝑇 𝑋 𝑇 − 𝑡 𝑇 𝑋 = 0
𝑋 𝑇 𝑋Β − 𝑡 = 0
Hence, 𝛽 = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑡 [assuming 𝑋 𝑇 𝑋 −1 exists]
The following links may be helpful in finding matrix calculus identities
used in the previous proof:
https://en.wikipedia.org/wiki/Matrix_calculus
https://en.wikipedia.org/wiki/Matrix_calculus#Vector-by-vector
http://www.math.nyu.edu/~neylon/linalgfall04/project1/dj/proptranspose
.htm
Fill in the details in the proof (left as homework assignment)
Linear Regression for AND, OR and XOR
Inputs AND data OR data XOR data
[[0,0] [[-0.25] [[ 0.25] [[ 0.5]
[ 0,1] [ 0.25] [ 0.75] [ 0.5]
[ 1,0] [ 0.25] [ 0.75] [ 0.5]
[ 1,1]] [ 0.75]] [ 1.25]] [ 0.5]]
Miscellaneous Topics
Adaline: Adaptive Linear Neuron
A single linear unit that uses input to activation function (activation potential)
for calculating error, rather than the output of the activation function
Update Rule
𝑤𝑖 ← 𝑤𝑖 − 𝜂 𝑦𝑖𝑛 − 𝑡 ∙ 𝑥𝑖
Madaline: Multiple adaptive linear neurons
Many Adalines in parallel with a single output unit
Output is based on selection rule (e.g., max, AND)
𝑣𝑖 s are fixed, +ve
and possess a
common value
Training is like Adaline
1. Let 𝑧𝑗 = 𝑓(𝑧in𝑗 ) denote the output of 𝑗 th Adaline unit
2. If the final output does not match with target
𝑤𝑖𝑗 ← 𝑤𝑖𝑗 − 𝜂 𝑧𝑗 − 𝑡 ∙ 𝑥𝑖
ANNs Based on Connections
Single-layer feed-forward network
Multilayer feed-forward network
Single node with its own feedback
Single-layer recurrent network
Multilayer recurrent network.
Single Layer Feed-Forward Network
Multi Layer Feed-Forward Network
It may or may not be fully connected
Single Node with Own Feedback
Lateral Feedback: feedback to the same layer
Recurrent Networks: feedback networks with closed loop
Single Layer Recurrent Neural Network
Multi Layer Recurrent Neural Network