Karthika
Kamath
TYBTech Electronics
Semester 6
Veermata Jijabai Technological Institute
Professor: Rizwan Ahmed
Neural Networks & Fuzzy
Logic
Lecture Summary
___
Lecture 1
Introduction to the course
What is Signal? What is system? What is LTI system? What is convolution?
Signal: conveys information System: Categorised as per the input and output signal.
LTI(Linear Time Invariant) system is unbiased and output doesn’t depends on the time at which
it is excited. All that we studied like Transfer Functions, Convolution etc is applied on LTI
system. Convolution gives output for any general input and remember “Flip Shift Multiply
Add”, like in a Fashion show there are participants lined up and the judges are sitting in a row
in the panel and the first participant comes to the first judge, gets evaluated then shifts to the
second judge while the next participant comes to this first judge, after this evaluation again
there is a shift with the third participant joining the first judge...process continues till the end
and the final result is given cumulatively by all the judges. This is nothing but a demonstration
of “Flip Shift Multiply Add” that convolution follows. But are the real world systems LTI?
What is biomimicry?
Imitate your own brain. And this is not programming, this is “learning” and observing and
adapting. Machines are better in sequential process than brains, you provide a sequence of
instructions and machines execute it as a diligent slave. But that’s not the case with our brain,
it functions parallely and this “parallelism” needs to be transferred to the machines. How do we
learn and remember things? A brain is basically a culmination of a lot of learning theories.
“Machines should be good at things brains are good at (like vision and sensing) and bad at
things brains are bad at (like huge mathematical calculations).” What this course deals with is
parallel computation inspired by neurons.
A demonstration!
When 2 people say “Good Morning” they sound different and thus we can identify them, though
the range of frequency in saying “Good Morning “ is the same but what differs is that they have
different fundamental frequencies (Analogy: it is like modulating signal is same but carrier is
changing). As a matter of fact males have least range of fundamental frequency than females
and infants have maximum, that is why male voices are less screeching. Another fact: in a
telephone system we can’t accommodate all the different fundamental frequencies and thus
two people many a times can sound similar. So, to conclude, these fundamental frequencies can
become the distinguishing features for speaker or speech recognition.
___
Lecture 2
Modelling of Neurons
What is analysis? What is synthesis?
Analysis is when input and the system is given and output is found wherein it would be an
unique output. Synthesis is when input and output are given and the system needs to be
designed which is not necessarily unique. For an LTI system, using Transfer Function (TF) that
gives relation between output and input could be one way of modelling with poles and zeroes
being parameters of the TF. Another example for modelling is when we consider heights of
students in the class, this would follow a normal distribution with the average height being the
peak of the curve. This would be a probabilistic model and not a deterministic model.
Probabilistic models are better to work with because it gives answer in terms of probability of it
being correct or wrong(like the Google image search results) rather than determining the
answer as either totally correct or totally wrong (0 or 1 logic).
What are Features and Classes?
Features(distinguishing factor) serve the purpose of inputs to the system and classes are like
the outputs. So given a feature the system should tell which class it belongs to. Neuron
modelling is done without considering the constraints of an LTI system. Neuron is the hardware
with input from dendrites, nucleus being the system and output given through axon and this
needs to be modelled.
“It is often worth to understand models that are known to be wrong.”
Some simple models of neurons are…
● Linear Neurons
○ y = b + Σx iwi y: class, b: bias, w: weight x:features, i: index giving ith input
○ b and w are the parameters of the model which will be computed on the way
neuron learns.
○ Concept of hyperplanes was introduced like you can say line is a hyperplane in
1D
● Binary Threshold neuron
○ Below a threshold output is taken as 0 and above it as 1
● Rectified Linear Neuron
○ Below a threshold output is taken as 0 and for above input is passed as a linear
function itself
○ Basically a combination of previous two
● Sigmoid neurons
○ It is a general function for binary threshold
○ Output is between 0 and 1 thus “Probabilistic”
○ It is also differentiable
○ This model is better than previous three
___
Lecture 3
Stochastic Binary Neurons
z=wTx (when b=w0x0 with x0=1) this is given to the sigmoidal activation function and g(z) is the
output.
Features as well as corresponding classes are given as inputs and using these the system
parameters that is the weights are computed.
Examples of AND & OR logic gates were taken and it’s weights were computed based on the
truth table which has two features and four samples. It can be viewed as one neuron with three
inputs ie two features and one bias giving an output. The weights were computed easily as when
observed in the feature plot the two classes (0 and 1) were linearly separable.
Next the classic example of XOR was taken wherein using one neuron calculation of weights
failed because when viewed in the feature plot the classes can’t be separated linearly. It could
be solved only when another neuron layer was introduced thus forming a multilayer perceptron
network.
___
Lecture 4 & 5
Errors and Lebesgue Functions
Error like we all know is desired output minus the actual output. But it could be computed as
either absolute error by taking the modulus of the difference or as mean square error by taking
the square of the difference. So which one to prefer, absolute error or rms/ms error?
RMS is nothing but the distance formula giving the shortest path while absolute distance could
be similar to calling it as Manhattan distance or chessboard distance.
If a signal is converging it can be called as summable like Σ (½)nx(n).
If a signal whose energy could be found in any domain (ie sinc is not summable but it’s FT
which is rectangular is square summable because its energy could be found) then it is square
summable. If the signal is neither energy nor power then it is not square summable.
So which error to consider depends on the function, if it is summable then absolute error is
considered and if it is square summable then ms error is considered.
Family of functions that are absolutely summable belongs to L1(R) and the ones who are square
summable belongs to L2(R). So absolute error is L1norm and rms error is L2norm.
Concept of dot product was discussed, it is basically used to find similarities or dissimilarities or
say the common energy. The entire 2D vector space is L2(R) or Hilbert space.
Ex: set of nos: 1, 2, 1, 0, 1000, 3, 1 L ∞ norm ie 100th root of summation of 100th power of each
term would be approximately 1000 itself ie the maxima of the sequence. But this is discrete, if it
was a function then we get a function of maxima called suprema which forms the upper envelop
of the amplitude modulated wave and similarly infimas(L − ∞ norm) form the lower envelop.
Lecture 6
Back Propagation Algorithm
There is an input layer, one or more hidden layer and an output layer. Using signal flow diagram
the mathematical significance of back propagation algorithm was described. The idea being the
weights are assigned randomly to start with and then inputs are provided to the network, the
output obtained is compared with the desired output and this error is differentiated wrt weights
and multiplied by the learning parameter (eta) and thus the weights are adjusted by going
reverse from output to input layer.
Concept of local and global gradient was introduced. Local is for intermediate or hidden layers
and global is for last layer.
Delta rule of learning ie a trial and error method supported by differential calculus was
discussed which gives the weight correction factor for the hidden layers as (learning
parameter)x(local gradient)x(output of previous layer) .
Thus the error is propagated from output to input and then the network is checked again for the
new weights, again the procedure is repeated till optimum condition. The stopping criteria for
this training algorithm could be:
● Converge when the error ie the L2norm (Euclidean norm) of gradient vector reaches
some small threshold.
● Converge when absolute rate of change in average squared error per epoch(when all
input combinations are given) is very small.
___
Lecture 7, 8 & 9
Radial Basis Function
The first class went in discussion about the 4 key moments: 1st moment is dc value, 2nd
moment is variance, 3rd moment is skewness(degree of asymmetry) and 4th moment is
Kurtosis(gives accuracy with few samples and best results).
Previously we studied about backpropagation, Its inability of learning and inability of
generalising poses as a problem.
A brief discussion of curve fitting was done, with a set of given data points it could be viewed as
either overfitting or underfitting or a fair fit. In overfitting error is zero but the network can’t
learn so it can’t generalise. In underfitting it can learn but error is more. Fair fit is between both
and is acceptable. Here by saying error we mean training error, and it can be increased by
changing the order of the curve. In case of testing error certain amount of total samples are
exposed in order to test the network and see if the output matches. We notice that a
compromise needs to be made between decreasing error and increasing generalisation. Till now
in backpropagation we discussed only how to reduce training error therefore it has very little
power to learn.
Gaussian function is a nonlinear function and it could be used as a kernel. Basically when linear
separation of data is not possible in the input space it could be made linearly separable in the
feature space by increasing its dimension using a kernel. This architecture would deal with only
three layers one input, one hidden and an output layer such that when going from input to
hidden nonlinear function(Gaussian) is used and from hidden to output linear function is used.
XOR problem was used again where the inputs x1 and x2 were mapped into features ϕ1 and ϕ2
with t1 and t2 as parameters. ϕ acts as basis and t acts as radius thus calling this as radial basis
function. Ones it is transformed into feature space, linear classifier could be used to solve the
problem as done in the case of forming a feed forward net.
Input space has physical as well as mathematical interpretation and significance while feature
space has only mathematical significance. But the final question lies in the choice of the
nonlinear kernel function and the parameters of the function. The parameters taken are from
the datasets itself generally being the mean of the data.
Towards the end, Cover’s theory and its corollary was discussed.
___
Lecture 10
System of linear equations
Let’s take the following set of equations:
2x + 2y = 2; x + 2y = 2; x + 3y = 2 These equations(of the form Ax=B) can’t be solved clearly as
they aren’t linearly independent. Here the columns represent the features (x and y) and rows
represent no. of samples. So, this is an example of 2 features and 3 samples that means probably
the error might not be zero or it depends on choosing the two samples out of three. When all
the lines intersect at one single point then only an unique solution exists and then error is also
zero. What is meant by error here is:
Here e is the error and p tells us what part of b lies in the plane of a. Hence we will find
x’=ATB/ATA and this will give the best possible solution with least error. This was further proved
by differentiating the error equation wrt x and y and solving those equations to get the same
answer.
Thus, in such kind of system of equation we solve the equation this way and give answers with
errors. If b was perpendicular to a in the figure above then it would have been the case of zero
trivial solution.
If the equation is: w0 + w1x1 + w2x12 then x12=x2 this way nth order can be brought into n features
and into a linear system.
___
Lecture 11
Eigen Vectors and Eigen Values
Ax=b; b= λ x Here A is the matrix, b is the vector, λ is the scalar and x is the solution. By finding
λ using lA- λ Il=0 will give eigen values. This when substituted in Ax= λ x will give the solution
that forms the eigen vectors.
Symmetric matrix will give perpendicular eigen vectors and also they will have real λ . While
antisymmetric matrix will have purely imaginary eigen values.
We can consider A as the system, b as the output and x as the input. So when b= λ x we can say
that the output is a scalar multiple of input in case of symmetric system. But in the case of
antisymmetric you are asking the system to change the direction so no scalar ( λ ) exists.
___
Lecture 12
Principle component analysis (PCA)
If you are given a set of data points plotted on x1 vs x2 axis and you need to choose one axis only
to represent then how do you choose that? This is done by finding out the axis such that
variance along that particular axis is maximum and discard the one with minimum.
Covariance matrix(C) will give us the covariance between x1 and x2 and is always symmetric.
And we need to find the best possible vector(v) that would be chosen to represent the whole
data points. Like we had done previously Ax= λ x, same way here Cv= λ v is computed where v
will give the eigen vectors and λ is the eigen value.
If C is an identity matrix the λ is 1 and hence there is no change and the axis remains the same.
When two eigen vectors are present then the eigen vector corresponding to the higher λ is
considered.
In PCA the procedure followed is, there will be X matrix given of which mean is determined and
then the covariance matrix is computed using both. The eigen values and eigen vectors are
found for this covariance matrix. The eigen vectors form the transformation matrix and its rows
are called the principles. When this transformation matrix is applied onto the given data matrix
we get a resultant matrix and if the inverse of the transformation matrix is applied onto this
resultant matrix we can retrieve the original data matrix.
Significance lies in “dimensionality reduction” ie by selecting the eigen vector with maximum
λ and applying it on the input data matrix would compress the data. Thus if the data contains
any sort of redundancy then it is better to go to eigen plane from data plane.
We noticed in RBF dimensionality was increased and in PCA it is decreased and its utility lies
based on the application.
___
Lecture 13
K-clustering algorithm
What is “supervised’ and what is “unsupervised” learning?
Supervised learning is one in which classes are already made and provided to the machine based
on some historical data while in unsupervised learning there is no historical data present and
thus the classes are self made by the machine using some algorithm like K-means.
According to this algorithm, to start with, randomly two data points are taken as clustering
centres from the whole data points provided. For the rest of the data points its distance from
these two clustering centres are calculated and is then labeled to the one in which distance is
less. So we get two classes, now the mean of these two classes is computed and that becomes
the new clustering centre. Again the rest of the data points are compared with the new
clustering centres and grouped into two new classes then again the mean of these two classes
are found and the process continues till the mean doesn’t change much. K=2 here because
initially two data points were chosen. Depending on the value of K the data could be grouped
into that many different classes. The success of the algorithm lies in finding K.
This same process could be carried out in a reversed fashion by initially taking a large value of K
and having K classes, now if inter-class mean is less than threshold then merge those classes.
This way no. of classes would be reduced to optimum. Here the success lies in deciding the
threshold either by finding the variance along the two axis or so.
In the RBF problem t1 and t2 that were used as parameters could be decided using this
algorithm.
___
Lecture 14
Regression vs Classification
Gradient descent algorithm to find the local minima by knowing the cost function was
discussed.
Regression could be univariate, multivariate given by the equation hw(x)=wTx. This is given to
sigmoidal function to convert regression problem to classification problem. Looking at the
broader aspect, regression and classification are the same problem but for regression the
outcome is coming from infinite set(or say unknown set) and for classification it’s coming from
finite set. (Analogy: regression: analog and classification:discrete).
The thing that differentiates or classifies is called dichotomy or hyperplane. wT gives the
decision boundary and acts as a hyperplane separating the classes when plotted in the feature
plane.
Consider an example: There is a patient and the doctor predicts his health. Now the 4 cases are,
the doctor predicts the patient will survive and the patient actually survives, the doctor predicts
the patient will die and the patient actually dies, the doctor predicts the patient will die and the
patient actually survives and the last case is when the doctor predicts the patient will survive
and the patient actually dies. First two cases are the cases with zero error and the last two cases
have infinite loss, but the question is which case(or rather which error, though both deal with
infinite loss) is more dangerous out of two. That would depend on the situation like here the
last case was worst. The case where infinite loss occurs signifies that the cost function needs to
be changed.
We normally find error as (desired-actual) but we can also consider the equation as
log(desired-actual), then it is log domain. Classification problem is also called as logistic
regression problem (or say probabilistic when the classes are associated with some probability).
In regression problem error is quantifiable whereas in classification it is not.
___
Lecture 15
Support Vector Machine
As we observe there are multiple classifiers possible, so the question lies in deciding which is
the best classifier. The method followed to find this out is firstly choosing two circles and
forming a marginal hyperplane connecting them and then moving parallely forward till the first
cross is encountered and then forming another marginal plane through it which is parallel to
the previous(marginal planes shown in red). Then the hyperplane that bisects the distance
between the marginal planes becomes the best classifier. Every data point in the figure is a
vector and these three data points that form the marginal planes are called the “support
vectors”.
The same concept was explained mathematically that concludes saying minimisation of
objective function which is quadratic under a linear constraint is required to optimise the
solution. This is primal form of SVM and there is another form which is more preferred called
dual form of SVM wherein maximisation of something is done instead. Then there is kernel
trick in which linear classification can be done by increasing dimension.
Given above, is an example of soft margin that classifies with providing some exception.
___
Lecture 16 & 17
Deep NN and Case study
Till now what we studied was shallow networks, deep neural networks would help in giving far
more better idea. Convolutional neural networks(CNN) are effective in image recognition and
classification problems. The 4 operations carried out under this is Convolution, nonlinearity
(ReLU), Pooling or subsampling and finally classification(fully connected layers).
Image can be represented as a matrix of pixel values. In convolution a mask or say filter or a
kernel or a feature detector matrix is applied on the original image matrix to obtain the output
matrix called as Feature Map. Based on the different filter matrix applied different operations
like edge detection, sharpening, blurring can be applied. (ex. Edges are the high frequency
components, so when passed through low pass filter edges get blurred.)
Convolution is applied on LTI system therefore it is linear, but real world data is nonlinear so to
introduce this non linearity element wise thresholding is done in the next step called Rectified
linear unit(ReLU).
Then for the pooling task that involves dimensionality reduction spatial pooling is done either
by applying max, average etc operations on the spatial neighbourhood.
These three operations can be applied more than ones on the image and finally given to the
fully connected layer. This operation is like the normal multi layer perceptron method where
the features obtained are given to the network for training.
A case study of remote sensing was demonstrated in order to make us realise how to apply the
concepts in different fields. This case study was based on studying the satellite images and
classifying different regions. Interpretation from electromagnetic energy reflected and emitted
from the surface of earth was studied in order to derive the necessary features for classification.
Thus it is important to study the field of application in order to select appropriate features.
___
Lecture 18 & 19
Fourier or Wavelets
Utility of fourier lies in finding the frequency components and their contribution in the signal
in order to find the distinguishing features in frequency domain. But fourier would fail if we
have to determine a particular frequency say 2kHz is present at what time or given a time which
frequency is present at that time.
Time resolution: distance between two time intervals at a given frequency
Frequency resolution: at a given time the distance between two frequency components.
So, coarse and fine freq/time resolution was explained.
Two functions called ϕ (approximation function(1,1)) and ψ (detailing function(1,-1)) were
introduced. Basically they represent LPF & HPF respectively which are complementary to each
other. Both together forms the filter bank. The significance of approximation and detailing
function was beautifully demonstrated through the help of a cabbage where if the first layer is
peeled out then the approximation is what is left and detail is what is peeled.
Features are not taken at one resolution rather taken at multiple resolutions ie Multi Resolution
Analysis(MRA). Fourier looks at a signal as a full wave but why not look at it as a segment that's
called wavelet. Thus wavelet analysis is done using ϕ (father wavelet; scaling function) and ψ
(mother wavelet; wavelet function). Haar transform uses these wavelets. The two functions
need to be magnitude and power complementary.
Designing a wavelet is not difficult but designing an useful wavelet is challenging.
___
Lecture 20
Fuzziness
This theory is yet to evolve completely. It deals with vagueness and the degree of imprecision is
captured. Something called membership functions are used in fuzzy model. It is similar to
probability but not really probabilistic model. You can say the washing machine controller is a
fuzzy controller.
Fuzzy set is also different from set theory in the fact that an element in fuzzy set can be a
member of 2 sets.
Operations on fuzzy set were discussed. In this Union and intersection operations are done
slightly differently and are called Supremum( ⋁ ) and infimum( ⋀ ) respectively.
This concluding lecture on fuzzy logic gave a fuzzy idea in order to provide food for thoughts. :P
___
Scrolling from up to down I realised it has been an amazing journey of
learning :)