Notes of Data Analytics Unit - II
[ By Dr K, S. Mishra,MCA Dept, MIET, Meerut]
Regression is a method to mathematically formulate
relationships between variables that in due course can be
used to estimate, interpolate and extrapolate. Suppose we
want to estimate the weight of individuals, which is
influenced by height, diet, workout, etc. Here, Weight is
the predicted variable. Height, Diet, Workout are
predictor variables.
The predicted variable is a dependent variable in the
sense that it depends on predictors. Predictors are also
called independent variables. Regression reveals to what
extent the predicted variable is affected by the predictors.
In other words, what amount of variation in predictors will
result in variations of the predicted variable. The predicted
variable is mathematically represented as
Y
. The predictor variables are represented as
X
This mathematical relationship is often called the
regression model.
Regression is a branch of statistics. There are many types
of regression. Regression is commonly used for prediction
and forecasting.
Discussion
● What's a typical process for performing
regression analysis?
● First select a suitable predicted variable with
acceptable measurement qualities such as reliability
and validity. Likewise, select the predictors. When
there's a single predictor, we call it bivariate analysis;
anything more, we call it multivariate analysis.
● Collect sufficient number of data points. Use a
suitable estimation technique to arrive at the
mathematical formula between predicted and
predictor variables. No model is perfect. Hence, give
error bounds.
● Finally, assess the model's stability by applying it to
different samples of the same population. When
predictor variables are given for a new data point,
estimate the predicted variable. If stable, the model's
accuracy should not decrease. This process is called
model cross-validation.
● I've heard of Least Squares. What's this and how
is it related to regression?
●
Least Squares is a term that signifies that the square of errors are at a
minimum. The error is defined as the difference between observed value and
predicted value. The objective of regression estimation is produce least
squared errors as a result. When error approaches zero, we term it as
overfitting.
Least Squares Method provides linear equations with unknowns that can be
solved for any given data. The unknowns are regression parameters. The
linear equations are called Normal Equations. The normal equations are
derived using calculus to minimize squared errors.
What is correlation? How is it related to regression?
Correlation coefficient r has the following formula:
r= ∑ (xi- x)(yi-y)/n √( ∑(xi-x)2 ∑(yi-y)2 ) where x & y are averages.It
is called Pearson Product Moment Correlation (PPMC). r has values in
[-1..1]
Types of correlation
How can we do data analysis when relationships are non-linear?
(x1,y1), (x2,y2),(x3,y3) ,(x4,y4) is a data set.
Let regression line be Y=ax+b(using least square method)
(y1- a*x1+b)2 +(y2-a*x2)2 +(y3-a*x3)2 should be minimum
What is the causal relationship in regression?
Causality or causation refers to the idea that variation in a predictor
X causes variation in the predicted variable Y.
. This is distinct from regression, which is more about predicting
Multivariate analysis (MVA) is a Statistical procedure for analysis of
data involving more than one type of measurement or observation. It
may also mean solving problems where more than one dependent
variable is analyzed simultaneously with other variables.
Advantages and Disadvantages of Multivariate
Analysis
Advantages
● The main advantage of multivariate analysis is that since it considers
more than one factor of independent variables that influence the
variability of dependent variables, the conclusion drawn is more accurate.
● The conclusions are more realistic and nearer to the real-life situation.
Disadvantages
● The main disadvantage of MVA includes that it requires rather complex
computations to arrive at a satisfactory conclusion.
● Many observations for a large number of variables need to be collected
and tabulated; it is a rather time-consuming process.
Classification Chart of Multivariate Techniques
Selection of the appropriate multivariate technique depends upon-
a) Are the variables divided into independent and dependent classification?
b) If Yes, how many variables are treated as dependents in a single analysis?
c) How are the variables, both dependent and independent measured?
Multivariate analysis technique can be classified into two broad categories viz.,
This classification depends upon the question: are the involved variables dependent
on each other or not?
If the answer is yes: We have Dependence methods.
If the answer is no: We have Interdependence methods.
Dependence technique: Dependence Techniques are types of multivariate
analysis techniques that are used when one or more of the variables can be
identified as dependent variables and the remaining variables can be identified as
independent
dependent variables and the remaining variables can be identified as independent.
Interdependence Technique
Interdependence techniques are a type of relationship that variables cannot be classified as either
dependent or independent.
It aims to unravel relationships between variables and/or subjects without explicitly assuming
specific distributions for the variables. The idea is to describe the patterns in the data without
making (very) strong assumptions about the variables.
Multiple Regression
Multiple Regression Analysis– Multiple regression is an extension of simple linear regression.
It is used when we want to predict the value of a variable based on the value of two or more
other variables. The variable we want to predict is called the dependent variable (or sometimes,
the outcome, target, or criterion variable). Multiple regression uses multiple “x” variables for
each independent variable: (x1)1, (x2)1, (x3)1, Y1)
y=ax1+bx2+cx3+d is an example of multiple regression.
Bayesian statistics:- is a theory in the field of statistics based on the
Bayesian interpretation of probability where probability expresses a degree of
belief in an event. The degree of belief may be based on prior knowledge
about the event, such as the results of previous experiments, or on personal
beliefs about the event. This differs from a number of other interpretations of
probability, such as the frequentist interpretation that views probability as the
limit of the relative frequency of an event after many trials.
Bayesian statistical methods use Bayes' theorem to compute and update
probabilities after obtaining new data. Bayes' theorem describes the
conditional probability of an event based on data as well as prior information
or beliefs about the event or conditions related to the event. For example, in
Bayesian inference, Bayes' theorem can be used to estimate the parameters
of a probability distribution or statistical model. Since Bayesian statistics treats
probability as a degree of belief, Bayes' theorem can directly assign a
probability distribution that quantifies the belief to the parameter or set of
parameters.Bayesian statistics is named after Thomas Bayes, who formulated
a specific case of Bayes' theorem in a paper published in 1763. In several
papers spanning from the late 18th to the early 19th centuries, Pierre-Simon
Laplace developed the Bayesian interpretation of probability. Laplace used
methods that would now be considered Bayesian to solve a number of
statistical problems. Many Bayesian methods were developed by later
authors, but the term was not commonly used to describe such methods until
the 1950s. During much of the 20th century, Bayesian methods were viewed
unfavorably by many statisticians due to philosophical and practical
considerations. Many Bayesian methods required much computation to
complete, and most methods that were widely used during the century were
based on the frequentist interpretation. However, with the advent of powerful
computers and new algorithms like the Markov chain Monte Carlo, Bayesian
methods have seen increasing use within statistics in the 21st century.
Bayesian inference
Bayesian inference refers to statistical inference where uncertainty in
inferences is quantified using probability] In classical frequentist
inference, model parameters and hypotheses are considered to be fixed.
Probabilities are not assigned to parameters or hypotheses in frequentist
inference. For example, it would not make sense in frequentist inference
to directly assign a probability to an event that can only happen once,
such as the result of the next flip of a fair coin. However, it would make
sense to state that the proportion of heads approaches one-half as the
number of coin flips increases.
Statistical models specify a set of statistical assumptions and processes
that represent how the sample data are generated. Statistical models
have a number of parameters that can be modified. For example, a coin
can be represented as samples from a Bernoulli distribution, which
models two possible outcomes. The Bernoulli distribution has a single
parameter equal to the probability of one outcome, which in most cases
is the probability of landing on heads. Devising a good model for the
data is central in Bayesian inference. In most cases, models only
approximate the true process, and may not take into account certain
factors influencing the data. In Bayesian inference, probabilities can be
assigned to model parameters. Parameters can be represented as
random variables. Bayesian inference uses Bayes' theorem to update
probabilities after more evidence is obtained or known]
Statistical modeling
The formulation of statistical models using Bayesian statistics has the
identifying feature of requiring the specification of prior distributions for
any unknown parameters. Indeed, parameters of prior distributions may
themselves have prior distributions, leading to Bayesian hierarchical
modeling, or may be interrelated, leading to Bayesian networks.
Exploratory analysis of Bayesian models
Exploratory analysis of Bayesian models is an adaptation or extension of
the exploratory data analysis approach to the needs and peculiarities of
Bayesian modeling. In the words of Persi Diaconis:Exploratory data
analysis seeks to reveal structure, or simple descriptions in data. We
look at numbers or graphs and try to find patterns. We pursue leads
suggested by background information, imagination, patterns perceived,
and experience with other data analyses
The inference process generates a posterior distribution, which has a
central role in Bayesian statistics, together with other distributions like
the posterior predictive distribution and the prior predictive distribution.
The correct visualization, analysis, and interpretation of these
distributions is key to properly answer the questions that motivate the
inference process.
When working with Bayesian models there are a series of related tasks
that need to be addressed besides inference itself:
● Diagnoses of the quality of the inference, this is needed when
using numerical methods such as Markov chain Monte Carlo
techniques
● Model criticism, including evaluations of both model
assumptions and model predictions
● Comparison of models, including model selection or model
averaging
● Preparation of the results for a particular audience
All these tasks are part of the Exploratory analysis of Bayesian models
approach and successfully performing them is central to the iterative and
interactive modeling process. These tasks require both numerical and
visual summaries.
Bayesian statistics is a theory in the field of statistics based on the
Bayesian interpretation of probability where probability expresses a
degree of belief in an event. The degree of belief may be based on prior
knowledge about the event, such as the results of previous experiments,
or on personal beliefs about the event. This differs from a number of
other interpretations of probability, such as the frequentist interpretation
that views probability as the limit of the relative frequency of an event
after many trials.
Bayesian statistical methods use Bayes' theorem to compute and update
probabilities after obtaining new data. Bayes' theorem describes the
conditional probability of an event based on data as well as prior
information or beliefs about the event or conditions related to the event.
For example, in Bayesian inference, Bayes' theorem can be used to
estimate the parameters of a probability distribution or statistical model.
Since Bayesian statistics treats probability as a degree of belief, Bayes'
theorem can directly assign a probability distribution that quantifies the
belief to the parameter or set of parameters.[
Bayesian statistics is named after Thomas Bayes, who formulated a
specific case of Bayes' theorem in a paper published in 1763. In several
papers spanning from the late 18th to the early 19th centuries,
Pierre-Simon Laplace developed the Bayesian interpretation of
probability.[4] Laplace used methods that would now be considered
Bayesian to solve a number of statistical problems. Many Bayesian
methods were developed by later authors, but the term was not
commonly used to describe such methods until the 1950s. During much
of the 20th century, Bayesian methods were viewed unfavorably by
many statisticians due to philosophical and practical considerations.
Many Bayesian methods required much computation to complete, and
most methods that were widely used during the century were based on
the frequentist interpretation. However, with the advent of powerful
computers and new algorithms like the Markov chain Monte Carlo,
Bayesian methods have seen increasing use within statistics in the 21st
century.
Bayesian inference
Bayesian inference refers to statistical inference where uncertainty in inferences is
quantified using probability. In classical frequentist inference, model parameters and
hypotheses are considered to be fixed. Probabilities are not assigned to parameters
or hypotheses in frequentist inference. For example, it would not make sense in
frequentist inference to directly assign a probability to an event that can only happen
once, such as the result of the next flip of a fair coin. However, it would make sense
to state that the proportion of heads approaches one-half as the number of coin flips
increases.Statistical models specify a set of statistical assumptions and processes
that represent how the sample data are generated. Statistical models have a
number of parameters that can be modified. For example, a coin can be represented
as samples from a Bernoulli distribution, which models two possible outcomes. The
Bernoulli distribution has a single parameter equal to the probability of one outcome,
which in most cases is the probability of landing on heads. Devising a good model
for the data is central in Bayesian inference. In most cases, models only
approximate the true process, and may not take into account certain factors
influencing the data In Bayesian inference, probabilities can be assigned to model
parameters. Parameters can be represented as random variables. Bayesian
inference uses Bayes' theorem to update probabilities after more evidence is
obtained or known.
Statistical modeling
The formulation of statistical models using Bayesian statistics has the
identifying feature of requiring the specification of prior distributions for any
unknown parameters. Indeed, parameters of prior distributions may
themselves have prior distributions, leading to Bayesian hierarchical
modeling, or may be interrelated, leading to Bayesian networks.
Design of experiments
The Bayesian design of experiments includes a concept called 'influence of
prior beliefs'. This approach uses sequential analysis techniques to include
the outcome of earlier experiments in the design of the next experiment. This
is achieved by updating 'beliefs' through the use of prior and posterior
distribution. This allows the design of experiments to make good use of
resources of all types. An example of this is the multi-armed bandit problem.
Exploratory analysis of Bayesian models
Exploratory analysis of Bayesian models is an adaptation or extension of the
exploratory data analysis approach to the needs and peculiarities of Bayesian
modeling. In the words of Persi Diaconis:
Exploratory data analysis seeks to reveal structure, or simple descriptions in
data. We look at numbers or graphs and try to find patterns. We pursue leads
suggested by background information, imagination, patterns perceived, and
experience with other data analyses
The inference process generates a posterior distribution, which has a central
role in Bayesian statistics, together with other distributions like the posterior
predictive distribution and the prior predictive distribution. The correct
visualization, analysis, and interpretation of these distributions is key to
properly answer the questions that motivate the inference process.
When working with Bayesian models there are a series of related tasks that
need to be addressed besides inference itself:
● Diagnoses of the quality of the inference, this is needed when using
numerical methods such as Markov chain Monte Carlo techniques
● Model criticism, including evaluations of both model assumptions and
model predictions
● Comparison of models, including model selection or model averaging
● Preparation of the results for a particular audience
All these tasks are part of the Exploratory analysis of Bayesian models
approach and successfully performing them is central to the iterative and
interactive modeling process. These tasks require both numerical and visual
summaries.
Bayesian statistics is a theory in the field of statistics based on the Bayesian
interpretation of probability where probability expresses a degree of belief in
an event. The degree of belief may be based on prior knowledge about the
event, such as the results of previous experiments, or on personal beliefs
about the event. This differs from a number of other interpretations of
probability, such as the frequentist interpretation that views probability as the
limit of the relative frequency of an event after many trials.
Bayesian statistical methods use Bayes' theorem to compute and update
probabilities after obtaining new data. Bayes' theorem describes the
conditional probability of an event based on data as well as prior information
or beliefs about the event or conditions related to the event For example, in
Bayesian inference, Bayes' theorem can be used to estimate the parameters
of a probability distribution or statistical model. Since Bayesian statistics treats
probability as a degree of belief, Bayes' theorem can directly assign a
probability distribution that quantifies the belief to the parameter or set of
parameters.
Bayesian statistics is named after Thomas Bayes, who formulated a specific
case of Bayes' theorem in a paper published in 1763. Overview
● The drawbacks of frequentist statistics lead to the need for Bayesian
Statistics
● Discover Bayesian Statistics and Bayesian Inference
Introduction
Bayesian Statistics continues to remain incomprehensible in the ignited minds
of many analysts. Being amazed by the incredible power of machine learning,
a lot of us have become unfaithful to statistics. Our focus has narrowed down
to exploring machine learning. Isn’t it true?
We fail to understand that machine learning is not the only way to solve real
world problems. In several situations, it does not help us solve business
problems, even though there is data involved in these problems. To say the
least, knowledge of statistics will allow you to work on complex analytical
problems, irrespective of the size of data.
In the 1770s, Thomas Bayes introduced the ‘Bayes Theorem’. Even after
centuries later, the importance of ‘Bayesian Statistics’ hasn’t faded away. In
fact, today this topic is being taught in great depths in some of the world’s
leading universities.
By the end of this topic, you will have a concrete understanding of Bayesian
Statistics and its associated concepts.
Bayesian Statistics
“Bayesian statistics is a mathematical procedure that applies probabilities to
statistical problems. It provides people the tools to update their beliefs in the
evidence of new data.”
Conditional Probability
It is defined as the: Probability of an event A given B equals the probability of
B and A happening together divided by the probability of B.”
For example: Assume two partially intersecting sets A and B as shown below.
Set A represents one set of events and Set B represents another. We wish to
calculate the probability of A given B has already happened. Lets represent
the happening of event B by shading it with red.
Now since B has happened, the part which now matters for A is the part
shaded in blue which is interestingly . So, the probability of A given B
turns out to be:
Therefore, we can write the formula for event B given A has already occurred
by:
or
Now, the second equation can be rewritten as :
This is known as Conditional Probability.
Let’s try to answer a betting problem with this technique.
Suppose, B be the event of winning of James Hunt. A be the event of raining.
Therefore,
1. P(A) =1/2, since it rained twice out of four days.
2. P(B) is 1/4, since James won only one race out of four.
3. P(A|B)=1, since it rained every time James won.
Substituting the values in the conditional probability formula, we get the
probability to be around 50%, which is almost the double of 25% when rain
was not taken into account This further strengthened our belief of James
winning in the light of new evidence i.e rain. You must be wondering that this
formula bears close resemblance to something you might have heard a lot
about.
Probably, you guessed it right. It looks like Bayes Theorem.
Bayes theorem is built on top of conditional probability and lies in the heart of
Bayesian Inference. Let’s understand it in detail now.
Bayes Theorem
Bayes Theorem comes into effect when multiple events form an exhaustive
set with another event B. This could be understood with the help of the below
diagram.
Now, B can be written as
So, probability of B can be written as,
But
So, replacing P(B) in the equation of conditional probability we get
This is the equation of Bayes Theorem.
4. Bayesian Inference
There is no point in diving into the theoretical aspect of it. So, we’ll learn how it
works! Let’s take an example of coin tossing to understand the idea behind
Bayesian inference.
An important part of Bayesian inference is the establishment of parameters
and models.
Models are the mathematical formulation of the observed events. Parameters
are the factors in the models affecting the observed data. For example, in
tossing a coin, fairness of coin may be defined as the parameter of coin
denoted by θ. The outcome of the events may be denoted by D.
Answer this now. What is the probability of 4 heads out of 9 tosses(D) given
the fairness of the coin (θ). i.e P(D|θ)
Wait, did I ask the right question? No.
We should be more interested in knowing : Given an outcome (D) what is the
probaility of coin being fair (θ=0.5)
Lets represent it using Bayes Theorem:
P(θ|D)=(P(D|θ) X P(θ))/P(D)
Here, P(θ) is the prior i.e the strength of our belief in the fairness of the coin
before the toss. It is perfectly okay to believe that a coin can have any degree
of fairness between 0 and 1.
P(D|θ) is the likelihood of observing our result given our distribution for θ. If
we knew that coin was fair, this gives the probability of observing the number
of heads in a particular number of flips.
P(D) is the evidence. This is the probability of data as determined by summing
(or integrating) across all possible values of θ, weighted by how strongly we
believe in those particular values of θ.
If we had multiple views of what the fairness of the coin is (but didn’t know for
sure), then this tells us the probability of seeing a certain sequence of flips for
all possibilities of our belief in the coin’s fairness.
P(θ|D) is the posterior belief of our parameters after observing the evidence
i.e the number of heads .
From here, we’ll dive deeper into the mathematical implications of this
concept. Don’t worry. Once you understand them, getting to its mathematics is
pretty easy.
To define our model correctly , we need two mathematical models beforehand.
One to represent the likelihood function P(D|θ) and the other for representing
the distribution of prior beliefs . The product of these two gives the posterior
belief P(θ|D) distribution.
Since prior and posterior are both beliefs about the distribution of fairness of
coin, intuition tells us that both should have the same mathematical form.
Keep this in mind. We will come back to it again.
So, there are several functions which support the existence of bayes theorem.
Bayesian Belief Network
Bayesian belief network is key computer technology for dealing with
probabilistic events and to solve a problem which has uncertainty. We can
define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set
of variables and their conditional dependencies using a directed acyclic
graph."
It is also called a Bayes network, belief network, decision network, or
Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a
probability distribution, and also use probability theory for prediction and
anomaly detection.
Real world applications are probabilistic in nature, and to represent the
relationship between multiple events, we need a Bayesian network. It can also
be used in various tasks including prediction, anomaly detection, diagnostics,
automated insight, reasoning, time series prediction, and decision making
under uncertainty.
Bayesian Network can be used for building models from data and experts
opinions, and it consists of two parts:
● Directed Acyclic Graph
● Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision
problems under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links),
where:
● Each node corresponds to the random variables, and a variable can be
continuous or discrete.
● Arc or directed arrows represent the causal relationship or conditional
probabilities between random variables. These directed links or arrows connect
the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there
is no directed link that means that nodes are independent with each other
○ In the above diagram, A, B, C, and D are random variables represented by
the nodes of the network graph.
○ If we are considering node B, which is connected with node A by a
directed arrow, then node A is called the parent of Node B.
○ Node C is independent of node A.
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is known as
a directed acyclic graph or DAG.
The Bayesian network has mainly two components:
● Causal Component
● Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi
|Parent(Xi) ), which determines the effect of the parent on that node.
The Bayesian network is based on Joint probability distribution and conditional
probability. So let's first understand the joint probability distribution:
Joint probability distribution:
If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of
x1, x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.
= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]
= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].
In general for each variable Xi, we can write the equation as:
P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))
Explanation of Bayesian network:
Let's understand the Bayesian network through an example by creating a directed
acyclic graph:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes. Harry
has two neighbors David and Sophia, who have taken a responsibility to inform Harry at
work when they hear the alarm. David always calls Harry when he hears the alarm, but
sometimes he got confused with the phone ringing and calls at that time too. On the
other hand, Sophia likes to listen to high music, so sometimes she misses hearing the
alarm. Here we would like to compute the probability of Burglary Alarm.
Problem:
Calculate the probability that the alarm has sounded, but there is neither a burglary,
nor an earthquake, and David and Sophia both called Harry.
Solution:
● The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the alarm
and directly affecting the probability of alarm going off, but David and Sophia's
calls depend on alarm probability.
● The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.
● The conditional distributions for each node are given as conditional probabilities
table or CPT.
● Each row in the CPT must be sum to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.
● In CPT, a boolean variable with k boolean parents contains 2K probabilities.
Hence, if there are two parents, then CPT will contain 4 probability values
List of all events occurring in this network:
● Burglary (B)
● Earthquake(E)
● Alarm(A)
● David Calls(D)
● Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A, B, E],
can rewrite the above probability statement using joint probability distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
= P [D| A]. P [ S| A, B, E]. P[ A, B, E]
= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]
= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]
Let's take the observed probability for the Burglary and earthquake component:
P(B= True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:
Conditional probability table for Alarm A:
The Conditional probability of Alarm A depends on Burglar and earthquake:
B E P(A= True) P(A= False)
True True 0.94 0.06
True False 0.95 0.04
False True 0.31 0.69
False False 0.001 0.999
Conditional probability table for David Calls:
The Conditional probability of David that he will call depends on the probability of Alarm.
A P(D= True) P(D= False)
True 0.91 0.09
False 0.05 0.95
Conditional probability table for Sophia Calls:
The Conditional probability of Sophia that she calls is depending on its Parent Node
"Alarm."
A P(S= True) P(S= False)
True 0.75 0.25
False 0.02 0.98
From the formula of joint distribution, we can write the problem statement in the form of
probability distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.
The semantics of Bayesian Network:
There are two ways to understand the semantics of the Bayesian network, which is
given below:
1. To understand the network as the representation of the Joint probability
distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of conditional
independence statements.
It is helpful in designing inference procedures.
Support-Vectors
Support vectors are the data points that are nearest to the hyper-plane and
affect the position and orientation of the hyper-plane. We have to select a
hyperplane, for which the margin, i.e the distance between support vectors
and hyper-plane is maximum. Even a little interference in the position of these
support vectors can change the hyper-plane.
How SVM works?
As we have a clear idea of the terminologies related to SVM, let’s now see
how the algorithm works. For example, we have a classification problem
where we have to separate the red data points from the blue ones.
Since it is a two-dimensional problem, our decision boundary will be a line, for
the 3-dimensional problem we have to use a plane, and similarly, the
complexity of the solution will increase with the rising number of features.
As shown in the above image, we have multiple lines separating the data
points successfully. But our objective is to look for the best solution.
There are few rules that can help us to identify the best line.
Maximum classification, i.e the selected line must be able to successfully
segregate all the data points into the respective classes. In our example, we
can clearly see lines E and D are miss classifying a red data point. Hence, for
this problem lines A, B, C is better than E and D. So we will drop them.
The second rule is Best Separation, which means, we must choose a line
such that it is perfectly able to separate the points.
If we talk about our example, if we get a new red data point closer to line A as
shown in the image below, line A will miss classifying that point. Similarly, if we
got a new blue instance closer to line B, then line A and C will classify the
data successfully, whereas line B will miss classifying this new point.
The point to be noticed here, In both the cases line C is successfully
classifying all the data points why? To understand this let’s take all the lines
one by one.
Why not Line A?
First, consider line A. If I move line A towards the left, we can see it has very
little chance to miss classify the blue points. on the other hand, if I shift line A
towards the right side it will very easily miss-classify the red points. The
reason is on the left side of the margin i.e the distance between the nearest
data point and the line is large whereas on the right side the margin is very
low.
Why Not Line B?
Similarly, in the case of line B. If we shift line B towards the right side, it has a
sufficient margin on the right side whereas it will wrongly classify the instances
on the left side as the margin towards the left is very low. Hence, B is not our
perfect solution.
Why Line C?
In the case of line C, It has sufficient margin on the left as well as the right
side. This maximum margin makes line C more robust for the new data points
that might appear in the future. Hence, C is the best fit in that case that
successfully classifies all the data points with the maximum margin.
This is what SVM looks for, it aims for the maximum margin and creates a line
that is equidistant from both sides, which is line C in our case. so we can say
C represents the SVM classifier with the maximum margin.
Now let’s look at the data below, As we can see this is not linearly separable
data, so SVM will not work in this situation. If anyhow we try to classify this
data with a line, the result will not be promising.
So, is there any way that SVM can classify this kind of data? For this problem,
we have to create a decision boundary that looks something like this.
The question is, is it possible to create such a decision boundary using SVM.
Well, the answer is Yes. SVM does this by projecting the data in a higher
dimension. As shown in the following image. In the first case, data is not
linearly separable, hence, we project into a higher dimension.
If we have more complex data then SVM will continue to project the data in a
higher dimension till it becomes linearly separable. Once the data become
linearly separable, we can use SVM to classify just like the previous problems.
Projection into Higher Dimension
Now let’s understand how SVM projects the data into a higher dimension.
Take this data, it is a circular non linearly separable dataset.
To project the data in a higher dimension, we are going to create another
dimension z, where
Now we will plot this feature Z with respect to x, which will give us linearly
separable data that looks like this.
Here, we have created a mapping Z using the base features X and Y, this
process is known as kernel transformation. Precisely, a kernel takes the
features as input and creates the linearly separable data in a higher
dimension.
Now the question is, do we have to perform this transformation manually? The
answer is no. SVM handles this process itself, just we have to choose the
kernel type.
Let’s quickly go through the different types of kernels available.
Linear Kernel
To start with, in the linear kernel, the decision boundary is a straight line.
Unfortunately, most of the real-world data is not linearly separable, this is the
reason the linear kernel is not widely used in SVM.
Gaussian / RBF kernel
It is the most commonly used kernel. It projects the data into a Gaussian
distribution, where the red points become the peak of the Gaussian surface
and the green data points become the base of the surface, making the data
linearly separable.
But this kernel is prone to overfitting and it captures the noise.
Polynomial kernel
At last, we have a polynomial kernel, which is non-uniform in nature due to the
polynomial combination of the base features. It often gives good results.
But the problem with the polynomial kernel is, the number of higher dimension
features increases exponentially. As a result, this is computationally more
expensive than RBF or linear kernel.
Linear Systems Analysis:--In systems theory, a linear system is a mathematical
model of a system based on the use of a linear operator. Linear systems typically
exhibit features and properties that are much simpler than the nonlinear case. As a
mathematical abstraction or idealization, linear systems find important applications
in automatic control theory, signal processing, and telecommunications. For
example, the propagation medium for wireless communication systems can often
be modeled by linear systems.
Mathematical models are usually composed of relationships and variables.
Relationships can be described by operators, such as algebraic operators, functions,
differential operators, etc. Variables are abstractions of system parameters of
interest, that can be quantified. Several classification criteria can be used for
mathematical models according to their structure:
Linear vs. nonlinear: If all the operators in a mathematical model exhibit linearity,
the resulting mathematical model is defined as linear. A model is considered to be
nonlinear otherwise. The definition of linearity and nonlinearity is dependent on
context, and linear models may have nonlinear expressions in them. For example,
in a statistical linear model, it is assumed that a relationship is linear in the
parameters, but it may be nonlinear in the predictor variables. Similarly, a
differential equation is said to be linear if it can be written with linear differential
operators, but it can still have nonlinear expressions in it. In a mathematical
programming model, if the objective functions and constraints are represented
entirely by linear equations, then the model is regarded as a linear model. If one or
more of the objective functions or constraints are represented with a nonlinear
equation, then the model is known as a nonlinear model.
Linear structure implies that a problem can be decomposed into simpler parts that
can be treated independently and/or analyzed at a different scale and the results
obtained will remain valid for the initial problem when recomposed and rescaled.
Nonlinearity, even in fairly simple systems, is often associated with phenomena
such as chaos and irreversibility. Although there are exceptions, nonlinear systems
and models tend to be more difficult to study than linear ones. A common approach
to nonlinear problems is linearization, but this can be problematic if one is trying to
study aspects such as irreversibility, which are strongly tied to nonlinearity.
Static vs. dynamic: A dynamic model accounts for time-dependent changes in the
state of the system, while a static (or steady-state) model calculates the system in
equilibrium, and thus is time-invariant. Dynamic models typically are represented
by differential equations or difference equations.
Explicit vs. implicit: If all of the input parameters of the overall model are known,
and the output parameters can be calculated by a finite series of computations, the
model is said to be explicit. But sometimes it is the output parameters which are
known, and the corresponding inputs must be solved for by an iterative procedure,
such as Newton's method or Broyden's method. In such a case the model is said to
be implicit. For example, a jet engine's physical properties such as turbine and
nozzle throat areas can be explicitly calculated given a design thermodynamic
cycle (air and fuel flow rates, pressures, and temperatures) at a specific flight
condition and power setting, but the engine's operating cycles at other flight
conditions and power settings cannot be explicitly calculated from the constant
physical properties.
Discrete vs. continuous: A discrete model treats objects as discrete, such as the
particles in a molecular model or the states in a statistical model; while a
continuous model represents the objects in a continuous manner, such as the
velocity field of fluid in pipe flows, temperatures and stresses in a solid, and
electric field that applies continuously over the entire model due to a point charge.
Deterministic vs. probabilistic (stochastic): A deterministic model is one in which
every set of variable states is uniquely determined by parameters in the model and
by sets of previous states of these variables; therefore, a deterministic model
always performs the same way for a given set of initial conditions. Conversely, in a
stochastic model—usually called a "statistical model"—randomness is present, and
variable states are not described by unique values, but rather by probability
distributions.
Nonlinear Dynamics:- Dynamics is a branch of
mathematics that studies how systems change over time. Up until the
18th century, people believed that the future could be perfectly
predicted given that one knows “all forces that set nature in motion,
and all positions of all items of which nature is composed” (that one
being is referred to as a Laplace Demon).
Now, provided that we believe that the world is fully deterministic,
then the statement makes sense. The problem is that in reality,
measured values (e.g. of forces and positions) are often approximated.
What we didn’t realise back then is that even if one knows of all
variables’ values, a slightly inaccurate approximation could result in a
totally different future. This is rather disheartening as it suggests that
in reality, even if we can build a supercomputer that resembles a
Laplace Demon, in order for it to make any meaningful prediction of
the future, all variables’ values must be perfectly obtained; not even a
tiny deviation is allowed. This phenomenon where “small differences
in initial conditions produce very great ones in the final phenomena”
(Poincare) is studied by the theory of chaos.
In the next section, we will explore a simple dynamic system and show
how it could produce chaos.
Population Model
Assume that the population growth of a place can be modelled by the
following equation. The population n(t+1) at time t+1 is equal to the
current population n(t) at time t subtracted by death which is
represented by n(t)²/k, or how overpopulated a place is with respect to
the maximum allowable population k, multiplied by the new babies
birth rate.
With some algebraic manipulations, we can convert this to an
equation of three variables.
The last equation relating x(t) and x(t+1) is a typical example of an
iteration equation. To get more intuitions on what this equation
entails, we shall analyse it with three charts. All of these simulations
are done with netlogo. The codes are provided in the course as well.
x(t) vs x(t+1)
This chart (or map) plots the current state x(t) against the next state
x(t+1). We can see that a one-humped map is shown on the chart (this
is basically a plot of R*x{t}*(1-x{t})). As we trace the point’s journey
over time, we can see it oscillates for a while before settling down on
about x equals to 0.64.
This behavior however, changes if we set the value of parameter R to
something else, say 3. The point perpetually oscillates between two
locations. It still settles down, albeit not on a static location but rather
two possible values.
t vs x(t)
For each of the above logistic map, we can also see how the point’s
position vary over time, that is by plotting time t against x(t).
R = 2.82
For the first logistic map (R = 2.82), we see that the value of x
eventually settles down to a single value. This is why we saw that the
blue dot in the previous chart stops moving after a while.
R = 3.2
Meanwhile for the second logistic map (R = 3.2), we see that x
eventually oscillates between two values, which explains why the blue
dot in the previous chart perpetually moved between two locations.
R vs x
Remember from the x(t) vs x(t+1) chart that the value of R determines
how the dynamics converge over time. In this section, we will plot R vs
the number of stable points that the system settles down to (a.k.a.
attractors).
As we can see, for R = 2.82, there’s only 1 attractor (around 0.64 to
0.65) while for R =3.2, two attractors are observed. Notice that there’s
a tendency for the number of attractors to increase as R grows. At
some point (when R is about 3.569946…), the number of attractors
reaches infinity (see the darkly shaded region in the diagram). At this
point, the system oscillates very widely, displaying almost no pattern.
This point is also known as the onset of chaos.
ANN(Artificial Neural Networks):-
An ANN usually involves a large number of processors
operating in parallel and arranged in tiers. The first tier
receives the raw input information -- analogous to optic
nerves in human visual processing. Each successive tier
receives the output from the tier preceding it, rather than
the raw input -- in the same way neurons further from the
optic nerve receive signals from those closer to it. The last
tier produces the output of the system.
Each processing node has its own small sphere of
knowledge, including what it has seen and any rules it was
originally programmed with or developed for itself. The
tiers are highly interconnected, which means each node in
tier n will be connected to many nodes in tier n-1 -- its
inputs -- and in tier n+1, which provides input data for
those nodes. There may be one or multiple nodes in the
output layer, from which the answer it produces can be
read.
Artificial neural networks are notable for being adaptive,
which means they modify themselves as they learn from
initial training and subsequent runs provide more
information about the world. The most basic learning
model is centered on weighting the input streams, which is
how each node weights the importance of input data from
each of its predecessors. Inputs that contribute to getting
right answers are weighted higher.
How neural networks learn
Architectures of Neural Network:
ANN is a computational system consisting of many interconnected units called
artificial neurons. The connection between artificial neurons can transmit a
signal from one neuron to another. So, there are multiple possibilities for
connecting the neurons based on which the architecture we are going to adopt
for a specific solution. Some permutations and combinations are as follows:
● There may be just two layers of neuron in the network – the input and
output layer.
● There can be one or more intermediate ‘hidden’ layers of a neuron.
● The neurons may be connected with all neurons in the next layer and
so on …..
So let’s start talking about the various possible architectures:
A. Single-layer Feed Forward Network:
It is the simplest and most basic architecture of ANN’s. It consists of only two
layers- the input layer and the output layer. The input layer consists of ‘m’ input
neurons connected to each of the ‘n’ output neurons. The connections carry
weights w11 and so on. The input layer of the neurons doesn’t conduct any
processing – they pass the i/p signals to the o/p neurons. The computations are
performed in the output layer. So, though it has 2 layers of neurons, only one
layer is performing the computation. This is the reason why the network is
known as SINGLE layer. Also, the signals always flow from the input layer to the
output layer. Hence, the network is known as FEED FORWARD.
The net signal input to the output neurons is given by:
The signal output from each output neuron will depend on the activation function
used.
B. Multi-layer Feed Forward Network:
Multi-Layer Feed Forward Network
The multi-layer feed-forward network is quite similar to the single-layer
feed-forward network, except for the fact that there are one or more intermediate
layers of neurons between the input and output layer. Hence, the network is
termed as multi-layer. Each of the layers may have a varying number of
neurons. For example, the one shown in the above diagram has ‘m’ neurons in
the input layer and ‘r’ neurons in the output layer and there is only one hidden
layer with ‘n’ neurons.
for the kth hidden layer neuron. The net signal input to the neuron in the output
layer is given by:
C. Competitive Network:
It is as same as the single-layer feed-forward network in structure. The only
difference is that the output neurons are connected with each other (either
partially or fully). Below is the diagram for this type of network.
Competitive Network
According to the diagram, it is clear that few of the output neurons are
interconnected to each other. For a given input, the output neurons compete
against themselves to represent the input. It represents a form of an
unsupervised learning algorithm in ANN that is suitable to find the clusters in a
data set.
D. Recurrent Network:
Recurrent Network
In feed-forward networks, the signal always flows from the input layer towards
the output layer (in one direction only). In the case of recurrent neural networks,
there is a feedback loop (from the neurons in the output layer to the input layer
neurons). There can be self-loops too.
Learning Process In ANN:
Learning process in ANN mainly depends on four factors, they are:
1. The number of layers in the network (Single-layered or
multi-layered)
2. Direction of signal flow (Feedforward or recurrent)
3. Number of nodes in layers: The number of node in the input layer is
equal to the number of features of the input data set. The number of
output nodes will depend on possible outcomes i.e. the number of
classes in case of supervised learning. But the number of layers in the
hidden layer is to be chosen by the user. A larger number of nodes in
the hidden layer, higher the performance but too many nodes may
result in overfitting as well as increased computational expense.
4. Weight of Interconnected Nodes: Deciding the value of weights
attached with each interconnection between each neuron so that a
specific learning problem can be solved correctly is quite a difficult
problem by itself. Take an example to understand the problem. Take the
example of a Multi-layered Feed-Forward Network, we have to train
an ANN model using some data, so that it can classify a new data set,
say p_5(3,-2). Say we have deduced that p_1=(5,2) and p_2 = (-1,12)
belonging to class C1 while p_3=(3,-5) and p_4 = (-2,-1) belonging to
class C2. We assume the values of synaptic weights w_0,w_1,w_2 as
-2, 1/2 and 1/4 respectively. But we will NOT get these weight values for
every learning problem. For solving a learning problem with ANN, we
can start with a set of values for synaptic weights and keep changing
those in multiple iterations. The stopping criterion may be the rate of
misclassification < 1% or the maximum numbers of iterations
should be less than 25(a threshold value). There may be another
problem that, the rate of misclassification may not reduce progressively.
Types of neural networks
Specific types of artificial neural networks include:
● Feed-forward neural networks: one of the simplest variants of neural
networks. They pass information in one direction, through various
input nodes, until it makes it to the output node. The network may or
may not have hidden node layers, making their functioning more
interpretable. It is prepared to process large amounts of noise. This
type of ANN computational model is used in technologies such as
facial recognition and computer vision.
● Recurrent neural networks: more complex. They save the output of
processing nodes and feed the result back into the model. This is
how the model is said to learn to predict the outcome of a layer. Each
node in the RNN model acts as a memory cell, continuing the
computation and implementation of operations. This neural network
starts with the same front propagation as a feed-forward network, but
then goes on to remember all processed information in order to
reuse it in the future. If the network's prediction is incorrect, then the
system self-learns and continues working towards the correct
prediction during backpropagation. This type of ANN is frequently
used in text-to-speech conversions.
● Convolutional neural networks: one of the most popular models used
today. This neural network computational model uses a variation of
multilayer perceptronsand contains one or more convolutional layers
that can be either entirely connected or pooled. These convolutional
layers create feature maps that record a region of image which is
ultimately broken into rectangles and sent out for nonlinear The CNN
model is particularly popular in the realm of image recognition; it has
been used in many of the most advanced applications of AI,
including facial recognition, text digitization and natural language
processing. Other uses include paraphrase detection, signal
processing and image classification.
● Deconvolutional neural networks: utilize a reversed CNN model
process. They aim to find lost features or signals that may have
originally been considered unimportant to the CNN system's task.
This network model can be used in image synthesis and analysis.
● Modular neural networks: contain multiple neural networks working
separately from one another. The networks do not communicate or
interfere with each other's activities during the computation process.
Consequently, complex or big computational processes can be
performed more efficiently.
Advantages of artificial neural networks
Advantages of artificial neural networks include:
● Parallel processing abilities mean the network can perform more
than one job at a time.
● Information is stored on an entire network, not just a database.
● The ability to learn and model nonlinear, complex relationships helps
model the real-life relationships between input and output.
● Fault tolerance means the corruption of one or more cells of the ANN
will not stop the generation of output.
● Gradual corruption means the network will slowly degrade over time,
instead of a problem destroying the network instantly.
● The ability to produce output with incomplete knowledge with the
loss of performance being based on how important the missing
information is.
● No restrictions are placed on the input variables, such as how they
should be distributed.
● Machine learning means the ANN can learn from events and make
decisions based on the observations.
● The ability to learn hidden relationships in the data without
commanding any fixed relationship means an ANN can better model
highly volatile data and non-constant variance.
● The ability to generalize and infer unseen relationships on unseen
data means ANNs can predict the output of unseen data.
Disadvantages of artificial neural networks
The disadvantages of ANNs include:
● The lack of rules for determining the proper network structure means
the appropriate artificial neural network architecture can only be
found through trial and error and experience.
● The requirement of processors with parallel processing abilities
makes neural networks hardware-dependent.
● The network works with numerical information, therefore all problems
must be translated into numerical values before they can be
presented to the ANN.
● The lack of explanation behind probing solutions is one of the
biggest disadvantages in ANNs. The inability to explain the why or
how behind the solution generates a lack of trust in the network.
Applications of artificial neural networks
Image recognition was one of the first areas to which neural networks were
successfully applied, but the technology uses have expanded to many more
areas, including:
● Chatbots
● Natural language processing, translation and language generation
● Stock market prediction
● Delivery driver route planning and optimization
● Drug discovery and development
These are just a few specific areas to which neural networks are being applied
today. Prime uses involve any process that operates according to strict rules
or patterns and has large amounts of data. If the data involved is too large for
a human to make sense of in a reasonable amount of time, the process is
likely a prime candidate for automation through artificial neural networks.
Generalization
In machine learning, generalization is a definition to demonstrate how
well is a trained model to classify or forecast unseen data. Training a
generalized machine learning model means, in general, it works for all
subset of unseen data. An example is when we train a model to classify
between dogs and cats. If the model is provided with dogs images
dataset with only two breeds, it may obtain a good performance. But, it
possibly gets a low classification score when it is tested by other breeds
of dogs as well. This issue can result to classify an actual dog image as
a cat from the unseen dataset. Therefore, data diversity is very important
factor in order to make a good prediction.
COMPETITIVE LEARNING:-
The basic architecture of a competitive learning system shown below i.. It consists of a set of
hierarchically layered units in which each layer connects, via excitatory connections, with the
layer immediately above it, and has inhibitory connections to units in its own layer. In the most
general case, each unit in a layer receives an input from each unit in the layer immediately
below it and projects to each unit in the layer immediately above it. Moreover, within a layer, the
units are broken into a set of inhibitory clusters in which all elements within a cluster inhibit all
other elements in the cluster. Thus the elements within a cluster at one level compete with one
another to respond to the pattern appearing on the layer below. The more strongly any particular
unit responds to an incoming stimulus, the more it shuts down the other members of its cluster.
Figure: The architecture of the competitive learning mechanism.
● The units in a given layer are broken into several sets of non overlapping clusters. Each
unit within a cluster inhibits every other unit within a cluster. Within each cluster, the unit
receiving the largest input achieves its maximum value while all other units in the cluster
1
are pushed to their minimum value. We have arbitrarily set the maximum value to 1 and
the minimum value to 0.
● Every unit in every cluster receives inputs from all members of the same set of input
units.
● A unit learns if and only if it wins the competition with other units in its cluster.
● A stimulus pattern Sj consists of a binary pattern in which each element of the pattern is
either active or inactive. An active element is assigned the value 1 and an inactive
element is assigned the value 0.
● Each unit has a fixed amount of weight (all weights are positive) that is distributed
among its input lines. The weight on the line connecting to unit i on the upper layer from unit j on
the lower layer is designated wij. The fixed total amount of weight for unit j is designated ∑ jwij =
1. A unit learns by shifting weight from its inactive to its active input lines. If a unit does not
respond to a particular pattern, no learning takes place in that unit. If a unit wins the competition,
then each of its input lines gives up some portion ϵ of its weight and that weight is then
distributed equally among the active input lines. Mathematically, this learning rule can be stated
What Is Principal Component Analysis?
Principal Component Analysis, or PCA, is a
dimensionality-reduction method that is often used to reduce the
dimensionality of large data sets, by transforming a large set of
variables into a smaller one that still contains most of the
information in the large set.
Reducing the number of variables of a data set naturally comes at
the expense of accuracy, but the trick in dimensionality reduction
is to trade a little accuracy for simplicity. Because smaller data sets
are easier to explore and visualize and make analyzing data much
easier and faster for machine learning algorithms without
extraneous variables to process.
So to sum up, the idea of PCA is simple — reduce the number of
variables of a data set, while preserving as much information as
possible.
Step by Step Explanation of PCA
STEP 1: STANDARDIZATION
The aim of this step is to standardize the range of the continuous
initial variables so that each one of them contributes equally to the
analysis.
More specifically, the reason why it is critical to perform
standardization prior to PCA, is that the latter is quite sensitive
regarding the variances of the initial variables. That is, if there are
large differences between the ranges of initial variables, those
variables with larger ranges will dominate over those with small
ranges (For example, a variable that ranges between 0 and 100 will
dominate over a variable that ranges between 0 and 1), which will
lead to biased results. So, transforming the data to comparable
scales can prevent this problem.
Mathematically, this can be done by subtracting the mean and
dividing by the standard deviation for each value of each variable.
Once the standardization is done, all the variables will be
transformed to the same scale.
STEP 2: COVARIANCE MATRIX COMPUTATION
The aim of this step is to understand how the variables of the input
data set are varying from the mean with respect to each other, or
in other words, to see if there is any relationship between them.
Because sometimes, variables are highly correlated in such a way
that they contain redundant information. So, in order to identify
these correlations, we compute the covariance matrix.
The covariance matrix is a p × p symmetric matrix (where p is the
number of dimensions) that has as entries the covariances
associated with all possible pairs of the initial variables. For
example, for a 3-dimensional data set with 3 variables x, y, and z,
the covariance matrix is a 3×3 matrix of this from:
Since the covariance of a variable with itself is its variance
(Cov(a,a)=Var(a)), in the main diagonal (Top left to bottom right) we
actually have the variances of each initial variable. And since the
covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of the
covariance matrix are symmetric with respect to the main
diagonal, which means that the upper and the lower triangular
portions are equal.
What do the covariances that we have as entries of the matrix
tell us about the correlations between the variables?
It’s actually the sign of the covariance that matters :
● if positive then : the two variables increase or decrease together
(correlated)
● if negative then : One increases when the other decreases
(Inversely correlated)
Now that we know that the covariance matrix is not more than a
table that summarizes the correlations between all the possible
pairs of variables, let’s move to the next step.
STEP 3: COMPUTE THE EIGENVECTORS AND
EIGENVALUES OF THE COVARIANCE MATRIX TO
IDENTIFY THE PRINCIPAL COMPONENTS
Eigenvectors and eigenvalues are the linear algebra concepts that
we need to compute from the covariance matrix in order to
determine the principal components of the data. Before getting to
the explanation of these concepts, let’s first understand what do
we mean by principal components.
Principal components are new variables that are constructed as
linear combinations or mixtures of the initial variables. These
combinations are done in such a way that the new variables (i.e.,
principal components) are uncorrelated and most of the
information within the initial variables is squeezed or compressed
into the first components. So, the idea is 10-dimensional data gives
you 10 principal components, but PCA tries to put maximum
possible information in the first component, then maximum
remaining information in the second and so on, until having
something like shown in the scree plot below.
Percentage of Variance (Information) for each by PC
Organizing information in principal components this way, will
allow you to reduce dimensionality without losing much
information, and this by discarding the components with low
information and considering the remaining components as your
new variables.
An important thing to realize here is that, the principal
components are less interpretable and don’t have any real meaning
since they are constructed as linear combinations of the initial
variables.
Geometrically speaking, principal components represent the
directions of the data that explain a maximal amount of variance,
that is to say, the lines that capture most information of the data.
The relationship between variance and information here, is that,
the larger the variance carried by a line, the larger the dispersion
of the data points along it, and the larger the dispersion along a
line, the more the information it has. To put all this simply, just
think of principal components as new axes that provide the best
angle to see and evaluate the data, so that the differences between
the observations are better visible.
How PCA Constructs the Principal
Components
As there are as many principal components as there are variables
in the data, principal components are constructed in such a
manner that the first principal component accounts for the largest
possible variance in the data set. For example, let’s assume that
the scatter plot of our data set is as shown below, can we guess the
first principal component ? Yes, it’s approximately the line that
matches the purple marks because it goes through the origin and
it’s the line in which the projection of the points (red dots) is the
most spread out. Or mathematically speaking, it’s the line that
maximizes the variance (the average of the squared distances from
the projected points (red dots) to the origin).
The second principal component is calculated in the same way,
with the condition that it is uncorrelated with (i.e., perpendicular
to) the first principal component and that it accounts for the next
highest variance.
This continues until a total of p principal components have been
calculated, equal to the original number of variables.
Now that we understand what we mean by principal components,
let’s go back to eigenvectors and eigenvalues. What you firstly
need to know about them is that they always come in pairs, so that
every eigenvector has an eigenvalue. And their number is equal to
the number of dimensions of the data. For example, for a
3-dimensional data set, there are 3 variables, therefore there are 3
eigenvectors with 3 corresponding eigenvalues.
Without further ado, it is eigenvectors and eigenvalues who are
behind all the magic explained above, because the eigenvectors of
the Covariance matrix are actually the directions of the axes where
there is the most variance(most information) and that we call
Principal Components. And eigenvalues are simply the coefficients
attached to eigenvectors, which give the amount of variance
carried in each Principal Component.
By ranking your eigenvectors in order of their eigenvalues, highest
to lowest, you get the principal components in order of
significance.
Example:
Let’s suppose that our data set is 2-dimensional with 2 variables
x,y and that the eigenvectors and eigenvalues of the covariance
matrix are as follows:
If we rank the eigenvalues in descending order, we get λ1>λ2, which
means that the eigenvector that corresponds to the first principal
component (PC1) is v1 and the one that corresponds to the second
component (PC2) isv2.
After having the principal components, to compute the percentage
of variance (information) accounted for by each component, we
divide the eigenvalue of each component by the sum of
eigenvalues. If we apply this on the example above, we find that
PC1 and PC2 carry respectively 96% and 4% of the variance of the
data.
STEP 4: FEATURE VECTOR
As we saw in the previous step, computing the eigenvectors and
ordering them by their eigenvalues in descending order, allow us
to find the principal components in order of significance. In this
step, what we do is, to choose whether to keep all these
components or discard those of lesser significance (of low
eigenvalues), and form with the remaining ones a matrix of vectors
that we call Feature vector.
So, the feature vector is simply a matrix that has as columns the
eigenvectors of the components that we decide to keep. This
makes it the first step towards dimensionality reduction, because
if we choose to keep only p eigenvectors (components) out of n,
the final data set will have only p dimensions.
Example:
Continuing with the example from the previous step, we can either
form a feature vector with both of the eigenvectors v1 and v2:
Or discard the eigenvector v2, which is the one of lesser
significance, and form a feature vector with v1 only:
Discarding the eigenvector v2 will reduce dimensionality by 1, and
will consequently cause a loss of information in the final data set.
But given that v2 was carrying only 4% of the information, the loss
will be therefore not important and we will still have 96% of the
information that is carried by v1.
So, as we saw in the example, it’s up to you to choose whether to
keep all the components or discard the ones of lesser significance,
depending on what you are looking for.
Fuzzy Logic:- The term fuzzy refers to things that are not clear or are vague. In
the real world many times we encounter a situation when we can’t determine
whether the state is true or false, their fuzzy logic provides very valuable flexibility
for reasoning. In this way, we can consider the inaccuracies and uncertainties of
any situation.
In the boolean system truth value, 1.0 represents the absolute truth value and 0.0
represents the absolute false value. But in the fuzzy system, there is no logic for
the absolute truth and absolute false value. But in fuzzy logic, there is an
intermediate value too present which is partially true and partially false.
ARCHITECTURE
Its Architecture contains four parts :
● RULE BASE: It contains the set of rules and the IF-THEN conditions
provided by the experts to govern the decision-making system, on the
basis of linguistic information. Recent developments in fuzzy theory
offer several effective methods for the design and tuning of fuzzy
controllers. Most of these developments reduce the number of fuzzy
rules.
● FUZZIFICATION: It is used to convert inputs i.e. crisp numbers into
fuzzy sets. Crisp inputs are basically the exact inputs measured by
sensors and passed into the control system for processing, such as
temperature, pressure, rpm’s, etc.
● INFERENCE ENGINE: It determines the matching degree of the current
fuzzy input with respect to each rule and decides which rules are to be
fired according to the input field. Next, the fired rules are combined to
form the control actions.
● DEFUZZIFICATION: It is used to convert the fuzzy sets obtained by the
inference engine into a crisp value. There are several defuzzification
methods available and the best-suited one is used with a specific expert
system to reduce the error.
Fuzzy decision trees:--