Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views91 pages

Notes of DA Unit-II

The document discusses regression analysis, explaining its purpose in estimating relationships between dependent and independent variables, and outlines the process of regression analysis, including model validation and the Least Squares method. It also covers correlation, multivariate analysis, and the advantages and disadvantages of using multiple regression techniques. Additionally, it introduces Bayesian statistics, emphasizing its approach to probability and inference, and highlights the importance of exploratory analysis in Bayesian modeling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views91 pages

Notes of DA Unit-II

The document discusses regression analysis, explaining its purpose in estimating relationships between dependent and independent variables, and outlines the process of regression analysis, including model validation and the Least Squares method. It also covers correlation, multivariate analysis, and the advantages and disadvantages of using multiple regression techniques. Additionally, it introduces Bayesian statistics, emphasizing its approach to probability and inference, and highlights the importance of exploratory analysis in Bayesian modeling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Notes of Data Analytics Unit - II

[ By Dr K, S. Mishra,MCA Dept, MIET, Meerut]


Regression is a method to mathematically formulate
relationships between variables that in due course can be
used to estimate, interpolate and extrapolate. Suppose we
want to estimate the weight of individuals, which is
influenced by height, diet, workout, etc. Here, Weight is
the predicted variable. Height, Diet, Workout are
predictor variables.
The predicted variable is a dependent variable in the
sense that it depends on predictors. Predictors are also
called independent variables. Regression reveals to what
extent the predicted variable is affected by the predictors.
In other words, what amount of variation in predictors will
result in variations of the predicted variable. The predicted
variable is mathematically represented as
Y

. The predictor variables are represented as


X

This mathematical relationship is often called the


regression model.
Regression is a branch of statistics. There are many types
of regression. Regression is commonly used for prediction
and forecasting.

Discussion

●​ What's a typical process for performing

regression analysis?

●​ First select a suitable predicted variable with

acceptable measurement qualities such as reliability

and validity. Likewise, select the predictors. When

there's a single predictor, we call it bivariate analysis;

anything more, we call it multivariate analysis.

●​ Collect sufficient number of data points. Use a

suitable estimation technique to arrive at the

mathematical formula between predicted and

predictor variables. No model is perfect. Hence, give

error bounds.
●​ Finally, assess the model's stability by applying it to

different samples of the same population. When

predictor variables are given for a new data point,

estimate the predicted variable. If stable, the model's

accuracy should not decrease. This process is called

model cross-validation.

●​ I've heard of Least Squares. What's this and how

is it related to regression?
●​

Least Squares is a term that signifies that the square of errors are at a
minimum. The error is defined as the difference between observed value and
predicted value. The objective of regression estimation is produce least
squared errors as a result. When error approaches zero, we term it as
overfitting.

Least Squares Method provides linear equations with unknowns that can be
solved for any given data. The unknowns are regression parameters. The
linear equations are called Normal Equations. The normal equations are
derived using calculus to minimize squared errors.
What is correlation? How is it related to regression?

Correlation coefficient r has the following formula:


r= ∑ (xi- x)(yi-y)/n √( ∑(xi-x)2 ∑(yi-y)2 ) where x & y are averages.It
is called Pearson Product Moment Correlation (PPMC). r has values in
[-1..1]

Types of correlation

How can we do data analysis when relationships are non-linear?


(x1,y1), (x2,y2),(x3,y3) ,(x4,y4) is a data set.
Let regression line be Y=ax+b(using least square method)
(y1- a*x1+b)2 +(y2-a*x2)2 +(y3-a*x3)2 should be minimum

What is the causal relationship in regression?


Causality or causation refers to the idea that variation in a predictor
X causes variation in the predicted variable Y.

. This is distinct from regression, which is more about predicting

Multivariate analysis (MVA) is a Statistical procedure for analysis of


data involving more than one type of measurement or observation. It
may also mean solving problems where more than one dependent
variable is analyzed simultaneously with other variables.
Advantages and Disadvantages of Multivariate
Analysis

Advantages

●​ The main advantage of multivariate analysis is that since it considers


more than one factor of independent variables that influence the
variability of dependent variables, the conclusion drawn is more accurate.
●​ The conclusions are more realistic and nearer to the real-life situation.

Disadvantages

●​ The main disadvantage of MVA includes that it requires rather complex


computations to arrive at a satisfactory conclusion.
●​ Many observations for a large number of variables need to be collected
and tabulated; it is a rather time-consuming process.

Classification Chart of Multivariate Techniques

Selection of the appropriate multivariate technique depends upon-

a) Are the variables divided into independent and dependent classification?

b) If Yes, how many variables are treated as dependents in a single analysis?

c) How are the variables, both dependent and independent measured?

Multivariate analysis technique can be classified into two broad categories viz.,
This classification depends upon the question: are the involved variables dependent
on each other or not?

If the answer is yes: We have Dependence methods.


If the answer is no: We have Interdependence methods.

Dependence technique: Dependence Techniques are types of multivariate


analysis techniques that are used when one or more of the variables can be
identified as dependent variables and the remaining variables can be identified as
independent

dependent variables and the remaining variables can be identified as independent.

Interdependence Technique
Interdependence techniques are a type of relationship that variables cannot be classified as either
dependent or independent.

It aims to unravel relationships between variables and/or subjects without explicitly assuming
specific distributions for the variables. The idea is to describe the patterns in the data without
making (very) strong assumptions about the variables.

Multiple Regression
Multiple Regression Analysis– Multiple regression is an extension of simple linear regression.
It is used when we want to predict the value of a variable based on the value of two or more
other variables. The variable we want to predict is called the dependent variable (or sometimes,
the outcome, target, or criterion variable). Multiple regression uses multiple “x” variables for
each independent variable: (x1)1, (x2)1, (x3)1, Y1)

y=ax1+bx2+cx3+d is an example of multiple regression.

Bayesian statistics:- is a theory in the field of statistics based on the


Bayesian interpretation of probability where probability expresses a degree of
belief in an event. The degree of belief may be based on prior knowledge
about the event, such as the results of previous experiments, or on personal
beliefs about the event. This differs from a number of other interpretations of
probability, such as the frequentist interpretation that views probability as the
limit of the relative frequency of an event after many trials.
Bayesian statistical methods use Bayes' theorem to compute and update
probabilities after obtaining new data. Bayes' theorem describes the
conditional probability of an event based on data as well as prior information
or beliefs about the event or conditions related to the event. For example, in
Bayesian inference, Bayes' theorem can be used to estimate the parameters
of a probability distribution or statistical model. Since Bayesian statistics treats
probability as a degree of belief, Bayes' theorem can directly assign a
probability distribution that quantifies the belief to the parameter or set of
parameters.Bayesian statistics is named after Thomas Bayes, who formulated
a specific case of Bayes' theorem in a paper published in 1763. In several
papers spanning from the late 18th to the early 19th centuries, Pierre-Simon
Laplace developed the Bayesian interpretation of probability. Laplace used
methods that would now be considered Bayesian to solve a number of
statistical problems. Many Bayesian methods were developed by later
authors, but the term was not commonly used to describe such methods until
the 1950s. During much of the 20th century, Bayesian methods were viewed
unfavorably by many statisticians due to philosophical and practical
considerations. Many Bayesian methods required much computation to
complete, and most methods that were widely used during the century were
based on the frequentist interpretation. However, with the advent of powerful
computers and new algorithms like the Markov chain Monte Carlo, Bayesian
methods have seen increasing use within statistics in the 21st century.

Bayesian inference

Bayesian inference refers to statistical inference where uncertainty in


inferences is quantified using probability] In classical frequentist
inference, model parameters and hypotheses are considered to be fixed.
Probabilities are not assigned to parameters or hypotheses in frequentist
inference. For example, it would not make sense in frequentist inference
to directly assign a probability to an event that can only happen once,
such as the result of the next flip of a fair coin. However, it would make
sense to state that the proportion of heads approaches one-half as the
number of coin flips increases.
Statistical models specify a set of statistical assumptions and processes
that represent how the sample data are generated. Statistical models
have a number of parameters that can be modified. For example, a coin
can be represented as samples from a Bernoulli distribution, which
models two possible outcomes. The Bernoulli distribution has a single
parameter equal to the probability of one outcome, which in most cases
is the probability of landing on heads. Devising a good model for the
data is central in Bayesian inference. In most cases, models only
approximate the true process, and may not take into account certain
factors influencing the data. In Bayesian inference, probabilities can be
assigned to model parameters. Parameters can be represented as
random variables. Bayesian inference uses Bayes' theorem to update
probabilities after more evidence is obtained or known]

Statistical modeling
The formulation of statistical models using Bayesian statistics has the
identifying feature of requiring the specification of prior distributions for
any unknown parameters. Indeed, parameters of prior distributions may
themselves have prior distributions, leading to Bayesian hierarchical
modeling, or may be interrelated, leading to Bayesian networks.

Exploratory analysis of Bayesian models


Exploratory analysis of Bayesian models is an adaptation or extension of
the exploratory data analysis approach to the needs and peculiarities of
Bayesian modeling. In the words of Persi Diaconis:Exploratory data
analysis seeks to reveal structure, or simple descriptions in data. We
look at numbers or graphs and try to find patterns. We pursue leads
suggested by background information, imagination, patterns perceived,
and experience with other data analyses
The inference process generates a posterior distribution, which has a
central role in Bayesian statistics, together with other distributions like
the posterior predictive distribution and the prior predictive distribution.
The correct visualization, analysis, and interpretation of these
distributions is key to properly answer the questions that motivate the
inference process.
When working with Bayesian models there are a series of related tasks
that need to be addressed besides inference itself:
●​ Diagnoses of the quality of the inference, this is needed when
using numerical methods such as Markov chain Monte Carlo
techniques
●​ Model criticism, including evaluations of both model
assumptions and model predictions
●​ Comparison of models, including model selection or model
averaging
●​ Preparation of the results for a particular audience
All these tasks are part of the Exploratory analysis of Bayesian models
approach and successfully performing them is central to the iterative and
interactive modeling process. These tasks require both numerical and
visual summaries.

Bayesian statistics is a theory in the field of statistics based on the


Bayesian interpretation of probability where probability expresses a
degree of belief in an event. The degree of belief may be based on prior
knowledge about the event, such as the results of previous experiments,
or on personal beliefs about the event. This differs from a number of
other interpretations of probability, such as the frequentist interpretation
that views probability as the limit of the relative frequency of an event
after many trials.
Bayesian statistical methods use Bayes' theorem to compute and update
probabilities after obtaining new data. Bayes' theorem describes the
conditional probability of an event based on data as well as prior
information or beliefs about the event or conditions related to the event.
For example, in Bayesian inference, Bayes' theorem can be used to
estimate the parameters of a probability distribution or statistical model.
Since Bayesian statistics treats probability as a degree of belief, Bayes'
theorem can directly assign a probability distribution that quantifies the
belief to the parameter or set of parameters.[
Bayesian statistics is named after Thomas Bayes, who formulated a
specific case of Bayes' theorem in a paper published in 1763. In several
papers spanning from the late 18th to the early 19th centuries,
Pierre-Simon Laplace developed the Bayesian interpretation of
probability.[4] Laplace used methods that would now be considered
Bayesian to solve a number of statistical problems. Many Bayesian
methods were developed by later authors, but the term was not
commonly used to describe such methods until the 1950s. During much
of the 20th century, Bayesian methods were viewed unfavorably by
many statisticians due to philosophical and practical considerations.
Many Bayesian methods required much computation to complete, and
most methods that were widely used during the century were based on
the frequentist interpretation. However, with the advent of powerful
computers and new algorithms like the Markov chain Monte Carlo,
Bayesian methods have seen increasing use within statistics in the 21st
century.
Bayesian inference
Bayesian inference refers to statistical inference where uncertainty in inferences is
quantified using probability. In classical frequentist inference, model parameters and
hypotheses are considered to be fixed. Probabilities are not assigned to parameters
or hypotheses in frequentist inference. For example, it would not make sense in
frequentist inference to directly assign a probability to an event that can only happen
once, such as the result of the next flip of a fair coin. However, it would make sense
to state that the proportion of heads approaches one-half as the number of coin flips
increases.Statistical models specify a set of statistical assumptions and processes
that represent how the sample data are generated. Statistical models have a
number of parameters that can be modified. For example, a coin can be represented
as samples from a Bernoulli distribution, which models two possible outcomes. The
Bernoulli distribution has a single parameter equal to the probability of one outcome,
which in most cases is the probability of landing on heads. Devising a good model
for the data is central in Bayesian inference. In most cases, models only
approximate the true process, and may not take into account certain factors
influencing the data In Bayesian inference, probabilities can be assigned to model
parameters. Parameters can be represented as random variables. Bayesian
inference uses Bayes' theorem to update probabilities after more evidence is
obtained or known.

Statistical modeling

The formulation of statistical models using Bayesian statistics has the


identifying feature of requiring the specification of prior distributions for any
unknown parameters. Indeed, parameters of prior distributions may
themselves have prior distributions, leading to Bayesian hierarchical
modeling, or may be interrelated, leading to Bayesian networks.

Design of experiments
The Bayesian design of experiments includes a concept called 'influence of
prior beliefs'. This approach uses sequential analysis techniques to include
the outcome of earlier experiments in the design of the next experiment. This
is achieved by updating 'beliefs' through the use of prior and posterior
distribution. This allows the design of experiments to make good use of
resources of all types. An example of this is the multi-armed bandit problem.

Exploratory analysis of Bayesian models


Exploratory analysis of Bayesian models is an adaptation or extension of the
exploratory data analysis approach to the needs and peculiarities of Bayesian
modeling. In the words of Persi Diaconis:

Exploratory data analysis seeks to reveal structure, or simple descriptions in


data. We look at numbers or graphs and try to find patterns. We pursue leads
suggested by background information, imagination, patterns perceived, and
experience with other data analyses

The inference process generates a posterior distribution, which has a central


role in Bayesian statistics, together with other distributions like the posterior
predictive distribution and the prior predictive distribution. The correct
visualization, analysis, and interpretation of these distributions is key to
properly answer the questions that motivate the inference process.
When working with Bayesian models there are a series of related tasks that
need to be addressed besides inference itself:
●​ Diagnoses of the quality of the inference, this is needed when using
numerical methods such as Markov chain Monte Carlo techniques
●​ Model criticism, including evaluations of both model assumptions and
model predictions
●​ Comparison of models, including model selection or model averaging
●​ Preparation of the results for a particular audience
All these tasks are part of the Exploratory analysis of Bayesian models
approach and successfully performing them is central to the iterative and
interactive modeling process. These tasks require both numerical and visual
summaries.
Bayesian statistics is a theory in the field of statistics based on the Bayesian
interpretation of probability where probability expresses a degree of belief in
an event. The degree of belief may be based on prior knowledge about the
event, such as the results of previous experiments, or on personal beliefs
about the event. This differs from a number of other interpretations of
probability, such as the frequentist interpretation that views probability as the
limit of the relative frequency of an event after many trials.
Bayesian statistical methods use Bayes' theorem to compute and update
probabilities after obtaining new data. Bayes' theorem describes the
conditional probability of an event based on data as well as prior information
or beliefs about the event or conditions related to the event For example, in
Bayesian inference, Bayes' theorem can be used to estimate the parameters
of a probability distribution or statistical model. Since Bayesian statistics treats
probability as a degree of belief, Bayes' theorem can directly assign a
probability distribution that quantifies the belief to the parameter or set of
parameters.
Bayesian statistics is named after Thomas Bayes, who formulated a specific
case of Bayes' theorem in a paper published in 1763. Overview
●​ The drawbacks of frequentist statistics lead to the need for Bayesian
Statistics
●​ Discover Bayesian Statistics and Bayesian Inference
Introduction
Bayesian Statistics continues to remain incomprehensible in the ignited minds

of many analysts. Being amazed by the incredible power of machine learning,

a lot of us have become unfaithful to statistics. Our focus has narrowed down

to exploring machine learning. Isn’t it true?

We fail to understand that machine learning is not the only way to solve real

world problems. In several situations, it does not help us solve business


problems, even though there is data involved in these problems. To say the

least, knowledge of statistics will allow you to work on complex analytical

problems, irrespective of the size of data.

In the 1770s, Thomas Bayes introduced the ‘Bayes Theorem’. Even after

centuries later, the importance of ‘Bayesian Statistics’ hasn’t faded away. In

fact, today this topic is being taught in great depths in some of the world’s

leading universities.

By the end of this topic, you will have a concrete understanding of Bayesian

Statistics and its associated concepts.


Bayesian Statistics
“Bayesian statistics is a mathematical procedure that applies probabilities to

statistical problems. It provides people the tools to update their beliefs in the

evidence of new data.”

Conditional Probability
It is defined as the: Probability of an event A given B equals the probability of

B and A happening together divided by the probability of B.”

For example: Assume two partially intersecting sets A and B as shown below.

Set A represents one set of events and Set B represents another. We wish to

calculate the probability of A given B has already happened. Lets represent

the happening of event B by shading it with red.


Now since B has happened, the part which now matters for A is the part

shaded in blue which is interestingly . So, the probability of A given B

turns out to be:

Therefore, we can write the formula for event B given A has already occurred

by:

or

Now, the second equation can be rewritten as :

This is known as Conditional Probability.

Let’s try to answer a betting problem with this technique.

Suppose, B be the event of winning of James Hunt. A be the event of raining.

Therefore,
1.​ P(A) =1/2, since it rained twice out of four days.
2.​ P(B) is 1/4, since James won only one race out of four.
3.​ P(A|B)=1, since it rained every time James won.

Substituting the values in the conditional probability formula, we get the

probability to be around 50%, which is almost the double of 25% when rain

was not taken into account This further strengthened our belief of James

winning in the light of new evidence i.e rain. You must be wondering that this

formula bears close resemblance to something you might have heard a lot

about.

Probably, you guessed it right. It looks like Bayes Theorem.

Bayes theorem is built on top of conditional probability and lies in the heart of

Bayesian Inference. Let’s understand it in detail now.

Bayes Theorem

Bayes Theorem comes into effect when multiple events form an exhaustive

set with another event B. This could be understood with the help of the below

diagram.
Now, B can be written as

So, probability of B can be written as,

But

So, replacing P(B) in the equation of conditional probability we get

This is the equation of Bayes Theorem.


4. Bayesian Inference
There is no point in diving into the theoretical aspect of it. So, we’ll learn how it

works! Let’s take an example of coin tossing to understand the idea behind

Bayesian inference.

An important part of Bayesian inference is the establishment of parameters

and models.

Models are the mathematical formulation of the observed events. Parameters

are the factors in the models affecting the observed data. For example, in

tossing a coin, fairness of coin may be defined as the parameter of coin

denoted by θ. The outcome of the events may be denoted by D.

Answer this now. What is the probability of 4 heads out of 9 tosses(D) given

the fairness of the coin (θ). i.e P(D|θ)

Wait, did I ask the right question? No.

We should be more interested in knowing : Given an outcome (D) what is the

probaility of coin being fair (θ=0.5)

Lets represent it using Bayes Theorem:

P(θ|D)=(P(D|θ) X P(θ))/P(D)
Here, P(θ) is the prior i.e the strength of our belief in the fairness of the coin

before the toss. It is perfectly okay to believe that a coin can have any degree

of fairness between 0 and 1.

P(D|θ) is the likelihood of observing our result given our distribution for θ. If

we knew that coin was fair, this gives the probability of observing the number

of heads in a particular number of flips.

P(D) is the evidence. This is the probability of data as determined by summing

(or integrating) across all possible values of θ, weighted by how strongly we

believe in those particular values of θ.

If we had multiple views of what the fairness of the coin is (but didn’t know for

sure), then this tells us the probability of seeing a certain sequence of flips for

all possibilities of our belief in the coin’s fairness.

P(θ|D) is the posterior belief of our parameters after observing the evidence

i.e the number of heads .

From here, we’ll dive deeper into the mathematical implications of this

concept. Don’t worry. Once you understand them, getting to its mathematics is

pretty easy.

To define our model correctly , we need two mathematical models beforehand.

One to represent the likelihood function P(D|θ) and the other for representing
the distribution of prior beliefs . The product of these two gives the posterior

belief P(θ|D) distribution.

Since prior and posterior are both beliefs about the distribution of fairness of

coin, intuition tells us that both should have the same mathematical form.

Keep this in mind. We will come back to it again.

So, there are several functions which support the existence of bayes theorem.

Bayesian Belief Network

Bayesian belief network is key computer technology for dealing with

probabilistic events and to solve a problem which has uncertainty. We can

define a Bayesian network as:


"A Bayesian network is a probabilistic graphical model which represents a set

of variables and their conditional dependencies using a directed acyclic

graph."

It is also called a Bayes network, belief network, decision network, or

Bayesian model.

Bayesian networks are probabilistic, because these networks are built from a

probability distribution, and also use probability theory for prediction and

anomaly detection.

Real world applications are probabilistic in nature, and to represent the

relationship between multiple events, we need a Bayesian network. It can also

be used in various tasks including prediction, anomaly detection, diagnostics,

automated insight, reasoning, time series prediction, and decision making

under uncertainty.

Bayesian Network can be used for building models from data and experts

opinions, and it consists of two parts:

●​ Directed Acyclic Graph

●​ Table of conditional probabilities.

The generalized form of Bayesian network that represents and solve decision

problems under uncertain knowledge is known as an Influence diagram.


A Bayesian network graph is made up of nodes and Arcs (directed links),

where:

●​ Each node corresponds to the random variables, and a variable can be


continuous or discrete.

●​ Arc or directed arrows represent the causal relationship or conditional


probabilities between random variables. These directed links or arrows connect
the pair of nodes in the graph.​
These links represent that one node directly influence the other node, and if there
is no directed link that means that nodes are independent with each other

○​ In the above diagram, A, B, C, and D are random variables represented by


the nodes of the network graph.
○​ If we are considering node B, which is connected with node A by a
directed arrow, then node A is called the parent of Node B.

○​ Node C is independent of node A.

Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is known as

a directed acyclic graph or DAG.

The Bayesian network has mainly two components:

●​ Causal Component

●​ Actual numbers

Each node in the Bayesian network has condition probability distribution P(Xi

|Parent(Xi) ), which determines the effect of the parent on that node.

The Bayesian network is based on Joint probability distribution and conditional

probability. So let's first understand the joint probability distribution:

Joint probability distribution:


If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of

x1, x2, x3.. xn, are known as Joint probability distribution.

P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability

distribution.
= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]

= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].

In general for each variable Xi, we can write the equation as:

P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))

Explanation of Bayesian network:


Let's understand the Bayesian network through an example by creating a directed

acyclic graph:

Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm

reliably responds at detecting a burglary but also responds for minor earthquakes. Harry

has two neighbors David and Sophia, who have taken a responsibility to inform Harry at

work when they hear the alarm. David always calls Harry when he hears the alarm, but

sometimes he got confused with the phone ringing and calls at that time too. On the

other hand, Sophia likes to listen to high music, so sometimes she misses hearing the

alarm. Here we would like to compute the probability of Burglary Alarm.

Problem:

Calculate the probability that the alarm has sounded, but there is neither a burglary,

nor an earthquake, and David and Sophia both called Harry.


Solution:

●​ The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the alarm
and directly affecting the probability of alarm going off, but David and Sophia's
calls depend on alarm probability.

●​ The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.

●​ The conditional distributions for each node are given as conditional probabilities
table or CPT.

●​ Each row in the CPT must be sum to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.

●​ In CPT, a boolean variable with k boolean parents contains 2K probabilities.


Hence, if there are two parents, then CPT will contain 4 probability values

List of all events occurring in this network:

●​ Burglary (B)

●​ Earthquake(E)

●​ Alarm(A)

●​ David Calls(D)

●​ Sophia calls(S)

We can write the events of problem statement in the form of probability: P[D, S, A, B, E],

can rewrite the above probability statement using joint probability distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]

=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]

= P [D| A]. P [ S| A, B, E]. P[ A, B, E]

= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]

= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

Let's take the observed probability for the Burglary and earthquake component:

P(B= True) = 0.002, which is the probability of burglary.

P(B= False)= 0.998, which is the probability of no burglary.


P(E= True)= 0.001, which is the probability of a minor earthquake

P(E= False)= 0.999, Which is the probability that an earthquake not occurred.

We can provide the conditional probabilities as per the below tables:

Conditional probability table for Alarm A:

The Conditional probability of Alarm A depends on Burglar and earthquake:

B E P(A= True) P(A= False)

True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999

Conditional probability table for David Calls:

The Conditional probability of David that he will call depends on the probability of Alarm.
A P(D= True) P(D= False)

True 0.91 0.09

False 0.05 0.95

Conditional probability table for Sophia Calls:

The Conditional probability of Sophia that she calls is depending on its Parent Node

"Alarm."

A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

From the formula of joint distribution, we can write the problem statement in the form of

probability distribution:

P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).


= 0.75* 0.91* 0.001* 0.998*0.999

= 0.00068045.

Hence, a Bayesian network can answer any query about the domain by using Joint

distribution.

The semantics of Bayesian Network:

There are two ways to understand the semantics of the Bayesian network, which is

given below:

1. To understand the network as the representation of the Joint probability

distribution.

It is helpful to understand how to construct the network.

2. To understand the network as an encoding of a collection of conditional

independence statements.

It is helpful in designing inference procedures.

Support-Vectors
Support vectors are the data points that are nearest to the hyper-plane and

affect the position and orientation of the hyper-plane. We have to select a

hyperplane, for which the margin, i.e the distance between support vectors
and hyper-plane is maximum. Even a little interference in the position of these

support vectors can change the hyper-plane.

How SVM works?


As we have a clear idea of the terminologies related to SVM, let’s now see

how the algorithm works. For example, we have a classification problem

where we have to separate the red data points from the blue ones.
Since it is a two-dimensional problem, our decision boundary will be a line, for

the 3-dimensional problem we have to use a plane, and similarly, the

complexity of the solution will increase with the rising number of features.
As shown in the above image, we have multiple lines separating the data

points successfully. But our objective is to look for the best solution.

There are few rules that can help us to identify the best line.

Maximum classification, i.e the selected line must be able to successfully

segregate all the data points into the respective classes. In our example, we

can clearly see lines E and D are miss classifying a red data point. Hence, for

this problem lines A, B, C is better than E and D. So we will drop them.


The second rule is Best Separation, which means, we must choose a line

such that it is perfectly able to separate the points.

If we talk about our example, if we get a new red data point closer to line A as

shown in the image below, line A will miss classifying that point. Similarly, if we

got a new blue instance closer to line B, then line A and C will classify the

data successfully, whereas line B will miss classifying this new point.
The point to be noticed here, In both the cases line C is successfully

classifying all the data points why? To understand this let’s take all the lines

one by one.

Why not Line A?


First, consider line A. If I move line A towards the left, we can see it has very

little chance to miss classify the blue points. on the other hand, if I shift line A

towards the right side it will very easily miss-classify the red points. The

reason is on the left side of the margin i.e the distance between the nearest

data point and the line is large whereas on the right side the margin is very

low.
Why Not Line B?
Similarly, in the case of line B. If we shift line B towards the right side, it has a

sufficient margin on the right side whereas it will wrongly classify the instances
on the left side as the margin towards the left is very low. Hence, B is not our

perfect solution.

Why Line C?
In the case of line C, It has sufficient margin on the left as well as the right

side. This maximum margin makes line C more robust for the new data points

that might appear in the future. Hence, C is the best fit in that case that

successfully classifies all the data points with the maximum margin.
This is what SVM looks for, it aims for the maximum margin and creates a line

that is equidistant from both sides, which is line C in our case. so we can say

C represents the SVM classifier with the maximum margin.

Now let’s look at the data below, As we can see this is not linearly separable

data, so SVM will not work in this situation. If anyhow we try to classify this

data with a line, the result will not be promising.


So, is there any way that SVM can classify this kind of data? For this problem,

we have to create a decision boundary that looks something like this.

The question is, is it possible to create such a decision boundary using SVM.

Well, the answer is Yes. SVM does this by projecting the data in a higher
dimension. As shown in the following image. In the first case, data is not

linearly separable, hence, we project into a higher dimension.

If we have more complex data then SVM will continue to project the data in a

higher dimension till it becomes linearly separable. Once the data become

linearly separable, we can use SVM to classify just like the previous problems.

Projection into Higher Dimension


Now let’s understand how SVM projects the data into a higher dimension.

Take this data, it is a circular non linearly separable dataset.


To project the data in a higher dimension, we are going to create another

dimension z, where

Now we will plot this feature Z with respect to x, which will give us linearly

separable data that looks like this.


Here, we have created a mapping Z using the base features X and Y, this

process is known as kernel transformation. Precisely, a kernel takes the

features as input and creates the linearly separable data in a higher

dimension.

Now the question is, do we have to perform this transformation manually? The

answer is no. SVM handles this process itself, just we have to choose the

kernel type.

Let’s quickly go through the different types of kernels available.

Linear Kernel
To start with, in the linear kernel, the decision boundary is a straight line.

Unfortunately, most of the real-world data is not linearly separable, this is the

reason the linear kernel is not widely used in SVM.


Gaussian / RBF kernel
It is the most commonly used kernel. It projects the data into a Gaussian

distribution, where the red points become the peak of the Gaussian surface

and the green data points become the base of the surface, making the data

linearly separable.
But this kernel is prone to overfitting and it captures the noise.

Polynomial kernel
At last, we have a polynomial kernel, which is non-uniform in nature due to the

polynomial combination of the base features. It often gives good results.

But the problem with the polynomial kernel is, the number of higher dimension

features increases exponentially. As a result, this is computationally more

expensive than RBF or linear kernel.


Linear Systems Analysis:--In systems theory, a linear system is a mathematical

model of a system based on the use of a linear operator. Linear systems typically

exhibit features and properties that are much simpler than the nonlinear case. As a

mathematical abstraction or idealization, linear systems find important applications

in automatic control theory, signal processing, and telecommunications. For

example, the propagation medium for wireless communication systems can often

be modeled by linear systems.

Mathematical models are usually composed of relationships and variables.

Relationships can be described by operators, such as algebraic operators, functions,

differential operators, etc. Variables are abstractions of system parameters of

interest, that can be quantified. Several classification criteria can be used for

mathematical models according to their structure:

Linear vs. nonlinear: If all the operators in a mathematical model exhibit linearity,

the resulting mathematical model is defined as linear. A model is considered to be

nonlinear otherwise. The definition of linearity and nonlinearity is dependent on

context, and linear models may have nonlinear expressions in them. For example,

in a statistical linear model, it is assumed that a relationship is linear in the

parameters, but it may be nonlinear in the predictor variables. Similarly, a

differential equation is said to be linear if it can be written with linear differential


operators, but it can still have nonlinear expressions in it. In a mathematical

programming model, if the objective functions and constraints are represented

entirely by linear equations, then the model is regarded as a linear model. If one or

more of the objective functions or constraints are represented with a nonlinear

equation, then the model is known as a nonlinear model.​

Linear structure implies that a problem can be decomposed into simpler parts that

can be treated independently and/or analyzed at a different scale and the results

obtained will remain valid for the initial problem when recomposed and rescaled.​

Nonlinearity, even in fairly simple systems, is often associated with phenomena

such as chaos and irreversibility. Although there are exceptions, nonlinear systems

and models tend to be more difficult to study than linear ones. A common approach

to nonlinear problems is linearization, but this can be problematic if one is trying to

study aspects such as irreversibility, which are strongly tied to nonlinearity.

Static vs. dynamic: A dynamic model accounts for time-dependent changes in the

state of the system, while a static (or steady-state) model calculates the system in

equilibrium, and thus is time-invariant. Dynamic models typically are represented

by differential equations or difference equations.

Explicit vs. implicit: If all of the input parameters of the overall model are known,

and the output parameters can be calculated by a finite series of computations, the

model is said to be explicit. But sometimes it is the output parameters which are
known, and the corresponding inputs must be solved for by an iterative procedure,

such as Newton's method or Broyden's method. In such a case the model is said to

be implicit. For example, a jet engine's physical properties such as turbine and

nozzle throat areas can be explicitly calculated given a design thermodynamic

cycle (air and fuel flow rates, pressures, and temperatures) at a specific flight

condition and power setting, but the engine's operating cycles at other flight

conditions and power settings cannot be explicitly calculated from the constant

physical properties.

Discrete vs. continuous: A discrete model treats objects as discrete, such as the

particles in a molecular model or the states in a statistical model; while a

continuous model represents the objects in a continuous manner, such as the

velocity field of fluid in pipe flows, temperatures and stresses in a solid, and

electric field that applies continuously over the entire model due to a point charge.

Deterministic vs. probabilistic (stochastic): A deterministic model is one in which

every set of variable states is uniquely determined by parameters in the model and

by sets of previous states of these variables; therefore, a deterministic model

always performs the same way for a given set of initial conditions. Conversely, in a

stochastic model—usually called a "statistical model"—randomness is present, and

variable states are not described by unique values, but rather by probability

distributions.
Nonlinear Dynamics:- Dynamics is a branch of
mathematics that studies how systems change over time. Up until the
18th century, people believed that the future could be perfectly
predicted given that one knows “all forces that set nature in motion,
and all positions of all items of which nature is composed” (that one
being is referred to as a Laplace Demon).

Now, provided that we believe that the world is fully deterministic,

then the statement makes sense. The problem is that in reality,

measured values (e.g. of forces and positions) are often approximated.

What we didn’t realise back then is that even if one knows of all

variables’ values, a slightly inaccurate approximation could result in a

totally different future. This is rather disheartening as it suggests that

in reality, even if we can build a supercomputer that resembles a

Laplace Demon, in order for it to make any meaningful prediction of

the future, all variables’ values must be perfectly obtained; not even a

tiny deviation is allowed. This phenomenon where “small differences

in initial conditions produce very great ones in the final phenomena”

(Poincare) is studied by the theory of chaos.


In the next section, we will explore a simple dynamic system and show

how it could produce chaos.

Population Model
Assume that the population growth of a place can be modelled by the

following equation. The population n(t+1) at time t+1 is equal to the

current population n(t) at time t subtracted by death which is

represented by n(t)²/k, or how overpopulated a place is with respect to

the maximum allowable population k, multiplied by the new babies

birth rate.
With some algebraic manipulations, we can convert this to an

equation of three variables.

The last equation relating x(t) and x(t+1) is a typical example of an

iteration equation. To get more intuitions on what this equation

entails, we shall analyse it with three charts. All of these simulations

are done with netlogo. The codes are provided in the course as well.

x(t) vs x(t+1)
This chart (or map) plots the current state x(t) against the next state

x(t+1). We can see that a one-humped map is shown on the chart (this

is basically a plot of R*x{t}*(1-x{t})). As we trace the point’s journey

over time, we can see it oscillates for a while before settling down on

about x equals to 0.64.

This behavior however, changes if we set the value of parameter R to

something else, say 3. The point perpetually oscillates between two


locations. It still settles down, albeit not on a static location but rather

two possible values.

t vs x(t)

For each of the above logistic map, we can also see how the point’s

position vary over time, that is by plotting time t against x(t).


R = 2.82

For the first logistic map (R = 2.82), we see that the value of x

eventually settles down to a single value. This is why we saw that the

blue dot in the previous chart stops moving after a while.

R = 3.2
Meanwhile for the second logistic map (R = 3.2), we see that x

eventually oscillates between two values, which explains why the blue

dot in the previous chart perpetually moved between two locations.

R vs x

Remember from the x(t) vs x(t+1) chart that the value of R determines

how the dynamics converge over time. In this section, we will plot R vs

the number of stable points that the system settles down to (a.k.a.

attractors).
As we can see, for R = 2.82, there’s only 1 attractor (around 0.64 to

0.65) while for R =3.2, two attractors are observed. Notice that there’s

a tendency for the number of attractors to increase as R grows. At

some point (when R is about 3.569946…), the number of attractors

reaches infinity (see the darkly shaded region in the diagram). At this
point, the system oscillates very widely, displaying almost no pattern.

This point is also known as the onset of chaos.

ANN(Artificial Neural Networks):-

An ANN usually involves a large number of processors


operating in parallel and arranged in tiers. The first tier
receives the raw input information -- analogous to optic
nerves in human visual processing. Each successive tier
receives the output from the tier preceding it, rather than
the raw input -- in the same way neurons further from the
optic nerve receive signals from those closer to it. The last
tier produces the output of the system.

Each processing node has its own small sphere of


knowledge, including what it has seen and any rules it was
originally programmed with or developed for itself. The
tiers are highly interconnected, which means each node in
tier n will be connected to many nodes in tier n-1 -- its
inputs -- and in tier n+1, which provides input data for
those nodes. There may be one or multiple nodes in the
output layer, from which the answer it produces can be
read.

Artificial neural networks are notable for being adaptive,


which means they modify themselves as they learn from
initial training and subsequent runs provide more
information about the world. The most basic learning
model is centered on weighting the input streams, which is
how each node weights the importance of input data from
each of its predecessors. Inputs that contribute to getting
right answers are weighted higher.

How neural networks learn


Architectures of Neural Network:
ANN is a computational system consisting of many interconnected units called
artificial neurons. The connection between artificial neurons can transmit a
signal from one neuron to another. So, there are multiple possibilities for
connecting the neurons based on which the architecture we are going to adopt
for a specific solution. Some permutations and combinations are as follows:
●​ There may be just two layers of neuron in the network – the input and
output layer.
●​ There can be one or more intermediate ‘hidden’ layers of a neuron.
●​ The neurons may be connected with all neurons in the next layer and
so on …..

So let’s start talking about the various possible architectures:


A. Single-layer Feed Forward Network:

It is the simplest and most basic architecture of ANN’s. It consists of only two
layers- the input layer and the output layer. The input layer consists of ‘m’ input
neurons connected to each of the ‘n’ output neurons. The connections carry
weights w11 and so on. The input layer of the neurons doesn’t conduct any
processing – they pass the i/p signals to the o/p neurons. The computations are
performed in the output layer. So, though it has 2 layers of neurons, only one
layer is performing the computation. This is the reason why the network is
known as SINGLE layer. Also, the signals always flow from the input layer to the
output layer. Hence, the network is known as FEED FORWARD.
The net signal input to the output neurons is given by:

The signal output from each output neuron will depend on the activation function
used.
B. Multi-layer Feed Forward Network:

Multi-Layer Feed Forward Network

The multi-layer feed-forward network is quite similar to the single-layer


feed-forward network, except for the fact that there are one or more intermediate
layers of neurons between the input and output layer. Hence, the network is
termed as multi-layer. Each of the layers may have a varying number of
neurons. For example, the one shown in the above diagram has ‘m’ neurons in
the input layer and ‘r’ neurons in the output layer and there is only one hidden
layer with ‘n’ neurons.
for the kth hidden layer neuron. The net signal input to the neuron in the output
layer is given by:

C. Competitive Network:
It is as same as the single-layer feed-forward network in structure. The only
difference is that the output neurons are connected with each other (either
partially or fully). Below is the diagram for this type of network.

Competitive Network

According to the diagram, it is clear that few of the output neurons are
interconnected to each other. For a given input, the output neurons compete
against themselves to represent the input. It represents a form of an
unsupervised learning algorithm in ANN that is suitable to find the clusters in a
data set.
D. Recurrent Network:

Recurrent Network

In feed-forward networks, the signal always flows from the input layer towards
the output layer (in one direction only). In the case of recurrent neural networks,
there is a feedback loop (from the neurons in the output layer to the input layer
neurons). There can be self-loops too.
Learning Process In ANN:
Learning process in ANN mainly depends on four factors, they are:
1.​ The number of layers in the network (Single-layered or
multi-layered)
2.​ Direction of signal flow (Feedforward or recurrent)
3.​ Number of nodes in layers: The number of node in the input layer is
equal to the number of features of the input data set. The number of
output nodes will depend on possible outcomes i.e. the number of
classes in case of supervised learning. But the number of layers in the
hidden layer is to be chosen by the user. A larger number of nodes in
the hidden layer, higher the performance but too many nodes may
result in overfitting as well as increased computational expense.
4.​ Weight of Interconnected Nodes: Deciding the value of weights
attached with each interconnection between each neuron so that a
specific learning problem can be solved correctly is quite a difficult
problem by itself. Take an example to understand the problem. Take the
example of a Multi-layered Feed-Forward Network, we have to train
an ANN model using some data, so that it can classify a new data set,
say p_5(3,-2). Say we have deduced that p_1=(5,2) and p_2 = (-1,12)
belonging to class C1 while p_3=(3,-5) and p_4 = (-2,-1) belonging to
class C2. We assume the values of synaptic weights w_0,w_1,w_2 as
-2, 1/2 and 1/4 respectively. But we will NOT get these weight values for
every learning problem. For solving a learning problem with ANN, we
can start with a set of values for synaptic weights and keep changing
those in multiple iterations. The stopping criterion may be the rate of
misclassification < 1% or the maximum numbers of iterations
should be less than 25(a threshold value). There may be another
problem that, the rate of misclassification may not reduce progressively.

Types of neural networks


Specific types of artificial neural networks include:

●​ Feed-forward neural networks: one of the simplest variants of neural


networks. They pass information in one direction, through various
input nodes, until it makes it to the output node. The network may or
may not have hidden node layers, making their functioning more
interpretable. It is prepared to process large amounts of noise. This
type of ANN computational model is used in technologies such as
facial recognition and computer vision.
●​ Recurrent neural networks: more complex. They save the output of
processing nodes and feed the result back into the model. This is
how the model is said to learn to predict the outcome of a layer. Each
node in the RNN model acts as a memory cell, continuing the
computation and implementation of operations. This neural network
starts with the same front propagation as a feed-forward network, but
then goes on to remember all processed information in order to
reuse it in the future. If the network's prediction is incorrect, then the
system self-learns and continues working towards the correct
prediction during backpropagation. This type of ANN is frequently
used in text-to-speech conversions.
●​ Convolutional neural networks: one of the most popular models used
today. This neural network computational model uses a variation of
multilayer perceptronsand contains one or more convolutional layers
that can be either entirely connected or pooled. These convolutional
layers create feature maps that record a region of image which is
ultimately broken into rectangles and sent out for nonlinear The CNN
model is particularly popular in the realm of image recognition; it has
been used in many of the most advanced applications of AI,
including facial recognition, text digitization and natural language
processing. Other uses include paraphrase detection, signal
processing and image classification.
●​ Deconvolutional neural networks: utilize a reversed CNN model
process. They aim to find lost features or signals that may have
originally been considered unimportant to the CNN system's task.
This network model can be used in image synthesis and analysis.
●​ Modular neural networks: contain multiple neural networks working
separately from one another. The networks do not communicate or
interfere with each other's activities during the computation process.
Consequently, complex or big computational processes can be
performed more efficiently.

Advantages of artificial neural networks

Advantages of artificial neural networks include:

●​ Parallel processing abilities mean the network can perform more


than one job at a time.
●​ Information is stored on an entire network, not just a database.
●​ The ability to learn and model nonlinear, complex relationships helps
model the real-life relationships between input and output.
●​ Fault tolerance means the corruption of one or more cells of the ANN
will not stop the generation of output.
●​ Gradual corruption means the network will slowly degrade over time,
instead of a problem destroying the network instantly.
●​ The ability to produce output with incomplete knowledge with the
loss of performance being based on how important the missing
information is.
●​ No restrictions are placed on the input variables, such as how they
should be distributed.
●​ Machine learning means the ANN can learn from events and make
decisions based on the observations.
●​ The ability to learn hidden relationships in the data without
commanding any fixed relationship means an ANN can better model
highly volatile data and non-constant variance.
●​ The ability to generalize and infer unseen relationships on unseen
data means ANNs can predict the output of unseen data.

Disadvantages of artificial neural networks

The disadvantages of ANNs include:


●​ The lack of rules for determining the proper network structure means
the appropriate artificial neural network architecture can only be
found through trial and error and experience.
●​ The requirement of processors with parallel processing abilities
makes neural networks hardware-dependent.
●​ The network works with numerical information, therefore all problems
must be translated into numerical values before they can be
presented to the ANN.
●​ The lack of explanation behind probing solutions is one of the
biggest disadvantages in ANNs. The inability to explain the why or
how behind the solution generates a lack of trust in the network.

Applications of artificial neural networks

Image recognition was one of the first areas to which neural networks were
successfully applied, but the technology uses have expanded to many more
areas, including:

●​ Chatbots
●​ Natural language processing, translation and language generation
●​ Stock market prediction
●​ Delivery driver route planning and optimization
●​ Drug discovery and development
These are just a few specific areas to which neural networks are being applied
today. Prime uses involve any process that operates according to strict rules
or patterns and has large amounts of data. If the data involved is too large for
a human to make sense of in a reasonable amount of time, the process is
likely a prime candidate for automation through artificial neural networks.

Generalization
In machine learning, generalization is a definition to demonstrate how
well is a trained model to classify or forecast unseen data. Training a
generalized machine learning model means, in general, it works for all
subset of unseen data. An example is when we train a model to classify
between dogs and cats. If the model is provided with dogs images
dataset with only two breeds, it may obtain a good performance. But, it
possibly gets a low classification score when it is tested by other breeds
of dogs as well. This issue can result to classify an actual dog image as
a cat from the unseen dataset. Therefore, data diversity is very important
factor in order to make a good prediction.

COMPETITIVE LEARNING:-
The basic architecture of a competitive learning system shown below i.. It consists of a set of
hierarchically layered units in which each layer connects, via excitatory connections, with the
layer immediately above it, and has inhibitory connections to units in its own layer. In the most
general case, each unit in a layer receives an input from each unit in the layer immediately
below it and projects to each unit in the layer immediately above it. Moreover, within a layer, the
units are broken into a set of inhibitory clusters in which all elements within a cluster inhibit all
other elements in the cluster. Thus the elements within a cluster at one level compete with one
another to respond to the pattern appearing on the layer below. The more strongly any particular
unit responds to an incoming stimulus, the more it shuts down the other members of its cluster.

Figure: The architecture of the competitive learning mechanism.


●​ The units in a given layer are broken into several sets of non overlapping clusters. Each
unit within a cluster inhibits every other unit within a cluster. Within each cluster, the unit
receiving the largest input achieves its maximum value while all other units in the cluster
1
are pushed to their minimum value. We have arbitrarily set the maximum value to 1 and
the minimum value to 0.
●​ Every unit in every cluster receives inputs from all members of the same set of input
units.
●​ A unit learns if and only if it wins the competition with other units in its cluster.
●​ A stimulus pattern Sj consists of a binary pattern in which each element of the pattern is
either active or inactive. An active element is assigned the value 1 and an inactive
element is assigned the value 0.
●​ Each unit has a fixed amount of weight (all weights are positive) that is distributed
among its input lines. The weight on the line connecting to unit i on the upper layer from unit j on
the lower layer is designated wij. The fixed total amount of weight for unit j is designated ∑ jwij =
1. A unit learns by shifting weight from its inactive to its active input lines. If a unit does not
respond to a particular pattern, no learning takes place in that unit. If a unit wins the competition,
then each of its input lines gives up some portion ϵ of its weight and that weight is then
distributed equally among the active input lines. Mathematically, this learning rule can be stated

What Is Principal Component Analysis?

Principal Component Analysis, or PCA, is a

dimensionality-reduction method that is often used to reduce the

dimensionality of large data sets, by transforming a large set of

variables into a smaller one that still contains most of the

information in the large set.


Reducing the number of variables of a data set naturally comes at

the expense of accuracy, but the trick in dimensionality reduction

is to trade a little accuracy for simplicity. Because smaller data sets

are easier to explore and visualize and make analyzing data much

easier and faster for machine learning algorithms without

extraneous variables to process.

So to sum up, the idea of PCA is simple — reduce the number of

variables of a data set, while preserving as much information as

possible.

Step by Step Explanation of PCA

STEP 1: STANDARDIZATION

The aim of this step is to standardize the range of the continuous

initial variables so that each one of them contributes equally to the

analysis.
More specifically, the reason why it is critical to perform

standardization prior to PCA, is that the latter is quite sensitive

regarding the variances of the initial variables. That is, if there are

large differences between the ranges of initial variables, those

variables with larger ranges will dominate over those with small

ranges (For example, a variable that ranges between 0 and 100 will

dominate over a variable that ranges between 0 and 1), which will

lead to biased results. So, transforming the data to comparable

scales can prevent this problem.

Mathematically, this can be done by subtracting the mean and

dividing by the standard deviation for each value of each variable.

Once the standardization is done, all the variables will be

transformed to the same scale.


STEP 2: COVARIANCE MATRIX COMPUTATION

The aim of this step is to understand how the variables of the input

data set are varying from the mean with respect to each other, or

in other words, to see if there is any relationship between them.

Because sometimes, variables are highly correlated in such a way

that they contain redundant information. So, in order to identify

these correlations, we compute the covariance matrix.

The covariance matrix is a p × p symmetric matrix (where p is the

number of dimensions) that has as entries the covariances

associated with all possible pairs of the initial variables. For

example, for a 3-dimensional data set with 3 variables x, y, and z,

the covariance matrix is a 3×3 matrix of this from:

Since the covariance of a variable with itself is its variance

(Cov(a,a)=Var(a)), in the main diagonal (Top left to bottom right) we

actually have the variances of each initial variable. And since the
covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of the

covariance matrix are symmetric with respect to the main

diagonal, which means that the upper and the lower triangular

portions are equal.

What do the covariances that we have as entries of the matrix

tell us about the correlations between the variables?

It’s actually the sign of the covariance that matters :

●​ if positive then : the two variables increase or decrease together

(correlated)

●​ if negative then : One increases when the other decreases

(Inversely correlated)

Now that we know that the covariance matrix is not more than a

table that summarizes the correlations between all the possible

pairs of variables, let’s move to the next step.


STEP 3: COMPUTE THE EIGENVECTORS AND
EIGENVALUES OF THE COVARIANCE MATRIX TO
IDENTIFY THE PRINCIPAL COMPONENTS

Eigenvectors and eigenvalues are the linear algebra concepts that

we need to compute from the covariance matrix in order to

determine the principal components of the data. Before getting to

the explanation of these concepts, let’s first understand what do

we mean by principal components.

Principal components are new variables that are constructed as

linear combinations or mixtures of the initial variables. These

combinations are done in such a way that the new variables (i.e.,

principal components) are uncorrelated and most of the

information within the initial variables is squeezed or compressed

into the first components. So, the idea is 10-dimensional data gives

you 10 principal components, but PCA tries to put maximum

possible information in the first component, then maximum

remaining information in the second and so on, until having

something like shown in the scree plot below.


Percentage of Variance (Information) for each by PC

Organizing information in principal components this way, will

allow you to reduce dimensionality without losing much

information, and this by discarding the components with low

information and considering the remaining components as your

new variables.

An important thing to realize here is that, the principal

components are less interpretable and don’t have any real meaning

since they are constructed as linear combinations of the initial

variables.

Geometrically speaking, principal components represent the

directions of the data that explain a maximal amount of variance,

that is to say, the lines that capture most information of the data.

The relationship between variance and information here, is that,

the larger the variance carried by a line, the larger the dispersion

of the data points along it, and the larger the dispersion along a

line, the more the information it has. To put all this simply, just

think of principal components as new axes that provide the best


angle to see and evaluate the data, so that the differences between

the observations are better visible.

How PCA Constructs the Principal


Components

As there are as many principal components as there are variables

in the data, principal components are constructed in such a

manner that the first principal component accounts for the largest

possible variance in the data set. For example, let’s assume that

the scatter plot of our data set is as shown below, can we guess the

first principal component ? Yes, it’s approximately the line that

matches the purple marks because it goes through the origin and

it’s the line in which the projection of the points (red dots) is the

most spread out. Or mathematically speaking, it’s the line that

maximizes the variance (the average of the squared distances from

the projected points (red dots) to the origin).


The second principal component is calculated in the same way,

with the condition that it is uncorrelated with (i.e., perpendicular

to) the first principal component and that it accounts for the next

highest variance.

This continues until a total of p principal components have been

calculated, equal to the original number of variables.

Now that we understand what we mean by principal components,

let’s go back to eigenvectors and eigenvalues. What you firstly

need to know about them is that they always come in pairs, so that

every eigenvector has an eigenvalue. And their number is equal to

the number of dimensions of the data. For example, for a


3-dimensional data set, there are 3 variables, therefore there are 3

eigenvectors with 3 corresponding eigenvalues.

Without further ado, it is eigenvectors and eigenvalues who are

behind all the magic explained above, because the eigenvectors of

the Covariance matrix are actually the directions of the axes where

there is the most variance(most information) and that we call

Principal Components. And eigenvalues are simply the coefficients

attached to eigenvectors, which give the amount of variance

carried in each Principal Component.

By ranking your eigenvectors in order of their eigenvalues, highest

to lowest, you get the principal components in order of

significance.

Example:

Let’s suppose that our data set is 2-dimensional with 2 variables

x,y and that the eigenvectors and eigenvalues of the covariance

matrix are as follows:


If we rank the eigenvalues in descending order, we get λ1>λ2, which

means that the eigenvector that corresponds to the first principal

component (PC1) is v1 and the one that corresponds to the second

component (PC2) isv2.

After having the principal components, to compute the percentage

of variance (information) accounted for by each component, we

divide the eigenvalue of each component by the sum of

eigenvalues. If we apply this on the example above, we find that

PC1 and PC2 carry respectively 96% and 4% of the variance of the

data.

STEP 4: FEATURE VECTOR

As we saw in the previous step, computing the eigenvectors and

ordering them by their eigenvalues in descending order, allow us


to find the principal components in order of significance. In this

step, what we do is, to choose whether to keep all these

components or discard those of lesser significance (of low

eigenvalues), and form with the remaining ones a matrix of vectors

that we call Feature vector.

So, the feature vector is simply a matrix that has as columns the

eigenvectors of the components that we decide to keep. This

makes it the first step towards dimensionality reduction, because

if we choose to keep only p eigenvectors (components) out of n,

the final data set will have only p dimensions.

Example:

Continuing with the example from the previous step, we can either

form a feature vector with both of the eigenvectors v1 and v2:


Or discard the eigenvector v2, which is the one of lesser

significance, and form a feature vector with v1 only:

Discarding the eigenvector v2 will reduce dimensionality by 1, and

will consequently cause a loss of information in the final data set.

But given that v2 was carrying only 4% of the information, the loss

will be therefore not important and we will still have 96% of the

information that is carried by v1.

So, as we saw in the example, it’s up to you to choose whether to

keep all the components or discard the ones of lesser significance,

depending on what you are looking for.


Fuzzy Logic:- The term fuzzy refers to things that are not clear or are vague. In

the real world many times we encounter a situation when we can’t determine

whether the state is true or false, their fuzzy logic provides very valuable flexibility

for reasoning. In this way, we can consider the inaccuracies and uncertainties of

any situation.

In the boolean system truth value, 1.0 represents the absolute truth value and 0.0

represents the absolute false value. But in the fuzzy system, there is no logic for

the absolute truth and absolute false value. But in fuzzy logic, there is an

intermediate value too present which is partially true and partially false.

ARCHITECTURE

Its Architecture contains four parts :


●​ RULE BASE: It contains the set of rules and the IF-THEN conditions
provided by the experts to govern the decision-making system, on the
basis of linguistic information. Recent developments in fuzzy theory
offer several effective methods for the design and tuning of fuzzy
controllers. Most of these developments reduce the number of fuzzy
rules.
●​ FUZZIFICATION: It is used to convert inputs i.e. crisp numbers into
fuzzy sets. Crisp inputs are basically the exact inputs measured by
sensors and passed into the control system for processing, such as
temperature, pressure, rpm’s, etc.
●​ INFERENCE ENGINE: It determines the matching degree of the current
fuzzy input with respect to each rule and decides which rules are to be
fired according to the input field. Next, the fired rules are combined to
form the control actions.
●​ DEFUZZIFICATION: It is used to convert the fuzzy sets obtained by the
inference engine into a crisp value. There are several defuzzification
methods available and the best-suited one is used with a specific expert
system to reduce the error.
Fuzzy decision trees:--

You might also like