ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
MODULE 4
Understanding Data: 2
2.6 BIVARIATE DATA AND MULTIVARIATE DATA
Bivariate Data involves two variables. Bivariate data deals with causes of relationships. The
aim is to find relationships among data. Consider the following Table 2.3, with data of the
temperature in a shop and sales of sweaters.
Here, the aim of bivariate analysis is to find relationships among variables. The relationships
can then be used in comparisons, finding causes, and in further explorations. To do that,
graphical display of the data is necessary. One such graph method is called scatter plot. Scatter
plot is used to visualize bivariate data. It is useful to plot two variables with or without nominal
variables, to illustrate the trends, and also to show differences. It is a plot between explanatory
and response variables. It is a 2D graph showing the relationship between two variables.
Line graphs are similar to scatter plots. The Line Chart for sales data is shown in Figure 2.12.
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
2.6.1 BIVARIATE STATISTICS
Covariance and Correlation are examples of bivariate statistics. Covariance is a measure of
joint probability of random variables, say X and Y. Generally, random variables are represented
in capital letters. It is defined as covariance(X, Y) or COV(X, Y) and is used to measure the
variance between two dimensions. The formula for finding co-variance for specific x, and y
are:
The covariance between X and Y is 12. It can be normalized to a value between -1 and +1. This
is done by dividing it by the correlation of variables. This is called Pearson correlation
coefficient.
Sometimes, N - 1 is also can be used instead of N. In that case, the covariance is 60/4 = 15.
Correlation
The Pearson correlation coefficient is the most common test for determining any association
between two phenomena. It measures the strength and direction of a linear relationship between
the x and y variables.
1. If the value is positive, it indicates that the dimensions increase together.
2. If the value is negative, it indicates that while one-dimension increases, the other dimension
decreases.
3. If the value is zero, then it indicates that both the dimensions are independent of each other.
If the dimensions are correlated, then it is better to remove one dimension as it is a redundant
dimension. If the given attributes are X = (x1, x2, … , xN) and Y = (y1, y2, … , yN), then the
Pearson correlation coefficient, that is denoted as r, is given as:
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
EXAMPLE PROBLEM ON COV and CORR
Given two datasets:
X=[2,4,6,8] ,Y=[1,3,2,5]Find the covariance between X and Y.
Solution:
Step 1: Compute the Means
The mean of X is:
• Xˉ=2+4+6+84=204=5
The mean of Y is:
• Yˉ=1+3+2+54=114=2.75
Step 2: Compute the Covariance Formula
This positive covariance indicates that as X increases, Y tends to increase as well.
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
r≈0.832
Since r is close to 1, this indicates a strong positive correlation between X and Y. As X
increases, Y tends to increase as well.
2.7 MULTIVARIATE STATISTICS
In machine learning, almost all datasets are multivariable. Multivariate data is the analysis of
more than two observable variables, and often, thousands of multiple measurements need to be
conducted for one or more subjects.
Multivariate data has three or more variables. The aim of the multivariate analysis is much
more. They are regression analysis, factor analysis and multivariate analysis of variance .
Heatmap
Heatmap is a graphical representation of 2D matrix. It takes a matrix as input and colours it.
The darker colours indicate very large values and lighter colours indicate smaller values.
The advantage of this method is that humans perceive colours well. So, by colour shaping,
larger values can be perceived well.
For example, in vehicle traffic data, heavy traffic regions can be differentiated from low traffic
regions through heatmap. In Figure 2.13, patient data highlighting weight and health status is
plotted. Here, X-axis is weights and Y-axis is patient counts. The dark colour regions highlight
patients’ weights vs patient counts in health status.
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Pairplot
Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter matrix
consists of several pair-wise scatter plots of variables of the multivariate data. A random matrix
of three columns is chosen and the relationships of the columns is plotted as a pairplot (or
scatter matrix) as shown below in Figure 2.14.
2.8 ESSENTIAL MATHEMATICS FOR MULTIVARIATE DATA
Machine learning involves many mathematical concepts from the domain of Linear algebra,
Statistics, Probability and Information theory. The subsequent sections discuss important
aspects of linear algebra and probability.
2.8.1 Linear Systems and Gaussian Elimination for Multivariate Data
A linear system of equations is a group of equations with unknown variables. Let Ax = y, then
the solution x is given as:
This is true if y is not zero and A is not zero. The logic can be extended for N-set of equations
with ‘n’ unknown variables.
It means if A= and y=(y1 y2…yn), then the unknown variable x can be
If there is a unique solution, then the system is called consistent independent. If there are
various solutions, then the system is called consistent dependant. If there are no solutions and
if the equations are contradictory, then the system is called inconsistent. For solving large
number of system of equations, Gaussian elimination can be used. The procedure for applying
Gaussian elimination is given as follows:
1. Write the given matrix.
2. Append vector y to the matrix A. This matrix is called augmentation matrix.
3. Keep the element a11 as pivot and eliminate all a11 in second row using the matrix operation,
The same logic can be used to remove a11 in all other equations.
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
4. Repeat the same logic and reduce it to reduced echelon form. Then, the unknown variable
as:
5. Then, the remaining unknown variables can be found by back-substitution as:
This part is called backward substitution. To facilitate the application of Gaussian elimination
method, the following row operations are applied:
1. Swapping the rows
2. Multiplying or dividing a row by a constant
3. Replacing a row by adding or subtracting a multiple of another row to it .
Example Problems:
GAUSSIAN ELIMINATION
1. Given the Following sytem of equation.
2x +3y –Z=5
4x+y+2y=6
-2x+5y+z=4
Find x , y, z
Step 1: Convert the Equations into an Augmented Matrix
We write only the coefficients and constants in matrix form:
Step 2: Make the First Element (Pivot) 1
To make the first element 1, we divide the first row by 2
Step 3: Make the First Column Below the Pivot Zero
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
We want to make the numbers below the 1st pivot (1 in row 1, column 1) zero.
R2->R2-4R1
R3->R3+2R1
Step 4: Make the Second Pivot 1
To make the second pivot 1, divide Row 2 by -5:
Step 5: Make the Second Column Below the Pivot Zero
To make the second column below the pivot 0, R2->R3-8R2
Step 6: Make the Third Pivot 1
To make the third pivot 1, divide Row 3 by 6.4:
Now, we have an upper triangular matrix, and we can solve for z, y, and x using back-
substitution.
Step 7: Back-Substitution
We now solve for z, y, and x one by one.
From Row 3:
z=0.40625 ,
From Row 2:
y−0.8(0.40625) =0.8=1.125
From Row 1:
x+1.5(1.125)−0.5(0.40625)=2.5
Final Answer:
x=1.0156, y=1.125, z=0.40625
2. Given the following
X+Y=5, 2X-Y=1 ,Find X , Y
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
step 1: Convert to Augmented Matrix
Step 2: Make the first pivot element 1,The first pivot is already 1.
Step 3: Make the element below the pivot 0
R2->R2-2R1
Step 4: Make the second pivot 1
Divide Row 2 by -3 to get 1:
Step 5: Make the element above the second pivot 0
Subtract R2 from Row 1:
Step 6: Read the solution, From the final matrix:
X=2 ,Y=3
These concepts are illustrated in Example 2.8.
2.8.2 Matrix Decomposition
It is often necessary to reduce a matrix to its constituent parts so that complex matrix operations
can be performed. Then, the matrix A can be decomposed as:
where, Q is the matrix of eigen vectors, Λ is the diagonal matrix and QT is the transpose of
matrix Q.
LU Decomposition One of the simplest matrix decomposition is LU decomposition where the
matrix A can be decomposed matrices:
A = LU
Here, L is the lower triangular matrix and U is the upper triangular matrix.
The decomposition can be done using Gaussian elimination method as discussed in the
previous section. First, an identity matrix is augmented to the given matrix.
Then, row operations and Gaussian elimination is applied to reduce the given matrix to get
matrices L and U.
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Example 2.9 illustrates the application of Gaussian elimination to get LU.
2.8.3 Machine Learning and Importance of Probability and Statistics
Machine learning is linked with statistics and probability. Like linear algebra, statistics is the
heart of machine learning. The importance of statistics needs to be stressed as without statistics;
Probability Distributions
A probability distribution of a variable, say X, summarizes the probability associated with X’s
events. Distribution is a parameterized mathematical function. In other words, distribution is a
function that describes the relationship between the observations in a sample space.
Consider a set of data. The data is said to follow a distribution if it obeys a mathematical
function that characterizes that distribution. The function can be used to calculate the
probability of individual observations.
Probability distributions are of two types:
1. Discrete probability distribution
2.Continuous probability distribution
The relationships between the events for a continuous random variable and their probabilities
1. Continuous Probability Distributions -Normal, Rectangular, and Exponential distributions
fall under this category.
I. Normal Distribution – Normal distribution is a continuous probability distribution.
This is also known as gaussian distribution or bell-shaped curve distribution. It is the
most common distribution function. The shape of this distribution is a typical bell-
shaped curve. In normal distribution, data tends to be around a central value with no
bias on left or right. The heights of the students, blood pressure of a population, and
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
marks scored in a class can be approximated using normal distribution. PDF of the
normal distribution is given as:
a.
b. Here, m is mean and s is the standard deviation. Normal distribution is
characterized by two parameters – mean and variance. One important concept
associated with normal distribution is z-score. It can be computed as:
c.
d. This is useful to normalize the data.
II. Rectangular Distribution – This is also known as uniform distribution. It has equal
probabilities for all values in the range a, b. The uniform distribution is given as follows:
a.
III. Exponential Distribution – This is a continuous uniform distribution. This probability
distribution is used to describe the time between events in a Poisson process.
Exponential distribution is another special case of Gamma distribution with a fixed
parameter of 1.
a. This distribution is helpful in modelling of time until an event occurs. The PDF
is given as follows:
2. Discrete Distribution
Binomial, Poisson, and Bernoulli distributions fall under this category.
I. Binomial Distribution – Binomial distribution is another distribution that is often
encountered in machine learning. It has only two outcomes: success or failure. This is
also called Bernoulli trial. The objective of this distribution is to find probability of
getting success k out of n trials. The way to get success out of k out of n number of
trials is given as:
The binomial distribution function is given as follows, where p is the probability of
success and probability of failure is (1 - p). The probability of success in a certain
number of trials is given as:
Combining both, one gets PDF of binomial distribution as:
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Here, p is the probability of each choice, k is the number of choices, and n is the total
number of choices. The mean of binomial distribution is given below:
II. Poisson Distribution – It is another important distribution that is quite useful. Given
an interval of time, this distribution is used to model the probability of a given number
of events k. The mean rule l is inclusive of previous events. Some of the examples of
Poisson distribution are number of emails received, number of customers visiting a shop
and the number of phone calls received by the office. The PDF of Poisson distribution
is given as follows:
III. Bernoulli Distribution – This distribution models an experiment whose outcome is
binary. The outcome is positive with p and negative with 1 - p. The PMF of this
distribution is given as:
IV. Density Estimation
Let there be a set of observed values x1, x2, … , xn from a larger set of data whose distribution
is not known. Density estimation is the problem of estimating the density function from an
observed data.
There are two types of density estimation methods, namely parametric density estimation and
non-parametric density estimation.
V. Parametric Density Estimation
It assumes that the data is from a known probabilistic distri- bution and can be estimated as
Maximum likelihood function is a parametric estimation
method.
VI. Maximum Likelihood Estimation
For a sample of observations, one can estimate the probability distribution. This is called
density estimation. Maximum Likelihood Estimation (MLE) is a probabilistic framework that
can be used for density estimation.
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
This involves formulating a function called likelihood function which is the conditional
probability of observing the observed samples and distribution function with its parameters.
For example, if the observations are X = {x1, x2, … , xn}, then density estimation is the
problem of choosing a PDF with suitable parameters to describe the data.
MLE treats this problem as a search or optimization problem where the probability should be
maximized for the joint probabilities of X and its parameter, theta.
If one assumes that the regression problem can be framed as predicting output y given input x,
then for p(y/x), the MLE framework can be applied as:
Here, h is the linear regression model. If Gaussian distribution is assumed as it is an obvious
fact that most of the data follow Gaussian distribution, then MLE can be stated as:
Here, b is the regression coefficient and xi is the given sample. One can maximize this function
or minimize the negative log likelihood function to provide a solution for linear regression
problem. The Eq. (2.37) yields the same answer of the least-square approach.
Gaussian Mixture Model and Expectation-Maximization (EM) Algorithm :
Generally, there can be many unspecified distributions with different set of parameters. The
EM algorithm has two stages:
1. Expectation (E) Stage – In this stage, the expected PDF and its parameters are estimated
for each latent variable.
2. Maximization (M) stage – In this, the parameters are optimized using the MLE function.
This process is iterative, and the iteration is continued till all the latent variables are fitted
by probability distributions effectively along with the parameters.
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
KNN Estimation: KNN estimation is another non-parametric density estimation method.
The initial Parameter K is determined and based on that k-neighbours are determined. The
probability density function estimate is the average of the values that are returned by
neighbours.
2.9 OVERVIEW OF HYPOTHESIS
Data collection alone is not enough. Data must be interpreted to give a conclusion. The
conclusion should be a structured outcome. This assumption of the outcome is called a
hypothesis.
• Statistical methods are used to confirm or reject the hypothesis.
• The assumption of the statistical test is called null hypothesis. It is also called as hypothesis
zero (H0).
• In other words, hypothesis is the existing belief. The violation of this hypothesis is called
first hypothesis (H1) or hypothesis one. This is the hypothesis the researcher is trying to
establish.
There are two types of hypothesis tests, parametric and non-parametric.
• Parametric tests are based on parameters such as mean and standard deviation.
• Non-parametric tests are dependent on characteristics such as independence of events or
data following certain distribution.
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
1. Define null and alternate hypothesis
2. Describe the hypothesis using parameters
3. Identify the statistical test and statistics
4. Decide the criteria called significance value a
5. Compute p-value (probability value)
1.Define Null and Alternate Hypothesis:
• Null Hypothesis (H0): It represents the statement of no effect, no difference, or status
quo. It assumes that any observed differences are due to random chance.
• Alternative Hypothesis (Hα): It represents a statement indicating the presence of an
effect, a difference, or a relationship between variables.
2. Hypotheses are expressed using population parameters (e.g., mean μ, proportion p,
standard deviation σ\sigmaσ). Example:
3. Identify the Statistical Test and Statistic:
• The choice of statistical test depends on the type of data and hypothesis:
• t-test: Compares means of two groups.
• z-test: Used when population variance is known.
• Chi-square test: Used for categorical data.
• ANOVA: Compares means of more than two groups.
• Regression analysis: Tests relationships between variables.
Test Statistic: A numerical value computed from the sample data to determine whether to
reject H0.
4. Decide the Significance Level (α\alphaα):
• The significance level (α) represents the probability of rejecting H0 when it is actually
true (Type I error).
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
5. Compute p-value (Probability Value):
• The p-value represents the probability of obtaining a result at least as extreme as the
observed one, assuming H0 is true.
• It is compared to α\alphaα:
• If p≤αp , reject H0(significant result).
• If p>αp , fail to reject H0 (not significant).
• Take the final decision of accepting or rejecting the hypothesis based on the
parameters .
Two kinds of errors are involved, that are Type I and Type II.
• Type I error is the incorrect rejection of a true null hypothesis and is called false
positive.
Type II error is the incomplete failure of rejecting a false hypothesis and is called false
negative.
Hypothesis Testing
Two important errors called sample error and true (or actual error).
1. Sample Error (Sampling Error) – Error Due to Taking a Sample
What is it?
• When we take a sample from a population, it may not perfectly represent the whole
population.
• This difference between the sample result and the actual population result is called
sample error.
Why does it happen?
• Because we are studying only a part (sample) of the population, not the entire
population.
• The sample might accidentally have more extreme values or fewer average values.
Ex: Imagine you want to know the average height of all students in your school.
• The actual average height of all students (population) is 170 cm.
• But you don’t have time to measure everyone, so you randomly select 50 students
(sample).
• You find the average height of these 50 students is 168 cm.
• Sampling Error = 168 cm - 170 cm = -2 cm.
2. Actual Error (True Error) – Mistake in Measurement or Process
What is it?
• Sometimes, errors happen because of incorrect methods or faulty tools.
• This is a real error that should not happen if everything was done correctly.
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Why does it happen?
• Systematic errors (like using a broken ruler that adds 2 cm to every height).
• Human mistakes (like reading a scale wrong).
Ex:Suppose your height measuring tool is faulty and adds 2 cm extra to everyone’s height.
• The real average height of students is 170 cm.
• But because of the faulty tool, your measurement shows 172 cm for the whole
population.
• This is Actual Error = 172 cm - 170 cm = 2 cm.
• Even if you take a perfect sample, your results will still be wrong because of the
measurement mistake.
Let us assume that D is the unknown distribution, Target function is f(x): x ≥ (0, 1), x is the
instance, h(x) is the hypothesis, and sample set is S that derives the samples on instances drawn
from X. Then, the actual error is denoted as:
• So, another error is called sample error or estimator. Sample error is with respect to sample
S. It is the probability for instances drawn from X, that is, the fractions of S that are
misclassified. The sample error is given as follows:
p-value
• Statistical tests can be performed to either accept or reject the null hypothesis. This is
done by the value called p-value or probability value. It indicates the probability of
hypothesis being true. The p-value is used to interpret or quantify the test.
• For example, a statistical test result may give a value of 0.03. Then, one can compare it
with the level 0.05. As 0.03 <0.05, the result is assumed to be significant. This means that
the variables tested are not independent. Here, 0.05 is called significant level.
• In general, significant level is called alpha and p-value is compared with Alpha . If p-value
≤ Alpha, then the hypothesis H1 is rejected and if p-value>Alpha, then the hypothesis HO
is rejected.
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
• 0.05 (5%): Standard significance level.
• 0.01 (1%): More strict, used in highly sensitive studies.
• 0.10 (10%): More lenient, used in exploratory research.
Ex: A company claims their new medicine lowers blood pressure more than the old one. We
conduct a test and calculate a p-value.
• If p-value = 0.02, it means there’s only a 2% chance that we would see this result if the
new medicine was not better.
• Since 0.02 is less than 0.05, we reject the null hypothesis and conclude the new medicine
is likely more effective.
Confidence Interval :A confidence interval helps us estimate where the true mean (average)
of a population lies based on a sample. It gives us a range of values that we are fairly confident
contains the true mean.
Confidence Interval Formula: Confidence Interval=1−Significance Level
• If the significance level is 0.05 (5%), then the confidence level is 95%.
• This means we are 95% sure that the true mean lies within our calculated range.
Key Terms:
• Mean (xˉ) – The average of the sample data.
• Standard Deviation (s) – Tells how spread out the data is.
• Sample Size (N) – The number of observations in our sample.
• Z-score (z) – A value from statistical tables that corresponds to the confidence level
(e.g., 1.96 for 95% confidence).
• Margin of Error – The amount we add and subtract from the mean to get the
confidence interval.
4. Interpreting the Confidence Interval:
• If we get a 95% confidence interval of (50, 60), we can say:
→ "We are 95% confident that the true mean is between 50 and 60."
• If our hypothesis states the mean should be 55, and 55 is inside this range → We
accept the hypothesis.
If 55 is outside the range → We reject the hypothesis.
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Comparing Learning Methods
1.Z test is a statistical test that is conducted on data that approximately follows a normal
distribution. The z test can be performed on one sample, two samples, or on proportions for
hypothesis testing. It checks if the means of two large samples are different or not when the
population variance is known.
2.t-test : t-test is a hypothesis test and checks if the difference between two samples mean is
real or by chance.
T-test follows t-distributions under null hypothesis and is used when number of samples < 30.
One sample Test:
Mean of one group is checked against the set average that can be either theoretical value or
population mean.
Select a group
Compute average
Compare it with theoretical value and compute t-statistic:
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Given Sample data: X={1,2,3,4,5}
• Population mean μ = 12
• Population variance σ2 = 2
Steps to Compute the Z-score:
The formula for the Z-test statistic is:
where:
• Xˉ = sample mean
• μ= population mean
• σ = population standard deviation
n = sample size
The computed Z-score is approximately -14.23. This indicates that the sample mean (3) is
extremely far from the population mean (12) in terms of standard deviations, suggesting a
highly significant difference.
Independent Two sample t-test: t-statistic for two groups A and B is computed.
3. paired t-test : Used to evaluate the hypothesis before and after intervention. The fact is these
samples are not independent.
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
4. Chi-Square test : Is non –parametric test.The goodness of fit test statistics follows Chi-
Square distribution under null hypothesis and measures the statistical significance between
observed frequency and expressed frequency , each observation is independent of each other
and follows normal distribution.
2.10 FEATURE ENGINEERING AND DIMENSIONALITY REDUCTION
TECHNIQUES
• Features are attributes. Feature engineering is about determining the subset of features that
form an important part of the input that improves the performance of the model, be it
classification or any other model in machine learning.
• Feature engineering deals with two problems – Feature Transformation and Feature
Selection.
• The features can be removed based on two aspects:
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
1. Feature relevancy – Some features contribute more for classification than other features.
For example, a mole on the face can help in face detection than common features like nose. In
simple words, the features should be relevant.
2. Feature redundancy – Some features are redundant. For example, when a database table
has a field called Date of birth, then age field is not relevant as age can be computed easily
from date of birth. So, the procedure is:
• 1.Generate all possible subsets
• 2.Evaluate the subsets and model performance
3. Evaluate the results for optimal feature selection
2.10.1 Stepwise Forward Selection
This procedure starts with an empty set of attributes. Every time, an attribute is tested for statistical
significance for best quality and is added to the reduced set.
This process is continued till a good reduced set of attributes is obtained.
2.10.2 Stepwise Backward Elimination
This procedure starts with a complete set of attributes. At every stage, the procedure removes the
worst attribute from the set, leading to the reduced set.
2.10.3 Principal Component Analysis
The idea of the principal component analysis (PCA) or KL transform is to transform a given set of
measurements to a new set of features so that the features exhibit high information packing
properties.
This leads to a reduced and compact set of features. Consider a group of random vectors of the
form:
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
The PCA algorithm is as follows:
1. The target dataset x is obtained
2. The mean is subtracted from the dataset. Let the mean be m. Thus, the adjusted dataset is X – m.
The objective of this process is to transform the dataset with zero mean.
3. The covariance of dataset x is obtained. Let it be C.
4. Eigen values and eigen vectors of the covariance matrix are calculated.
5. The eigen vector of the highest eigen value is the principal component of the dataset. The eigen
values are arranged in a descending order. The feature vector is formed with these eigen vectors in
its columns. Feature vector = {eigen vector1, eigen vector2, … , eigen vectorn}
6. Obtain the transpose of feature vector. Let it be A.
7. PCA transform is y = A × (x – m), where x is the input dataset, m is the mean, and A is the
transpose of the feature vector. The original data can be retrieved using the formula given below:
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
2.10.4 Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) is also a feature reduction technique like PCA. The focus of
LDA is to project higher dimension data to a line (lower dimension data). LDA is also used to
classify the data. Let there be two classes, c1 and c2. Let m1 and m2 be the mean of the patterns
of two classes. The mean of the class c1 and c2 can be computed as:
2.10.5 Singular Value Decomposition (SVD) :
Is another useful decomposition technique. Let A be the matrix, then the matrix A can be
decomposed as:
Here, A is the given matrix of dimension m × n, U is the orthogonal matrix whose dimension
is m × n, S is the diagonal matrix of dimension n × n, and V is the orthogonal matrix. The
procedure for finding decomposition matrix is given as follows:
1. For a given matrix, find AA^T.
2. Find eigen values of AA^T
3. Sort the eigen values in a descending order. Pack the eigen vectors as a matrix U. 4.Arrange
the square root of the eigen values in diagonal. This matrix is diagonal matrix, S.
5.Find eigen values and eigen vectors for A^TA. Find the eigen value and pack the eigen vector
as a matrix called V. Thus, A = USV^ T. Here, U and V are orthogonal matrices. The columns
of U and V are left and right singular values, respectively. SVD is useful in compression, as
one can decide to retain only a certain component instead of the original matrix A as:
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Problems on PCA
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Problems on SVD
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Dr. S J SAVITA, CSE(DS) RNSIT
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Dr. S J SAVITA, CSE(DS) RNSIT