0% found this document useful (0 votes)

4 views44 pages

AIML Module4

Module 4 of the Artificial Intelligence & Machine Learning course covers bivariate and multivariate data analysis, focusing on relationships between variables using methods like scatter plots and correlation coefficients. It explains essential statistics, including covariance and correlation, and introduces multivariate statistics, emphasizing the importance of linear algebra and probability in machine learning. Additionally, it discusses various probability distributions, density estimation, and maximum likelihood estimation as key concepts in understanding data behavior.

Uploaded by

kartheekrocker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views44 pages

AIML Module4

Uploaded by

kartheekrocker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

MODULE 4

Understanding Data: 2
2.6 BIVARIATE DATA AND MULTIVARIATE DATA
Bivariate Data involves two variables. Bivariate data deals with causes of relationships. The
aim is to find relationships among data. Consider the following Table 2.3, with data of the
temperature in a shop and sales of sweaters.

Here, the aim of bivariate analysis is to find relationships among variables. The relationships
can then be used in comparisons, finding causes, and in further explorations. To do that,
graphical display of the data is necessary. One such graph method is called scatter plot. Scatter
plot is used to visualize bivariate data. It is useful to plot two variables with or without nominal
variables, to illustrate the trends, and also to show differences. It is a plot between explanatory
and response variables. It is a 2D graph showing the relationship between two variables.

Line graphs are similar to scatter plots. The Line Chart for sales data is shown in Figure 2.12.

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
2.6.1 BIVARIATE STATISTICS
Covariance and Correlation are examples of bivariate statistics. Covariance is a measure of
joint probability of random variables, say X and Y. Generally, random variables are represented
in capital letters. It is defined as covariance(X, Y) or COV(X, Y) and is used to measure the
variance between two dimensions. The formula for finding co-variance for specific x, and y
are:

The covariance between X and Y is 12. It can be normalized to a value between -1 and +1. This
is done by dividing it by the correlation of variables. This is called Pearson correlation
coefficient.
Sometimes, N - 1 is also can be used instead of N. In that case, the covariance is 60/4 = 15.
Correlation
The Pearson correlation coefficient is the most common test for determining any association
between two phenomena. It measures the strength and direction of a linear relationship between
the x and y variables.
1. If the value is positive, it indicates that the dimensions increase together.
2. If the value is negative, it indicates that while one-dimension increases, the other dimension
decreases.
3. If the value is zero, then it indicates that both the dimensions are independent of each other.
If the dimensions are correlated, then it is better to remove one dimension as it is a redundant
dimension. If the given attributes are X = (x1, x2, … , xN) and Y = (y1, y2, … , yN), then the
Pearson correlation coefficient, that is denoted as r, is given as:

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
EXAMPLE PROBLEM ON COV and CORR
Given two datasets:
X=[2,4,6,8] ,Y=[1,3,2,5]Find the covariance between X and Y.
Solution:
Step 1: Compute the Means
The mean of X is:
• Xˉ=2+4+6+84=204=5
The mean of Y is:
• Yˉ=1+3+2+54=114=2.75
Step 2: Compute the Covariance Formula

This positive covariance indicates that as X increases, Y tends to increase as well.

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

r≈0.832
Since r is close to 1, this indicates a strong positive correlation between X and Y. As X
increases, Y tends to increase as well.
2.7 MULTIVARIATE STATISTICS
In machine learning, almost all datasets are multivariable. Multivariate data is the analysis of
more than two observable variables, and often, thousands of multiple measurements need to be
conducted for one or more subjects.
Multivariate data has three or more variables. The aim of the multivariate analysis is much
more. They are regression analysis, factor analysis and multivariate analysis of variance .

Heatmap
Heatmap is a graphical representation of 2D matrix. It takes a matrix as input and colours it.
The darker colours indicate very large values and lighter colours indicate smaller values.
The advantage of this method is that humans perceive colours well. So, by colour shaping,
larger values can be perceived well.
For example, in vehicle traffic data, heavy traffic regions can be differentiated from low traffic
regions through heatmap. In Figure 2.13, patient data highlighting weight and health status is
plotted. Here, X-axis is weights and Y-axis is patient counts. The dark colour regions highlight
patients’ weights vs patient counts in health status.

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Pairplot
Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter matrix
consists of several pair-wise scatter plots of variables of the multivariate data. A random matrix
of three columns is chosen and the relationships of the columns is plotted as a pairplot (or
scatter matrix) as shown below in Figure 2.14.

2.8 ESSENTIAL MATHEMATICS FOR MULTIVARIATE DATA

Machine learning involves many mathematical concepts from the domain of Linear algebra,
Statistics, Probability and Information theory. The subsequent sections discuss important
aspects of linear algebra and probability.
2.8.1 Linear Systems and Gaussian Elimination for Multivariate Data
A linear system of equations is a group of equations with unknown variables. Let Ax = y, then
the solution x is given as:

This is true if y is not zero and A is not zero. The logic can be extended for N-set of equations
with ‘n’ unknown variables.
It means if A= and y=(y1 y2…yn), then the unknown variable x can be

If there is a unique solution, then the system is called consistent independent. If there are
various solutions, then the system is called consistent dependant. If there are no solutions and
if the equations are contradictory, then the system is called inconsistent. For solving large
number of system of equations, Gaussian elimination can be used. The procedure for applying
Gaussian elimination is given as follows:
1. Write the given matrix.
2. Append vector y to the matrix A. This matrix is called augmentation matrix.
3. Keep the element a11 as pivot and eliminate all a11 in second row using the matrix operation,
The same logic can be used to remove a11 in all other equations.

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

4. Repeat the same logic and reduce it to reduced echelon form. Then, the unknown variable
as:

5. Then, the remaining unknown variables can be found by back-substitution as:

This part is called backward substitution. To facilitate the application of Gaussian elimination
method, the following row operations are applied:
1. Swapping the rows
2. Multiplying or dividing a row by a constant
3. Replacing a row by adding or subtracting a multiple of another row to it .
Example Problems:
GAUSSIAN ELIMINATION
1. Given the Following sytem of equation.

2x +3y –Z=5
4x+y+2y=6
-2x+5y+z=4
Find x , y, z
Step 1: Convert the Equations into an Augmented Matrix
We write only the coefficients and constants in matrix form:

Step 2: Make the First Element (Pivot) 1

To make the first element 1, we divide the first row by 2

Step 3: Make the First Column Below the Pivot Zero

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
We want to make the numbers below the 1st pivot (1 in row 1, column 1) zero.
R2->R2-4R1
R3->R3+2R1

Step 4: Make the Second Pivot 1

To make the second pivot 1, divide Row 2 by -5:

Step 5: Make the Second Column Below the Pivot Zero

To make the second column below the pivot 0, R2->R3-8R2

Step 6: Make the Third Pivot 1

To make the third pivot 1, divide Row 3 by 6.4:

Now, we have an upper triangular matrix, and we can solve for z, y, and x using back-
substitution.
Step 7: Back-Substitution
We now solve for z, y, and x one by one.
From Row 3:
z=0.40625 ,
From Row 2:
y−0.8(0.40625) =0.8=1.125
From Row 1:
x+1.5(1.125)−0.5(0.40625)=2.5
Final Answer:
x=1.0156, y=1.125, z=0.40625
2. Given the following

X+Y=5, 2X-Y=1 ,Find X , Y

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
step 1: Convert to Augmented Matrix

Step 2: Make the first pivot element 1,The first pivot is already 1.
Step 3: Make the element below the pivot 0
R2->R2-2R1

Step 4: Make the second pivot 1

Divide Row 2 by -3 to get 1:

Step 5: Make the element above the second pivot 0

Subtract R2 from Row 1:

Step 6: Read the solution, From the final matrix:

X=2 ,Y=3
These concepts are illustrated in Example 2.8.
2.8.2 Matrix Decomposition
It is often necessary to reduce a matrix to its constituent parts so that complex matrix operations
can be performed. Then, the matrix A can be decomposed as:

where, Q is the matrix of eigen vectors, Λ is the diagonal matrix and QT is the transpose of
matrix Q.
LU Decomposition One of the simplest matrix decomposition is LU decomposition where the
matrix A can be decomposed matrices:
A = LU
Here, L is the lower triangular matrix and U is the upper triangular matrix.
The decomposition can be done using Gaussian elimination method as discussed in the
previous section. First, an identity matrix is augmented to the given matrix.
Then, row operations and Gaussian elimination is applied to reduce the given matrix to get
matrices L and U.

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Example 2.9 illustrates the application of Gaussian elimination to get LU.

2.8.3 Machine Learning and Importance of Probability and Statistics

Machine learning is linked with statistics and probability. Like linear algebra, statistics is the
heart of machine learning. The importance of statistics needs to be stressed as without statistics;
Probability Distributions
A probability distribution of a variable, say X, summarizes the probability associated with X’s
events. Distribution is a parameterized mathematical function. In other words, distribution is a
function that describes the relationship between the observations in a sample space.
Consider a set of data. The data is said to follow a distribution if it obeys a mathematical
function that characterizes that distribution. The function can be used to calculate the
probability of individual observations.
Probability distributions are of two types:
1. Discrete probability distribution
2.Continuous probability distribution
The relationships between the events for a continuous random variable and their probabilities
1. Continuous Probability Distributions -Normal, Rectangular, and Exponential distributions
fall under this category.
I. Normal Distribution – Normal distribution is a continuous probability distribution.
This is also known as gaussian distribution or bell-shaped curve distribution. It is the
most common distribution function. The shape of this distribution is a typical bell-
shaped curve. In normal distribution, data tends to be around a central value with no
bias on left or right. The heights of the students, blood pressure of a population, and

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
marks scored in a class can be approximated using normal distribution. PDF of the
normal distribution is given as:

a.
b. Here, m is mean and s is the standard deviation. Normal distribution is
characterized by two parameters – mean and variance. One important concept
associated with normal distribution is z-score. It can be computed as:

c.
d. This is useful to normalize the data.
II. Rectangular Distribution – This is also known as uniform distribution. It has equal
probabilities for all values in the range a, b. The uniform distribution is given as follows:

a.
III. Exponential Distribution – This is a continuous uniform distribution. This probability
distribution is used to describe the time between events in a Poisson process.
Exponential distribution is another special case of Gamma distribution with a fixed
parameter of 1.
a. This distribution is helpful in modelling of time until an event occurs. The PDF
is given as follows:

2. Discrete Distribution
Binomial, Poisson, and Bernoulli distributions fall under this category.
I. Binomial Distribution – Binomial distribution is another distribution that is often
encountered in machine learning. It has only two outcomes: success or failure. This is
also called Bernoulli trial. The objective of this distribution is to find probability of
getting success k out of n trials. The way to get success out of k out of n number of
trials is given as:

The binomial distribution function is given as follows, where p is the probability of

success and probability of failure is (1 - p). The probability of success in a certain
number of trials is given as:

Combining both, one gets PDF of binomial distribution as:

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

Here, p is the probability of each choice, k is the number of choices, and n is the total
number of choices. The mean of binomial distribution is given below:

II. Poisson Distribution – It is another important distribution that is quite useful. Given
an interval of time, this distribution is used to model the probability of a given number
of events k. The mean rule l is inclusive of previous events. Some of the examples of
Poisson distribution are number of emails received, number of customers visiting a shop
and the number of phone calls received by the office. The PDF of Poisson distribution
is given as follows:

III. Bernoulli Distribution – This distribution models an experiment whose outcome is

binary. The outcome is positive with p and negative with 1 - p. The PMF of this
distribution is given as:

IV. Density Estimation

Let there be a set of observed values x1, x2, … , xn from a larger set of data whose distribution
is not known. Density estimation is the problem of estimating the density function from an
observed data.
There are two types of density estimation methods, namely parametric density estimation and
non-parametric density estimation.

V. Parametric Density Estimation

It assumes that the data is from a known probabilistic distribution and can be estimated as
Maximum likelihood function is a parametric estimation
method.

VI. Maximum Likelihood Estimation

For a sample of observations, one can estimate the probability distribution. This is called
density estimation. Maximum Likelihood Estimation (MLE) is a probabilistic framework that
can be used for density estimation.

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

This involves formulating a function called likelihood function which is the conditional
probability of observing the observed samples and distribution function with its parameters.
For example, if the observations are X = {x1, x2, … , xn}, then density estimation is the
problem of choosing a PDF with suitable parameters to describe the data.

MLE treats this problem as a search or optimization problem where the probability should be
maximized for the joint probabilities of X and its parameter, theta.

If one assumes that the regression problem can be framed as predicting output y given input x,
then for p(y/x), the MLE framework can be applied as:

Here, h is the linear regression model. If Gaussian distribution is assumed as it is an obvious

fact that most of the data follow Gaussian distribution, then MLE can be stated as:

Here, b is the regression coefficient and xi is the given sample. One can maximize this function
or minimize the negative log likelihood function to provide a solution for linear regression
problem. The Eq. (2.37) yields the same answer of the least-square approach.

Gaussian Mixture Model and Expectation-Maximization (EM) Algorithm :

Generally, there can be many unspecified distributions with different set of parameters. The
EM algorithm has two stages:
1. Expectation (E) Stage – In this stage, the expected PDF and its parameters are estimated
for each latent variable.
2. Maximization (M) stage – In this, the parameters are optimized using the MLE function.
This process is iterative, and the iteration is continued till all the latent variables are fitted
by probability distributions effectively along with the parameters.

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

KNN Estimation: KNN estimation is another non-parametric density estimation method.

The initial Parameter K is determined and based on that k-neighbours are determined. The
probability density function estimate is the average of the values that are returned by
neighbours.
2.9 OVERVIEW OF HYPOTHESIS
Data collection alone is not enough. Data must be interpreted to give a conclusion. The
conclusion should be a structured outcome. This assumption of the outcome is called a
hypothesis.
• Statistical methods are used to confirm or reject the hypothesis.
• The assumption of the statistical test is called null hypothesis. It is also called as hypothesis
zero (H0).
• In other words, hypothesis is the existing belief. The violation of this hypothesis is called
first hypothesis (H1) or hypothesis one. This is the hypothesis the researcher is trying to
establish.
There are two types of hypothesis tests, parametric and non-parametric.
• Parametric tests are based on parameters such as mean and standard deviation.
• Non-parametric tests are dependent on characteristics such as independence of events or
data following certain distribution.

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
1. Define null and alternate hypothesis
2. Describe the hypothesis using parameters
3. Identify the statistical test and statistics
4. Decide the criteria called significance value a
5. Compute p-value (probability value)
1.Define Null and Alternate Hypothesis:
• Null Hypothesis (H0): It represents the statement of no effect, no difference, or status
quo. It assumes that any observed differences are due to random chance.
• Alternative Hypothesis (Hα): It represents a statement indicating the presence of an
effect, a difference, or a relationship between variables.
2. Hypotheses are expressed using population parameters (e.g., mean μ, proportion p,
standard deviation σ\sigmaσ). Example:

3. Identify the Statistical Test and Statistic:

• The choice of statistical test depends on the type of data and hypothesis:
• t-test: Compares means of two groups.
• z-test: Used when population variance is known.
• Chi-square test: Used for categorical data.
• ANOVA: Compares means of more than two groups.
• Regression analysis: Tests relationships between variables.
Test Statistic: A numerical value computed from the sample data to determine whether to
reject H0.

4. Decide the Significance Level (α\alphaα):

• The significance level (α) represents the probability of rejecting H0 when it is actually
true (Type I error).

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
5. Compute p-value (Probability Value):
• The p-value represents the probability of obtaining a result at least as extreme as the
observed one, assuming H0 is true.
• It is compared to α\alphaα:
• If p≤αp , reject H0(significant result).
• If p>αp , fail to reject H0 (not significant).
• Take the final decision of accepting or rejecting the hypothesis based on the
parameters .
Two kinds of errors are involved, that are Type I and Type II.
• Type I error is the incorrect rejection of a true null hypothesis and is called false
positive.
Type II error is the incomplete failure of rejecting a false hypothesis and is called false
negative.
Hypothesis Testing
Two important errors called sample error and true (or actual error).
1. Sample Error (Sampling Error) – Error Due to Taking a Sample
What is it?
• When we take a sample from a population, it may not perfectly represent the whole
population.
• This difference between the sample result and the actual population result is called
sample error.
Why does it happen?
• Because we are studying only a part (sample) of the population, not the entire
population.
• The sample might accidentally have more extreme values or fewer average values.
Ex: Imagine you want to know the average height of all students in your school.
• The actual average height of all students (population) is 170 cm.
• But you don’t have time to measure everyone, so you randomly select 50 students
(sample).
• You find the average height of these 50 students is 168 cm.
• Sampling Error = 168 cm - 170 cm = -2 cm.
2. Actual Error (True Error) – Mistake in Measurement or Process
What is it?
• Sometimes, errors happen because of incorrect methods or faulty tools.
• This is a real error that should not happen if everything was done correctly.

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Why does it happen?
• Systematic errors (like using a broken ruler that adds 2 cm to every height).
• Human mistakes (like reading a scale wrong).
Ex:Suppose your height measuring tool is faulty and adds 2 cm extra to everyone’s height.
• The real average height of students is 170 cm.
• But because of the faulty tool, your measurement shows 172 cm for the whole
population.
• This is Actual Error = 172 cm - 170 cm = 2 cm.
• Even if you take a perfect sample, your results will still be wrong because of the
measurement mistake.
Let us assume that D is the unknown distribution, Target function is f(x): x ≥ (0, 1), x is the
instance, h(x) is the hypothesis, and sample set is S that derives the samples on instances drawn
from X. Then, the actual error is denoted as:

• So, another error is called sample error or estimator. Sample error is with respect to sample
S. It is the probability for instances drawn from X, that is, the fractions of S that are
misclassified. The sample error is given as follows:

p-value

• Statistical tests can be performed to either accept or reject the null hypothesis. This is
done by the value called p-value or probability value. It indicates the probability of
hypothesis being true. The p-value is used to interpret or quantify the test.
• For example, a statistical test result may give a value of 0.03. Then, one can compare it
with the level 0.05. As 0.03 <0.05, the result is assumed to be significant. This means that
the variables tested are not independent. Here, 0.05 is called significant level.
• In general, significant level is called alpha and p-value is compared with Alpha . If p-value
≤ Alpha, then the hypothesis H1 is rejected and if p-value>Alpha, then the hypothesis HO
is rejected.

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

• 0.05 (5%): Standard significance level.

• 0.01 (1%): More strict, used in highly sensitive studies.
• 0.10 (10%): More lenient, used in exploratory research.
Ex: A company claims their new medicine lowers blood pressure more than the old one. We
conduct a test and calculate a p-value.
• If p-value = 0.02, it means there’s only a 2% chance that we would see this result if the
new medicine was not better.
• Since 0.02 is less than 0.05, we reject the null hypothesis and conclude the new medicine
is likely more effective.
Confidence Interval :A confidence interval helps us estimate where the true mean (average)
of a population lies based on a sample. It gives us a range of values that we are fairly confident
contains the true mean.
Confidence Interval Formula: Confidence Interval=1−Significance Level
• If the significance level is 0.05 (5%), then the confidence level is 95%.
• This means we are 95% sure that the true mean lies within our calculated range.
Key Terms:
• Mean (xˉ) – The average of the sample data.
• Standard Deviation (s) – Tells how spread out the data is.
• Sample Size (N) – The number of observations in our sample.
• Z-score (z) – A value from statistical tables that corresponds to the confidence level
(e.g., 1.96 for 95% confidence).
• Margin of Error – The amount we add and subtract from the mean to get the
confidence interval.

4. Interpreting the Confidence Interval:

• If we get a 95% confidence interval of (50, 60), we can say:
→ "We are 95% confident that the true mean is between 50 and 60."
• If our hypothesis states the mean should be 55, and 55 is inside this range → We
accept the hypothesis.
If 55 is outside the range → We reject the hypothesis.

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

Comparing Learning Methods

1.Z test is a statistical test that is conducted on data that approximately follows a normal
distribution. The z test can be performed on one sample, two samples, or on proportions for
hypothesis testing. It checks if the means of two large samples are different or not when the
population variance is known.

2.t-test : t-test is a hypothesis test and checks if the difference between two samples mean is
real or by chance.
T-test follows t-distributions under null hypothesis and is used when number of samples < 30.
One sample Test:
Mean of one group is checked against the set average that can be either theoretical value or
population mean.
Select a group
Compute average
Compare it with theoretical value and compute t-statistic:

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
Given Sample data: X={1,2,3,4,5}
• Population mean μ = 12
• Population variance σ2 = 2
Steps to Compute the Z-score:
The formula for the Z-test statistic is:

where:
• Xˉ = sample mean
• μ= population mean
• σ = population standard deviation
n = sample size

The computed Z-score is approximately -14.23. This indicates that the sample mean (3) is
extremely far from the population mean (12) in terms of standard deviations, suggesting a
highly significant difference.
Independent Two sample t-test: t-statistic for two groups A and B is computed.

3. paired t-test : Used to evaluate the hypothesis before and after intervention. The fact is these
samples are not independent.

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
4. Chi-Square test : Is non –parametric test.The goodness of fit test statistics follows Chi-
Square distribution under null hypothesis and measures the statistical significance between
observed frequency and expressed frequency , each observation is independent of each other
and follows normal distribution.

2.10 FEATURE ENGINEERING AND DIMENSIONALITY REDUCTION

TECHNIQUES
• Features are attributes. Feature engineering is about determining the subset of features that
form an important part of the input that improves the performance of the model, be it
classification or any other model in machine learning.
• Feature engineering deals with two problems – Feature Transformation and Feature
Selection.
• The features can be removed based on two aspects:

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)
1. Feature relevancy – Some features contribute more for classification than other features.
For example, a mole on the face can help in face detection than common features like nose. In
simple words, the features should be relevant.
2. Feature redundancy – Some features are redundant. For example, when a database table
has a field called Date of birth, then age field is not relevant as age can be computed easily
from date of birth. So, the procedure is:
• 1.Generate all possible subsets
• 2.Evaluate the subsets and model performance
3. Evaluate the results for optimal feature selection

2.10.1 Stepwise Forward Selection

This procedure starts with an empty set of attributes. Every time, an attribute is tested for statistical
significance for best quality and is added to the reduced set.
This process is continued till a good reduced set of attributes is obtained.
2.10.2 Stepwise Backward Elimination
This procedure starts with a complete set of attributes. At every stage, the procedure removes the
worst attribute from the set, leading to the reduced set.
2.10.3 Principal Component Analysis
The idea of the principal component analysis (PCA) or KL transform is to transform a given set of
measurements to a new set of features so that the features exhibit high information packing
properties.
This leads to a reduced and compact set of features. Consider a group of random vectors of the
form:

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

The PCA algorithm is as follows:

1. The target dataset x is obtained
2. The mean is subtracted from the dataset. Let the mean be m. Thus, the adjusted dataset is X – m.
The objective of this process is to transform the dataset with zero mean.
3. The covariance of dataset x is obtained. Let it be C.
4. Eigen values and eigen vectors of the covariance matrix are calculated.
5. The eigen vector of the highest eigen value is the principal component of the dataset. The eigen
values are arranged in a descending order. The feature vector is formed with these eigen vectors in
its columns. Feature vector = {eigen vector1, eigen vector2, … , eigen vectorn}
6. Obtain the transpose of feature vector. Let it be A.
7. PCA transform is y = A × (x – m), where x is the input dataset, m is the mean, and A is the
transpose of the feature vector. The original data can be retrieved using the formula given below:

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

2.10.4 Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is also a feature reduction technique like PCA. The focus of
LDA is to project higher dimension data to a line (lower dimension data). LDA is also used to
classify the data. Let there be two classes, c1 and c2. Let m1 and m2 be the mean of the patterns
of two classes. The mean of the class c1 and c2 can be computed as:

2.10.5 Singular Value Decomposition (SVD) :

Is another useful decomposition technique. Let A be the matrix, then the matrix A can be
decomposed as:
Here, A is the given matrix of dimension m × n, U is the orthogonal matrix whose dimension
is m × n, S is the diagonal matrix of dimension n × n, and V is the orthogonal matrix. The
procedure for finding decomposition matrix is given as follows:
1. For a given matrix, find AA^T.
2. Find eigen values of AA^T
3. Sort the eigen values in a descending order. Pack the eigen vectors as a matrix U. 4.Arrange
the square root of the eigen values in diagonal. This matrix is diagonal matrix, S.
5.Find eigen values and eigen vectors for A^TA. Find the eigen value and pack the eigen vector
as a matrix called V. Thus, A = USV^ T. Here, U and V are orthogonal matrices. The columns
of U and V are left and right singular values, respectively. SVD is useful in compression, as
one can decide to retain only a certain component instead of the original matrix A as:

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

Problems on PCA

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

Problems on SVD

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

Dr. S J SAVITA, CSE(DS) RNSIT

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

Dr. S J SAVITA, CSE(DS) RNSIT

Year 5 Maths
100% (1)
Year 5 Maths
11 pages
Module 2
No ratings yet
Module 2
104 pages
Data Mining Techniques Unit 2
No ratings yet
Data Mining Techniques Unit 2
48 pages
Gender Age and Inequality in The Professions 1st Edition Marta Choroszewicz Instant Download
No ratings yet
Gender Age and Inequality in The Professions 1st Edition Marta Choroszewicz Instant Download
76 pages
(Ebook PDF) Fundraising Principles and Practice 2nd Edition Download
100% (1)
(Ebook PDF) Fundraising Principles and Practice 2nd Edition Download
126 pages
Module 2
No ratings yet
Module 2
18 pages
Machine - Learning - Chapter 1 and 2
No ratings yet
Machine - Learning - Chapter 1 and 2
70 pages
A Primer of Ecological Statistics 2nd Edition Full Text PDF
No ratings yet
A Primer of Ecological Statistics 2nd Edition Full Text PDF
16 pages
Module 2
No ratings yet
Module 2
107 pages
Machine Learning Module 2
No ratings yet
Machine Learning Module 2
58 pages
ML Module-02
No ratings yet
ML Module-02
37 pages
ML Module 02
No ratings yet
ML Module 02
37 pages
Ds Practical
No ratings yet
Ds Practical
25 pages
Module 2 Rnsit
No ratings yet
Module 2 Rnsit
15 pages
AI&ML Module 2
No ratings yet
AI&ML Module 2
65 pages
ML Module2
No ratings yet
ML Module2
59 pages
Module - 02 Machine Learning (BCS602)
No ratings yet
Module - 02 Machine Learning (BCS602)
31 pages
Module-2 Notes-Bcs602
No ratings yet
Module-2 Notes-Bcs602
18 pages
Data Science Using R Programming - Data Science Using R Unit 1-5
No ratings yet
Data Science Using R Programming - Data Science Using R Unit 1-5
25 pages
Module 2 ML Chapter2
No ratings yet
Module 2 ML Chapter2
64 pages
07 DimensionalityReduction
No ratings yet
07 DimensionalityReduction
49 pages
Module 2 Notes Bcs602
No ratings yet
Module 2 Notes Bcs602
19 pages
Module BHR 2104 Medical Statistics For Health Professionals
No ratings yet
Module BHR 2104 Medical Statistics For Health Professionals
78 pages
Module - 02 Machine Learning (BCS602)
No ratings yet
Module - 02 Machine Learning (BCS602)
45 pages
Ml. Model 2
100% (1)
Ml. Model 2
31 pages
L7 Ann
No ratings yet
L7 Ann
22 pages
QM2 23-24 Session 3
No ratings yet
QM2 23-24 Session 3
53 pages
Mod2 Notes
No ratings yet
Mod2 Notes
72 pages
SKO 2022 AutoML PreSales Session Pre and in Session Draft Slides
No ratings yet
SKO 2022 AutoML PreSales Session Pre and in Session Draft Slides
25 pages
Module 4 - Chapter 2
No ratings yet
Module 4 - Chapter 2
14 pages
Final Exams Schedule F2024
No ratings yet
Final Exams Schedule F2024
3 pages
Textbook ML - Removed
No ratings yet
Textbook ML - Removed
22 pages
Final Paper Spring 2021 - BusinessIntelligence by Shariq Ahmed Khan 58549
0% (1)
Final Paper Spring 2021 - BusinessIntelligence by Shariq Ahmed Khan 58549
14 pages
Lecture 3 Introduction To Linear Algebra (Part 2)
No ratings yet
Lecture 3 Introduction To Linear Algebra (Part 2)
57 pages
AOD Lec5-6
No ratings yet
AOD Lec5-6
52 pages
ELE127 Extended Abstract ITALIIC 2023 AGRONUTRI-X MOBILE LEARNING APPLICATION
No ratings yet
ELE127 Extended Abstract ITALIIC 2023 AGRONUTRI-X MOBILE LEARNING APPLICATION
8 pages
AIML Module - 4
No ratings yet
AIML Module - 4
25 pages
AI Module4
No ratings yet
AI Module4
17 pages
FALLSEM2023-24 - ITE2011 - ETH - VL2023240102356 - 2023-09-01 - Reference-Material-I (3 Files Merged)
No ratings yet
FALLSEM2023-24 - ITE2011 - ETH - VL2023240102356 - 2023-09-01 - Reference-Material-I (3 Files Merged)
191 pages
ML PPT 2
No ratings yet
ML PPT 2
206 pages
Distribution of Normal Variables
No ratings yet
Distribution of Normal Variables
6 pages
PCA Complete
No ratings yet
PCA Complete
8 pages
rESEARCH BA2
No ratings yet
rESEARCH BA2
12 pages
SimProject Report
No ratings yet
SimProject Report
16 pages
Week 3 - Similarity Distance Measures
No ratings yet
Week 3 - Similarity Distance Measures
42 pages
Chapter 4 (Hypothesis Testing)
No ratings yet
Chapter 4 (Hypothesis Testing)
20 pages
6 Dimension Reduction Theory
No ratings yet
6 Dimension Reduction Theory
18 pages
ACM 2022-23 Unit 3 Simultaneous Linear Equation
No ratings yet
ACM 2022-23 Unit 3 Simultaneous Linear Equation
53 pages
DSSM
No ratings yet
DSSM
342 pages
Pearson Edexcel GCE As and AL Mathematics Data Set - Issue 1 (1) .Xls - 0
No ratings yet
Pearson Edexcel GCE As and AL Mathematics Data Set - Issue 1 (1) .Xls - 0
149 pages
Workplace Safety Systems Analysis
No ratings yet
Workplace Safety Systems Analysis
251 pages
A Journey From Linear Algebra To Machine Learning
No ratings yet
A Journey From Linear Algebra To Machine Learning
50 pages
Validation Study of The Science Literacy Assessment - A Measure To
No ratings yet
Validation Study of The Science Literacy Assessment - A Measure To
13 pages
Emetnotes 1
No ratings yet
Emetnotes 1
67 pages
Multivariate
100% (1)
Multivariate
78 pages
Statistics & Probability Assignment
No ratings yet
Statistics & Probability Assignment
5 pages
Linear Algebra & Feature Selection
No ratings yet
Linear Algebra & Feature Selection
49 pages
Linear Algebra
No ratings yet
Linear Algebra
21 pages
Cold Storage Case Analysis Final
No ratings yet
Cold Storage Case Analysis Final
7 pages
Feature Extraction
No ratings yet
Feature Extraction
90 pages
Econometrics Syllabi
No ratings yet
Econometrics Syllabi
5 pages
Animal Clinical Chemistry
No ratings yet
Animal Clinical Chemistry
231 pages
Transfer of Drug Dissolution Testing by Statistical Approaches: Case Study
No ratings yet
Transfer of Drug Dissolution Testing by Statistical Approaches: Case Study
9 pages
Class - Xii Mathematics Ncert Solutions: S A PA
No ratings yet
Class - Xii Mathematics Ncert Solutions: S A PA
10 pages
BIOENG 1330/2330 Biomedical Imaging FALL 2015: Sowmya Aggarwal Ker-Jiun Wang University of Pittsburgh
No ratings yet
BIOENG 1330/2330 Biomedical Imaging FALL 2015: Sowmya Aggarwal Ker-Jiun Wang University of Pittsburgh
61 pages
Time Series Analysis for Students
No ratings yet
Time Series Analysis for Students
2 pages
Proyecto
0% (1)
Proyecto
3 pages
Matlab Solved Problems
50% (2)
Matlab Solved Problems
137 pages
Computer Vision: Spring 2006 15-385,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm - 4:20pm
No ratings yet
Computer Vision: Spring 2006 15-385,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm - 4:20pm
58 pages
Multiple Linear Regression: Points of Significance
No ratings yet
Multiple Linear Regression: Points of Significance
2 pages
Stats Poster Project 1
No ratings yet
Stats Poster Project 1
3 pages
National Institute of Technology, Tiruchirappalli MBA Trimester Examination, Basic Data Analytic Marathon Exam
No ratings yet
National Institute of Technology, Tiruchirappalli MBA Trimester Examination, Basic Data Analytic Marathon Exam
22 pages
Bda
No ratings yet
Bda
24 pages
12.MATLAB Tutorial
No ratings yet
12.MATLAB Tutorial
29 pages
Action Research
No ratings yet
Action Research
12 pages
PCA Tutorial with Iris Dataset
No ratings yet
PCA Tutorial with Iris Dataset
3 pages
Computer Vision: Spring 2006 15-385,-685
No ratings yet
Computer Vision: Spring 2006 15-385,-685
58 pages
Principal Component Analysis
100% (1)
Principal Component Analysis
17 pages
Lec06 - Panel Data
No ratings yet
Lec06 - Panel Data
160 pages
Multivariate Data Analysis: Universiteit Van Amsterdam
No ratings yet
Multivariate Data Analysis: Universiteit Van Amsterdam
28 pages
Materi 5 - 2
No ratings yet
Materi 5 - 2
25 pages
Numerical Linear Algebra With Matlab
No ratings yet
Numerical Linear Algebra With Matlab
16 pages
Principal Component Analysis: Courtesy:University of Louisville, CVIP Lab
No ratings yet
Principal Component Analysis: Courtesy:University of Louisville, CVIP Lab
48 pages
Projecting Data To A Lower Dimension With PCA
No ratings yet
Projecting Data To A Lower Dimension With PCA
6 pages
Advanced Statistics With Matlab
100% (3)
Advanced Statistics With Matlab
5 pages

AIML Module4

Uploaded by

AIML Module4

Uploaded by

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (BDS602)

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

This positive covariance indicates that as X increases, Y tends to increase as well.

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

2.8 ESSENTIAL MATHEMATICS FOR MULTIVARIATE DATA

Dr. S J SAVITA, CSE(DS) RNSIT

5. Then, the remaining unknown variables can be found by back-substitution as:

Step 2: Make the First Element (Pivot) 1

Step 3: Make the First Column Below the Pivot Zero

Dr. S J SAVITA, CSE(DS) RNSIT

Step 4: Make the Second Pivot 1

Step 5: Make the Second Column Below the Pivot Zero

Step 6: Make the Third Pivot 1

X+Y=5, 2X-Y=1 ,Find X , Y

Dr. S J SAVITA, CSE(DS) RNSIT

Step 4: Make the second pivot 1

Step 5: Make the element above the second pivot 0

Step 6: Read the solution, From the final matrix:

Dr. S J SAVITA, CSE(DS) RNSIT

2.8.3 Machine Learning and Importance of Probability and Statistics

Dr. S J SAVITA, CSE(DS) RNSIT

The binomial distribution function is given as follows, where p is the probability of

Combining both, one gets PDF of binomial distribution as:

Dr. S J SAVITA, CSE(DS) RNSIT

III. Bernoulli Distribution – This distribution models an experiment whose outcome is

IV. Density Estimation

V. Parametric Density Estimation

VI. Maximum Likelihood Estimation

Dr. S J SAVITA, CSE(DS) RNSIT

Here, h is the linear regression model. If Gaussian distribution is assumed as it is an obvious

Gaussian Mixture Model and Expectation-Maximization (EM) Algorithm :

Dr. S J SAVITA, CSE(DS) RNSIT

KNN Estimation: KNN estimation is another non-parametric density estimation method.

Dr. S J SAVITA, CSE(DS) RNSIT

3. Identify the Statistical Test and Statistic:

4. Decide the Significance Level (α\alphaα):

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

• 0.05 (5%): Standard significance level.

4. Interpreting the Confidence Interval:

Dr. S J SAVITA, CSE(DS) RNSIT

Comparing Learning Methods

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

2.10 FEATURE ENGINEERING AND DIMENSIONALITY REDUCTION

Dr. S J SAVITA, CSE(DS) RNSIT

2.10.1 Stepwise Forward Selection

Dr. S J SAVITA, CSE(DS) RNSIT

The PCA algorithm is as follows:

Dr. S J SAVITA, CSE(DS) RNSIT

2.10.4 Linear Discriminant Analysis

2.10.5 Singular Value Decomposition (SVD) :

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

Dr. S J SAVITA, CSE(DS) RNSIT

You might also like