Machine
Learning
S. Sridhar and M. Vijayalakshmi
Chapter 2
Understanding of Data
Bivariate Data
INVOLVES TWO VARIABLES
Bivariate Data Visualization
Scatter Plot Line Plot
Bivariate Data – Covariance
Bivariate Data – Correlation
Bivariate Data – Correlation
Multivariate Data Visualization
HeatMap
Multivariate Data Visualization
HeatMap
Multivariate Data Visualization
Multivariate Essential Mathematics
1. LINEAR SYSTEM and GAUSSIAN ELIMINATION
In mathematics, the Gaussian elimination method is known as the row reduction
algorithm for solving linear equations systems.
Multivariate Essential Mathematics
1. GAUSSIAN ELIMINATION
Multivariate Essential Mathematics
1. MATRIX DECOMPOSITION
Multivariate Essential Mathematics
1. MATRIX DECOMPOSITION
Multivariate Essential Mathematics
1. PROBABILITY DISTRIBUTIONS
Any data is assumed to be generated by a probability distribution
Probability Distribution is of two types:
1. Discrete Probability Distribution
2. Continuous Probability Distribution
Discrete Probability Distribution Continuous Probability Distribution
• Binomial Distribution • Normal Distribution
• Poisson Distribution • Rectangular Distribution
• Bernoulli Distribution • Exponential Distribution
Multivariate Essential Mathematics
The relationship between the events for a continuous random variable and their
probabilities is called a continuous probability distribution.
• It is summarized as Probability Density Function (PDF)
• PDF calculates the probability of observing an instance.
• The plot of the PDF shows the shape of the distribution
1. NORMAL DISTRIBUTIONS
Bell Curve or Gaussian Distribution
Multivariate Essential Mathematics
Most babies are likely to weight around 7.5 pounds, with few weighing less than 7 pounds and
few weighing more than 8 pounds.
Multivariate Essential Mathematics
2. RECTANGULAR DISTRIBUTIONS
It has equal probabilities for all values in the range a,b.
Also known as Uniform Distribution
3. EXPONENTIAL DISTRIBUTIONS
This distribution is helpful in modeling time until an event
occurs. Also known as Continuous Uniform Distribution
Multivariate Essential Mathematics
1. BINOMIAL DISTRIBUTIONS
The objective of this distribution is to find the probability of getting success k
out of n trials. It has only 2 outcomes – Success or Failure
Multivariate Essential Mathematics
2. POISSON DISTRIBUTIONS
Given an interval of time, this distribution is used to model the probability of a
given number of events k. Ex: Number of e-mails received, number of
customers visiting a shop.
Multivariate Essential Mathematics
3. BERNOULLI DISTRIBUTIONS
Density Estimation
• Let there be a set of observed values x1, x2, …….xn from a larger set of data
whose distribution is not known.
• Density estimation is the problem of estimating the density function from observed
data (sample data)
• The estimated density function, denoted as p(x) can be used to value directly for
any unknown data say xt as p(xt).
• If its value is less than a certain threshold, then xt is not an outlier or anomaly data.
Else it is categorized as anomaly data.
Density Estimation
Two types of Density Estimation
Parametric Density Estimation
• Maximum Likelihood Estimate
• Gaussian Mixture Model and Expectation Maximization(EM)
Algorithm
Non-Parametric Density Estimation
• Parzen Window
• KNN Estimation
Density Estimation
• Maximum likelihood estimation is a method that determines values for the
parameters of a model.
• The parameter values are found such that they maximise the likelihood that the
process described by the model produced the data that were actually observed.
• Let’s suppose we have observed 10 data points from some process. These 10 data
points are shown in the figure below
For these data we’ll assume that the data
generation process can be adequately described by
a Gaussian (normal) distribution. Visual inspection
of the figure suggests that a Gaussian distribution is
plausible because most of the 10 points are
clustered in the middle with few points scattered to
the left and the right
Density Estimation
• Gaussian distribution has 2 parameters. The mean, μ, and the standard deviation, σ.
• Different values of these parameters result in different curves. We want to
know which curve was most likely responsible for creating the data points that we
observed? (See figure below).
• Maximum likelihood estimation is a method that will find the values of μ and σ that
result in the curve that best fits the data. (Curve f1 best fits the data in the figure
which has mean=10 and standard deviation = 2.25)
The probability density of observing a single data
point x, that is generated from a Gaussian distribution is
given by:
Density Estimation
Gaussian Mixture Model and EM Algorithm
• In Machine learning, clustering is an important task.
• MLE framework is quite useful for designing model based methods
for clustering data
• A model is a statistical method and data is assumed to be
generated by a distribution model with its parameter
• There may be many distributions involved and hence called Mixture
model
• Gaussians are normally assumed and hence named Gaussian
Mixture Model (GMM)
Density Estimation
Gaussian Mixture Model – Expectation Maximization (EM) Algorithm is
commonly used for Estimating MLE
Density Estimation
Two types of Density Estimation
Parametric Density Estimation
• Maximum Likelihood Estimate
• Gaussian Mixture Model and Expectation Maximization(EM)
Algorithm
Non-Parametric Density Estimation
• Parzen Window
• KNN Estimation
Density Estimation
Parsen Window
Density Estimation
Feature Engineering
• Feature Engineering deals with 2 problems:
• FEATURE TRANSFORMATION
• FEATURE SELECTIONS
Feature Engineering
• FEATURE TRANSFORMATION
- Extraction of features and creating new features that may be
helpful in increasing performance.
• Ex: Height and Weight may give new attribute - BMI
• FEATURE SELECTIONS
- feature subset selection by removing irrelevant features
- “Curse of Dimensionality” – as the number of dimensions
increases, time complexity increases
Characteristics of Good Features
• FEATURES ARE REMOVED USING RELEVANCY
• FEATURES ARE REMOVED BASED ON REDUNDANCY
Procedure:
1. Generate all possible subsets
2. Evaluate the subsets and model performance
3. Evaluate the results for optimal feature selection
Characteristics of Good Features
• FEATURES ARE REMOVED USING RELEVANCY
• FEATURES ARE REMOVED BASED ON REDUNDANCY
Procedure:
1. Generate all possible subsets
2. Evaluate the subsets and model performance
3. Evaluate the results for optimal feature selection
Feature Selection
Filter Based Selection Methods
- No learning algorithm is used
- Uses statistical measures such as correlation, mutual
information, entropy etc for feature selection
Wrapper – based Methods
- Use classifiers to identify best features
- They are selected and evaluated by learning algorithms
- Computationallly intensive but superior performance
FEATURE SELECTION
FORWARD SELECTION
BACKWARD SELECTION
Principal Component Analysis
• Principal Component Analysis (PCA) is a dimensionality reduction
technique
• Trasforms a given set of features to a new set of features so that
features exhibit high information packing properties
• This leads to a reduced and compact set of features
• PCA extracts the most important information. This in turn leads to
compression since the less important information are discarded.
Principal Component Analysis
Principal Component Analysis
Principal Component Analysis
Compute Covariance matrix as
Compute Eigen values and Eigen vectors
and matrix A as a set of eigen vectors
Principal Component Analysis
Compute PCA as
The original Data can be
recovered as
PCA Algorithm
PCA Example
PCA Example
PCA Example
PCA Example
PCA Example
Verification
Linear Discriminant Analysis
• Linear discriminant analysis (LDA) Fisher's linear discriminant, a method used
in statisctics and other fields, to find a linear combination of features that
characterizes or separates two or more classes.
• LDA is a supervised learning method that seeks to find a linear combination of
features, forming a decision boundary that effectively separates two or more
classes in a dataset.
• It involves two primary steps: dimensionality reduction and linear classification.
• By projecting high-dimensional data onto a lower-dimensional space, LDA
maximizes the separation between classes while minimizing variance within each
class.
Linear Discriminant Analysis
• LDA is also closely related to principal component analysis (PCA)
• They both look for linear combinations of variables which best explain the
data.
• LDA explicitly attempts to model the difference between the classes of data.
• PCA, in contrast, does not take into account any difference in class, but only
variance in the data
• The axes created by LDA (LD1, LD2, etc.) prioritize class separation, whereas
PCA’s axes (PC1, PC2, etc.) prioritize variance.
Linear Discriminant Analysis
LDA Algorithm
Where, V is the linear projection and σb and σw are class scatter matrix
and within class scatter matrix respectively. For two class problem, these
matrices are given as,
LDA Algorithm
Singular Value Decomposition
Singular Value Decomposition
Singular Value Decomposition
SVD Algorithm
SVD Example
SVD Example
SVD Example
SVD Example
Summary
Summary