Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
29 views23 pages

U20cs604 Machine Learning Unit III

Uploaded by

Boovi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views23 pages

U20cs604 Machine Learning Unit III

Uploaded by

Boovi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

UNIT-III UNSUPERVISED LEARNING

Clustering- K-means- EM – Mixtures of Gaussians – The EM Algorithm in General –


Model election for latent variable models – high-dimensional spaces – The Curse of
Dimensionality – dimensionality Reduction – Factor analysis – Principal Component
Analysis – Probabilistic PCA – Independent components analysis.

Clustering

Clustering or cluster analysis is a machine learning technique, which groups the


unlabelled dataset. It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points. The objects with the possible
similarities remain in a group that has less or no similarities with another
group."

 It does it by finding some similar patterns in the unlabelled dataset such as


shape, size, color, behavior, etc., and divides them as per the presence and
absence of those similar patterns.
 It is an unsupervised learning method, hence no supervision is provided to the
algorithm, and it deals with the unlabeled dataset.
 After applying this clustering technique, each cluster or group is provided with
a cluster-ID. ML system can use this id to simplify the processing of large and
complex datasets.
 The clustering technique is commonly used for statistical data analysis.

Example: Let's understand the clustering technique with the real-world example of
Mall: When we visit any shopping mall, we can observe that the things with similar
usage are grouped together. Such as the t-shirts are grouped in one section, and
trousers are at other sections, similarly, at vegetable sections, apples, bananas,
Mangoes, etc., are grouped in separate sections, so that we can easily find out the
things. The clustering technique also works in the same way. Other examples of
clustering are grouping documents according to the topic.

The clustering technique can be widely used in various tasks. Some most common
uses of this technique are:

o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.

1
Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past search of
products. Netflix also uses this technique to recommend the movies and web-series to
its users as per the watch history.

The below diagram explains the working of the clustering algorithm. We can see the
different fruits are divided into several
groups with similar properties.

Types of Clustering Methods

The clustering methods are broadly


divided into Hard clustering (datapoint belongs to only one group) and Soft
Clustering (data points can belong to another group also). But there are also other
various approaches of Clustering exist. Below are the main clustering methods used in
Machine learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

1. Partitioning Clustering

It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups. The cluster center is created in such a way that the
distance between the data points of one cluster is minimum as compared to another
cluster centroid.

2
2. Density-Based Clustering

The density-based clustering method connects the highly-dense areas into clusters,
and the arbitrarily shaped distributions are formed as long as the dense region can be
connected. This algorithm does it by identifying different clusters in the dataset and
connects the areas of high densities into clusters. The dense areas in data space are
divided from each other by sparser areas.

These algorithms can face difficulty in clustering the data points if the dataset has

varying densities and high dimensions.

3. Distribution Model-Based Clustering

In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done
by assuming some distributions commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering


algorithm that uses Gaussian Mixture Models (GMM).
3
4. Hierarchical Clustering

Hierarchical clustering can be used as an alternative for the partitioned clustering


as there is no requirement of pre-specifying the number of clusters to be created. In
this technique, the dataset is divided into clusters to create a tree-like structure, which
is also called a dendrogram. The observations or any number of clusters can be
selected by cutting the tree at the correct level. The most common example of this
method is the Agglomerative Hierarchical algorithm.

5. Fuzzy Clustering

Fuzzy clustering is a type of soft method in which a data object may belong to
more than one group or cluster. Each dataset has a set of membership coefficients,
which depend on the degree of membership to be in a cluster. Fuzzy C-means
algorithm is the example of this type of clustering; it is sometimes also known as the
Fuzzy k-means algorithm.

Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained
above. There are different types of clustering algorithms published, but only a few are

4
commonly used. The clustering algorithm is based on the kind of data that we are
using. Such as, some algorithms need to guess the number of clusters in the given
dataset, whereas some are required to find the minimum distance between the
observation of the dataset.

1. K-Means algorithm: The k-means algorithm is one of the most popular


clustering algorithms. It classifies the dataset by dividing the samples into
different clusters of equal variances. The number of clusters must be specified
in this algorithm. It is fast with fewer computations required, with the linear
complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the
smooth density of data points. It is an example of a centroid-based model, that
works on updating the candidates for centroid to be the center of the points
within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model similar to
the mean-shift, but with some remarkable advantages. In this algorithm, the
areas of high density are separated by the areas of low density. Because of this,
the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be
used as an alternative for the k-means algorithm or for those cases where K-
means can be failed. In GMM, it is assumed that the data points are Gaussian
distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical
algorithm performs the bottom-up hierarchical clustering. In this, each data
point is treated as a single cluster at the outset and then successively merged.
The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does
not require to specify the number of clusters. In this, each data point sends a
message between the pair of data points until convergence. It has O(N 2T) time
complexity, which is the main drawback of this algorithm.

Applications of Clustering
Below are some commonly known applications of clustering technique in Machine
Learning:

5
o In Identification of Cancer Cells: The clustering algorithms are widely used
for the identification of cancerous cells. It divides the cancerous and non-
cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique. The
search result appears based on the closest object to the search query. It does it
by grouping similar data objects in one group that is far from the other
dissimilar objects. The accurate result of a query depends on the quality of the
clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of
plants and animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar
lands use in the GIS database. This can be very useful to find that for what
purpose the particular land should be used, that means for which purpose it is
more suitable.

K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve


the clustering problems in machine learning or data science. In this topic, we will
learn what is K-means clustering algorithm, how the algorithm works, along with the
Python implementation of k-means clustering.

What is K-Means Algorithm?


K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K=2, there will be two clusters,
and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.

6
It is a centroid-based algorithm, where each cluster is associated with a centroid. The
main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value of
k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative


process.
o Assigns each data point to its closest k-center. Those data points which are near
to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

EM algorithm

The Expectation-Maximization (EM) algorithm is defined as the combination


of various unsupervised machine learning algorithms, which is used to determine
the local maximum likelihood estimates (MLE) or maximum a posteriori
estimates (MAP) for unobservable variables in statistical models. Further, it is a
technique to find maximum likelihood estimation when the latent variables are
present. It is also referred to as the latent variable model.

7
A latent variable model consists of both observable and unobservable variables where
observable can be predicted while unobserved are inferred from the observed variable.
These unobservable variables are known as latent variables.

Key Points:

o It is known as the latent variable model to determine MLE and MAP


parameters for latent variables.
o It is used to predict values of parameters in instances where data is missing or
unobservable for learning, and this is done until convergence of the values
occurs.

EM Algorithm

The EM algorithm is the combination of various unsupervised ML algorithms, such as


the k-means clustering algorithm. Being an iterative approach, it consists of two
modes. In the first mode, we estimate the missing or latent variables. Hence it is
referred to as the Expectation/estimation step (E-step). Further, the other mode is
used to optimize the parameters of the models so that it can explain the data more

clearly. The second mode is known as the maximization-step or M-step.

o Expectation step (E - step): It involves the estimation (guess) of all missing


values in the dataset so that after completing this step, there should not be any
missing value.
o Maximization step (M - step): This step involves the use of estimated data in
the E-step and updating the parameters.
o Repeat E-step and M-step until the convergence of the values occurs.

8
The primary goal of the EM algorithm is to use the available observed data of the
dataset to estimate the missing data of the latent variables and then use that data to
update the values of the parameters in the M-step.

What is Convergence in the EM algorithm?

Convergence is defined as the specific situation in probability based on intuition,


e.g., if there are two random variables that have very less difference in their
probability, then they are known as converged. In other words, whenever the values of
given variables are matched with each other, it is called convergence.

Steps in EM Algorithm

The EM algorithm is completed mainly in 4 steps, which include Initialization Step,


Expectation Step, Maximization Step, and convergence Step. These steps are
explained as follows:

o 1st Step: The very first step is to initialize the parameter values. Further, the
system is provided with incomplete observed data with the assumption that data
is obtained from a specific model.

o 2nd Step: This step is known as Expectation or E-Step, which is used to


estimate or guess the values of the missing or incomplete data using the
observed data. Further, E-step primarily updates the variables.
o 3rd Step: This step is known as Maximization or M-step, where we use
complete data obtained from the 2nd step to update the parameter values.
Further, M-step primarily updates the hypothesis.

9
o 4th step: The last step is to check if the values of latent variables are converging
or not. If it gets "yes", then stop the process; else, repeat the process from step
2 until the convergence occurs.

Applications of EM algorithm

The primary aim of the EM algorithm is to estimate the missing data in the latent
variables through observed data in datasets. The EM algorithm or latent variable
model has a broad range of real-life applications in machine learning. These are as
follows:

o The EM algorithm is applicable in data clustering in machine learning.


o It is often used in computer vision and NLP (Natural language processing).
o It is used to estimate the value of the parameter in mixed models such as
the Gaussian Mixture Modeland quantitative genetics.
o It is also used in psychometrics for estimating item parameters and latent
abilities of item response theory models.
o It is also applicable in the medical and healthcare industry, such as in image
reconstruction and structural engineering.
o It is used to determine the Gaussian density of a function.

Advantages of EM algorithm
o It is very easy to implement the first two basic steps of the EM algorithm in
various machine learning problems, which are E-step and M- step.
o It is mostly guaranteed that likelihood will enhance after each iteration.
o It often generates a solution for the M-step in the closed form.

Disadvantages of EM algorithm
o The convergence of the EM algorithm is very slow.
o It can make convergence for the local optima only.
o It takes both forward and backward probability into consideration. It is opposite
to that of numerical optimization, which takes only forward probabilities.

Gaussian Mixture Model (GMM)

The Gaussian Mixture Model or GMM is defined as a mixture model that


has a combination of the unspecified probability distribution function. Further,

10
GMM also requires estimated statistics values such as mean and standard deviation or
parameters. It is used to estimate the parameters of the probability distributions to best
fit the density of a given training dataset. Although there are plenty of techniques
available to estimate the parameter of the Gaussian Mixture Model (GMM),
the Maximum Likelihood Estimation is one of the most popular techniques among
them.

Let's understand a case where we have a dataset with multiple data points
generated by two different processes. However, both processes contain a similar
Gaussian probability distribution and combined data. Hence it is very difficult to
discriminate which distribution a given point may belong to.

The processes used to generate the data point represent a latent variable or
unobservable data. In such cases, the Estimation-Maximization algorithm is one of the
best techniques which helps us to estimate the parameters of the gaussian
distributions. In the EM algorithm, E-step estimates the expected value for each latent
variable, whereas M-step helps in optimizing them significantly using the Maximum
Likelihood Estimation (MLE). Further, this process is repeated until a good set of
latent values, and a maximum likelihood is achieved that fits the data.

The Curse of Dimensionality

Handling the high-dimensional data is very difficult in practice, commonly


known as the curse of dimensionality. If the dimensionality of the input dataset
increases, any machine learning algorithm and model becomes more complex. As the
number of features increases, the number of samples also gets increased
proportionally, and the chance of overfitting also increases. If the machine learning
model is trained on high-dimensional data, it becomes overfitted and results in poor
performance.

Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.

Dimensionality Reduction

The number of input features, variables, or columns present in a given dataset


is known as dimensionality, and the process to reduce these features is called
dimensionality reduction.

A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or
make predictions for the training dataset with a high number of features, for such
cases, dimensionality reduction techniques are required to use.

11
Dimensionality reduction technique can be defined as, "It is a way of converting the
higher dimensions dataset into lesser dimensions dataset ensuring that it provides
similar information." These techniques are widely used in machine learning for
obtaining a better fit predictive model while solving the classification and regression
problems.

It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.

Benefits of applying Dimensionality Reduction

Some benefits of applying dimensionality reduction technique to the given dataset are
given below:

o By reducing the dimensions of the features, the space required to store the
dataset also gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data
quickly.
o It removes the redundant features (if present) by taking care of
multicollinearity.

Disadvantages of dimensionality Reduction

There are also some disadvantages of applying the dimensionality reduction, which
are given below:

o Some data may be lost due to dimensionality reduction.


o In the PCA dimensionality reduction technique, sometimes the principal
components required to consider are unknown.

Approaches of Dimension Reduction

There are two ways to apply the dimension reduction technique, which are given
below:

12
Feature Selection

Feature selection is the process of selecting the subset of the relevant features and
leaving out the irrelevant features present in a dataset to build a model of high
accuracy. In other words, it is a way of selecting the optimal features from the input
dataset.

Three methods are used for the feature selection:

1. Filters Methods In this method, the dataset is filtered, and a subset that contains
only the relevant features is taken. Some common techniques of filters method are:

o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.

2. Wrappers Methods

The wrapper method has the same goal as the filter method, but it takes a machine
learning model for its evaluation. In this method, some features are fed to the ML
model, and evaluate the performance. The performance decides whether to add those
features or remove to increase the accuracy of the model. This method is more
accurate than the filtering method but complex to work. Some common techniques of
wrapper methods are:

o Forward Selection
o Backward Selection
o Bi-directional Elimination

3. Embedded Methods: Embedded methods check the different training iterations of


the machine learning model and evaluate the importance of each feature. Some
common techniques of Embedded methods are:

o LASSO
o Elastic Net
o Ridge Regression, etc.

13
Feature Extraction:

Feature extraction is the process of transforming the space containing many


dimensions into space with fewer dimensions. This approach is useful when we want
to keep the whole information but use fewer resources while processing the
information.

Some common feature extraction techniques are:

a) Principal Component Analysis


b) Linear Discriminant Analysis
c) Kernel PCA
d) Quadratic Discriminant Analysis

Factor Analysis
Factor Analytics is a special technique reducing the huge number of variables
into a few numbers of factors is known as factoring of the data, and managing which
data is to be present in sheet comes under factor analysis. It is completely a statistical
approach that is also used to describe fluctuations among the observed and correlated
variables in terms of a potentially lower number of unobserved variables
called factors.
The factor analysis technique extracts the maximum common variance from all
the variables and puts them into a common score. It is a theory that is used in training
the machine learning model and so it is quite related to data mining. The belief behind
factor analytic techniques is that the information gained about the interdependencies
between observed variables can be used later to reduce the set of variables in a
dataset.
Factor analysis is a very effective tool for inspecting changeable relationships
for complex concepts such as social status, economic status, dietary patterns,
psychological scales, biology, psychometrics, personality theories, marketing, product
management, operations research, finance, etc. It can help a researcher to investigate
the concepts that are not easily measured in a much easier and quicker way directly by
the cave in a large number of variables into a few easily interpretable fundamental
factors.
Types of factor analysis:
1. Exploratory factor analysis (EFA) :

It is used to identify composite inter-relationships among items and group items that are the
part of uniting concepts. The Analyst can’t make any prior assumptions about the

14
relationships among factors. It is also used to find the fundamental structure of a huge set of
variables. It lessens the large data to a much smaller set of summary variables. It is almost
similar to the Confirmatory Factor Analysis(CFA).
Similarities are:
 Evaluate the internal reliability of an amount.
 Examine the factors represented by item sets. They presume that the factors
aren’t correlated.
 Investigate the grade/class of each item.

However, some common differences, most of them are concerned about how
factors are used. Basically, EFA is a data-driven approach, which allows all items
to load on all the factors, while in CFA you need to specify which factors are
required to load. EFA is really a nice choice if you have no idea about what
common factors might exist. EFA is able to generate a huge number of possible
models for your data, something which is not possible is, if a researcher has to
specify factors. If you have a bit idea about what actually the models look like, and
then afterwards you want to test your hypotheses about the data structure, in that
case, the CFA is a better approach.
2. Confirmatory factor analysis (CFA) :

It is a more complex(composite) approach that tests the theory that the items are
associated with specific factors. Confirmatory Factor Analysis uses a properly
structured equation model to test a measurement model whereby loading on the
factors allows for the evaluation of relationships between observed variables and
unobservedvariables.
As we know, the Structural equation modelling approaches can board measurement
error easily, and these are much less restrictive than least-squares estimation thus
provide more exposure to accommodate errors. Hypothesized models are tested
against actual data, and the analysis would demonstrate loadings of observed variables
on the latent variables (factors), as well as the correlation between the latent variables.
Confirmatory Factor Analysis allows an analyst and researcher to figure out if a
relationship between a set of observed variables (also known as manifest variables)
and their underlying constructs exists. It is similar to the Exploratory Factor Analysis.

1. The main difference between the two is:


 Simply use Exploratory Factor Analysis to explore the pattern.
 Use Confirmatory Factor Analysis to perform hypothesis testing.

Confirmatory Factor Analysis provides information about the standard quality of


the number of factors that are required to represent the data set. Using

15
Confirmatory Factor Analysis, you can define the total number of factors required.
For example, Confirmatory Factor Analysis is able to answer questions like Does
my thousand question survey can able to measure accurately the one specific
factor. Even though it is technically applicable to any kind of discipline, it is
typically used in social sciences.

3. Multiple Factor Analysis :

This type of Factor Analysis is used when your variables are structured in
changeable groups. For example, you may have a teenager’s health
questionnaire with several points like sleeping patterns, wrong addictions,
psychological health, mobile phone addiction, or learning disabilities.
The Multiple Factor Analysis is performed in two steps which are:-
 Firstly, the Principal Component Analysis will perform on each and every
section of the data. Further, this can give a useful eigenvalue, which is
actually used to normalize the data sets for further use.
 The newly formed data sets are going to merge into a distinctive matrix and
then global PCA is performed.

4. Generalized Procrustes Analysis (GPA) :

The Procrustes analysis is actually a suggested way to compare then the


two approximate sets of configurations and shapes, which were originally
developed to equivalent to the two solutions from Factor Analysis, this
technique was actually used to extend the GP Analysis so that more than
two shapes could be compared in many ways. The shapes are properly
aligned to achieve the target shape. Mainly GPA (Generalized Procrustes
Analysis) uses geometric transformations.
Geometric progressions are :
 Isotropic rescaling,
 Reflection,
 Rotation,
 Translation of matrices to compare the sets of data.

Eigenvalues
When factor analysis going to generate the factors, each and every factor has
ab associated eigenvalue which will give the total variance explained by each

16
factor.
Usually, the factors having eigenvalues greater than 1 are useful :

Percentage of variation explained by F1 = Eigenvalue of Factor


1/No. of Variables

Percentage of variation explained by F2 = Eigenvalue of Factor


2/No. of Variables
# X? vector = ? ? vector

X is a general matrix as before, which is multiplied by some vector, and ? is a characteristic


value. Look at the equation and notice that when you multiply the matrix by the vector, the effect
is to reproduce the same vector just multiplied by the value ?. This is unusual behaviour and
earns the vector and quantity? special names: the eigenvector and eigenvalue.

FactorLoadings

In addition, factors are created with equality; some factors have more weights
some have low. In a simple example, imagine your car company says Maruti
Suzuki is conducting a survey includes, using – telephonic survey, physical
survey, google forms, etc. for customer satisfaction and the results show the
following factor loadings:
VARIABLE | F1 | F2 | F3
| | |
Problem 1 | 0.985 | 0.111 | -0.032
Problem 2 | 0.724 | 0.008 | 0.167
Problem 3 | 0.798 | 0.180 | 0.345

Here –
F1 – Factor 1
F2 – Factor 2
F3 – Factor 3

The factors that affect the question the most (and therefore have the highest
factor loadings) are bolded. Factor loadings are similar to correlation
coefficients in that they can vary from -1 to 1. The closer factors are to -1 or 1,
the more they affect the variable.
Note: A factor loading of 0 indicates no effect.

17
PRINICIPAL COMPONENT ANALYSIS:
Principal Component Analysis is an unsupervised learning algorithm that is
used for the dimensionality reduction in machine learning. It is a statistical process
that converts the observations of correlated features into a set of linearly uncorrelated
features with the help of orthogonal transformation. These new transformed features
are called the Principal Components. It is one of the popular tools that is used for
exploratory data analysis and predictive modeling. It is a technique to draw strong
patterns from the given dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.
PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality.
Some real-world applications of PCA are image processing, movie recommendation
system, optimizing the power allocation in various communication channels. It is a
feature extraction technique, so it contains the important variables and drops the least
important variable.
The PCA algorithm is based on some mathematical concepts such as:
 Variance and Covariance
 Eigenvalues and Eigen factors
Some common terms used in PCA algorithm:
Dimensionality: It is the number of features or variables present in the given dataset.
More easily, it is the number of columns present in the dataset.
Correlation: It signifies that how strongly two variables are related to each other.
Such as if one changes, the other variable also gets changed. The correlation value
ranges from -1 to +1. Here, -1 occurs if variables are inversely proportional to each
other, and +1 indicates that variables are directly proportional to each other.
Orthogonal: It defines that variable are not correlated to each other, and hence the
correlation between the pair of variables is zero.
Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v
will be eigenvector if Av is the scalar multiple of v.
Covariance Matrix: A matrix containing the covariance between the pair of variables
is called the Covariance Matrix

Principal Components in PCA

As described above, the transformed new features or the output of PCA are the
Principal Components. The number of these PCs are either equal to or less than the

18
original features present in the dataset. Some properties of these principal components
are given below:

o The principal component must be the linear combination of the original


features.
o These components are orthogonal, i.e., the correlation between a pair of
variables is zero.
o The importance of each component decreases when going to 1 ton, it means the
1 PC has the most importance, and n PC will have the least importance.

Steps for PCA algorithm


1.Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X and Y,
where X is the training set, and Y is the validation set.

2.Representing data into a structure


Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data
items, and the column corresponds to the Features. The number of columns is the
dimensions of the dataset.

3.Standardizing the data


In this step, we will standardize our dataset. Such as in a particular column, the
features with high variance are more important compared to the features with lower
variance.
If the importance of features is independent of the variance of the feature, then we will
divide each data item in a column with the standard deviation of the column. Here we
will name the matrix as Z.

4.Calculating the Covariance of Z


To calculate the covariance of Z, we will take the matrix Z, and will transpose it.
After transpose, we will multiply it by Z. The output matrix will be the Covariance
matrix of Z.

5.Calculating the Eigen Values and Eigen Vectors


Now we need to calculate the eigenvalues and eigenvectors for the resultant
covariance matrix Z. Eigenvectors or the covariance matrix are the directions of the
axes with high information. And the coefficients of these eigenvectors are defined as
the eigenvalues.

6.Sorting the Eigen Vectors


In this step, we will take all the eigenvalues and will sort them in decreasing order,
19
which means from largest to smallest. And simultaneously sort the eigenvectors
accordingly in matrix P of eigenvalues. The resultant matrix will be named as P*.

7.Calculating the new features Or Principal Components


Here we will calculate the new features. To do this, we will multiply the P* matrix to
the Z. In the resultant matrix Z*, each observation is the linear combination of original
features. Each column of the Z* matrix is independent of each other.

8.Remove less or unimportant features from the new dataset.


The new feature set has occurred, so we will decide here what to keep and what to
remove. It means, we will only keep the relevant or important features in the new
dataset, and unimportant features will removed out.

APPLICATIONS OF PRINCIPAL COMPONENT ANALYSIS:


PCA is mainly used as the dimensionality reduction technique in various AI
applications such as computer vision, image compression, etc.
It can also be used for finding hidden patterns if data has high dimensions. Some
fields where PCA is used are Finance, data mining, Psychology, etc

PROBABILISTIC PCA:
Probabilistic principal components analysis (PCA) is a dimensionality
reduction technique that analyzes data via a lower dimensional latent space.It is often
used when there are missing values in the data or for multidimensional scaling

INDEPENDENT COMPONENT ANALYSIS:


Independent Component Analysis (ICA) is a machine learning technique to
separate independent sources from a mixed signal. Unlike principal component
analysis which focuses on maximizing the variance of the data points, the
independent component analysis focuses on independence, i.e. independent
components.

Problem: To extract independent sources’ signals from a mixed signal composed of


the signals from those sources.
Independent Component Analysis (ICA) is a technique for separating independent
signals from a multi-dimensional signal. It is used for signal processing, data
analysis, and machine learning applications. The goal of ICA is to find a linear
transformation of the data such that the transformed data is as close to being
statistically independent as possible
1. The underlying idea of ICA is to find a set of basis functions that are as
independent as possible, and to represent the data in terms of these basis
functions. The transformed data is then assumed to be statistically
independent, and can be used for various applications, such as denoising,
feature extraction, and source separation.
20
2. There are various algorithms that can be used to perform ICA, including
FastICA, JADE, and infomax. These algorithms differ in their

optimization objectives and the methods used to estimate the independent


components

Given: Mixed signal from five different independent sources.


Aim: To decompose the mixed signal into independent sources:
 Source 1
 Source 2
 Source 3
 Source 4
 Source 5

Solution: Independent Component Analysis (ICA). Consider Cocktail Party


Problem or Blind Source Separation problem to understand the problem which is
solved by independent component analysis.

Here, There is a party going into a room full of people. There is ‘n’ number of speakers
in that room and they are speaking simultaneously at the party. In the same room, there
are also ‘n’ microphones placed at different distances from the speakers which are
recording ‘n’ speakers’ voice signals. Hence, the number of speakers is equal to the
number must of microphones in the room. Now, using these microphones’ recordings,
we want to separate all the ‘n’ speakers’ voice signals in the room given each
microphone recorded the voice signals coming from each speaker of different intensity
21
due to the difference in distances between them. Decomposing the mixed signal of each
microphone’s recording into an independent source’s speech signal can be done by
using the machine learning technique, independent component analysis. [ X1, X2, …..,
Xn ] => [ Y1, Y2, ….., Yn ] where, X1, X2, …, Xn are the original signals present in the
mixed signal and Y1, Y2, …, Yn are the new features and are independent components
which are independent of each other.

Restrictions on ICA –

1. The independent components generated by the ICA are assumed to be


statistically independent of each other.
2. The independent components generated by the ICA must have non-gaussian
distribution.
3. The number of independent components generated by the ICA is equal to the
number of observed mixtures.
4.
Advantages of Independent Component Analysis (ICA):
 Non-Gaussianity: ICA assumes that the source signals are non-Gaussian,
which makes it well-suited for separating signals that are not easily separable
by other methods, such as linear regression or PCA.
 Blind Source Separation: ICA is capable of separating signals without any
prior knowledge about the sources or their relationships. This is useful in many
applications where the sources are unknown, such as in speech separation or
EEG signal analysis.
 Computationally Efficient: ICA algorithms are computationally efficient and
can be applied to large datasets.
 Interpretability: ICA provides an interpretable representation of the data,
where each component represents a single source signal. This can help in
understanding the underlying structure of the data and in making informed
decisions about the data.

Disadvantages of Independent Component Analysis (ICA):


 Non-uniqueness: There is no unique solution to the ICA problem, and the
estimated independent components may not match the true sources. This can
lead to suboptimal results or incorrect interpretations.
 Non-deterministic: Some ICA algorithms are non-deterministic, meaning that
they can produce different results each time they are run on the same data.
 Limitations of Gaussianity: If the source signals are not non-Gaussian, then
ICA may not perform well, and other methods such as PCA or linear regression
may be more appropriate.

22
Difference between PCA and ICA are as follows:
Principal Component Analysis Independent Component Analysis

It reduces the dimensions to avoid the It decomposes the mixed signal into
problem of overfitting. its independent sources’ signals.

It deals with the Independent


It deals with the Principal Components. Components.

It doesn’t focus on the issue of


It focuses on maximizing the variance. variance among the data points.

It focuses on the mutual orthogonality It doesn’t focus on the mutual


property of the principal components. orthogonality of the components.

It doesn’t focus on the mutual It focuses on the mutual


independence of the components. independence of the components.

23

You might also like