U20cs604 Machine Learning Unit III
U20cs604 Machine Learning Unit III
Clustering
Example: Let's understand the clustering technique with the real-world example of
Mall: When we visit any shopping mall, we can observe that the things with similar
usage are grouped together. Such as the t-shirts are grouped in one section, and
trousers are at other sections, similarly, at vegetable sections, apples, bananas,
Mangoes, etc., are grouped in separate sections, so that we can easily find out the
things. The clustering technique also works in the same way. Other examples of
clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common
uses of this technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
1
Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past search of
products. Netflix also uses this technique to recommend the movies and web-series to
its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the
different fruits are divided into several
groups with similar properties.
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
1. Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups. The cluster center is created in such a way that the
distance between the data points of one cluster is minimum as compared to another
cluster centroid.
2
2. Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters,
and the arbitrarily shaped distributions are formed as long as the dense region can be
connected. This algorithm does it by identifying different clusters in the dataset and
connects the areas of high densities into clusters. The dense areas in data space are
divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done
by assuming some distributions commonly Gaussian Distribution.
5. Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to
more than one group or cluster. Each dataset has a set of membership coefficients,
which depend on the degree of membership to be in a cluster. Fuzzy C-means
algorithm is the example of this type of clustering; it is sometimes also known as the
Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained
above. There are different types of clustering algorithms published, but only a few are
4
commonly used. The clustering algorithm is based on the kind of data that we are
using. Such as, some algorithms need to guess the number of clusters in the given
dataset, whereas some are required to find the minimum distance between the
observation of the dataset.
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine
Learning:
5
o In Identification of Cancer Cells: The clustering algorithms are widely used
for the identification of cancerous cells. It divides the cancerous and non-
cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique. The
search result appears based on the closest object to the search query. It does it
by grouping similar data objects in one group that is far from the other
dissimilar objects. The accurate result of a query depends on the quality of the
clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of
plants and animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar
lands use in the GIS database. This can be very useful to find that for what
purpose the particular land should be used, that means for which purpose it is
more suitable.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.
6
It is a centroid-based algorithm, where each cluster is associated with a centroid. The
main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value of
k should be predetermined in this algorithm.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
EM algorithm
7
A latent variable model consists of both observable and unobservable variables where
observable can be predicted while unobserved are inferred from the observed variable.
These unobservable variables are known as latent variables.
Key Points:
EM Algorithm
8
The primary goal of the EM algorithm is to use the available observed data of the
dataset to estimate the missing data of the latent variables and then use that data to
update the values of the parameters in the M-step.
Steps in EM Algorithm
o 1st Step: The very first step is to initialize the parameter values. Further, the
system is provided with incomplete observed data with the assumption that data
is obtained from a specific model.
9
o 4th step: The last step is to check if the values of latent variables are converging
or not. If it gets "yes", then stop the process; else, repeat the process from step
2 until the convergence occurs.
Applications of EM algorithm
The primary aim of the EM algorithm is to estimate the missing data in the latent
variables through observed data in datasets. The EM algorithm or latent variable
model has a broad range of real-life applications in machine learning. These are as
follows:
Advantages of EM algorithm
o It is very easy to implement the first two basic steps of the EM algorithm in
various machine learning problems, which are E-step and M- step.
o It is mostly guaranteed that likelihood will enhance after each iteration.
o It often generates a solution for the M-step in the closed form.
Disadvantages of EM algorithm
o The convergence of the EM algorithm is very slow.
o It can make convergence for the local optima only.
o It takes both forward and backward probability into consideration. It is opposite
to that of numerical optimization, which takes only forward probabilities.
10
GMM also requires estimated statistics values such as mean and standard deviation or
parameters. It is used to estimate the parameters of the probability distributions to best
fit the density of a given training dataset. Although there are plenty of techniques
available to estimate the parameter of the Gaussian Mixture Model (GMM),
the Maximum Likelihood Estimation is one of the most popular techniques among
them.
Let's understand a case where we have a dataset with multiple data points
generated by two different processes. However, both processes contain a similar
Gaussian probability distribution and combined data. Hence it is very difficult to
discriminate which distribution a given point may belong to.
The processes used to generate the data point represent a latent variable or
unobservable data. In such cases, the Estimation-Maximization algorithm is one of the
best techniques which helps us to estimate the parameters of the gaussian
distributions. In the EM algorithm, E-step estimates the expected value for each latent
variable, whereas M-step helps in optimizing them significantly using the Maximum
Likelihood Estimation (MLE). Further, this process is repeated until a good set of
latent values, and a maximum likelihood is achieved that fits the data.
Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
Dimensionality Reduction
A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or
make predictions for the training dataset with a high number of features, for such
cases, dimensionality reduction techniques are required to use.
11
Dimensionality reduction technique can be defined as, "It is a way of converting the
higher dimensions dataset into lesser dimensions dataset ensuring that it provides
similar information." These techniques are widely used in machine learning for
obtaining a better fit predictive model while solving the classification and regression
problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
Some benefits of applying dimensionality reduction technique to the given dataset are
given below:
o By reducing the dimensions of the features, the space required to store the
dataset also gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data
quickly.
o It removes the redundant features (if present) by taking care of
multicollinearity.
There are also some disadvantages of applying the dimensionality reduction, which
are given below:
There are two ways to apply the dimension reduction technique, which are given
below:
12
Feature Selection
Feature selection is the process of selecting the subset of the relevant features and
leaving out the irrelevant features present in a dataset to build a model of high
accuracy. In other words, it is a way of selecting the optimal features from the input
dataset.
1. Filters Methods In this method, the dataset is filtered, and a subset that contains
only the relevant features is taken. Some common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine
learning model for its evaluation. In this method, some features are fed to the ML
model, and evaluate the performance. The performance decides whether to add those
features or remove to increase the accuracy of the model. This method is more
accurate than the filtering method but complex to work. Some common techniques of
wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
o LASSO
o Elastic Net
o Ridge Regression, etc.
13
Feature Extraction:
Factor Analysis
Factor Analytics is a special technique reducing the huge number of variables
into a few numbers of factors is known as factoring of the data, and managing which
data is to be present in sheet comes under factor analysis. It is completely a statistical
approach that is also used to describe fluctuations among the observed and correlated
variables in terms of a potentially lower number of unobserved variables
called factors.
The factor analysis technique extracts the maximum common variance from all
the variables and puts them into a common score. It is a theory that is used in training
the machine learning model and so it is quite related to data mining. The belief behind
factor analytic techniques is that the information gained about the interdependencies
between observed variables can be used later to reduce the set of variables in a
dataset.
Factor analysis is a very effective tool for inspecting changeable relationships
for complex concepts such as social status, economic status, dietary patterns,
psychological scales, biology, psychometrics, personality theories, marketing, product
management, operations research, finance, etc. It can help a researcher to investigate
the concepts that are not easily measured in a much easier and quicker way directly by
the cave in a large number of variables into a few easily interpretable fundamental
factors.
Types of factor analysis:
1. Exploratory factor analysis (EFA) :
It is used to identify composite inter-relationships among items and group items that are the
part of uniting concepts. The Analyst can’t make any prior assumptions about the
14
relationships among factors. It is also used to find the fundamental structure of a huge set of
variables. It lessens the large data to a much smaller set of summary variables. It is almost
similar to the Confirmatory Factor Analysis(CFA).
Similarities are:
Evaluate the internal reliability of an amount.
Examine the factors represented by item sets. They presume that the factors
aren’t correlated.
Investigate the grade/class of each item.
However, some common differences, most of them are concerned about how
factors are used. Basically, EFA is a data-driven approach, which allows all items
to load on all the factors, while in CFA you need to specify which factors are
required to load. EFA is really a nice choice if you have no idea about what
common factors might exist. EFA is able to generate a huge number of possible
models for your data, something which is not possible is, if a researcher has to
specify factors. If you have a bit idea about what actually the models look like, and
then afterwards you want to test your hypotheses about the data structure, in that
case, the CFA is a better approach.
2. Confirmatory factor analysis (CFA) :
It is a more complex(composite) approach that tests the theory that the items are
associated with specific factors. Confirmatory Factor Analysis uses a properly
structured equation model to test a measurement model whereby loading on the
factors allows for the evaluation of relationships between observed variables and
unobservedvariables.
As we know, the Structural equation modelling approaches can board measurement
error easily, and these are much less restrictive than least-squares estimation thus
provide more exposure to accommodate errors. Hypothesized models are tested
against actual data, and the analysis would demonstrate loadings of observed variables
on the latent variables (factors), as well as the correlation between the latent variables.
Confirmatory Factor Analysis allows an analyst and researcher to figure out if a
relationship between a set of observed variables (also known as manifest variables)
and their underlying constructs exists. It is similar to the Exploratory Factor Analysis.
15
Confirmatory Factor Analysis, you can define the total number of factors required.
For example, Confirmatory Factor Analysis is able to answer questions like Does
my thousand question survey can able to measure accurately the one specific
factor. Even though it is technically applicable to any kind of discipline, it is
typically used in social sciences.
This type of Factor Analysis is used when your variables are structured in
changeable groups. For example, you may have a teenager’s health
questionnaire with several points like sleeping patterns, wrong addictions,
psychological health, mobile phone addiction, or learning disabilities.
The Multiple Factor Analysis is performed in two steps which are:-
Firstly, the Principal Component Analysis will perform on each and every
section of the data. Further, this can give a useful eigenvalue, which is
actually used to normalize the data sets for further use.
The newly formed data sets are going to merge into a distinctive matrix and
then global PCA is performed.
Eigenvalues
When factor analysis going to generate the factors, each and every factor has
ab associated eigenvalue which will give the total variance explained by each
16
factor.
Usually, the factors having eigenvalues greater than 1 are useful :
FactorLoadings
In addition, factors are created with equality; some factors have more weights
some have low. In a simple example, imagine your car company says Maruti
Suzuki is conducting a survey includes, using – telephonic survey, physical
survey, google forms, etc. for customer satisfaction and the results show the
following factor loadings:
VARIABLE | F1 | F2 | F3
| | |
Problem 1 | 0.985 | 0.111 | -0.032
Problem 2 | 0.724 | 0.008 | 0.167
Problem 3 | 0.798 | 0.180 | 0.345
Here –
F1 – Factor 1
F2 – Factor 2
F3 – Factor 3
The factors that affect the question the most (and therefore have the highest
factor loadings) are bolded. Factor loadings are similar to correlation
coefficients in that they can vary from -1 to 1. The closer factors are to -1 or 1,
the more they affect the variable.
Note: A factor loading of 0 indicates no effect.
17
PRINICIPAL COMPONENT ANALYSIS:
Principal Component Analysis is an unsupervised learning algorithm that is
used for the dimensionality reduction in machine learning. It is a statistical process
that converts the observations of correlated features into a set of linearly uncorrelated
features with the help of orthogonal transformation. These new transformed features
are called the Principal Components. It is one of the popular tools that is used for
exploratory data analysis and predictive modeling. It is a technique to draw strong
patterns from the given dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.
PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality.
Some real-world applications of PCA are image processing, movie recommendation
system, optimizing the power allocation in various communication channels. It is a
feature extraction technique, so it contains the important variables and drops the least
important variable.
The PCA algorithm is based on some mathematical concepts such as:
Variance and Covariance
Eigenvalues and Eigen factors
Some common terms used in PCA algorithm:
Dimensionality: It is the number of features or variables present in the given dataset.
More easily, it is the number of columns present in the dataset.
Correlation: It signifies that how strongly two variables are related to each other.
Such as if one changes, the other variable also gets changed. The correlation value
ranges from -1 to +1. Here, -1 occurs if variables are inversely proportional to each
other, and +1 indicates that variables are directly proportional to each other.
Orthogonal: It defines that variable are not correlated to each other, and hence the
correlation between the pair of variables is zero.
Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v
will be eigenvector if Av is the scalar multiple of v.
Covariance Matrix: A matrix containing the covariance between the pair of variables
is called the Covariance Matrix
As described above, the transformed new features or the output of PCA are the
Principal Components. The number of these PCs are either equal to or less than the
18
original features present in the dataset. Some properties of these principal components
are given below:
PROBABILISTIC PCA:
Probabilistic principal components analysis (PCA) is a dimensionality
reduction technique that analyzes data via a lower dimensional latent space.It is often
used when there are missing values in the data or for multidimensional scaling
Here, There is a party going into a room full of people. There is ‘n’ number of speakers
in that room and they are speaking simultaneously at the party. In the same room, there
are also ‘n’ microphones placed at different distances from the speakers which are
recording ‘n’ speakers’ voice signals. Hence, the number of speakers is equal to the
number must of microphones in the room. Now, using these microphones’ recordings,
we want to separate all the ‘n’ speakers’ voice signals in the room given each
microphone recorded the voice signals coming from each speaker of different intensity
21
due to the difference in distances between them. Decomposing the mixed signal of each
microphone’s recording into an independent source’s speech signal can be done by
using the machine learning technique, independent component analysis. [ X1, X2, …..,
Xn ] => [ Y1, Y2, ….., Yn ] where, X1, X2, …, Xn are the original signals present in the
mixed signal and Y1, Y2, …, Yn are the new features and are independent components
which are independent of each other.
Restrictions on ICA –
22
Difference between PCA and ICA are as follows:
Principal Component Analysis Independent Component Analysis
It reduces the dimensions to avoid the It decomposes the mixed signal into
problem of overfitting. its independent sources’ signals.
23