Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
69 views53 pages

DSV - Unit 3 - Data Analysis in Depth

Uploaded by

27 03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views53 pages

DSV - Unit 3 - Data Analysis in Depth

Uploaded by

27 03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Department of Electronics & Computer Engineering Prof. S. K.

Choudhary

Unit 3: Data Analysis in Depth


Contents:
Data Analysis Theory and Methods: Clustering –Overview, K-means- overview of method,
determining number of clusters, Association Rules- Overview of method, Apriori algorithm,
evaluation of association rules, Regression-Overview of linear regression method, model
description. Classification- Overview, Naïve Bayes classifier.

Clustering – Overview:
The task of grouping data points based on their similarity with each other is called Clustering
or Cluster Analysis. This method is defined under the branch of Unsupervised Learning,
which aims at gaining insights from unlabelled data points, that is, unlike supervised learning
we don‘t have a target variable.
Clustering aims at forming groups of homogeneous data points from a heterogeneous dataset.
It evaluates the similarity based on a metric like Euclidean distance, Cosine similarity,
Manhattan distance, etc. and then groups the points with highest similarity score together.
For Example, in the graph given below, we can clearly see that there are 3 circular clusters
forming on the basis of distance.

Now it is not necessary that the clusters formed must be circular in shape. The shape of
clusters can be arbitrary. There are many algorithms that work well with detecting arbitrary
shaped clusters.

Data Science and Visualization Unit 3: Data Analysis in Depth


1
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to group similar data
points:
1. Hard Clustering: In this type of clustering, each data point belongs to a cluster
completely or not. For example, let‘s say there are 4 data point and we had to cluster
them into 2 clusters. So each data point will either belong to cluster 1 or cluster 2.
Data Points Clusters
A C1
B C2
C C2
D C1

2. Soft Clustering: In this type of clustering, instead of assigning each data point into a
separate cluster, a probability or likelihood of that point being that cluster is evaluated.
For example, Let‘s say there are 4 data point and we have to cluster them into 2 clusters.
So we will be evaluating a probability of a data point belonging to both clusters. This
probability is calculated for all data points.
Data Points Probability of C1 Probability of C2
A 0.91 0.09
B 0.3 0.7
C 0.17 0.83
D 1 0

Application of Clustering:
1. Market Segmentation – Businesses use clustering to group their customers and use
targeted advertisements to attract more audience.
2. Market Basket Analysis – Shop owners analyze their sales and figure out which items are
majorly bought together by the customers. For example, In USA, according to a study
diapers and beers were usually bought together by fathers.
3. Social Network Analysis – Social media sites use your data to understand your browsing
behaviour and provide you with targeted friend recommendations or content
recommendations.

Data Science and Visualization Unit 3: Data Analysis in Depth


2
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

4. Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic
images like X-rays.
5. Anomaly Detection – To find outliers in a stream of real-time dataset or forecasting
fraudulent transactions we can use clustering to identify them.

K-means – Overview of Method:


Unsupervised Machine Learning is the process of teaching a computer to use unlabeled,
unclassified data and enabling the algorithm to operate on that data without supervision.
Without any previous data training, the machine‘s job in this case is to organize unsorted data
according to parallels, patterns, and variations.
K means clustering, assigns data points to one of the K clusters depending on their distance
from the center of the clusters. It starts by randomly assigning the clusters centroid in the
space. Then each data point assign to one of the cluster based on its distance from centroid of
the cluster. After assigning each point to one of the cluster, new cluster centroids are
assigned. This process runs iteratively until it finds good cluster. In the analysis we assume
that number of cluster is given in advanced and we have to put points in one of the group.
In some cases, K is not clearly defined, and we have to think about the optimal number of K.
K Means clustering performs best data is well separated. When data points overlapped this
clustering is not suitable. K Means is faster as compare to other clustering technique. It
provides strong coupling between the data points. K Means cluster do not provide clear
information regarding the quality of clusters. Different initial assignment of cluster centroid
may lead to different clusters. Also, K Means algorithm is sensitive to noise. It may have
stuck in local minima.

How K-Means Clustering Works?


1. Initialization: Start by randomly selecting K points from the dataset. These points
will act as the initial cluster centroids.
2. Assignment: For each data point in the dataset, calculate the distance between that
point and each of the K centroids. Assign the data point to the cluster whose centroid
is closest to it. This step effectively forms K clusters.
3. Update centroids: Once all data points have been assigned to clusters, recalculate
the centroids of the clusters by taking the mean of all data points assigned to each
cluster.

Data Science and Visualization Unit 3: Data Analysis in Depth


3
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

4. Repeat: Repeat steps 2 and 3 until convergence. Convergence occurs when the
centroids no longer change significantly or when a specified number of iterations is
reached.
5. Final Result: Once convergence is achieved, the algorithm outputs the final cluster
centroids and the assignment of each data point to a cluster.

Objective of K means Clustering:


The main objective of k-means clustering is to partition your data into a specific number
(k) of groups, where data points within each group are similar and dissimilar to points in
other groups. It achieves this by minimizing the distance between data points and their
assigned cluster‘s center, called the centroid.
Here‘s an objective:
 Grouping similar data points: K-means aims to identify patterns in your data by
grouping data points that share similar characteristics together. This allows you to
discover underlying structures within the data.
 Minimizing within-cluster distance: The algorithm strives to make sure data
points within a cluster are as close as possible to each other, as measured by a
distance metric (usually Euclidean distance). This ensures tight-knit clusters with
high cohesiveness.
 Maximizing between-cluster distance: Conversely, k-means also tries to
maximize the separation between clusters. Ideally, data points from different
clusters should be far apart, making the clusters distinct from each other.

Determining number of clusters


K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters. Here K defines the number of pre-defined clusters that need to
be created in the process, as if K=2, there will be two clusters, and for K=3, there will be
three clusters, and so on.
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training. It is
a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of
this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.

Data Science and Visualization Unit 3: Data Analysis in Depth


4
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
1. Determines the best value for K center points or centroids by an iterative process.
2. Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters. The below diagram explains the working of the K-means Clustering Algorithm:

Figure : K-Means Clustering Algorithm

How does the K-Means Algorithm Work?


Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third step, which means reassign each datapoint to the new closest
centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

Data Science and Visualization Unit 3: Data Analysis in Depth


5
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Example 1: Cluster the following eight points (with (x, y) representing locations) into three
clusters: A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
Use K-Means Algorithm to find the three cluster centers after the second iteration.
Solution:
Iteration-01:
We calculate the distance of each point from each of the center of the three clusters. The
distance is calculated by using the given distance function.
Given Points Distance from Distance from Distance from Point belongs
center (2, 10) of center (5, 8) of center (1, 2) of to Cluster
Cluster-01 Cluster-02 Cluster-03
A1(2, 10) 0 5 9 C1
A2(2, 5) 5 6 4 C3
A3(8, 4) 12 7 9 C2
A4(5, 8) 5 0 10 C2
A5(7, 5) 10 5 9 C2
A6(6, 4) 10 5 7 C2
A7(1, 2) 9 10 0 C3
A8(4, 9) 3 2 10 C2

From here, New clusters are-


Cluster-01: First cluster contains points- A1(2, 10)
Cluster-02: Second cluster contains points- A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A8(4, 9)
Cluster-03: Third cluster contains points- A2(2, 5), A7(1, 2)
Now, We re-compute the new cluster clusters.
The new cluster center is computed by taking mean of all the points contained in that cluster.
For Cluster-01: We have only one point A1(2, 10) in Cluster-01. So, cluster center remains
the same.
For Cluster-02: Center of Cluster-02 = ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5) = (6, 6)
For Cluster-03: Center of Cluster-03 = ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5)
This is completion of Iteration-01.
Iteration-02:
We calculate the distance of each point from each of the center of the three clusters. The
distance is calculated by using the given distance function.

Data Science and Visualization Unit 3: Data Analysis in Depth


6
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Given Points Distance from Distance from Distance from Point belongs to
center (2, 10) of center (6, 6) of center (1.5, 3.5) of Cluster
Cluster-01 Cluster-02 Cluster-03
A1(2, 10) 0 8 7 C1
A2(2, 5) 5 5 2 C3
A3(8, 4) 12 4 7 C2
A4(5, 8) 5 3 8 C2
A5(7, 5) 10 2 7 C2
A6(6, 4) 10 2 5 C2
A7(1, 2) 9 9 2 C3
A8(4, 9) 3 5 8 C1

From here, New clusters are-


Cluster-01: First cluster contains points- A1(2, 10), A8(4, 9)
Cluster-02: Second cluster contains points- A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4),
Cluster-03: Third cluster contains points-A2(2, 5), A7(1, 2)
Now, We re-compute the new cluster clusters. The new cluster center is computed by taking
mean of all the points contained in that cluster.
For Cluster-01: Center of Cluster-01 = ((2 + 4)/2, (10 + 9)/2) = (3, 9.5)
For Cluster-02: Center of Cluster-02 = ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4) = (6.5, 5.25)
For Cluster-03: Center of Cluster-03 = ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5)
This is completion of Iteration-02.
After second iteration, the centers of the three clusters are-
 C1(3, 9.5)
 C2(6.5, 5.25)
 C3(1.5, 3.5)

How to choose the value of "K number of clusters" in K-means Clustering?


The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms. But choosing the optimal number of clusters is a big task. There are some
different ways to find the optimal number of clusters, but here we are discussing the most
appropriate method to find the number of clusters or value of K. The method is given below:

Data Science and Visualization Unit 3: Data Analysis in Depth


7
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

In the above formula of WCSS,


∑Pi in Cluster1 distance (Pi C1)2: It is the sum of the square of the distances between each data
point and its centroid within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
 It executes the K-means clustering on a given dataset for different K values (ranges
from 1-10).
 For each value of K, calculates the WCSS value.
 Plots a curve between calculated WCSS values and the number of clusters K.
 The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:

Data Science and Visualization Unit 3: Data Analysis in Depth


8
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Note: The maximum possible number of clusters will be equal to the number of
observations in the dataset.

Python Implementation of K-means Clustering Algorithm:


We have a dataset of Mall_Customers, which is the data of customers who visit the mall and
spend there.
In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($), and
Spending Score (which is the calculated value of how much a customer has spent in the
mall, the more the value, the more he has spent). From this dataset, we need to calculate some
patterns, as it is an unsupervised method, so we don't know what to calculate exactly.
The steps to be followed for the implementation are given below:
o Data Pre-processing
o Finding the optimal number of clusters using the elbow method
o Training the K-means algorithm on the training dataset
o Visualizing the clusters

Step-1: Data pre-processing Step


The first step will be the data pre-processing, as we did in our earlier topics of Regression and
Classification. But for the clustering problem, it will be different from other models. Let's
discuss it:
 Importing Libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
 Importing the Dataset:
Next, we will import the dataset that we need to use. So here, we are using the
Mall_Customer_data.csv dataset. It can be imported using the below code:
dataset = pd.read_csv('Mall_Customers_data.csv')
By executing the above lines of code, we will get our dataset in the Spyder IDE. The dataset
looks like the below image:

Data Science and Visualization Unit 3: Data Analysis in Depth


9
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

From the above dataset, we need to find some patterns in it.


 Extracting Independent Variables
Here we don't need any dependent variable for data pre-processing step as it is a clustering
problem, and we have no idea about what to determine. So we will just add a line of code for
the matrix of features.
x = dataset.iloc[:, [3, 4]].values
As we can see, we are extracting only 3rd and 4th feature. It is because we need a 2d plot to
visualize the model, and some features are not required, such as customer_id.
Step-2: Finding the optimal number of clusters using the elbow method
In the second step, we will try to find the optimal number of clusters for our clustering
problem. So, as discussed above, here we are going to use the elbow method for this purpose.
As we know, the elbow method uses the WCSS concept to draw the plot by plotting WCSS
values on the Y-axis and the number of clusters on the X-axis. So we are going to calculate
the value for WCSS for different k values ranging from 1 to 10. Below is the code for it:

Data Science and Visualization Unit 3: Data Analysis in Depth


10
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

#finding optimal number of clusters using the elbow method


from sklearn.cluster import KMeans
wcss_list= [] #Initializing the list for the values of WCSS
#Using for loop for iterations from 1 to 10.
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
kmeans.fit(x)
wcss_list.append(kmeans.inertia_)
mtp.plot(range(1, 11), wcss_list)
mtp.title('The Elobw Method Graph')
mtp.xlabel('Number of clusters(k)')
mtp.ylabel('wcss_list')
mtp.show()

As we can see in the above code, we have used the KMeans class of sklearn. cluster library
to form the clusters.
Next, we have created the wcss_list variable to initialize an empty list, which is used to
contain the value of wcss computed for different values of k ranging from 1 to 10.
After that, we have initialized the for loop for the iteration on a different value of k ranging
from 1 to 10; since for loop in Python, exclude the outbound limit, so it is taken as 11 to
include 10th value.
The rest part of the code is similar as we did in earlier topics, as we have fitted the model on a
matrix of features and then plotted the graph between the number of clusters and WCSS.
Output: After executing the above code, we will get the below output:

Data Science and Visualization Unit 3: Data Analysis in Depth


11
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Step- 3: Training the K-means algorithm on the training dataset


As we have got the number of clusters, so we can now train the model on the dataset.
To train the model, we will use the same two lines of code as we have used in the above
section, but here instead of using i, we will use 5, as we know there are 5 clusters that need to
be formed. The code is given below:

kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)


y_predict= kmeans.fit_predict(x)

The first line is the same as above for creating the object of KMeans class.
In the second line of code, we have created the dependent variable y_predict to train the
model.
By executing the above lines of code, we will get the y_predict variable. We can check it
under the variable explorer option in the Spyder IDE. We can now compare the values of
y_predict with our original dataset. Consider the below image:

Data Science and Visualization Unit 3: Data Analysis in Depth


12
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

From the above image, we can now relate that the CustomerID 1 belongs to a cluster3 (as
index starts from 0, hence 2 will be considered as 3), and 2 belongs to cluster 4, and so on.
Step-4: Visualizing the Clusters
The last step is to visualize the clusters. As we have 5 clusters for our model, so we will
visualize each cluster one by one.
To visualize the clusters will use scatter plot using mtp.scatter() function of matplotlib.

#visulaizing the clusters


mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1'')
#for first cluster
mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2')
#for second cluster
mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3')
#for third cluster
mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
#for fourth cluster
mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
#for fifth cluster
mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow',
label = 'Centroid')
mtp.title('Clusters of customers')
mtp.xlabel('Annual Income (k$)')
mtp.ylabel('Spending Score (1-100)')
mtp.legend()
mtp.show()

In above lines of code, we have written code for each clusters, ranging from 1 to 5. The first
coordinate of the mtp.scatter, i.e., x[y_predict == 0, 0] containing the x value for the showing
the matrix of features values, and the y_predict is ranging from 0 to 1.

Data Science and Visualization Unit 3: Data Analysis in Depth


13
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Output:

The output image is clearly showing the five different clusters with different colors. The
clusters are formed between two parameters of the dataset; Annual income of customer and
Spending. We can change the colors and labels as per the requirement or choice. We can also
observe some points from the above patterns, which are given below:
 Cluster1 shows the customers with average salary and average spending so we can
categorize these customers as
 Cluster2 shows the customer has a high income but low spending, so we can categorize
them as careful.
 Cluster3 shows the low income and also low spending so they can be categorized as
sensible.
 Cluster4 shows the customers with low income with very high spending so they can be
categorized as careless.
 Cluster5 shows the customers with high income and high spending so they can be
categorized as target, and these customers can be the most profitable customers for the
mall owner.
Elbow Method Drawbacks:
1. Subjectivity: The choice of the ―elbow point‖ can be subjective and might vary between
individuals analyzing the same data.
2. Non-Gaussian Data: It assumes that clusters are spherical and equally sized, which may
not hold for complex datasets with irregularly shaped or differently sized clusters.
3. Sensitivity to Initialization: K-means itself is sensitive to initial cluster centroids, which
can affect the WCSS values and, consequently, the choice of the optimal K.

Data Science and Visualization Unit 3: Data Analysis in Depth


14
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

4. Inefficient for Large Datasets: For large datasets, calculating WCSS for a range of K
values can be computationally expensive and time-consuming.
5. Unsuitable for All Distributions: The elbow method is not suitable for all data
distributions, especially when clusters have varying densities or are non-convex.
6. Limited to K-means: It specifically applies to K-means clustering and may not be
suitable for other clustering algorithms with different objectives.

Applications of the Elbow Method:


1. Customer Segmentation: Businesses often use the Elbow Method to determine the
optimal number of customer segments for personalized marketing and product
recommendations. It helps identify distinct customer groups based on their behaviors,
preferences, and demographics.
2. Image Compression: In image processing, the Elbow Method can be applied to find the
optimal number of colors or features to represent an image efficiently. This is crucial for
reducing the storage space or bandwidth required for image data.
3. Outliers Detection: Anomaly detection is the identification of rare or unusual data
points. By clustering data into different groups and observing clusters with significantly
fewer data points, the Elbow Method can assist in identifying anomalies or outliers more
effectively.
4. Recommender Systems: Recommendation engines, like those used by e-commerce
platforms or content streaming services, can benefit from the Elbow Method to find the
right number of user or item clusters. This, in turn, can improve the accuracy and
relevance of recommendations.
5. Genomic Data Analysis: In genomics, researchers use the Elbow Method to identify
clusters of genes or proteins for tasks like disease classification, gene expression
analysis, or identifying functionally related genes.

Evaluation metrics for clustering:


1. Inertia:
Inertia, also known as within-cluster sum of squares, measures the sum of squared distances
between each data point and its assigned cluster center.
Lower inertia indicates better clustering, as it implies that data points within each cluster are
closer to their cluster center. However, inertia alone does not consider the number of clusters
or the distribution of data points.

Data Science and Visualization Unit 3: Data Analysis in Depth


15
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

• Inter-cluster distance d(a, b) between two clusters a and b can be –


i. Single linkage distance: Closest distance between two objects belonging
to a and b respectively.
ii. Complete linkage distance: Distance between two most remote objects belonging
to a and b respectively.
iii. Average linkage distance: Average distance between all the objects belonging
to a and b respectively.
iv. Centroid linkage distance: Distance between the centroid of the two
clusters a and b respectively.

Intra-cluster distance D(a) of a cluster a can be –


a) Complete diameter linkage distance: Distance between two farthest objects
belonging to cluster a.
b) Average diameter linkage distance: Average distance between all the objects
belonging to cluster a.
c) Centroid diameter linkage distance: Twice the average distance between all the
objects and the centroid of the cluster a.

2. Dunn Index:
The Dunn index (DI) (introduced by J. C. Dunn in 1974), a metric for evaluating
clustering algorithms, is an internal evaluation scheme, where the result is based on the
clustered data itself. Like all other such indices, the aim of this Dunn index to identify sets

Data Science and Visualization Unit 3: Data Analysis in Depth


16
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

of clusters that are compact, with a small variance between members of the cluster, and
well separated, where the means of different clusters are sufficiently far apart, as
compared to the within cluster variance.
Higher the Dunn index value, better is the clustering. The number of clusters that
maximizes Dunn index is taken as the optimal number of clusters k. It also has some
drawbacks. As the number of clusters and dimensionality of the data increase, the
computational cost also increases.
The Dunn index for c number of clusters is defined as:

Where, dist(ci , cj) is the distance between clusters ci and cj , and diam(cl) is diameter of
the cluster cl :

3. DB index :
The Davies–Bouldin index (DBI) (introduced by David L. Davies and Donald W.
Bouldin in 1979), a metric for evaluating clustering algorithms, is an internal evaluation
scheme, where the validation of how well the clustering has been done is made using
quantities and features inherent to the dataset.
Lower the DB index value, better is the clustering. It also has a drawback.
A good value reported by this method does not imply the best information retrieval.
The DB index for k number of clusters is defined as :

Data Science and Visualization Unit 3: Data Analysis in Depth


17
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

where,

4. Silhouette Score:
Silhouette analysis refers to a method of interpretation and validation of consistency
within clusters of data. The silhouette value is a measure of how similar an object is to its
own cluster (cohesion) compared to other clusters (separation). It can be used to study the
separation distance between the resulting clusters.
The silhouette plot displays a measure of how close each point in one cluster is to points
in the neighboring clusters and thus provides a way to assess parameters like number of
clusters visually. If the Silhouette index value is high, the object is well-matched to its
own cluster and poorly matched to neighbouring clusters. The Silhouette Coefficient is
calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance
(b) for each sample.
The Silhouette Coefficient is defined as –
S(i) = ( b(i) – a(i) ) / ( max { ( a(i), b(i) ) }

where,
a(i) is the average dissimilarity of ith object to all other objects in the same cluster
b(i) is the average dissimilarity of ith object with all objects in the closest cluster.

Data Science and Visualization Unit 3: Data Analysis in Depth


18
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Range of Silhouette Value –


Now, obviously S(i) will lie between [-1, 1] . If silhouette value is close to 1, sample is well-
clustered and already assigned to a very appropriate cluster. If silhouette value is about to 0,
sample could be assign to another cluster closest to it and the sample lies equally far away
from both the clusters. That means it indicates overlapping clusters. If silhouette value is
close to –1, sample is misclassified and is merely placed somewhere in between the clusters.

 Association Rules- Overview of method:


Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be
more profitable. It tries to find some interesting relations or associations among the variables
of dataset. It is based on different rules to discover the interesting relations between variables
in the database. The association rule learning is one of the very important concepts
of machine learning, and it is employed in Market Basket analysis, Web usage mining,
continuous production, etc.
Association analysis is useful for discovering interesting relationships hidden in large data
sets. The uncovered relationships can be represented in the form of association rules or sets of
frequent items. Association rule mining is a procedure which is meant to find frequent
patterns, correlations, associations, or causal structures from data sets found in various kinds
of databases such as relational databases, transactional databases, and other forms of data
repositories.
Association rule learning can be divided into three types of algorithms:
1. Apriori
2. Eclat
3. F-P Growth Algorithm
How does Association Rule Learning work?
Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the If element is called antecedent, and then statement is called as Consequent. These
types of relationships where we can find out some association or relation between two items
is known as single cardinality.

Data Science and Visualization Unit 3: Data Analysis in Depth


19
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

It is all about creating rules, and if the number of items increases, then cardinality also
increases accordingly. So, to measure the associations between thousands of data items, there
are several metrics.
Lets now see what an association rule exactly looks like. It consists of an antecedent and a
consequent, both of which are a list of items. Note that implication here is co-occurrence and
not causality. For a given rule, itemset is the list of all the items in the antecedent and the
consequent.

Evaluation of association rules:


1. Support: The number of transactions that include items in the {X} and {Y} parts of the
rule as a percentage of the total number of transaction. It is a measure of how frequently
the collections of items occur together as a percentage of all transactions.

Consider itemset1 = {bread} and itemset2 = {shampoo}. There will be far more transactions
containing bread than those containing shampoo. So, itemset1 will generally have a higher
support than itemset2.
Now consider itemset1 = {bread, butter} and itemset2 = {bread, shampoo}. Many
transactions will have both bread and butter on the cart but bread and shampoo? Not so much.
So in this case, itemset1 will generally have a higher support than itemset2.

Data Science and Visualization Unit 3: Data Analysis in Depth


20
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

One might want to consider only the itemsets which occur at least 50 times out of a total of
10,000 transactions i.e. support = 0.005. If an itemset happens to have a very low support, we
do not have enough information on the relationship between its items and hence no
conclusions can be drawn from such a rule.

2. Confidence: It is the ratio of the no of transactions that includes all items in {B} as well
as the no of transactions that includes all items in {A} to the no of transactions that
includes all items in {A}.

This measure defines the likeliness of occurrence of consequent on the cart given that the cart
already has the antecedents. Technically, confidence is the conditional probability of
occurrence of consequent given the antecedent.
What do you think would be the confidence for {Butter} → {Bread}? That is, what fraction
of transactions having butter also had bread? Very high i.e. a value closes to 1? That‘s right.
What about {Yogurt} → {Milk}? High again.
{Toothbrush} → {Milk}? Not so sure? Confidence for this rule will also be high since
{Milk} is such a frequent itemset and would be present in every other transaction.

Data Science and Visualization Unit 3: Data Analysis in Depth


21
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

3. Lift: The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence, assuming that the itemsets X and Y are independent of each other.The
expected confidence is the confidence divided by the frequency of {Y}.

• If Lift= 1: The probability of occurrence of antecedent and consequent is independent of


each other.
• Lift>1: It determines the degree to which the two itemsets are dependent to each other.
• Lift<1: It tells us that one item is a substitute for other items, which means one item has
a negative effect on another.

Apriori algorithm:
The Apriori algorithm uses frequent itemsets to generate association rules, and it is designed
to work on the databases that contain transactions. With the help of these association rule, it
determines how strongly or how weakly two objects are connected. This algorithm uses

Data Science and Visualization Unit 3: Data Analysis in Depth


22
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

a breadth-first search and Hash Tree to calculate the itemset associations efficiently. It is
the iterative process for finding the frequent itemsets from the large dataset.
This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is mainly used
for market basket analysis and helps to find those products that can be bought together. It can
also be used in the healthcare field to find drug reactions for patients.
1. Generating itemsets from a list of items
First step in generation of association rules is to get all the frequent itemsets on which binary
partitions can be performed to get the antecedent and the consequent. For example, if there
are 6 items {Bread, Butter, Egg, Milk, Notebook, Toothbrush} on all the transactions
combined, itemsets will look like {Bread}, {Butter}, {Bread, Notebook}, {Milk,
Toothbrush}, {Milk, Egg, Vegetables} etc. Size of an itemset can vary from one to the total
number of items that we have. Now, we seek only frequent itemsets from this and not all so
as to put a check on the number of total itemsets generated.

Frequent itemsets are the ones which occur at least a minimum number of times in the
transactions. Technically, these are the itemsets for which support value (fraction of
transactions containing the itemset) is above a minimum threshold — minsup.
So, {Bread, Notebook} might not be a frequent itemset if it occurs only 2 times out of 100
transactions and (2/100) = 0.02 falls below the value of minsup.
A brute force approach to find frequent itemsets is to form all possible itemsets and check the
support value of each of these. Apriori principle helps in making this search efficient. It
states that all subsets of a frequent itemset must also be frequent.
This is equivalent to saying that number of transactions containing items {Bread, Egg} is
greater than or equal to number of transactions containing {Bread, Egg, Vegetables}. If the
latter occurs in 30 transactions, former is occurring in all 30 of them and possibly will occur
in even some more transactions. So if support value of {Bread, Egg, Vegetables} i.e. (30/100)
= 0.3 is above minsup, then we can be assured that support value of {Bread, Egg} i.e.

Data Science and Visualization Unit 3: Data Analysis in Depth


23
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

(>30/100) = >0.3 is above minsup too. This is called the anti-monotone property of
support where if we drop out an item from an itemset, support value of new itemset
generated will either be the same or will increase.
Apriori principle allows us to prune all the supersets of an itemset which does not satisfy the
minimum threshold condition for support. For example, if {Milk, Notebook} does not satisfy
our threshold of minsup, an itemset with any item added to this will never cross the threshold
too. The methodology that results is called the apriori algorithm.

2. Generating all possible rules from the frequent itemsets


Once the frequent itemsets are generated, identifying rules out of them is comparatively less
taxing. Rules are formed by binary partition of each itemset. If {Bread,Egg,Milk,Butter} is the
frequent itemset, candidate rules will look like:
(Egg, Milk, Butter → Bread), (Bread, Milk, Butter → Egg), (Bread, Egg → Milk, Butter),
(Egg, Milk → Bread, Butter), (Butter→ Bread, Egg, Milk)
From a list of all possible candidate rules, we aim to identify rules that fall above a minimum
confidence level (minconf). Just like the anti-monotone property of support, confidence of
rules generated from the same itemset also follows an anti-monotone property. It is anti-
monotone with respect to the number of elements in consequent.
This means that confidence of (A,B,C→ D) ≥ (B,C → A,D) ≥ (C → A,B,D). To remind,
confidence for {X → Y} = support of {X,Y}/support of {X}
As we know that support of all the rules generated from same itemset remains the same and
difference occurs only in the denominator calculation of confidence. As number of items in X
decrease, support{X} increases (as follows from the anti-monotone property of support) and
hence the confidence value decreases.
An intuitive explanation for the above will be as follows. Consider F1 and F2:
F1 = fraction of transactions having (butter) also having (egg, milk, bread)
There will be many transactions having butter and all three of egg, milk and bread will be able
to find place only in a small number of those.
F2 = fraction of transactions having (milk, butter, bread) also having (egg)
There will only be a handful of transactions having all three of milk, butter and bread (as
compared to having just butter) and there will be high chances of having egg on those.

Data Science and Visualization Unit 3: Data Analysis in Depth


24
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

So it will be observed that F1 < F2. Using this property of confidence, pruning is done in a
similar way as was done while looking for frequent itemsets. It is illustrated in the figure
below.
We start with a frequent itemset {a,b,c,d} and start forming rules with just one consequent.
Remove the rules failing to satisfy the minconf condition. Now, start forming rules using a
combination of consequents from the remaining ones. Keep repeating until only one item is
left on antecedent. This process has to be done for all frequent itemsets.

With these two steps, we have identified a set of association rules which satisfy both the
minimum support and minimum confidence condition. The number of such rules obtained will
vary with the values of minsup and minconf. Now, this subset of rules thus generated can be
searched for highest values of lift to make business decisions.
Generally, the apriori algorithm operates on a database containing a huge number of
transactions. Apriori algorithm helps the customers to buy their products with ease and
increases the sales performance of the particular store.
Let‘s say we have the following data of a store.

Data Science and Visualization Unit 3: Data Analysis in Depth


25
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Iteration 1: Let‘s assume the support value is 2 and create the item sets of the size of 1 and
calculate their support values.

As you can see here, item 4 has a support value of 1 which is less than the min support value.
So we are going to discard {4} in the upcoming iterations. We have the final Table F1.

Iteration 2: Next we will create itemsets of size 2 and calculate their support values. All the
combinations of items set in F1 are used in this iteration.

Itemsets having Support less than 2 are eliminated again. In this case {1,2}. Now, Let‘s
understand what is pruning and how it makes Apriori one of the best algorithm for finding
frequent itemsets.
Pruning: We are going to divide the itemsets in C3 into subsets and eliminate the subsets
that are having a support value less than 2.

Data Science and Visualization Unit 3: Data Analysis in Depth


26
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Iteration 3: We will discard {1,2,3} and {1,2,5} as they both contain {1,2}. This is the main
highlight of the Apriori Algorithm.

Iteration 4: Using sets of F3 we will create C4.

Since the Support of this itemset is less than 2, we will stop here and the final itemset we will
have is F3.
Note: Till now we haven‘t calculated the confidence values yet.
With F3 we get the following itemsets:
For I = {1,3,5}, subsets are {1,3}, {1,5}, {3,5}, {1}, {3}, {5}
For I = {2,3,5}, subsets are {2,3}, {2,5}, {3,5}, {2}, {3}, {5}
Applying Rules: We will create rules and apply them on itemset F3. Now let‘s assume a
minimum confidence value is 60%.
For every subsets S of I, you output the rule
S –> (I-S) (means S recommends I-S)
if support(I) / support(S) >= min_conf value
{1,3,5}
Rule 1: {1,3} –> ({1,3,5} – {1,3}) means 1 & 3 –> 5
Confidence = support(1,3,5)/support(1,3) = 2/3 = 66.66% > 60%
Hence Rule 1 is selected.
Rule 2: {1,5} –> ({1,3,5} – {1,5}) means 1 & 5 –> 3
Confidence = support(1,3,5)/support(1,5) = 2/2 = 100% > 60%
Rule 2 is selected.
Rule 3: {3,5} –> ({1,3,5} – {3,5}) means 3 & 5 –> 1

Data Science and Visualization Unit 3: Data Analysis in Depth


27
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Confidence = support(1,3,5)/support(3,5) = 2/3 = 66.66% > 60%


Rule 3 is selected.
Rule 4: {1} –> ({1,3,5} – {1}) means 1 –> 3 & 5
Confidence = support(1,3,5)/support(1) = 2/3 = 66.66% > 60%
Rule 4 is selected.
Rule 5: {3} –> ({1,3,5} – {3}) means 3 –> 1 & 5
Confidence = support(1,3,5)/support(3) = 2/4 = 50% <60%
Rule 5 is rejected.
Rule 6: {5} –> ({1,3,5} – {5}) means 5 –> 1 & 3
Confidence = support(1,3,5)/support(5) = 2/4 = 50% < 60%
Rule 6 is rejected.
{2,3,5}
Rule 1: {2,3} –> ({2,3,5} – {2,3}) means 2 & 3 –> 5
Confidence = support(2,3,5)/support(2,3) = 2/2 = 100% > 60%
Hence Rule 1 is Selected
Rule 2: {2,5} –> ({2,3,5} – {2,5}) means 2 & 5 –> 3
Confidence = support(2,3,5)/support(2,5) = 2/3 = 66.66% > 60%
Rule 2 is Selected
Rule 3: {3,5} –> ({2,3,5} – {3,5}) means 3 & 5 –> 2
Confidence = support(2,3,5)/support(3,5) = 2/3 = 66.66% > 60%
Rule 3 is Selected
Rule 4: {2} –> ({2,3,5} – {1}) means 1 –> 3 & 5
Confidence = support(2,3,5)/support(1) = 2/3 = 66.66% > 60%
Rule 4 is Selected
Rule 5: {3} –> ({2,3,5} – {3}) means 3 –> 2 & 5
Confidence = support(2,3,5)/support(3) = 2/4 = 50% <60%
Rule 5 is Rejected
Rule 6: {5} –> ({2,3,5} – {5}) means 5 –> 2 & 3
Confidence = support(2,3,5)/support(5) = 2/4 = 50% < 60%
Rule 6 is Rejected

Regression-Overview of linear regression method:


Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables. More

Data Science and Visualization Unit 3: Data Analysis in Depth


28
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

specifically, Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable when other independent
variables are held fixed. It predicts continuous/real values such as temperature, age, salary,
price, etc.
Linear regression predicts the relationship between two variables by assuming they have a
straight-line connection. It finds the best line that minimizes the differences between
predicted and actual values. Used in fields like economics and finance, it helps analyze and
forecast data trends. Linear regression can also involve several variables (multiple linear
regression) or be adapted for yes/no questions (logistic regression).
Simple Linear Regression
In a simple linear regression, there is one independent variable and one dependent
variable. The model estimates the slope and intercept of the line of best fit, which
represents the relationship between the variables. The slope represents the change in the
dependent variable for each unit change in the independent variable, while the intercept
represents the predicted value of the dependent variable when the independent variable is
zero.
Linear regression is a quiet and the simplest statistical regression technique used for
predictive analysis in machine learning. It shows the linear relationship between the
independent(predictor) variable i.e. X-axis and the dependent (output) variable i.e. Y-
axis, called linear regression. If there is a single input variable X (independent variable),
such linear regression is simple linear regression.

The graph above presents the linear relationship between the output(y) and predictor(X)
variables. The blue line is referred to as the best-fit straight line. Based on the given data
points, we attempt to plot a line that fits the points the best.

Data Science and Visualization Unit 3: Data Analysis in Depth


29
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Simple Regression Calculation


To calculate best-fit line linear regression uses a traditional slope-intercept form which is
given below,
Yi = β0 + β1 Xi
where Yi = Dependent variable, β0 = constant/Intercept, β1 = Slope/Intercept, Xi =
Independent variable.
This algorithm explains the linear relationship between the dependent (output) variable y
and the independent (predictor) variable X using a straight line Y= β0 + β1X.
But how does the regression find out which is the best-fit line?
The goal of the linear regression algorithm is to get the best values for B 0 and B1 to find
the best-fit line. The best-fit line is a line that has the least error which means the error
between predicted values and actual values should be minimum.

But how the linear regression finds out which is the best fit line?
The goal of the linear regression algorithm is to get the best values for B 0 and B1 to find
the best fit line. The best fit line is a line that has the least error which means the error
between predicted values and actual values should be minimum.

Random Error(Residuals)
In regression, the difference between the observed value of the dependent variable( y i )
and the predicted value(ypredicted) is called the residuals.
εi = ypredicted – yi

Data Science and Visualization Unit 3: Data Analysis in Depth


30
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

where y predicted = B0 + B1Xi


What is the Best Fit Line?
In simple terms, the best-fit line is a line that best fits the given scatter plot.
Mathematically, you obtain the best-fit line by minimizing the Residual Sum of Squares
(RSS).

Cost Function for Linear Regression


The cost function helps to work out the optimal values for B0 and B1, which provides
the best-fit line for the data points.
In Linear Regression, generally Mean Squared Error (MSE) cost function is used,
which is the average squared error that occurred between the ypredicted and yi.
We calculate MSE using the simple linear equation y=mx+b:

Using the MSE function, we‘ll update the values of B 0 and B1 such that the MSE value
settles at the minima. These parameters can be determined using the gradient descent
method such that the value for the cost function is minimum.

Gradient Descent for Linear Regression


Gradient Descent is one of the optimization algorithms that optimize the cost function
(objective function) to reach the optimal minimal solution. To find the optimum solution,
we need to reduce the cost function (MSE) for all data points. This is done by updating
the values of the slope coefficient (B1) and the constant coefficient (B0) iteratively until
we get an optimal solution for the linear function.
A regression model optimizes the gradient descent algorithm to update the coefficients
of the line by reducing the cost function by randomly selecting coefficient values and
then iteratively updating the coefficient values to reach the minimum cost function.

Data Science and Visualization Unit 3: Data Analysis in Depth


31
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Gradient Descent Example


Let‘s take an example to understand this. Imagine a U-shaped pit. You are standing at
the uppermost point in the pit, and your motive is to reach the bottom of the pit. Suppose
there is a treasure at the bottom of the pit, and you can only take a discrete number of
steps to reach the bottom. If you opted to take one step at a time, you would get to the
bottom of the pit in the end but, this would take a longer time. If you decide to take
larger steps each time, you may achieve the bottom sooner but, there‘s a probability that
you could overshoot the bottom of the pit and not even near the bottom. In the gradient
descent algorithm, the number of steps you‘re taking can be considered as the learning
rate, and this decides how fast the algorithm converges to the minima.

To update B0 and B1, we take gradients from the cost function. To find these gradients,
we take partial derivatives for B0 and B1.

Data Science and Visualization Unit 3: Data Analysis in Depth


32
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

We need to minimize the cost function J. One of the ways to achieve this is to apply the
batch gradient descent algorithm. In batch gradient descent, the values are updated in
each iteration. (The last two equations show the updating of values)
The partial derivatives are the gradients, and they are used to update the values of B 0
and B 1. Alpha is the learning rate.

Linear regression is important for a few reasons:


 Simplicity and interpretability: It‘s a relatively easy concept to understand and
apply. The resulting simple linear regression model is a straightforward equation
that shows how one variable affects another. This makes it easier to explain and
trust the results compared to more complex models.

Data Science and Visualization Unit 3: Data Analysis in Depth


33
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

 Prediction: Linear regression allows you to predict future values based on


existing data. For instance, you can use it to predict sales based on marketing
spend or house prices based on square footage.
 Foundation for other techniques: It serves as a building block for many other
data science and machine learning methods. Even complex algorithms often rely
on linear regression as a starting point or for comparison purposes.
 Widespread applicability: Linear regression can be used in various fields, from
finance and economics to science and social sciences. It‘s a versatile tool for
uncovering relationships between variables in many real-world scenarios.
In essence, linear regression provides a solid foundation for understanding data and
making predictions. It‘s a cornerstone technique that paves the way for more advanced
data analysis methods.

Evaluation Metrics for Linear Regression


The strength of any linear regression model can be assessed using various evaluation
metrics. These evaluation metrics usually provide a measure of how well the observed
outputs are being generated by the model.
The most used metrics are,
1. Coefficient of Determination or R-squared (R 2)
2. Root Mean Squared Error (RSME) and Residual Standard Error (RSE)

Coefficient of Determination or R-Squared (R2)


R-squared is a number that explains the amount of variation that is explained/captured by
the developed model. It always ranges between 0 & 1 . Overall, the higher the value of
R-squared, the better the model fits the data.
Mathematically it can be represented as,
R2 = 1 – ( RSS/TSS )
 Residual sum of Squares (RSS) is defined as the sum of squares of the residual for
each data point in the plot/data. It is the measure of the difference between the
expected and the actual observed output.

Data Science and Visualization Unit 3: Data Analysis in Depth


34
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

 Total Sum of Squares (TSS) is defined as the sum of errors of the data points from
the mean of the response variable. Mathematically TSS is,

where ̅ hat is the mean of the sample data points.

The significance of R-squared is shown by the following figures,

Root Mean Squared Error


The Root Mean Squared Error is the square root of the variance of the residuals. It
specifies the absolute fit of the model to the data i.e. how close the observed data points
are to the predicted values. Mathematically it can be represented as,

To make this estimate unbiased, one has to divide the sum of the squared residuals by
the degrees of freedom rather than the total number of data points in the model. This
term is then called the Residual Standard Error(RSE). Mathematically it can be
represented as,

R-squared is a better measure than RSME. Because the value of Root Mean Squared
Error depends on the units of the variables (i.e. it is not a normalized measure), it can
change with the change in the unit of the variables.

Assumptions of Linear Regression


Regression is a parametric approach, which means that it makes assumptions about the
data for analysis. For successful regression analysis, it‘s essential to validate the
following assumptions.

Data Science and Visualization Unit 3: Data Analysis in Depth


35
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

1. Linearity of residuals: There needs to be a linear relationship between the


dependent variable and independent variable(s).

2. Independence of residuals: The error terms should not be dependent on one another
(like in time-series data wherein the next value is dependent on the previous one).
There should be no correlation between the residual terms. The absence of this
phenomenon is known as Autocorrelation.
There should not be any visible patterns in the error terms.

3. Normal distribution of residuals: The mean of residuals should follow a normal


distribution with a mean equal to zero or close to zero. This is done to check whethe r
the selected line is the line of best fit or not. If the error terms are non -normally
distributed, suggests that there are a few unusual data points that must be studied
closely to make a better model.

Data Science and Visualization Unit 3: Data Analysis in Depth


36
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

4. The equal variance of residuals: The error terms must have constant variance. This
phenomenon is known as Homoscedasticity. The presence of non-constant variance in
the error terms is referred to as Heteroscedasticity. Generally, non-constant variance
arises in the presence of outliers or extreme leverage values.

Hypothesis in Linear Regression


Once you have fitted a straight line on the data, you need to ask, ―Is this straight line a
significant fit for the data?‖ Or ―Is the beta coefficient explain the variance in the
data plotted?” And here comes the idea of hypothesis testing on the beta coefficient.
The Null and Alternate hypotheses in this case are:
H0 : B1 = 0
HA : B1 ≠ 0
To test this hypothesis we use a t-test, test statistics for the beta coefficient is given by,

Assessing the Model Fit


Some other parameters to assess a model are:
 t statistic: It is used to determine the p-value and hence, helps in determining
whether the coefficient is significant or not
 F statistic: It is used to assess whether the overall model fit is significant or not.
Generally, the higher the value of the F-statistic, the more significant a model
turns out to be.

Multiple Linear Regression


Multiple linear regression is a technique to understand the relationship between
a single dependent variable and multiple independent variables.
The formulation for multiple linear regression is also similar to simple linear regression
with the small change that instead of having one beta variable, you will now have betas
for all the variables used. The formula is given as:
Y = B0 + B1 X1 + B2 X 2 + … + Bp Xp + ε

Data Science and Visualization Unit 3: Data Analysis in Depth


37
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Considerations of Multiple Linear Regression


All four assumptions made for Simple Linear Regression still hold for Multiple Linear
Regression along with a few new additional assumptions.
 Overfitting: When more and more variables are added to a model, the model may
become far too complex and usually ends up memorizing all the data points in the
training set. This phenomenon is known as the overfitting of a model. This
usually leads to high training accuracy and very low test accuracy.

 Multicollinearity: It is the phenomenon where a model with several independent


variables, may have some variables interrelated.

 Feature Selection: With more variables present, selecting the optimal set of
predictors from the pool of given features (many of which might be redundant)
becomes an important task for building a relevant and better model.

Multicollinearity
As multicollinearity makes it difficult to find out which variable is contributing towards
the prediction of the response variable, it leads one to conclude incorrectly, the effects of
a variable on the target variable. Though it does not affect the precision of the model
predictions, it is essential to properly detect and deal with the multicollinearity presen t in
the model, as random removal of any of these correlated variables from the model causes
the coefficient values to swing wildly and even change signs.

Multicollinearity can be detected using the following methods.


 Pairwise Correlations: Checking the pairwise correlations between different
pairs of independent variables can throw useful insights into detecting
multicollinearity.

 Variance Inflation Factor (VIF): Pairwise correlations may not always be


useful as it is possible that just one variable might not be able to completely
explain some other variable but some of the variables combined could be ready to
do this. Thus, to check these sorts of relations between variables, one can use
VIF. VIF explains the relationship of one independent variable with all the other
independent variables. VIF is given by,

Data Science and Visualization Unit 3: Data Analysis in Depth


38
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

where i refers to the ith variable which is being represented as a linear combination of the
rest of the independent variables.
The common heuristic followed for the VIF values is if VIF > 10 then the value is high
and it should be dropped. And if the VIF=5 then it may be valid but should be inspected
first. If VIF < 5, then it is considered a good VIF value.

Overfitting and Underfitting in Linear Regression


There have always been situations where a model performs well on training data but not
on the test data. While training models on a dataset, overfitting, and underfitting are the
most common problems faced by people.
Before understanding overfitting and underfitting one must know about bias and
variance.
Bias
Bias is a measure to determine how accurate a model‘s predictions are likely to be on
future unseen data. Complex models, assuming there is enough training data available,
can make accurate model predictions. Whereas the models that are too naive, are very
likely to perform badly concerning model predictions. Simply, Bias is errors made by
training data.
Generally, linear algorithms have a high bias which makes them fast to learn and easier
to understand but in general, are less flexible. Implying lower predictive performance on
complex problems that fail to meet the expected outcomes.

Variance
Variance is the sensitivity of the model towards training data, that is it quantifies how
much the model will react when input data is changed.
Ideally, the model shouldn‘t change too much from one training dataset to the next
training data, which means that the algorithm is good at picking out the hidden
underlying patterns between the inputs and the output variables.
Ideally, a model should have lower variance which means that the model doesn‘t change
drastically after changing the training data(it is generalizable). Having higher variance
will make a model change drastically even on a small change in the training dataset.

Data Science and Visualization Unit 3: Data Analysis in Depth


39
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Bias Variance Trade-off

In the pursuit of optimal performance, a supervised machine learning algorithm seeks to


strike a balance between low bias and low variance for increased robustness.
In the realm of machine learning, there exists an inherent relationship between bias and
variance, characterized by an inverse correlation.
 Increased bias leads to reduced variance.
 Conversely, heightened variance results in diminished bias.
Finding an equilibrium between bias and variance is crucial, and algorithms must
navigate this trade-off for optimal outcomes.
In practice, calculating precise bias and variance error terms is challenging due to the
unknown nature of the underlying target function.

Overfitting
When a model learns every pattern and noise in the data to such an extent that it affects
the performance of the model on the unseen future dataset, it is referred to as overfitting.
The model fits the data so well that it interprets noise as patterns in the data.
When a model has low bias and higher variance it ends up memorizing the data and
causing overfitting. Overfitting causes the model to become specific rather than generic.
This usually leads to high training accuracy and very low test accuracy.
Detecting overfitting is useful, but it doesn‘t solve the actual problem. There are several
ways to prevent overfitting, which are stated below:
 Cross-validation
 If the training data is too small to train add more relevant and clean data.

Data Science and Visualization Unit 3: Data Analysis in Depth


40
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

 If the training data is too large, do some feature selection and remove unnecessary
features.
 Regularization

Underfitting
Underfitting is not often discussed as often as overfitting is discussed. When the model
fails to learn from the training dataset and is also not able to generalize the test dataset,
is referred to as underfitting. This type of problem can be very easily detected by the
performance metrics.
When a model has high bias and low variance it ends up not generalizing the data and
causing underfitting. It is unable to find the hidden underlying patterns in the data. This
usually leads to low training accuracy and very low test accuracy. The ways to prevent
underfitting are stated below,
 Increase the model complexity
 Increase the number of features in the training data
 Remove noise from the data.

Model description:
import pandas as pd
df = pd.read_csv('WeightnHeight.csv')
df

Sr. No. Weight-KG Height-CM


0 51 167
1 62 182
2 69 176
3 65 173
4 65 172
... ... ...
24995 54 177
24996 55 164
24997 54 164
24998 60 172
24999 57 175
25000 rows × 2 columns

Data Science and Visualization Unit 3: Data Analysis in Depth


41
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

import matplotlib.pyplot as plt


plt.scatter(x = df['Weight-KG'] ,y = df['Height-CM'])

data = df.iloc[0:100]
data

Sr. No. Weight-KG Height-CM


0 51 167
1 62 182
2 69 176
3 65 173
4 65 172
... ... ...
95 60 179
96 54 168
97 56 161
98 58 170
99 52 175
100 rows × 2 columns
plt.scatter(x = data['Weight-KG'] ,y = data['Height-CM'])

Data Science and Visualization Unit 3: Data Analysis in Depth


42
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

from sklearn.model_selection import train_test_split


x = data[0:75]
x

Sr. No. Weight-KG Height-CM


0 51 167
1 62 182
2 69 176
3 65 173
4 65 172
... ... ...
70 59 166
71 61 180
72 64 178
73 47 163
74 58 173
75 rows × 2 columns
x = data[['Weight-KG']]
y = data['Height-CM']
x_train, x_test , y_train,y_test = train_test_split(x,y, test_size=0.3 ,random_state=10)
x_train.shape, x_test.shape , y_train.shape,y_test.shape
print(x_train)

Data Science and Visualization Unit 3: Data Analysis in Depth


43
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Sr. No. Weight-KG


42 63
34 64
84 59
52 57
35 58
.. …
89 60
28 49
64 61
15 64
9 55
[70 rows x 1 columns]
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
model = LinearRegression()
model.fit(x_train,y_train) ### Model Training
LinearRegression()
c = model.intercept_
c
144.6936193619362
m = model.coef_
m
array([0.48482412])
x = 63
y=m*x+c
y
array([175.23753914])
x= 63
model.predict([[x]])
array([175.23753914])
x_train[0:5]

Data Science and Visualization Unit 3: Data Analysis in Depth


44
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Sr. No. Weight-KG


42 63
34 64
84 59
52 57
35 58

y_train[0:5]

Sr. No. Height-CM


42 177
34 182
84 176
52 178
35 176
Name: Height-CM, dtype: int64
error = 177 - 175.23
error
1.7700000000000102
y_pred = model.predict(x_test)
y_pred
array([171.84377027, 170.3892979 , 174.75271502, 171.35894615,
176.20718739, 176.20718739, 171.84377027, 172.3285944 , 173.78306677,
167.48035316, 178.14648388, 174.75271502, 172.81341852, 169.90447378,
173.78306677, 172.81341852, 175.72236326, 176.20718739, 174.26789089,
176.69201151, 174.75271502, 173.29824264, 177.17683563, 167.96517729,
175.23753914, 177.17683563, 176.20718739, 174.75271502,
175.72236326, 172.81341852])
print(y_test)

Sr. No. Height-CM


19 171
14 173
43 173
37 172
66 175

Data Science and Visualization Unit 3: Data Analysis in Depth


45
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

3 173
79 172
41 166
38 172
68 171
2 176
1 182
60 169
53 176
95 179
74 173
92 182
26 180
59 172
46 170
90 173

error = y_test - y_pred


error

Sr. No. Error


19 -0.843770
14 2.610702
43 -1.752715
37 0.641054
66 -1.207187
3 -3.207187
79 0.156230
41 -6.328594
38 -1.783067
68 3.519647
2 -2.146484
1 7.247285
60 -3.813419
53 6.095526
95 5.216933
74 0.186581
92 6.277637
26 3.792813
59 -2.267891
46 -6.692012
90 -1.752715

sum(error)
-5.400947787086153
mse = mean_squared_error(y_test,y_pred)

Data Science and Visualization Unit 3: Data Analysis in Depth


46
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

mse
12.949567298780456
rmse = np.sqrt(mse)
rmse
3.5985507219963506

Classification- Overview
The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns
from the given dataset or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes
can be called as targets/labels or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with the
corresponding output.
In classification algorithm, a discrete output function(y) is mapped to input variable(x).
y=f(x), where y = categorical output
The best example of an ML classification algorithm is Email Spam Detector.
The main goal of the Classification algorithm is to identify the category of a given dataset,
and these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are
similar to each other and dissimilar to other classes.

Data Science and Visualization Unit 3: Data Analysis in Depth


47
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

The algorithm which implements the classification on a dataset is known as a classifier.


o Types of Classification: Binary Classifier: If the classification problem has only two
possible outcomes, then it is called as Binary Classifier. Examples: YES or NO,
MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it
is called as Multi-class Classifier. Example: Classifications of types of crops,
Classification of types of music.
o Multi-Label Classification: Each instance can belong to multiple classes. Example:
Tagging a news article with multiple tags like ―sports‖, ―politics‖, etc.
o Imbalanced Classification: Deals with datasets where the class distribution is not
uniform. Example: Fraud detection where fraudulent transactions are much less
frequent than legitimate ones.

Types of ML Classification Algorithms:


1. Linear Models
i. Logistic Regression- Suitable for binary classification.
ii. Support Vector Machines- Effective for complex boundaries.
2. Non-linear Models
i. Naïve Bayes- Simple and efficient, assuming feature independence.
ii. Decision Tree Classification- Easy to interpret, but can overfit.
iii. Random Forest Classification- Ensemble of decision trees, often more robust.
iv. Neural Networks- Powerful for complex tasks, but can be computationally
expensive.

Use cases of Classification Algorithms


1. Email Spam Detection
2. Speech Recognition
3. Identifications of Cancer tumor cells.
4. Drugs Classification
5. Biometric Identification, etc.

Naïve Bayes classifier:


o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.

Data Science and Visualization Unit 3: Data Analysis in Depth


48
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

o It is mainly used in text classification that includes a high-dimensional training


dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

Why is it called Naïve Bayes?


The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
 Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
 Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.

Data Science and Visualization Unit 3: Data Analysis in Depth


49
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

Working of Naïve Bayes' Classifier:


Consider a fictional dataset that describes the weather conditions for playing a game of golf.
Given the weather conditions, each tuple classifies the conditions as fit(―Yes‖) or unfit(―No‖)
for playing golf. So using this dataset we need to decide that whether we should play or not
on a particular day according to the weather conditions.
Solution: To solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Here is a tabular representation of our dataset:

Sr. No. Outlook Temperature Humidity Windy Play Golf

0 Rainy Hot High False No

1 Rainy Hot High True No

2 Overcast Hot High False Yes

3 Sunny Mild High False Yes

4 Sunny Cool Normal False Yes

5 Sunny Cool Normal True No

6 Overcast Cool Normal True Yes

7 Rainy Mild High False No

8 Rainy Cool Normal False Yes

9 Sunny Mild Normal False Yes

10 Rainy Mild Normal True Yes

11 Overcast Mild High True Yes

12 Overcast Hot Normal False Yes

13 Sunny Mild High True No

The dataset is divided into two parts, namely, feature matrix and the response vector. Feature
matrix contains all the vectors(rows) of dataset in which each vector consists of the value

Data Science and Visualization Unit 3: Data Analysis in Depth


50
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

of dependent features. In above dataset, features are ‗Outlook‘, ‗Temperature‘, ‗Humidity‘


and ‗Windy‘. Response vector contains the value of class variable (prediction or output) for
each row of feature matrix. In above dataset, the class variable name is ‗Play golf‘.
Convert the data set into a frequency table:

Let us test it on a new set of features (let us call it today):


( ∣ )
( ∣ ) ( ∣ ) ( ∣ ) ( ∣ ) ( )
( )

and probability to not play golf is given by:


( ∣ )
( ∣ ) ( ∣ ) ( ∣ ) ( ∣ ) ( )
( )

Since, P(today) is common in both probabilities, we can ignore P(today) and find
proportional probabilities as:
( ∣ ) . . .

( ∣ ) . . .

Now, since P(Yes∣today)+P(No∣today)=1

Data Science and Visualization Unit 3: Data Analysis in Depth


51
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

These numbers can be converted into a probability by making the sum equal to 1
(normalization):
P(Yes∣today)=0.02116/(0.02116+0.0068) ≈ 0.7568 , and
P(No∣today)=0.0068/(0.0141+0.0068) ≈ 0.2432
Since, P(Yes∣today)>P(No∣today)
So, prediction that golf would be played is ‗Yes‘.

Types of Naive Bayes Model:


1. Gaussian Naive Bayes classifier:
In Gaussian Naive Bayes, continuous values associated with each feature are assumed to
be distributed according to a Gaussian distribution. A Gaussian distribution is also
called Normal distribution when plotted, it gives a bell shaped curve which is symmetric
about the mean of the feature values.
2. Multinomial Naive Bayes:
Feature vectors represent the frequencies with which certain events have been generated
by a multinomial distribution. This is the event model typically used for document
classification. It means a particular document belongs to which category such as Sports,
Politics, education, etc.
3. Bernoulli Naive Bayes:
In the multivariate Bernoulli event model, features are independent booleans (binary
variables) describing inputs. Like the multinomial model, this model is popular for
document classification tasks, where binary term occurrence (i.e. a word occurs in a
document or not) features are used rather than term frequencies (i.e. frequency of a word
in the document).

Advantages of Naïve Bayes Classifier:


1. Easy to implement and computationally efficient.
2. Effective in cases with a large number of features.
3. Performs well even with limited training data.
4. It performs well in the presence of categorical features.
5. For numerical features data is assumed to come from normal distributions

Disadvantages of Naïve Bayes Classifier:


1. Assumes that features are independent, which may not always hold in real-world data.

Data Science and Visualization Unit 3: Data Analysis in Depth


52
Department of Electronics & Computer Engineering Prof. S. K. Choudhary

2. Can be influenced by irrelevant attributes.


3. May assign zero probability to unseen events, leading to poor generalization.

Applications of Naïve Bayes Classifier:


1. Spam Email Filtering: Classifies emails as spam or non-spam based on features.
2. Text Classification: Used in sentiment analysis, document categorization, and topic
classification.
3. Medical Diagnosis: Helps in predicting the likelihood of a disease based on symptoms.
4. Credit Scoring: Evaluates creditworthiness of individuals for loan approval.
5. Weather Prediction: Classifies weather conditions based on various factors.

References:
1. Joel Grus, ―Data Science from Scratch‖, O‘Reilly Media Inc., ISBN: 9781491901427
2. David Dietrich, Barry Heller, and Beibei Yang, ―Data Science & Big Data Analytics‖,
Wiley-EMC education Services.
3. https://www.geeksforgeeks.org/

Data Science and Visualization Unit 3: Data Analysis in Depth


53

You might also like