DATA Mining
Assignment-Part1
Clustering Clean Ads
Submitted by
Ganesh M
Ajai Subramaniam K
Aditya Raman
Shree Raksha S
1 - Clustering Digital Ads Data: The ads24x7 is a Digital Marketing
company which has now got seed funding of $10 Million. They are
expanding their wings in Marketing Analytics. They collected data
from their Marketing Intelligence team and now wants you (their
newly appointed data analyst) to segment type of ads based on the
features provided. Use Clustering procedure to segment ads into
homogeneous groups.
The following three features are commonly used in digital marketing:
CPM = (Total Campaign Spend / Number of Impressions) x 1,000.
Note that the Total Campaign Spend refers to the 'Spend' Column in
the dataset and the Number of Impressions refers to the
'Impressions' Column in the dataset.
CPC = Total Cost (spend) / Number of Clicks. Note that the Total Cost
(spend) refers to the 'Spend' Column in the dataset and the Number
of Clicks refers to the 'Clicks' Column in the dataset.
CTR = (Total Measured Clicks / Total Measured Ad Impressions) x
100. Note that the Total Measured Clicks refers to the 'Clicks' Column
in the dataset and the Total Measured Ad Impressions refers to the
'Impressions' Column in the dataset.
The Data Dictionary and the detailed description of the formulas for
CPM, CPC and CTR are given in the sheet 2 of the Clustering Clean
ads_data Excel File.
i. Read the data and perform basic analysis such
as printing a few rows (head and tail), info,
data summary, null values duplicate values,
etc.
All checks were done and given here the info tab for reference.
Total columns – 19
Total Null items – (23066-18330) – 4736 – CTR, CPM & CPC
Total duplicates – 0
Data types available
Float64 – (06)
Int64 – (07)
Object – (06)
ii.Treat missing values in CPC, CTR and CPM
using the formula given in question using
lambda function
Along with the given formulae and using lambda function,
the missing values were treated and created a new set of
columns as CPMA,CTRA & CPCA to start with the clustering.
Checking & Treating the Outliers:
Box plot for Outliers check:
As most columns are highly right skewed, we have decided to treat outliers with upper whisker
using the formula = Q3 + 1.5*IQR for all columns except CTRA, CPMA & CPCA.
Yes. We must treat the outliers definitely for K-Means clustering. K-Means clustering can be highly
sensitive to outliers. As most columns are highly right skewed, we shall equate them to the upper
whisker calculated using the formula = Q3 + 1.5*IQR. IQR is the Inter Quartile Range (Q3-Q1)
Looking at the box plot, it seems that the variables
Available_Impressions,Matched_Queries,Impressions,Clicks,Spend,Fee Revenue,CTRA,CPMA,CPCA
have outlier present in the variables.
Removal of outliers using formula
Box Plot view After Treatment:
After using the IQR score method, the outliers were treated and above box plot shows
no outliers.
iii. Perform z-score scaling and discuss how it affects the speed of
the algorithm.
z-score scaling is a method of standardization where the scores are centred around the
mean with a standard deviation of 1. This means that each score is converted to a number of
standard deviations from the mean.
iv. Construction of Dendrogram
A dendrogram is a graphical representation of a set of data, in which the individual elements are represented
by branches extending from a central point. The dendrogram can be used to visualize the relationships
between the elements, as well as the relative importance of each element.
Truncated Dendrogram:
A dendrogram is a tree-like diagram that shows the relationships between different groups of
data. A truncated dendrogram is a dendrogram that has been cut off at a certain point,
typically to make it easier to read or to focus on a particular group of data.
v.Make Elbow plot (up to n=10) and identify optimum
number of clusters for k-means algorithm.
The WSS elbow plot is a graphical tool used to determine the optimal number of clusters
in a data set. The plot is generated by calculating the within-cluster sum of squares (WSS)
for a range of cluster numbers and then plotting the results. The point on the plot where
the WSS begins to decrease more slowly is generally considered to be the optimal number
of clusters.
vi.Print silhouette scores for up to 10 clusters and identify
optimum number of clusters.
The silhouette score is a measure of how well each point in a cluster is matched to points in its
own cluster, as compared to points in other clusters. A higher silhouette score indicates that the
points in a cluster are more similar to each other than to points in other clusters.
The optimum number of clusters is the number of clusters that results in the highest silhouette
score.
Finalizing through the above silhouette scores, we can say that the
Optimal number of Clusters is 6 as its silhouette score is greater
than all the clusters.
vii. Profile the ads based on optimum number of clusters using silhouette score and your
domain understanding [Hint: Group the data by clusters and take sum or mean to
identify trends in Clicks, spend, revenue, CPM, CTR, & CPC based on Device Type.
Make bar plots]
viii. Project Summary:
Based on analysis of using the 2 different clustering Hierarchical clustering & K-
Means Clustering.
We have identified 5 optimum number of clusters through Hierarchical
clustering
We have identified chosen optimum number of clusters as n=6 based on the
silhouette scores
This project is aimed at providing a 24x7 advertisement company with a more
efficient way to cluster their data using hierarchical clustering and K-means
clustering. The company will be able to use this project to determine which
advertisement campaigns are most successful and to better target their
advertising.