CUSTOMER SEGMENTATION
CLASSIFICATION
A Project report submitted in partial fulfillment of
the requirements for the award of the degree of
BACHELOR
OF
TECHNOLOG
Y IN
INFORMATION TECHNOLOGY
Submitted by
P.SAI VIVEK 319126511046
M.SOWMYA 319126511036
M.SAI VARDHAN 319126511037
P.S.R.TEJASWI 319126511042
P.RAJESH 319126511044
P.HARIKRISHNA 319126511045
Under the guidance of
Mrs.D.CHANDRIKA
(ASSISTANT PROFESSOR)
DEPARTMENT OF INFORMATION TECHNOLOGY
ANIL NEERUKONDA INSTITUTE OF TECHNOLOGY
AND SCIENCES (UGC AUTONOMOUS)
(Permanently Affiliated to AU, Approved by AICTE and Accredited by NBA& NAAC with ‘A’
Grade)
Sangivalasa, bheemili mandal, visakhapatnam dist.(A.P) 2022-2023
DECLARATION
This is to certify that the project work entitled “CUSTOMER
SEGMENTATION CLASSIFICATION” is a bonafide work carried out by P
Sai Vivek, M Sowmya, M.Sai Vardhan, P.Srirama Tejaswi, P.Rajesh,
P.HariKrishna as a part of B.TECH fourth year 1st semester of
Information Technology of Andhra University, Visakhapatnam during the
year 2019-2023.
We, P Sai Vivek, M Sowmya, M.Sai Vardhan, P.Srirama Tejaswi,
P.Rajesh, P.HariKrishna, students of seventh semester B.Tech,
Information Technology from ANITS, Visakhapatnam, hereby declare that
the project entitled “CUSTOMER SEGMENTATION CLASSIFICATION” is
carried out by us and submitted in fulfillment of the requirements for the
award of Bachelor of Technology in Information Technology, under Anil
Neerukonda Institute of Technology and Sciences during the academic
year 2019-2023 and has not been submitted to any other university for
the award of any kind of degree.
P.SAI VIVEK 319126511046
M.SOWMYA 319126511036
M.SAI VARDHAN 319126511037
P.S.R.TEJASWI 319126511042
P.RAJESH 319126511044
P.HARIKRISHNA 319126511045
ACKNOWLEDGEMENT
We would like to express our deep gratitude to our project guide
Mrs.D.Chandrika AssociateProfessor, Department of Information Technology,
ANITS, for his/her guidance with unsurpassed knowledge and immense
encouragement. We are grateful to Dr.Mantripragada Rekha Sundari, Head of
the Department, Information Technology, for providing us with the required
facilities for the completion of the project work.
We are very much thankful to the Principal, ANITS, Sangivalasa, for their
encouragement and cooperation to carry out this work . We are very much
thankful to the Management, ANITS, Sangivalasa , for their encouragement
and cooperation to carry out this work.
We express our thanks to all teaching faculty of Department of IT, whose
suggestions during reviews helped us in accomplishment of our project. We
would like to thank all non-teaching staff of the Department of IT, ANITS for
providing great assistance in accomplishment of our project.
We would like to thank our parents, friends, and classmates for their
encouragement throughout our project period. At last but not the least, we
thank everyone for supporting us directly or indirectly in completing this
project successfully.
` PROJECT STUDENTS
P.SAI VIVEK 319126511046
M.SOWMYA 319126511036
M.SAI VARDHAN 319126511037
P.S.R.TEJASWI 319126511042
P.RAJESH 319126511044
P.HARIKRISHNA 319126511045
ANIL NEERUKONDA INSTITUTE OF TECHNOLOGY AND SCIENCES
(Affiliated to Andhra University)
SANGIVALASA, VISAKHAPATNAM -531162
2019-2022
CERTIFICATE
This is to certify that this project report “CUSTOMER SEGMENTATION
CLASSIFICATION” is the bonafide work of P Sai Vivek, M Sowmya, M.Sai
Vardhan, P.Srirama Tejaswi, P.Rajesh,
P.HariKrishna,in partial fulfillment of the requirements for the award of the
degree of Bachelor of Technology in Information Technology of Anil
Neerukonda Institute of technology and sciences, Visakhapatnam is a record of
bonafide work carried out under my guidance and supervision.
Project Guide Head of the Department
Mrs.D.Chandrika Dr.Mantripragada Rekha Sundari
(ASSISTANT PROFESSOR) Department of IT
Department of IT ANITS
ANITS
CUSTOMER SEGMENTATION
CLASSIFICATION
Introduction
Customer Segmentation is one the most important applications of unsupervised learning. Using clustering
techniques, companies can identify the several segments of customers allowing them to target the potential
user base. In this machine learning project, we will make use of k-mean Clustering which is the essential
algorithm for clustering unlabelled dataset.
SCOPE
Whenever you need to find your best customer, customer segmentation is the ideal methodology. We will
perform one of the most essential applications of machine learning – Customer Segmentation. In this project,
we will implement customer segmentation in R.
PLATFORM: R STUDIO
R was specifically designed for statistical analysis, which makes it highly suitable for data science
applications. Although the learning curve for programming with R can be steep, especially for people
without prior programming experience, the tools now available for carrying out text analysis in R make it
easy to perform powerful, cutting-edge text analytics using only a few simple commands. One of the keys to
R’s explosive growth has been its densely populated collection of extension software libraries, known in R
terminology as packages, supplied and maintained by R’s extensive user community. Each package extends
the functionality of the base R language and core packages, and in addition to functions and data must
include documentation and examples, often in the form of vignettes demonstrating the use of the package.
The best- known package repository, the Comprehensive R Archive Network (CRAN), currently has over
10,000 packages that are published.
Text analysis in particular has become well established in R. There is a vast collection of dedicated text
processing and text analysis packages, from low-level string operations to advanced text modelling
techniques such as fitting Latent Dirichlet Allocation models, R provides it all. One of the main advantages
of performing text analysis in R is that it is often possible, and relatively easy, to switch between different
packages or to combine them. Recent efforts among the R text analysis developers’ community are designed
to promote this interoperability to maximize flexibility and choice among users. As a result, learning the
basics for text analysis in R provides access to a wide range of advanced text analysis features.
PROJECT SPECIFICATION
⮚ R Studio version 1.2.5033
HARDWARE SPECIFICATIONS
⮚ Microsoft® Windows® 7/8/10 (32- or 64-bit)
⮚ 3 GB RAM minimum, 8 GB RAM recommended;
⮚ 2 GB of available disk space minimum
⮚ core processor of i3 minimum or above.
DATASET
⮚ Mall_Customers.csv
PACKAGES REQURIED:
plotrix
purr
cluster
gridExtra
grid
nbClust
factoextra
ggplot2
dplyr
CHAPTER NO: 1
What is Customer Segmentation?
Customer Segmentation is the process of division of customer base into several groups of individuals that
share a similarity in different ways that are relevant to marketing such as gender, age, interests, and
miscellaneous spending habits.
Companies that deploy customer segmentation are under the notion that every customer has different
requirements and require a specific marketing effort to address them appropriately. Companies aim to gain a
deeper approach of the customer they are targeting. Therefore, their aim has to be specific and should be
tailored to address the requirements of each and every individual customer. Furthermore, through the data
collected, companies can gain a deeper understanding of customer preferences as well as the requirements
for discovering valuable segments that would reap them maximum profit. This way, they can strategize their
marketing techniques more efficiently and minimize the possibility of risk to their investment.
The technique of customer segmentation is dependent on several key differentiators that divide customers
into groups to be targeted. Data related to demographics, geography, economic status as well as behavioral
patterns play a crucial role in determining the company direction towards addressing the various segments.
IMPLEMENTATION:
In the first step of this data science project, we will perform data exploration. We will import the essential
packages required for this role and then read our data. Finally, we will go through the input data to gain
necessary insights about it.
READING EVENTS FROM MALL_CUSTOMERS.CSV:-
Before going to customer segmentation analysis, the first step is to read the data for performing
analysis on. The data is saved in dataset named as Mall_Customers.csv. This dataset contains 400
record of various type of customers. The events saved in dataset are unstructured. To perform
analysis, reading of data set is done using command “read.csv”.
customer_data=read.csv("C:/home/desktop/Mall_Customers.csv")
Figure 1. Mall_Customerscsv
CHAPTER NO: 2
Customer Gender Visualization:
In this, we will create a barplot and a piechart to show the gender distribution across our customer_data
dataset. A bar chart represents data in rectangular bars with length of the bar proportional to the value of the
variable. R uses the function barplot() to create bar charts. R can draw both vertical and Horizontal bars in
the bar chart. In bar chart each of the bars can be given different colors
Fig no- 2 Gender Comparison
From the below graph, we conclude that the percentage of females is 56%, whereas the percentage of male
in the customer dataset is 44%.
Fig no- 3 Gender ratio
Visualization of Age Distribution
Let us plot a histogram to view the distribution to plot the frequency of customer ages. We will first proceed
by taking summary of the Age variable.
Code:
summary(customer_data$Age)
hist(customer_data$Age,
col="blue",
main="Histogram to Show Count of Age Class",
xlab="Age Class",
ylab="Frequency",
labels=TRUE)
Fig no- 4 Age Distribution
From the above two visualizations, we conclude that the maximum customer ages are between
30 and 35. The minimum age of customers is 18, whereas, the maximum age is 70.
Analysis of the Annual Income of the Customers:
In this section of the R project, we will create visualizations to analyze the annual income of the customers.
We will plot a histogram and then we will proceed to examine this data using a density plot.
Fig no- 5 Annual Income
From the above descriptive analysis, we conclude that the minimum annual income of the
customers is 15 and the maximum income is 137. People earning an average income of 70
have the highest frequency count in our histogram distribution. The average salary of all the
customers is 60.56. In the Kernel Density Plot that we displayed above, we observe that the
annual income has a normal distribution.
CHAPTER NO: 3
K-means Algorithm
While using the k-means clustering algorithm, the first step is to indicate the number of clusters
(k) that we wish to produce in the final output. The algorithm starts by selecting k objects from
dataset randomly that will serve as the initial centers for our clusters. These selected objects are
the cluster means, also known as centroids. Then, the remaining objects have an assignment of
the closest centroid. This centroid is defined by the Euclidean Distance present between the
object and the cluster mean. We refer to this step as “cluster assignment”. When the assignment
is complete, the algorithm proceeds to calculate new mean value of each cluster present in the
data. After the recalculation of the centers, the observations are checked if they are closer to a
different cluster. Using the updated cluster mean, the objects undergo reassignment. This goes
on repeatedly through several iterations until the cluster assignments stop altering. The clusters
that are present in the current iteration are the same as the ones obtained in the previous
iteration.
Summing up the K-means clustering –
We specify the number of clusters that we need to create.
The algorithm selects k objects at random from the dataset. This object is the initial cluster or mean.
The closest centroid obtains the assignment of a new observation. We base this assignment on the
Euclidean Distance between object and the centroid.
k clusters in the data points update the centroid through calculation of the new mean values present in all
the data points of the cluster. The kth cluster’s centroid has a length of p that contains means of all
variables for observations in the k-th cluster. We denote the number of variables with p.
Iterative minimization of the total within the sum of squares. Then through the iterative minimization of
the total sum of the square, the assignment stop wavering when we achieve maximum iteration. The
default value is 10 that the R software uses for the maximum iterations.
we calculate the clustering algorithm for several values of k. This can be done by creating a variation within
k from 1 to 10 clusters. We then calculate the total intra-cluster sum of square (iss). Then, we proceed to plot
iss based on the number of k clusters. This plot denotes the appropriate number of clusters required in our
model. In the plot, the location of a bend or a knee is the indication of the optimum number of clusters. Let
us implement this in R as follows –
Code:
library(purrr)
set.seed(123)
# function to calculate total intra-cluster sum of square
iss <- function(k) {
kmeans(customer_data[,3:5],k,iter.max=100,nstart=100,algorithm="Lloyd" )$tot.withinss
}
k.values <- 1:10
iss_values <- map_dbl(k.values, iss)
plot(k.values, iss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total intra-clusters sum of squares")
Fig no- 6 Culsters
CHAPTER NO- 4
Visualizing the Clustering Results using the First Two PrincipleComponents:
A line chart or line plot or line graph or curve chart is a type of chart which displays information as a
series of data points called 'markers' connected by straight line segments. It is a basic type of chart
common in many fields. Used across many fields, this type of graph can be quite helpful in depicting
the changes in values over time. We are going to use ggplot for depicting the line plot.
Code:
set.seed(1)
ggplot(customer_data, aes(x =Annual.Income..k.., y = Spending.Score..1.100.)) +
geom_point(stat = "identity", aes(color = as.factor(k6$cluster))) +
scale_color_discrete(name=" ",
breaks=c("1", "2", "3", "4", "5","6"),
labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5","Cluster 6"))
+
ggtitle("Segments of Mall Customers", subtitle = "Using K-means Clustering")
Fig no-7 visualization
From the above visualization, we observe that there is a distribution of 6 clusters as follows –
Cluster 6 and 4 – These clusters represent the customer_data with the medium income
salary as well as the medium annual spend of salary.
Cluster 1 – This cluster represents the customer_data having a high annual income as well as
a high annual spend.
Cluster 3 – This cluster denotes the customer_data with low annual income as well as low
yearly spend of income.
Cluster 2 – This cluster denotes a high annual income and low yearly spend.
Cluster 5 – This cluster represents a low annual income but its high yearly expenditure.
Fig no- 8 k-mean visualization
Cluster 4 and 1 – These two clusters consist of customers with medium PCA1 and medium
PCA2 score.
Cluster 6 – This cluster represents customers having a high PCA2 and a low PCA1.
Cluster 5 – In this cluster, there are customers with a medium PCA1 and a low PCA2 score.
Cluster 3 – This cluster comprises of customers with a high PCA1 income and a high PCA2.
Cluster 2 – This comprises of customers with a high PCA2 and a medium annual spend of
income.
With the help of clustering, we can understand the variables much better, prompting us to
take careful decisions. With the identification of customers, companies can release products
and services that target customers based on several parameters like income, age, spending
patterns, etc. Furthermore, more complex patterns like product reviews are taken into
consideration for better segmentation.
CONCLUSION
In this data science project, we went through the customer segmentation model. We
developed this using a class of machine learning known as unsupervised learning.
Specifically, we made use of a clustering algorithm called K-means clustering. We analysed
and visualized the data and then proceeded to implement our algorithm. Hope you enjoyed this
customer segmentation project of machine learning using R