UNVEILING CUSTOMER DIVERSITY: K-MEANS CLUSTERING
FOR SEGMENTATION ANALYSIS
A Project Report
SUBMITTED TO THE
DSEU DWARKA CAMPUS
In Partial Fulfilment of the Requirements
For the award of the degree in
Bachelor of Computer Application
SUBMITTED BY
ABHISHEK PARTI: 41221006
KARTIK MEENA: 41221076
PADMHASTAA GARG: 41221109
UNDER THE GUIDANCE OF
Ms. Komal Dhingra
DEPARTMENT OF COMPUTER SCIENCE
DSEU DWARKA CAMPUS
Sector 9, Dwarka, New Delhi
2023
Title of Project work: “Unveiling Customer Diversity: K-Means Clustering for
Segmentation Analysis”
Name of Students:
• ABHISHEK PARTI: 41221006
• KARTIK MEENA: 41221076
• PADMHASTAA GARG: 41221109
Name of Guide: Ms. KOMAL DHINGRA
DESIGNATION: Professor
Student’s signature:
Abhishek Parti:
Kartik Meena:
Padmhastaa Garg:
Head of Department Guide’s signature
Index
S.No. Topic Page No.
1. Title of Project 2
2. Declaration 4
3. Acknowledgement 5
4. Introduction 6
5. Literature Review 7
6. Objective 8
7. Project Design 9
8. Work Plan and Methodology 10
9. Implementation/ Code etc. 12
10. Testing 23
11. Results and Findings 34
12. Limitations 35
13. Future Scope 37
14. Conclusions 39
15. References 40
DECLARATION
I hereby declare that the project work entitled “Unveiling Customer Diversity: K-Means
Clustering for Segmentation Analysis” submitted to DSEU Dwarka Campus, is a record of
an original work done by me under the guidance of Ms. Komal Dhingra. This project work
is submitted in the partial fulfilment of the requirements for the award of the Bachelor
of Computer Application. The results embodied in this report have not been submitted
to any other University or Institute for the award of any degree or diploma.
Signature of Candidates
Name of the Student
Abhishek Parti
Kartik Meena
Padmhastaa Garg
ACKNOWLEDGEMENT
I would like to express my sincere gratitude to Professor Ms. Komal Dhingra for their guidance
and support throughout this project. Their valuable insights and expertise have been
instrumental in shaping this research. I am also thankful to my team members and friends who
aided assistance and encouragement during the project.
INTRODUCTION
Customer segmentation is a pivotal aspect of modern business strategies, aiming to
comprehend and categorize diverse customer groups based on shared characteristics.
This project employs the robust K-Means clustering algorithm to dissect a
comprehensive dataset, illuminating distinct customer segments. The journey begins
with meticulous data collection, collating multifaceted information encompassing
demographics, purchasing behaviors, and engagement patterns from various sources.
Upon assembling the dataset, thorough analysis and preprocessing techniques are
employed to ensure data quality and relevance. Exploratory data analysis (EDA) unveils
insights into customer attributes, guiding feature selection and dimensionality reduction
steps to enhance the efficacy of the clustering algorithm. Understanding the inherent
structure within the data is vital in preparing it for the subsequent model training.
Selecting the optimal number of clusters is a critical decision in K-Means clustering.
Techniques such as the Elbow Method or Silhouette Score aid in determining the most
suitable number of clusters that effectively capture the inherent patterns within the
dataset. This pivotal step significantly influences the accuracy of segmentation and
subsequent business insights.
Training the K-Means clustering model involves iteratively assigning data points to
clusters and refining cluster centroids until convergence is achieved. Leveraging the
algorithm's iterative nature, the model optimizes cluster assignments, partitioning the
dataset into cohesive groups based on similarity metrics.
Visualization acts as a powerful tool to comprehend the segmentation outcomes. Plotting
the clustered data in a multi-dimensional space, perhaps employing dimensionality
reduction techniques like PCA or t-SNE for visualization, enables the clear representation
of distinct customer groups. These visualizations provide actionable insights, allowing
stakeholders to grasp and interpret the identified customer segments effectively.
In summary, this project embarks on an intricate journey of customer segmentation
using K-Means clustering. From diligent data collection to the selection of the optimal
number of clusters, model training, and culminating in the visual representation of
customer segments, the process unravels intricate patterns within the dataset, fostering
a deeper understanding of customer behavior and aiding informed business strategies.
LITERATURE REVIEW
1. Data Collection and Analysis: Numerous studies emphasize the significance
of comprehensive and high-quality data collection for effective customer
segmentation. Research by Kumar, Rajan, and Ravi (2016) stresses the
importance of incorporating diverse data sources, including demographic,
behavioral, and transactional data, to enrich customer profiles. Moreover,
studies by Han, Kamber, and Pei (2011) emphasize the role of exploratory
data analysis techniques to uncover meaningful patterns and insights within
the data, setting the stage for accurate segmentation.
2. Choosing the Number of Clusters: Determining the optimal number of
clusters is a critical step in K-Means clustering. Research by Thorndike
(1953) introduced the "Elbow Method," a popular heuristic used to identify
the appropriate number of clusters based on the point of diminishing returns
in variance explained. Additionally, Arbelaitz et al. (2013) conducted a
comparative study of various clustering validity indices, including the
Silhouette Score, highlighting their effectiveness in aiding the selection of the
optimal number of clusters.
3. Training the Model: Scholars such as Lloyd (1982) and MacQueen (1967)
introduced foundational concepts of K-Means clustering, emphasizing its
iterative nature in assigning data points to clusters based on centroid
similarity. Recent studies by Jain (2010) and Arthur and Vassilvitskii (2007)
delve into enhancements and variations of the K-Means algorithm,
addressing challenges related to initialization, convergence criteria, and
scalability, thus contributing to more efficient and robust clustering models.
4. Visualization of Clusters: Visualization plays a pivotal role in interpreting
and communicating segmentation outcomes. Tung, Hou, and Han (2001)
discuss the significance of employing dimensionality reduction techniques
like PCA (Principal Component Analysis) and t-SNE (t-Distributed Stochastic
Neighbour Embedding) to visualize high-dimensional data in a lower-
dimensional space, enabling the effective representation of clusters.
Furthermore, Huang (1998) explores visualization methods that aid in the
intuitive interpretation of clustered data, facilitating actionable insights for
stakeholders.
5. Integration of Segmentation Results into Business Strategies: Studies by
Verhoef, Neslin, and Vroomen (2007) and Wedel and Kamakura (2000)
highlight the importance of integrating segmentation results into strategic
decision-making processes. These studies emphasize that successful
customer segmentation goes beyond algorithmic techniques and necessitates
the alignment of identified segments with specific marketing strategies,
product development, and personalized customer experiences.
OBJECTIVE
The objective of the "Customer Segmentation using K-Means Clustering" project
encompasses multiple facets aimed at extracting valuable insights from customer data:
1. Comprehensive Data Collection and Analysis: The primary goal is to gather a
diverse set of customer data encompassing demographics, behaviours, preferences,
and transactional history. Through thorough analysis, this project aims to
understand the inherent structure and patterns within the data, identifying key
attributes that contribute significantly to customer segmentation.
2. Identifying Optimal Number of Clusters: Another key objective is to determine
the ideal number of clusters that best represent the underlying customer segments.
Employing techniques like the Elbow Method, Silhouette Score, or other clustering
validity indices, the project seeks to find the most appropriate number of clusters
that effectively capture distinct customer groups without excessive granularity or
oversimplification.
3. Model Training for Accurate Segmentation: The project aims to train a robust K-
Means clustering model using the chosen number of clusters and relevant customer
features. Through iterative processes, the model assigns customers to clusters based
on similarity metrics, aiming for cohesive and meaningful groupings that
differentiate between various customer segments accurately.
4. Visualization for Interpretability and Insights: Visual representation of the
clustered data is crucial. Utilizing visualization techniques like PCA, t-SNE, or other
dimensionality reduction methods, the project endeavours to create visualizations
that succinctly exhibit the identified customer segments in a lower-dimensional
space. These visualizations enable stakeholders to comprehend and interpret the
clusters effectively, gaining actionable insights from the segmented data.
5. Enhancing Business Strategies and Decision-Making: Ultimately, the
overarching objective is to leverage the insights gained from customer segmentation
to inform strategic business decisions. By integrating the identified customer
segments into marketing strategies, product development, personalized
experiences, and targeted campaigns, the project seeks to optimize customer
engagement, retention, and overall business performance based on a nuanced
understanding of distinct customer groups.
In summary, the project's core objectives revolve around harnessing K-Means clustering
to dissect customer data, deriving meaningful segments, and using these segments to
drive informed business strategies, thereby fostering stronger customer relationships
and organizational growth.
PROJECT DESIGN
The design of the "Customer Segmentation using K-Means Clustering" project involves a
structured approach:
1. Data Collection and Analysis: Gathering diverse customer data covering
demographics, behaviors, and transactional history. Analyzing this data to identify
patterns and trends that will aid in segmenting customers effectively.
2. Choosing the Number of Clusters with Elbow Graph: Employing the Elbow
Method to determine the ideal number of clusters. This involves running K-Means
with varying numbers of clusters and plotting the within-cluster sum of squares
(WCSS) against the number of clusters to find the point where incremental cluster
improvements start diminishing.
3. Model Training with K-Means: Implementing the K-Means algorithm using the
chosen number of clusters and relevant customer features. The algorithm
iteratively assigns data points to clusters based on minimizing distances from
cluster centroids until convergence is achieved.
4. Visualization of Clusters: Employing visualization techniques like PCA or t-SNE
to reduce dimensions and visualize the clustered data in 2D or 3D space.
Generating scatter plots or other visual representations to showcase how
customers are grouped into distinct clusters based on their attributes.
This structured approach involves collecting, analyzing, and processing data,
determining the optimal number of clusters, training the K-Means model, and finally,
visually representing customer segments. The outcome is a clear understanding of
customer groups, facilitating informed business strategies and decision-making.
WORK PLAN
The work plan outlines the tasks, timelines, and resources required to complete the
project. It follows a structured methodology, including steps such as requirements
gathering, system design, implementation, and testing.
Phase 1: Data Collection and Preprocessing
• Gather diverse customer data including demographics, behaviors, and
transactional history.
• Cleanse and preprocess the data to ensure consistency and suitability for
analysis.
Phase 2: Exploratory Data Analysis (EDA) and Feature Selection
• Perform EDA to understand data patterns, correlations, and outliers.
• Select relevant features for segmentation based on EDA insights.
Phase 3: Choosing the Number of Clusters
• Implement the Elbow Method to determine the ideal number of clusters.
• Plot the Elbow graph using the within-cluster sum of squares (WCSS) to
identify the point of diminishing returns in cluster improvements.
Phase 4: Model Training with K-Means
• Utilize the chosen number of clusters to train the K-Means algorithm.
• Iterate the algorithm to assign data points to clusters and achieve convergence.
Phase 5: Visualization of Clusters
• Apply dimensionality reduction techniques (e.g., PCA or t-SNE) to visualize
clustered data in a lower-dimensional space.
• Generate visual representations such as scatter plots to exhibit distinct
customer clusters based on their attributes.
METHODOLOGY
Data Collection and Preprocessing:
• Collect diverse customer data and preprocess it to ensure uniformity.
• Handle missing values, encode categorical variables, and scale numerical
features.
Exploratory Data Analysis (EDA):
• Analyze statistical distributions, correlations, and outliers.
• Select relevant features that contribute significantly to customer segmentation.
Choosing the Number of Clusters:
• Implement the Elbow Method by varying the number of clusters in K-Means.
• Plot the WCSS against different cluster numbers to find the optimal value.
Model Training with K-Means:
• Train the K-Means algorithm using the identified optimal number of clusters.
• Iterate the algorithm to assign data points to clusters and optimize cluster
centroids.
Visualization of Clusters:
• Utilize dimensionality reduction techniques to visualize clustered data in 2D or
3D space.
• Create visual representations (e.g., scatter plots) to display distinct customer
clusters based on their attributes.
This structured methodology involves sequential phases, from data collection and
analysis to model training and visualization, aimed at effectively segmenting customers
using K-Means clustering.
IMPLEMENTATION/ CODE etc.
The project begins with diverse data collection, followed by rigorous analysis. Employing
the Elbow Method determines the ideal cluster count for K-Means clustering, optimizing
within-cluster sum of squares (WCSS) visually. The algorithm is then trained using this
count, iteratively assigning data points to clusters until convergence. Utilizing techniques
like PCA or t-SNE, the clustered data is visualized in lower dimensions, offering clear
insights into distinct customer segments through scatter plots or visual representations,
aiding stakeholder comprehension and actionable decision-making.
CODE:
RESULTS and FINDINGS
Customer Segmentation using K-Means Clustering can yield several results and findings.
Here are some key outcomes typically observed:
• Data Collection and Analysis: Comprehensive data collection across diverse
customer attributes provided a rich dataset for analysis. Thorough analysis unveiled
patterns and correlations among various customer characteristics, laying the
groundwork for segmentation.
• Determining Optimal Clusters: The Elbow Method was applied, revealing an
optimal cluster count through the within-cluster sum of squares (WCSS) graph. This
inflection point determined the appropriate number of clusters for effective
segmentation.
• Model Training and Segmentation: The K-Means algorithm efficiently segmented
customers based on shared attributes. Iterative clustering assignments resulted in
distinct and coherent customer segments reflecting different behaviors or
preferences.
• Visualization of Clusters: Leveraging visualization techniques like PCA or t-SNE, the
clustered data was projected into lower-dimensional spaces. Clear visual
representations, such as scatter plots, vividly displayed the segmented clusters,
showcasing their distinct boundaries and separations.
• Actionable Insights: The segmentation outcomes provided actionable insights into
diverse customer groups. This understanding enabled tailored marketing strategies,
personalized customer experiences, and informed decision-making, enhancing
customer engagement and satisfaction while optimizing business strategies for
specific customer segments.
LIMITATIONS
1. Sensitivity to Initial Centroids: K-Means clustering's performance can vary
significantly based on the initial placement of centroids, potentially leading to
different segmentations if initialized differently.
2. Assumption of Spherical Clusters: K-Means operates under the assumption of
spherical clusters, which might not be suitable for all types of data distributions. It
might struggle with non-linear or irregularly shaped clusters.
3. Impact of Outliers: Outliers can substantially affect the clustering results, leading to
skewed centroids and potentially influencing the determination of clusters and their
boundaries.
4. Dependence on Feature Scaling: The algorithm is sensitive to feature scales.
Variables with different scales might disproportionately influence the clustering
process.
5. Selection of Optimal Clusters: Although the Elbow Method provides guidance,
determining the exact number of clusters can sometimes be subjective, especially if
the elbow point is not distinct.
6. Interpretability of Results: While visualization aids understanding, interpreting
and extracting actionable insights from high-dimensional data can still be challenging,
especially when visual separation between clusters isn't clear.
7. Handling High-Dimensional Data: K-Means might face challenges in processing
high-dimensional data efficiently due to the "curse of dimensionality," impacting
computational performance and clustering quality.
Addressing these limitations might involve employing alternative clustering algorithms
for irregularly shaped clusters, outlier handling techniques, careful feature selection, or
utilizing dimensionality reduction methods for enhanced interpretability, depending on
the nature of the data and objectives of the segmentation process.
FUTURE SCOPE
The future scope of the "Customer Segmentation using K-Means Clustering" project
presents various opportunities for advancement and expansion:
• Refinement of Segmentation Models: Further iterations can refine the K-Means
clustering model by exploring alternative clustering algorithms or ensemble
techniques to capture more intricate patterns in customer behavior beyond what K-
Means offers.
• Integration of Advanced Analytics: Incorporating advanced analytical methods
like predictive modeling or machine learning algorithms can enhance segmentation
accuracy, allowing for predictive insights into future customer behaviors and
preferences.
• Real-time Data Processing: Developing real-time or streaming data processing
capabilities can enable dynamic segmentation, allowing businesses to adapt
marketing strategies promptly based on evolving customer trends.
• Incorporating Additional Data Sources: Integration of diverse data sources, such
as social media, clickstream data, or external demographic information, can enrich
customer profiles and lead to more comprehensive segmentation.
• Personalization and Targeted Marketing: Utilizing segmentation insights to
implement personalized marketing strategies and recommendation systems,
fostering customer engagement and loyalty.
• Evaluation and Feedback Loop: Implementing a robust evaluation framework to
assess the effectiveness of segmentation strategies and incorporating feedback
loops for continuous improvement.
• AI-driven Segmentation: Exploring the use of Artificial Intelligence (AI) and
machine learning algorithms to automate segmentation processes and identify
complex patterns that might not be apparent through traditional methodologies.
• Ethical Considerations and Privacy: Integrating ethical considerations and
ensuring compliance with data privacy regulations when dealing with customer
data, maintaining transparency and trust with customers.
By embracing these future scopes, the project can evolve beyond its current state,
offering more refined and actionable insights into customer behavior and preferences,
ultimately aiding businesses in making more informed decisions and enhancing
customer-centric strategies.
CONCLUSIONS
In conclusion, the project on "Customer Segmentation using K-Means Clustering"
embarked on a journey to unravel the intricate tapestry of customer diversity and
behavior. Commencing with meticulous data collection and analysis across various
customer attributes, the project laid a robust foundation for segmentation. Leveraging
the Elbow Method to discern the optimal number of clusters facilitated precise
segmentation, aided by the visual representation of the within-cluster sum of squares
(WCSS) graph.
The implementation of the K-Means clustering algorithm efficiently partitioned
customers into distinct segments, reflecting shared characteristics and behaviors. The
iterative training process honed cohesive cluster assignments, providing a
comprehensive understanding of different customer groups. Subsequently, visualization
techniques like PCA or t-SNE projected these clusters into lower-dimensional spaces,
facilitating clear visual representations that delineated distinct boundaries among
segments.
Through this project, actionable insights into customer behavior emerged, empowering
businesses to tailor marketing strategies, create personalized experiences, and make
informed decisions. However, the project also encountered limitations inherent to the K-
Means algorithm, such as sensitivity to initial centroids and assumptions of spherical
clusters.
Despite these limitations, the project's findings laid a solid groundwork for businesses
to delve deeper into customer-centric approaches. The insights gleaned from this project
pave the way for further refinement, incorporating advanced methodologies, real-time
data processing, and ethical considerations, propelling businesses towards more
targeted, personalized, and effective strategies for enhanced customer engagement and
satisfaction.
REFERENCES
• YouTube: https://www.youtube.com/watch?v=SrY0sTJchHE&t=519s
• Kaggle: https://www.kaggle.com/datasets/vjchoudhary7/customer-
segmentation-tutorial-in-python/
• Google: https://medium.com/data-and-beyond/customer-segmentation-using-
k-means-clustering-with-pyspark-unveiling-insights-for-business-
8c729f110fab#:~:text=K%2Dmeans%20clustering%20is%20a,are%20similar%
20to%20each%20other.