Customer Segmentation
Computer Science and Engineering Department
Thapar Institute of Engineering and Technology
(Deemed to be University), Patiala – 147004
Machine Learning Project
Submitted By:
Name : Yogesh Rathee
Roll No : 102103022
Name : Jagveer Singh
Roll No : 102103024
Submitted To:
Ms. Kudratdeep Aulakh
Index
Sr. No. Content used Page No.
1. Introduction 3
2 Libraries used 4
3. Algorithm(s) used 5
4. Code and Screenshots 6
1. Introduction
1.1 Mall Customer Segmentation Data
https://www.kaggle.com/datasets/vjchoudhary7/custom
er-segmentation-tutorial-in-python
This data set is created only for the learning purpose of the customer
segmentation concepts, also known as market basket analysis. I will
demonstrate this by using unsupervised ML technique (KMeans Clustering
Algorithm) in the simplest form.
1.2 Description of dataset
You are owing a supermarket mall and through membership cards , you
have some basic data about your customers like Customer ID, age,
gender, annual income and spending score. Spending Score is something
you assign to the customer based on your defined parameters like
customer behavior and purchasing data.
2. Libraries Used:
Numpy : NumPy is a Python library for efficient numerical computation,
offering multi-dimensional array support and a wide range of
mathematical functions. It is widely used in data analysis, scientific
research, and machine learning.
Pandas : Pandas is a Python library for data manipulation and analysis,
offering DataFrames and Series for working with structured data
efficiently.
Matplotlib.pyplot : matplotlib.pyplot is a Python library for creating 2D
data visualizations, like plots and charts. It's a fundamental tool for data
visualization in Python.
Seaborn: Seaborn is a Python library that enhances Matplotlib for
creating appealing and informative statistical data visualizations.
Sklearn: Scikit-Learn (sklearn) is a Python library for machine learning,
offering a broad set of tools and algorithms for various tasks in data
science and artificial intelligence
3. Algorithm(s) Used
K-means clustering : K-means clustering is a popular unsupervised machine
learning algorithm. Its main task is to group data into a fixed number of clusters,
often referred to as "k." These clusters are formed based on the similarities
between data points, aiding data segmentation and organization.
The algorithm operates iteratively. Initially, it places "k" cluster centers
randomly within the data space. Data points are then assigned to the nearest
cluster center, typically using Euclidean distance. The cluster centers are then
recalculated as the mean of their assigned data points. This process repeats until
the cluster assignments and centers no longer change significantly.
K-means has applications in various fields, like marketing, image segmentation,
and document classification. It's essential for revealing natural data groupings,
making it a valuable tool for data analysis and preprocessing. However, it does
have some limitations, such as sensitivity to the initial placement of cluster
centers and the need to specify "k" beforehand. Nonetheless, it remains a
versatile and valuable method for data clustering and pattern recognition.
4. Code and Screenshots
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
# loading the data from csv file to a Pandas DataFrame
customer_data = pd.read_csv('D:/ML project/ye rha tere
project/Mall_Customers.csv')
# Display the first 5 rows in the dataframe
print(customer_data.head())
# finding the number of rows and columns
print(customer_data.shape)
# getting some informations about the dataset
print(customer_data.info())
# checking for missing values
print(customer_data.isnull().sum())
X = customer_data.iloc[:,[3,4]].values
print(X)
# finding wcss value for different number of clusters
wcss = []
for i in range(1,11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
# plot an elbow graph
sns.set()
plt.plot(range(1,11), wcss)
plt.title('The Elbow Point Graph')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=0)
# return a label for each data point based on their cluster
Y = kmeans.fit_predict(X)
print(Y)
# plotting all the clusters and their Centroids
plt.figure(figsize=(8,8))
plt.scatter(X[Y==0,0], X[Y==0,1], s=50, c='green', label='Cluster 1')
plt.scatter(X[Y==1,0], X[Y==1,1], s=50, c='red', label='Cluster 2')
plt.scatter(X[Y==2,0], X[Y==2,1], s=50, c='yellow', label='Cluster 3')
plt.scatter(X[Y==3,0], X[Y==3,1], s=50, c='violet', label='Cluster 4')
plt.scatter(X[Y==4,0], X[Y==4,1], s=50, c='blue', label='Cluster 5')
# plot the centroidsplt.scatter(kmeans.cluster_centers_[:,0],
kmeans.cluster_centers_[:,1], s=100, c='cyan', label='Centroids')
plt.title('Customer Groups')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.show()