3.2 E.D.
A: DIGGING INTO DATASETS
Danalyze
ata scientists utilize Exploratory Data Analysis (EDA) to
and examine data sets, as well as describe their key
properties, and they frequently use data visualization
techniques. It assists data scientists in determining how to
best modify data sources to obtain the answers they require,
making it easier for them to detect patterns, identify
anomalies, test a hypothesis, or con rm assumptions. And in
this session, we’ll get to understand it better through hands-
on learning through a simulated but close to real life dataset
as presented below.
A dedicated customer base says a lot about a company or
a service. Additionally, not only shows that their customers
are satis ed with their products or services, but also that they
put a lot of e ort into building a relationship with them. Loyal
clients are more likely to stick with a business, recommend it
to their friends and coworkers, and choose it over
competitors.
It is much easier to keep existing consumers than to nd
new ones. It indicates that as a business must focus more on
preserving their relationships with current customers than on
attempting to draw in new ones. This is also the reason it is
very important to know customer churn rate.
Customer Churn is the rate at which a business loses
customers on a given amount of time. A high customer
turnover rate indicates that many of the customers have lost
interest in buying the products or services for a variety of
reasons, which may indicate that there are problems with the
business.
HowTo : DataMine&AnalyzeWithR Page 114 of 131 EBBertulfo
fi
ff
fi
fi
In this session, you will examine a simple dataset and
determine why some customers abandon a service. The
provided dataset is designed to model customer behavior in
a subscription-based business. You will be taking a closer
look at the dataset to nd possible reasons why some
customers stayed and others left the service being o ered.
With this module, you will be able to execute basic data
analysis steps and hopefully master some key data
exploration approaches.
Objectives:
• Understand the structure of a dataset and explore its
contents.
• Calculate basic statistics (like averages) to summarize data.
• Use simple visualizations to identify patterns.
• Draw basic conclusions about customer behavior based on
their analysis.
The Dataset:
Let’s start rst by understanding the dataset that we are
going to use in this learning session. When a customer
"churns," it means they have stopped using a product or
service. This occurrence typically poses a lot of questions
from companies that values their customers and most often
then not, want to know why customers churn so they can
address these reasons and improve customer retention.
We will create a sample dataset that represents a close to
real world situation and set a population sample of 200
customers. The dataset will also include customerID [1-200],
average monthlyUsage [in hours], satisfactionScore [1-10],
subscriptionLength [in months], paymentStatus [0-1] where 0
when a customer missed a subscription payment and one
means that the account is active and churn [0-1] where 0
HowTo : DataMine&AnalyzeWithR Page 115 of 131 EBBertulfo
fi
fi
ff
means the customer is still actively using the service and 1 is
when the customer quit using the service. So, to create this
simulated dataset, let’s run these R scripts below:
# Load the dataset
set.seed(123) # let’s make sure that this dataset can be reproduced
# Let’s just start with the Vector Data
customerID = 1:200
monthlyUsage = round(rnorm(200, mean = 30, sd = 10), 1)
satisfactionScore = sample(1:10, 200, replace = TRUE)
subscriptionLength = sample(1:24, 200, replace = TRUE)
paymentStatus = sample(c(0, 1), 200, replace = TRUE, prob = c(0.2, 0.8))
churn = sample(c(0, 1), 200, replace = TRUE, prob = c(0.7, 0.3))
# Once the Vectors are all set, let’s create the dataset using data.frame()
customerData <- data.frame(customerID, monthlyUsage, satisfactionScore,
subscriptionLength, paymentStatus, churn)
# Let’s test is the customerData has been created successfully
head(customerData)
Now that we have a working dataset, you can describe it
rst through these guide questions. Try to use some R scripts
to the following statements below. Doing this before starting
any deeper data analysis will allow you to understand the
dataset better which will then lead you a much more
comprehensive description of the data. And if you do this
process as a habit, it will allow you to e ciently work on your
R Scripting skills and Data Analysis.
• How many customers are represented in this dataset?
• What information is available for each customer?
HowTo : DataMine&AnalyzeWithR Page 116 of 131 EBBertulfo
fi
ffi
• Are there any missing values or unusual entries in each
column? (For example, negative values or zeros in
columns where they don’t make sense.)
• What are the minimum, maximum, and range of
subscription lengths in the dataset?
• What does this tell you about the customer base?
• What percentage of customers have missed a payment
versus those who haven’t?
• What does this distribution tell you about payment
reliability?
Once you get the hang of it, let’s dig deeper into the
dataset and determine the following:
1. Calculate average values for monthly usage and
satisfaction score to understand general customer
behavior.
# Calculate average monthly usage
avgUsage <- mean(customerData$monthlyUsage)
# Calculate average satisfaction score
avgSatisfaction <- mean(customerData$satisfactionScore)
cat("Average Monthly Usage:", avgUsage, "\n")
cat("Average Satisfaction Score:", avgSatisfaction, "\n")
• What is the average usage time per month?
• Are customers generally satis ed (based on the average
satisfaction score)?
HowTo : DataMine&AnalyzeWithR Page 117 of 131 EBBertulfo
fi
2. See if customers who churn (leave the service) have lower
satisfaction scores or lower monthly usage than those
who stay.
# Average usage and satisfaction for customers who churned vs. stayed
avgUsageChrnd <- mean(customerData$monthlyUsage[customerData$churn == 1])
avgUsageRtnd <- mean(customerData$monthlyUsage[customerData$churn == 0])
avgSatChrnd <- mean(customerData$satisfactionScore[customerData$churn == 1])
avgSatRtnd <- mean(customerData$satisfactionScore[customerData$churn == 0])
# Display results
cat("Average Monthly Usage (Churned):", avgUsageChrnd, "\n")
cat("Average Monthly Usage (Retained):", avgUsageRtnd, "\n")
cat("Average Satisfaction Score (Churned):", avgSatChrnd, "\n")
cat("Average Satisfaction Score (Retained):", avgSatRtnd, "\n")
• Do customers who churn tend to have lower satisfaction
scores than those who stay?
• Is monthly usage lower for customers who churn?
3. See if there is a connection between payment issues and
customer churn. Use simple plots to see how monthly
usage and satisfaction score vary by churn status.
# Calculate the percentage of customers with payment issues
# (0 = missed payment) by churn status
paymentIssuesChrnd <-
mean(customerData$paymentStatus[customerData$Churn == 1] == 0)
paymentIssuesRtnd <-
mean(customerData$paymentStatus[customerData$Churn == 0] == 0)
HowTo : DataMine&AnalyzeWithR Page 118 of 131 EBBertulfo
# Display results
cat("Percentage of Payment Issues (Churned):", paymentIssuesChrnd * 100, "%\n")
cat("Percentage of Payment Issues (Retained):", paymentIssuesRtnd * 100, "%\n")
• Are customers who leave the service more likely to have
payment issues?
• Does it seem that missed payments contribute to
customer churn?
4. Use simple plots to see how monthly usage and
satisfaction score vary by churn status.
# Boxplot for Monthly Usage by Churn Status
boxplot(customerData$monthlyUsage ~ customerData$churn,
main = "Monthly Usage by Churn Status",
xlab = "Churn Status (0 = Retained, 1 = Churned)",
ylab = "Monthly Usage (hours)",
col = c("lightblue", "lightpink"))
# Boxplot for Satisfaction Score by Churn Status
boxplot(customerData$satisfactionScore ~ customerSata$churn,
main = "Satisfaction Score by Churn Status",
xlab = "Churn Status (0 = Retained, 1 = Churned)",
ylab = "Satisfaction Score",
col = c("lightgreen", "lightcoral"))
5. Summarize your ndings to answer the main question:
Why do some customers churn?
Based on your analysis, write a brief summary of the
factors that appear to in uence customer churn. Consider
these questions:
HowTo : DataMine&AnalyzeWithR Page 119 of 131 EBBertulfo
fi
fl
• Do customers with lower usage and satisfaction scores
churn more often?
• Are payment issues more common among customers
who churn?
This marks the end of this module’s learning objectives,
but please keep your data and make sure to save it as we will
use it to prepare a document to better present our ndings.
HowTo : DataMine&AnalyzeWithR Page 120 of 131 EBBertulfo
fi