Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
45 views56 pages

Balaji Capstone Project 2

The document discusses a project to develop a churn prediction model for a DTH company facing competition. It describes understanding the business problem, need for the study, and social opportunity. It then discusses the customer churn dataset including data ingestion, visualization, and attribute information to understand the data.

Uploaded by

Balaji Bala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views56 pages

Balaji Capstone Project 2

The document discusses a project to develop a churn prediction model for a DTH company facing competition. It describes understanding the business problem, need for the study, and social opportunity. It then discusses the customer churn dataset including data ingestion, visualization, and attribute information to understand the data.

Uploaded by

Balaji Bala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

CAPSTONE GRADED PROJECT -1

A PROJECT REPORT

Submitted by

BALAJI S (PGP-DSBA-JUNE 2023 TO JUNE2024)


Introduction of the business problem
Problem Statement: -
A DTH company provider is facing a lot of competition in the
current market and it has become a challenge to retain the existing
customers in the current situation. Hence, the DTH company wants to
develop a model through which they can do churn prediction of the
accounts and provide segmented offers to the potential churners. In
this company, account churn is a major thing because 1 account can
have multiple customers. hence by losing one account the company
might be losing more than one customer.
we have been assigned to develop a churn prediction model for this
company and provide business recommendations on the campaign.
The model or campaign has to be unique and has to sharp when offers
are suggested. The offers suggested should have a win-win situation
for company as well as customers so that company doesn’t hit on
revenue and on the other hand able to retain the customers.

Need of the study/project


This study/project is very essential for the client to plan for
future in terms of product designing, sales or in rolling out different
offers for different segment of clients. The outcome of this project
will give a clear understanding where the firm stands now and what’s
the capacity it holds in terms for taking risk. It will also denote
what’s the future prospective of the organization and how they can
make it even better and can plan better for the same and can help
them retaining customers in a longer run.

Understanding business/social opportunity


This a case study of a DTH company where in they have
customers assigned with unique account ID and a single account ID
can hold many customers (like family plan) across gender and marital
status, customers get flexibility in terms of mode of payment they
want to opt for. Customers are again segmented across various types
of plans they opt for as per their usage which also based on the device
they use (computer or mobile) moreover they ear cashbacks on bill
payment.
The overall business runs in customers loyalty and stickiness which
in-turn comes from providing quality and value-added services. Also,
running various promotional and festivals offers may help
organization in getting new customers and also retaining the old one.
We can conclude that a customer retained is a regular income for
organization, a customer added is a new income for organization and
a customers lost will be a negative impact as a single account ID
holds multiple number of customers i.e.; closure of one account ID
means loosing multiple customers.
It’s a great opportunity for the company as it’s a need of almost every
individual of family to have a DTH connection which in-turn also
leads to increase and competition.
Question arises how can a company creates difference when
compared to other competitors, what are the parameter plays a vital
role having customers loyalty and making them stay. All these social
responsibilities will decide the best player in the market.
Data Report
Dataset of problem: - Customer Churn Data Data Dictionary: -
 AccountID -- account unique identifier
 Churn -- account churn flag (Target Variable)
 Tenure -- Tenure of account
 City_Tier -- Tier of primary customer's city
 CC_Contacted_L12m -- How many times all the customers of the
account has
contacted customer care in last 12months
 Payment -- Preferred Payment mode of the customers in the
account
 Gender -- Gender of the primary customer of the account
 Service_Score -- Satisfaction score given by customers of the
account on service
provided by company
 Account_user_count -- Number of customers tagged with this
account
 account_segment -- Account segmentation on the basis of spend
 CC_Agent_Score -- Satisfaction score given by customers of the
account on customer
care service provided by company
 Marital_Status -- Marital status of the primary customer of the
account
 rev_per_month -- Monthly average revenue generated by account
in last 12 months
 Complain_l12m -- Any complaints has been raised by account in
last 12 months
 rev_growth_yoy -- revenue growth percentage of the account (last
12 months vs last
24 to 13 month)
 coupon_used_l12m -- How many times customers have used
coupons to do the
payment in last 12 months
 Day_Since_CC_connect -- Number of days since no customers in
the account has
contacted the customer care
 cashback_l12m -- Monthly average cashback generated by account
in last 12 months
 Login_device -- Preferred login device of the customers in the
account
Data Ingestion: -
Loaded the required packages, set the working directory, and
loaded the data file.
The data set has 11,260 observations and 19 variables (18
independent and 1 dependent or target variable).

Table 1 – glimpse of the data-frame head with top 5 rows


Understanding how data was collected in terms of time, frequency
and methodology
• data has been collected for random 11,260 unique account ID,
across gender and marital status.
• Looking at variables “CC_Contacted_L12m”, “rev_per_month”,
“Complain_l12m”,“rev_growth_yoy”, “coupon_used_l12m”,
“Day_Since_CC_connect” and “cashback_l12m”we can
conclude that the data has been collected for last 12 month.
• Data has 19 variables, 18 independent and 1 dependent or the
target variable, which shows if customer churned or not.
• The data is the combination of services customers are
usingalong with their payment option and also then basic
individual
• details as well. Data is mixed of categorical as well as
continuous variables.
Visual inspection of data (rows, columns, descriptive details)

Data has 11,260 rows and 19 variables.

Table 2:- Dataset Information

Fig 1:- Shape of dataset



• Describing data: - This shows description of variation in
various statistical
measurements across variables which denotes that each variable is
unique and
different.
Table 3: - Describing Dataset

1. Except variables “AccountID”, “Churn”, “rev_growth_yoy” and


“coupon_used_for_payment” all other variables have null values
present.

Table 4: - Showing Null Values in Dataset


Data has “NIL” duplicate observations.

Understanding of attributes (variable info, renaming if required)


This project has 18 attributes contributing towards the target
variable. Let’s discuss about these variables one after another.
• AccountID – This variable represents a unique ID which
represents a unique
customer. This is of Integer data type and there is no null values
present in this.
• Churn – This is our target variable, which represents if
customer has churned or not.
This is categorical in nature will no null values. “0” represents “NO”
and ”1”
represents “YES”.
• Tenure – This represents the total tenure of the account since
opened. This is a
continuous variable with 102 null values.
• City_Tier – These variable segregates customer into 3 parts
based on city the
primary customer resides. This variable is categorical in nature and
have 112 null
values.
• CC_Contacted_L12m – This variable represents the number
of times all the
customers of the account has contacted customer care in last
12months. This
variable is continuous in nature and have 102 null values.
• Payment – This variable represents the preferable mode of bill
payment opted by
customer. This is categorical in nature and have 109 null values.
• Gender – This variable represents the gender of the primary
account holder. This is
categorical in nature and 108 null values.
• Service_Score – Scores provided by the customer basis the
service provided by the
company. This variable is categorical in nature and have 98 null
values.
• Account_user_count – This variable gives the number of
customers attached with an
accountID. This is continuous in nature and have 112 null values.

• account_segment – These variable segregates customers into


different segment
basis their spend and revenue generation. This is categorical in nature
and have 97
null values.
• CC_Agent_Score -- Scores provided by the customer basis the
service provided by
the customer care representative of the company. This variable is
categorical in
nature and have 116 null values.
• Marital_Status – This represents marital status of the primary
account holder. This is
categorical in nature and have 212 null values.
rev_per_month – This represents average revenue generated per
account ID in last
12 months. This variable is continuous in nature and have 102 null
values.
• Complain_l12m – This denotes if customer have raised any
complaints in last 12
months. This is categorical in nature and have 357 null values.
• rev_growth_yoy – This variable shows revenue growth in
percentage of account for
12 months Vs 24 to 13 months. This is continuous in nature and
doesn’t have any
null values.
• upon_used_l12m – This represents the number of times
customer’s have used
discount coupons for bill payment. This is continuous in nature and
doesn’t have any
null values.
• Day_Since_CC_connect – This represents the number of days
since customer have
contacted the customer care. Higher the number of days denotes better
the service.
This is continuous in nature and have 357 null values.
• cashback_l12m – This variable represents the amount of cash
back earned by the
customer during bill payment. This is continuous in nature and have
471 null values.
• Login_device – This variable represents in which device
customer is availing the
services if it’s on phone or on computer. This is categorical in nature
and have 221
null values.

❖ With the above understanding of data, renaming any of the


variables is not required.
❖ With the above understanding of data, we can move towards the
EDA part where
❖ we will understand the data better along with treating bad data,
null values, and outliers.

Exploratory data analysis


Univariate analysis (distribution and spread for every continuous
attribute, distribution of data in categories for categorical ones)
Univariate Analysis: -
 The variable shows outlier in data, which needs to be treated in
further steps.
Table 5: - Showing Outliers in data
❖ None of the variables show normal distribution and are skewed
in nature.
Fig 2: - Count plot of categorical variable
Inferences from count plot: -
• Maximum customers are from city tire type “1”, which
indicates the high number of population density in this city
type.
• A maximum number of customers prefer debit and credit cards
as their preferred mode of payment.
• The ratio of male customers is higher when compared to
females.
• The average service score given by a customer for the service
provided is around “3” which shows the area of improvement.
• Most of the customers are in the “Super+” segment and least
number of customers are in the “Regular” segment.
• Most of the customers availing services are “Married”.
• Most customers prefer “Mobile” as the device to avail services.
Bi-variate Analysis: -
• Pair plot across all categorical data and its impact on the target
variable.
fig 4: - pair plot across categorical variables

• The pair-plot shown above indicates that the independent


variable are week or poor predictors of target variable as we the
density of independent
• variable overlaps with the density of target variable.

Correlation among variable:-

We have performed correlation between variables after treating


bad data and missing values. We have also converted into integer data
types to check on correlation as data type as categorical wont show in
the pictures below.
Fig 6: - Correlation among variables
Inferences from correlation: -
• Variable “Tenure” shows high co-relation with Churn.
• Variable “Marital Status” shows high co-relation with churn.
• Variable “complain_ly” shows high- correlation with churn.

Removal of unwanted variables: - After in-depth understanding of


data we conclude that removal of variables is not required at this stage
of project. We can remove the variable “AccountID” which denotes a
unique ID assigned to unique customers. However, removing them
will lead to 8 duplicate rows. Rest all the variables looks important
looking at the univariate and bi-variate analysis.

Outlier treatment: -
This dataset is the mix of continuous as well as categorical variables.
It doesn’t make nay sense if we perform outlier treatment on
categorical variable as each category denotes a type of customer. So,
we are performing outlier treatment only for variables continuous in
nature.
• Used box plot to determine the presence if outlier in a variable.
• The dots outside the upper limit of a quantile represents the
outlier in the variable.
• We have 8 continuous variables in the dataset namely,
“Tenure”,
• “CC_Contacted_LY”, “Account_user_count”, “cashback”,
“rev_per_month”,
• “Day_Since_CC_connect”, “coupon_used_for_payment” and
“rev_growth_yoy”.
• We have used upper limit and lower limit to remove outliers.
Below is the pictorial representation of variables before and
after outlier treatment.
Before After
Fig 7: - Before and after outlier treatement
Missing Value treatment and variable transformation: -
• Out of 19 variables we have data anomalies present in 17
variables and null values in 15 variables.
• Using “Median” to impute null values where the variable is
continuous because the Median is less prone to outliers when
compared with the mean.
• Using “Mode: to impute null values where variables are
categorical.
• We have treated null values variable by variable as each variable
is unique.
Treating Variable “Tenure”
• We look at the unique observations in the variable and see that
we have “#” and “nan” present in the data.
• Where “#” is a anomaly and “nan” represents null value.
Fig 8: - before treatment
• Replacing “#” with “nan” and further we replace “nan” with the
calculated median of the
variable and now we don’t see any presence of bad data and null
values.
• Converted data type to integer, because IDE has recognized it
as object data type
due presence of bad data.
Treating Variable “City_Tier”
• We look at the unique observations in the variable and presence
of null value as shown below.

Fig 9: - before treatment


• we replaced “nan” with the calculated mode of the variable and
now we don’t see any presence of null values.
• Converted data type to integer, because IDE has recognized it
as object data type due presence of bad data.

Treating Variable “CC_Contacted_LY”


• We look at the unique observations in the variable and see the
presence of a null value as shown below.
• we are replacing “nan” with the calculated Median of the
variable and now we don’t see
any presence of null values.
• Converted data type to integer, because IDE has recognized it
as object data type
due presence of bad data.
Treating Variable “Payment”
• We look at the unique observations in the variable and see the
presence of a null value as
shown below.

Fig 10: - before treatment


• we are replacing “nan” with the calculated Mode of the variable
and now we don’t see
• any presence of null values.
• Also performed label encoding for the observations. Where 1 =
Debit card, 2 = UPI, 3 = credit card, 4 = cash on delivery and 5
= e-wallet. Then converting them to integer
• data type as it will be used for further model building.
Treating Variable “Gender”
• We look at the unique observations in the variable and see
presence of a null value
and multiple abbreviations of the same observations as shown below.

Fig 11: - before treatment


• we are replacing “nan” with calculated Mode of the variable
and now we don’t see any presence of null values.
• Also performed label encoding for the observations.
• Where 1 = Female card and 2 =Male.
• Then converting them to integer data type as it will be used for
further model building.
Treating Variable “Service_Score”
• We look at the unique observations in the variable and see
presence of null value as
shown below.
• we are replacing “nan” with calculated Mode of the variable and
now we don’t see any presence of null values.
• Then converting them to integer data type as it will be used for
further model building.
Treating Variable “Account_user_count”
• We look at the unique observations in the variable and see
presence of null value as well “@” as bad data, shown below.

Fig 12: - before treatment


• Replacing “@” with “nan” and further we replace “nan” with
the calculated median of the variable and now we don’t see any
presence of bad data and null values.
• Then convert them to integer data type as it will be used for
further model building.
Treating Variable “account_segment”
• We look at the unique observations in the variable and see the
presence of a null value as
well different denotations for the same type of observations, shown
below.

Fig 13: - before treatment


• Replacing “nan” with the calculated Mode of the variable and
also labeled different account segments, where in 1 = Super, 2 =
Regular Plus, 3 = Regular, 4 = HNI and 5 =
Super Plus and now we don’t see any presence of bad data and null
values.
• Then convert them to an integer data type as it will be used for
further model building.
Treating Variable “CC_Agent_Score”
We look at the unique observations in the variable and see the
presence of a null value as
shown below.

• Replacing “nan” with the calculated Mode of the variable and


now we don’t see any presence of bad data and null values.
• Then convert them to integer data type as it will be used for
further model building.

Treating Variable “Marital_Status”


• We look at the unique observations in the variable and see
presence of null value as
shown below.

Fig 14: - before treatment


• Replacing “nan” with the calculated Mode of the variable and
also labelled the observations.
• Where in 1 = Single, 2 = Divorced and 3 = Married and now
we don’t see any presence of bad data and null values.
• Then converting them to integer data type as it will be used for
further model building.
Treating Variable “rev_per_month”
• We look at the unique observations in the variable and see
presence of null value as
• well as presence of “+” which denoted bad data. shown below.
Fig 15: - before treatment
• Replacing “+” with “nan” and further we replace “nan” with
calculated median of the variable and now we don’t see any
presence of bad data and null values.
• Then converting them to an integer data type as it will be used
for further model building.
Treating Variable “Complain_ly”
• We look at the unique observations in the variable and see the
presence of a null value as
• Replacing “nan” with calculated Mode of the variable and now
we don’t see any presence of null values.
• Then converting them to integer data type as it will be used for
further model building.
Treating Variable “rev_growth_yoy”
• We look at the unique observations in the variable and see
presence of “$” which denoted bad data. shown below.

Fig 15: - before treatment


• Replacing “$” with “nan” and further we replace “nan” with
calculated median of the
variable and now we don’t see any presence of bad data and null
values.
• Then converting them to integer data type as it will be used for
further model
building.

Treating Variable “coupon_used_for_payment”


• We look at the unique observations in the variable and see
presence of “$”, “*” and
“#” which denoted bad data. shown below.

• Replacing “$”, “*” and “#” with “nan” and further we replace
“nan” with calculated median of the variable and now we don’t
see any presence of bad data and null values.
• Then converting them to integer data type as it will be used for
further model building.
Treating Variable “Day_Since_CC_connect”
• We look at the unique observations in the variable and see
presence of “$” which denoted bad data and also the presence of
null values.
• Replacing “$” with “nan” and further we replace “nan” with
calculated median of the variable and now we don’t see any
presence of bad data and null values.
• Then converting them to integer data type as it will be used for
further model building.

Treating Variable “cashback”


• We look at the unique observations in the variable and see
presence of “$” which denoted bad data and also the presence of
null values.
• Replacing “$” with “nan” and further replace “nan” with the
calculated median of the variable and now we don’t see any
presence of bad data and null values.
• Then converting them to integer data type as it will be used for
further model building.
Treating Variable “Login_device”
• We look at the unique observations in the variable and see
presence of “&&&&”which denoted bad data and also the
presence of null values.
• Replacing “&&&&” with “nan” and further we replace “nan”
with calculated Mode of the variable.
• Also, labelling the observations where in 1= Mobile and 2 =
Computer and now we don’t see any presence of bad data and
null values.
• Then converting them to integer data type as it will be used for
further model building.
Count of null values before and after treatment

Before After

Fig 16: - Before and after null value treatment


• We see NIL null values across variable which indicated that the
data is now cleaned and we can move further for data
transformation of required.
Variable transformation: -
• We see that the different variable have different dimensions.
Like variable “Cashback” denotes currency where as
“CC_Agent_Score” denotes rating provided by the customers.
Due to which they differ in their statistical rating as well
• Scaling would be required for this data set which in turn will
normalize the date and standard deviation will be close to “0”
• Using MinMax scalar to perform normalization of data.
Addition of new variables: -
At the current stage we don’t see to create ay new variable as such.
May be required at further stage of model building and can be created
accordingly.
Business insights from EDA
Is the data unbalanced? If so, what can be done? Please explain in
the context of the business
• Dataset provided is imbalance in nature. The categorical count
of our target variable “Churn” shows high variation in counts.
• We have count of “0” as 9364 and count of “1” as 1896.

Fig 18: - Imbalanced dataset


Any other business insights
• We see decent variations in data collection with a mixture of
services provided along
• with rating provided by the customer and also about customer
profile.
• Business needs to increase its visibility in tier 2 city and can
acquire new customers.
• Business can promote payment via standing instruction in bank
account or UPI which can be hassle free and safe for customer.
• There is need of improvement in service scores and have a lot of
grey area left over.
• Business and roll out a survey for better understanding of
customer’s expectations.
• Business can train their customer care executive to provide
better customer experience which in turn will improve their
feedback scores.
• Can have curated plans for customers not only based on the
spend they have but also the tenure they have spent with the
business.
• Can have curated plan for married people something like a
family floater.
MODEL BUILDING

The objective of this project is to develop a classification model to


predict whether a given customer will churn (leave the service) or not.
The target variable is binary, indicating "Yes" for churn and "No" for
no churn. We will evaluate several machine learning algorithms to
determine which provides the best predictive performance for this
binary classification problem.

Algorithms for Model Building


We have selected a diverse set of algorithms to build the classification
model. These algorithms range from simple linear models to complex
ensemble methods, providing a comprehensive evaluation framework.

Logistic Regression
• Description:
o Logistic Regression is a supervised machine learning
algorithm used for binary classification problems. It
models the probability of a certain class or event,
assuming linear separability of the data.
• Use Case:
o Suitable for problems where the relationship between the
independent variables and the target variable is
approximately linear.
• Pros:
o Simple, interpretable, effective for linearly separable data.
• Cons:
o May not perform well with non-linear relationships.
Decision Tree
• Description:
o Decision Trees classify instances by splitting the dataset
based on feature values, forming a tree structure. It works
by recursively partitioning the data into subsets based on
the most significant splits.
• Use Case:
o Effective for capturing non-linear relationships in the
data.
• Pros:
o Easy to interpret, handles both numerical and categorical
data.
• Cons:
o Prone to overfitting, especially with deep trees.
Random Forest (Bagging)
• Description:
o Random Forest is an ensemble method that builds
multiple decision trees using bootstrap sampling and
aggregates their predictions to reduce variance.
• Use Case:
o Robust against overfitting and effective for both
regression and classification problems.
• Pros:
o Robust, handles non-linear data, reduces overfitting.
• Cons:
o Less interpretable, computationally intensive.
Linear Discriminant Analysis (LDA)
• Description:
o LDA is a predictive modeling algorithm for multi-class
classification, which can also be used for dimensionality
reduction by projecting the training dataset in a way that
best separates the classes.
• Use Case:
o Effective when classes are well-separated and normally
distributed.
• Pros:
o Effective for well-separated classes.
• Cons:
o Assumes normality and equal class variances.
K-Nearest Neighbors (KNN)
• Description:
o KNN is a non-parametric algorithm that classifies a data
point based on the majority y class among its k-nearest
neighbors.
• Use Case:
o Suitable for small to medium-sized datasets where the
decision boundary is not linear.
• Pros:
o Simple, effective for small datasets.
• Cons:
o Computationally expensive, sensitive to noise.
Naive Bayes
• Description:
o Naive Bayes is a probabilistic classifier based on Bayes'
Theorem, assuming independence between features.
• Use Case:
o Effective for high-dimensional datasets and situations
where the independence assumption approximately holds.
• Pros:
o Fast, handles high-dimensional data well.
• Cons:
o Assumes feature independence.
Gradient Boosting
• Description:
o Gradient Boosting builds an ensemble of weak learners
(typically decision trees) in a stage-wise manner,
optimizing for overall prediction error.
• Use Case:
o Suitable for complex datasets with non-linear
relationships.
• Pros:
o Handles complex data, reduces error iteratively.
• Cons:
o Prone to overfitting if not carefully tuned.
Extreme Gradient Boosting (XGBoost)
• Description:
o XGBoost is an advanced implementation of gradient
boosting that includes regularization and other
optimizations to enhance performance and prevent
overfitting.
• Use Case:
o Ideal for large datasets with complex patterns.
• Pros:
o Superior performance, robust against overfitting, efficient.
• Cons:
o Requires careful tuning, can be computationally intensive.
Extra Trees Classifier
• Description:
o Extra Trees (Extremely Randomized Trees) classifier
builds an ensemble of trees using random splits. It is
similar to Random Forest but differs in the way it chooses
to split points.
• Use Case:
o Suitable for reducing overfitting and variance in the
model.
• Pros:
o Fast computation, high variance reduction.
• Cons:
o Less interpretable, may require more trees to stabilize.

Model Evaluation Procedure


For each model, we performed the following steps to evaluate
their performance:

• Model Prediction: Generate predictions on the test dataset.


• Model Performance: Evaluate using metrics such as accuracy,
precision, recall, and F1 score.
• ROC-AUC Graph: Plot the ROC curve and calculate the Area
Under the Curve (AUC) to assess the model's ability to
distinguish between the two classes.
• Model Performance Metrics: Calculate and report additional
metrics including the confusion matrix and feature importance
(where applicable).
Splitting Data into Train and Test Dataset: - Following the
accepted market practice, we have divided data into Train and Test
datasets into a 70:30 ratio and built various models on the training
dataset and testing for accuracy over the testing dataset.

Detailed Evaluation of Algorithms

Logistic Regression
• Training Data Prediction: The model is used to predict the
target labels for the training dataset. This step helps in assessing
how well the model performs on the data it was trained on.


• Training Accuracy: The calculated accuracy score
(log_train_acc) indicates the proportion of correctly predicted
instances in the training dataset. In this specific case, the
training accuracy is 86.12%, suggesting that the model correctly
predicts the class labels for approximately 86.12% of the
instances in the training data.

• Test Data Prediction: The model's performance is evaluated by


predicting the target labels for the test dataset. This step assesses
how well the model generalizes to unseen data, which is crucial
for evaluating its effectiveness in real-world scenarios.


• Test Accuracy: The calculated accuracy score (log_test_acc)
represents the proportion of correctly predicted instances in the
test dataset. In this specific case, the test accuracy is 86.77%,
indicating that the model correctly predicts the class labels for
approximately 86.77% of the instances in the test data.

• Intercept (model.intercept_): The intercept represents the value


of the decision boundary (hyperplane) where the predicted
probability of the positive class is 0.5. In this case, the intercept
value is approximately -0.04015361.

• Coefficients (model.coef_): These are the coefficients associated


with each feature in the logistic regression model. Each
coefficient indicates the change in the log-odds of the target
variable for a one-unit change in the corresponding feature,
assuming all other features remain constant. The coefficients
array contains the coefficients for each feature. For example, the
coefficient for the first feature is approximately -7.08e-05.

Train

-CONFUSION MATRIX
Test

-CONFUSION MATRIX
Decision Tree
Train Summary

-CONFUSION MATRIX

AUC:0.898
Test Summary

-CONFUSION MATRIX
AUC:0.891

Random Forest
TRAIN SUMMARY

-CONFUSION MATRIX

-AUC:0.95936
TEST SUMMARY

-CONFUSION MATRIX

AUC=0.92900
Linear Discriminant Analysis (LDA)
TRAIN SUMMARY

-CONFUSION MATRIX

-AUC=0.898
TEST SUMMARY
-
CONFUSION MATRX

AUC=0.891

K-Nearest Neighbors (KNN)


TRAIN SUMMARY

-CONFUSION MATRIX
AUC=0.7448
TEST SUMMARY

-CONFUSION MATRIX
AUC=0.594
Naive Bayes (NB)
TRAIN SUMMARY:

-CONFUSION MATRIX
AUC=0.8133
TEST SUMMARY

- CONFUSION MATRIX
AUC=0.8087

Gradient Boosting
TRAIN SUMMARY

- CONFUSION MATRIX
AUC =0.9483
TEST SUMMARY

- CONFUSION MATRIX

AUC=0.9312
Extreme Gradient Boosting (XGBoost)
TRAIN SUMMARY

- CONFUSION MATRIX

AUC =0.9999

TEST SUMMARY
- CONFUSION MATRIX

AUC = 0.991

Extra Trees Classifier:


TRAIN SUMMARY

- CONFUSION MATRIX
AUC =1.0
TEST SUMMARY:

- CONFUSION MATRIX

AUC=0.996
Each model's performance was evaluated on the training & testing
dataset using the metrics mentioned above. The results for each metric
were compiled into a data frame for easy comparison.
Training dataset

Test dataset

ROC Curve Plotting


Training dataset:
Test dataset:

This report provides a comprehensive evaluation of various machine


learning models on the training & testing dataset using multiple
performance metrics. The ROC curves illustrate the trade-offs
between the true positive rate and false positive rate for each model,
providing a visual representation of their discriminative power. The
choice of the best model will depend on the specific requirements,
such as the need for higher recall, precision, or a balanced F1 score.
• Permutation importance is a model-agnostic approach that
measures the importance of a feature by calculating the decrease
in model accuracy when the values of that feature are randomly
shuffled. This shuffling breaks the relationship between the
feature and the target, allowing us to observe how much the
model's accuracy relies on that feature.
• The bar plot produced by the above illustrates the permutation
importance of the features in the Random Forest model. The y-
axis shows the mean decrease in accuracy when a feature is
permuted, and the error bars indicate the standard deviation of
this decrease over 10 repetitions.
• The features on the x-axis are ranked in descending order of
importance, showing which features have the most significant
impact on the model's performance. For instance, if Feature_1
shows the highest mean decrease in accuracy, it implies that
Feature_1 is the most important feature for the model's
predictions. Conversely, features with lower importance values
contribute less to the model's accuracy.
The cross-validation procedure involves splitting the training data into
10 subsets, training the model on 9 subsets, and evaluating it on the
remaining subset. This process is repeated 10 times, each time with a
different subset serving as the test set. The mean of the negative MSE
values across these 10 iterations is computed for each model.

The results indicate that tree-based models, particularly the


Extra Trees Classifier, XGBoost, and Random Forest, outperform
other types of models in terms of minimizing the mean squared error.
Logistic Regression, LDA, and Naive Bayes, while useful in many
contexts, are less effective on this dataset. The K-Nearest Neighbors
model has the highest error, suggesting it may not be well-suited for
this particular problem
Based on the cross-validation scores, the top-performing models are:

• Gradient Boosting Classifier (Best Score: 0.963588175839881)


• XGBoost (Best Score: 0.9529305305646846)
• K-Nearest Neighbors (Best Score: 0.9488706629885828)
• Extra Trees Classifier (Best Score: 0.9487438399067818)
• Random Forest Classifier (Best Score: 0.9487435180207875)
These models are particularly well-suited for this dataset and should
be considered for final model selection and deployment. Further
hyperparameter tuning and model ensemble could potentially improve
performance even further.
Conclusion:
Effective churn management is essential for the sustained
success of a DTH company. By implementing a comprehensive
approach encompassing customer segmentation, acquisition
strategies, customer delight initiatives, and churn prevention tactics,
the company can enhance customer retention and maximize
profitability. Furthermore, leveraging partnerships with lifestyle
vendors, promoting hassle-free payment methods, and expanding
presence in Tier-2 cities can help tap into new customer segments and
broaden market reach.
By focusing on personalized communication, exceptional
customer service, and targeted offers, the company can foster stronger
relationships with its customers, leading to increased loyalty and
reduced churn rates.
Ultimately, by prioritizing customer satisfaction and
implementing data-driven strategies, the DTH company can position
itself for long-term growth and success in the competitive market
landscape.

RECOMMENDATIONS:
Four Stages of Churn Management
• Customer Segmentation: Categorize customers based on needs
and usage for tailored strategies.
• Acquiring Customers: Utilize targeted marketing and
promotions to attract new customers.
• Customer Delight: Exceed expectations with exceptional service
and personalized rewards.
• Churn Prevention: Identify and retain at-risk customers through
targeted campaigns.
Other Recommendations
• Referral Drive: Incentivize customer referrals for organic
growth.
• Partnerships: Collaborate with vendors for enhanced customer
experiences.
• Segmentation: Internal segmentation for targeted acquisition
strategies.
• Loyalty Perks: Offer free cloud storage to boost customer
engagement.
• Personalization: Customize email responses to improve
communication.
• Priority Service: Establish dedicated teams for high-value
customers.
• Appreciation: Send thank-you notes and gifts to foster loyalty.
• Feedback Loop: Prioritize complaint resolution and gather
feedback.
• Payment Promotion: Encourage e-wallet usage with discounts
and incentives.
• Subsidized Offers: Introduce discounts for single customers to
encourage loyalty.
• Family Plans: Offer comprehensive plans for multi-user
households.
• Market Expansion: Increase presence in Tier-2 cities for growth
opportunities.
• Convenient Payments: Promote hassle-free payment methods
for customer convenience.
• Customer Segment Approach
• Segmentation: Divide customers based on spending and loyalty.
• Retention Strategies: Tailor retention efforts based on spending
and loyalty levels.

You might also like