Business Problem: Improving Customer Retention Rate for a Retail Bank
• Discuss and justify whether your problem relates a regression analysis or a classification analysis.
Section 2: Understand the Data (30%)
• Explore and discuss whether any of the data attributes you have focused on are closely correlated
with other attributes - either positively or negatively.
• Include at least 3 Tableau-generated visualisations (e.g., map, scatter plot, bar chart, pie chart,
box-and-whisker plot) that give different insights to support your discussions.
Certainly! Here are three ideas for Tableau-generated visualizations that can provide different
insights to support your discussions on customer churn prediction:
1. Geographical Churn Rate Map:
o Create a map visualization in Tableau that shows the churn rates across
different geographical regions.
o Use the Geography attribute to plot the data points on a map, with each region
represented by a color or size based on its churn rate.
o This visualization will help identify any spatial patterns or hotspots of churn,
allowing you to target retention efforts in specific regions.
o Insights: The map may reveal that certain regions have significantly higher
churn rates compared to others, indicating potential regional factors
influencing churn. It could also highlight areas where the bank has a strong
customer base with low churn rates.
2. Customer Tenure vs. Balance Scatter Plot:
o Create a scatter plot in Tableau with Tenure on the x-axis and Balance on the
y-axis.
o Color-code the data points based on the churn status (churned or retained).
o This visualization will help explore the relationship between customer tenure
and account balance, and how it relates to churn.
o Insights: The scatter plot may show that customers with longer tenures and
higher balances are more likely to be retained, while those with shorter tenures
and lower balances are at a higher risk of churn. It could also reveal any
outliers or clusters of customers with specific tenure and balance
combinations.
3. Age and Gender Churn Rate Bar Chart:
o Create a stacked bar chart in Tableau that displays the churn rates by age
group and gender.
o Divide the age attribute into meaningful age groups (e.g., 18-24, 25-34, 35-44,
etc.) and create separate bars for each age group.
o Stack the bars based on gender (male and female) to show the proportion of
churned and retained customers within each age group and gender
combination.
o Insights: The bar chart may reveal that certain age groups have higher churn
rates than others, indicating that age is a significant factor in churn behavior. It
could also highlight any gender differences in churn rates within each age
group, suggesting potential demographic segments to focus on for retention
efforts.
• Include at least 3 R-generated plots or aggregation tables that give different insights to support
your discussions.
Certainly! Here are three ideas for R-generated plots or aggregation tables that can provide
different insights to support your discussions on customer churn prediction:
1. Churn Rate by Product Combination:
o Create an aggregation table in R that shows the churn rates for different
combinations of products held by customers.
o Use the "No Of Products" attribute to group customers based on the number of
products they have (e.g., 1 product, 2 products, 3 products, etc.).
o Calculate the churn rate for each product combination group by counting the
number of churned customers and the total number of customers in each
group.
o Insights: The aggregation table may reveal that customers with specific
product combinations have higher or lower churn rates. For example,
customers with only one product may have a higher churn rate compared to
those with multiple products, indicating the importance of cross-selling and
product bundling in reducing churn.
2. Correlation Heatmap:
o Generate a correlation heatmap in R using the numerical attributes in your
dataset, such as Age, Tenure, Balance, CreditScore, and Estimated Salary.
o The heatmap will display the pairwise correlations between these attributes
using a color scale, with darker colors indicating stronger correlations.
o Insights: The correlation heatmap may highlight significant correlations
between certain attributes. For instance, if Age and Tenure have a strong
positive correlation, it suggests that older customers tend to have longer
relationships with the bank. This information can help identify important
factors influencing churn and guide feature selection for your predictive
models.
3. Decision Tree Visualization:
o Train a decision tree model in R using the customer churn data, with the churn
status as the target variable and relevant attributes as predictors.
o Visualize the decision tree using R's plotting capabilities, such as the "rpart" or
"party" package.
o The decision tree plot will display the split points and conditions at each node,
leading to the classification of customers as churned or retained.
o Insights: The decision tree visualization provides a clear and interpretable
representation of the churn prediction logic. It can identify the most important
attributes and thresholds that contribute to churn. For example, the tree may
reveal that customers with a Balance below a certain threshold and a
CreditScore above a specific value are more likely to churn, providing
actionable insights for targeted retention strategies.
• Include the R-code you used in the appendix of your report.
Section 3: Prepare the Data (10%)
• If required, explain how you have derived your chosen target attribute(s) in Tableau or in R.
I think this isn’t needed since the churn status is already given in the data set
• Discuss and justify what other steps you may have taken to prepare your data, including, where
relevant: removing attributes from consideration, adding further "derived" attributes (e.g., Dates),
imputing "reasonable" values for missing data, transforming attributes, and standardising data
values.
• Prepare suitable separate "Training" and "Testing" datasets.
80-20 split should be fine. I think.
• Include any R-code you used to prepare your data in the appendix of your report.
Section 4: Generate and Test Prediction Models (40%)
• Select and justify at least 3 different prediction models (with at least 1 ensemble model) that are
likely to best help with your stated problem objectives.
When selecting prediction models for your customer churn analysis, it's important to consider
the nature of your problem, the characteristics of your dataset, and the interpretability and
performance requirements. Here are three different prediction models, including one
ensemble model, that are commonly used and likely to be effective for churn prediction:
1. Logistic Regression:
o Logistic regression is a popular choice for binary classification problems like
churn prediction.
o It models the probability of an outcome (churn or retention) based on a linear
combination of the predictor variables.
o Logistic regression provides interpretable coefficients that indicate the impact
of each predictor on the churn probability.
o It assumes a linear relationship between the predictors and the log-odds of the
outcome, making it a simple and interpretable model.
o Logistic regression can handle both numerical and categorical predictors and
is less prone to overfitting compared to more complex models.
2. Random Forest (Ensemble Model):
o Random Forest is an ensemble model that combines multiple decision trees to
make predictions.
o It is known for its ability to handle high-dimensional datasets and capture
complex interactions between predictors.
o Random Forest reduces overfitting by creating a diverse set of decision trees
and aggregating their predictions.
o It can handle both numerical and categorical predictors and automatically
captures non-linear relationships.
o Random Forest provides feature importance measures, indicating the relative
contribution of each predictor to the churn prediction.
o It is robust to outliers and can handle missing data without extensive
preprocessing.
3. Gradient Boosting (e.g., XGBoost, LightGBM):
o Gradient Boosting is another ensemble model that combines weak learners
(usually decision trees) to create a strong predictive model.
o It iteratively trains decision trees, with each tree focusing on correcting the
errors made by the previous trees.
o Gradient Boosting algorithms, such as XGBoost and LightGBM, have gained
popularity due to their excellent performance and scalability.
o They can handle numerical and categorical predictors and capture complex
non-linear relationships.
o Gradient Boosting models provide feature importance measures and can
handle missing data.
o They have hyperparameters that allow for fine-tuning the model's performance
and complexity.
Justification:
Logistic Regression serves as a simple and interpretable baseline model. It provides
insights into the impact of each predictor on churn probability and is suitable when
interpretability is important.
Random Forest, as an ensemble model, captures complex interactions between
predictors and reduces overfitting. It is robust to outliers and can handle high-
dimensional datasets, making it a strong candidate for churn prediction.
Gradient Boosting models, such as XGBoost or LightGBM, have shown excellent
performance in various classification tasks, including churn prediction. They can
capture non-linear relationships and handle large datasets efficiently.
By selecting these three models, you cover a range of approaches, from simple and
interpretable (Logistic Regression) to more complex and powerful (Random Forest and
Gradient Boosting). This allows you to compare their performance and gain insights from
different perspectives.
In your report, justify your choice of models based on their strengths and suitability for the
churn prediction problem. Explain how each model can capture the underlying patterns in
your dataset and contribute to achieving your problem objectives.
Remember to evaluate the performance of each model using appropriate evaluation metrics
for binary classification, such as accuracy, precision, recall, F1-score, and area under the
ROC curve (AUC-ROC). Compare the results and discuss the trade-offs between model
complexity and interpretability.
By providing a clear justification for your model selection and evaluating their performance,
you demonstrate a thoughtful approach to predictive modeling and ensure that your churn
prediction analysis is comprehensive and well-grounded.
To configure your selected models (Logistic Regression, Random Forest, and Gradient
Boosting) for optimal performance and insights, you need to consider attribute selection and
model tuning parameters. Here's how you can approach this:
• Configure your models (e.g., select attributes and/or other model tuning parameters) that you
expect will best deliver relevant insights and/or provide the lowest error rates, justifying your
decisions.
1. Attribute Selection:
o Start by exploring the relevance and importance of each attribute in relation to
customer churn.
o Use domain knowledge and insights from exploratory data analysis to identify
attributes that are likely to be predictive of churn.
o Consider statistical methods, such as correlation analysis or chi-square tests, to
assess the relationship between each attribute and the churn outcome.
o Use feature importance measures provided by Random Forest and Gradient
Boosting models to identify the most influential attributes.
o Select a subset of attributes that collectively provide the best predictive power
while avoiding multicollinearity and overfitting.
o Justify your attribute selection based on their relevance to churn prediction
and the insights they can provide to the business.
2. Model Tuning Parameters:
o Each model has its own set of tuning parameters that can impact its
performance and complexity.
o For Logistic Regression:
Consider regularization techniques, such as L1 (Lasso) or L2 (Ridge),
to control overfitting and improve model generalization.
Adjust the regularization strength (e.g., C parameter in scikit-learn) to
balance between model complexity and performance.
o For Random Forest:
Tune the number of trees in the forest (n_estimators) to find the
optimal balance between model complexity and performance.
Adjust the maximum depth of the trees (max_depth) or the minimum
number of samples required to split a node (min_samples_split) to
control overfitting.
Experiment with the number of features considered at each split
(max_features) to introduce randomness and reduce correlation
between trees.
o For Gradient Boosting (XGBoost or LightGBM):
Tune the learning rate (learning_rate) to control the contribution of
each tree to the final prediction.
Adjust the maximum depth of the trees (max_depth) to control the
complexity of the model.
Modify the number of boosting rounds (n_estimators) to find the
optimal number of trees in the ensemble.
Consider regularization parameters, such as L1 (reg_alpha) and L2
(reg_lambda), to prevent overfitting.
o Use techniques like grid search or random search to systematically explore
different combinations of tuning parameters and find the best configuration for
each model.
oJustify your choices of tuning parameters based on their impact on model
performance, generalization, and computational efficiency.
3. Model Evaluation:
o Split your dataset into training and testing sets to evaluate the performance of
each model.
o Use cross-validation techniques, such as k-fold cross-validation, to assess the
model's performance on multiple subsets of the training data.
o Evaluate the models using relevant performance metrics for churn prediction,
such as accuracy, precision, recall, F1-score, and AUC-ROC.
o Compare the performance of the models and select the one that provides the
best balance between predictive accuracy and interpretability.
o Justify your model selection based on the evaluation results and the business
requirements for churn prediction.
In your report, document the attribute selection process and explain the reasoning behind
your choices. Provide the statistical analysis or feature importance measures used to support
your decisions.
For model tuning, describe the tuning parameters considered for each model and the range of
values explored. Present the results of the grid search or random search, highlighting the best-
performing configurations for each model.
Evaluate the models using appropriate performance metrics and present the results in a clear
and concise manner. Justify your final model selection based on the evaluation outcomes and
the trade-offs between performance and interpretability.
By carefully selecting attributes and tuning model parameters, you can optimize your models
to deliver relevant insights and achieve the lowest error rates for churn prediction. This
systematic approach demonstrates a thorough understanding of the modeling process and
enhances the credibility of your analysis.
• Run these models, discussing the model outputs and drawing, where possible, insights related to
your problem.
• Select proper evaluation metrics to measure the accuracy of your models. Determine and
comment on the best model across your 3 prediction models.
To evaluate the accuracy of your churn prediction models (Logistic Regression, Random
Forest, and Gradient Boosting), you need to select appropriate evaluation metrics. Since
churn prediction is a binary classification problem, you can use the following metrics:
1. Accuracy:
o Accuracy measures the overall correctness of the model's predictions.
o It is calculated as the ratio of correct predictions (true positives + true
negatives) to the total number of instances.
o However, accuracy alone may not be sufficient if your dataset is imbalanced,
as it can be biased towards the majority class.
2. Precision:
o Precision focuses on the correctness of the positive predictions (churned
customers).
o It is calculated as the ratio of true positives (correctly predicted churned
customers) to the total predicted positives (true positives + false positives).
o High precision indicates that when the model predicts a customer as churned,
it is highly likely to be correct.
3. Recall (Sensitivity):
o Recall measures the model's ability to identify churned customers correctly.
o It is calculated as the ratio of true positives (correctly predicted churned
customers) to the total actual positives (true positives + false negatives).
o High recall indicates that the model captures a large portion of the actual
churned customers.
4. F1-score:
o The F1-score is the harmonic mean of precision and recall.
o It provides a balanced measure of the model's performance, considering both
precision and recall.
o F1-score is useful when you want to find a balance between precision and
recall and when the class distribution is imbalanced.
5. Area Under the ROC Curve (AUC-ROC):
o AUC-ROC measures the model's ability to discriminate between churned and
non-churned customers.
o It plots the true positive rate (recall) against the false positive rate at various
classification thresholds.
o A higher AUC-ROC indicates better model performance, with a value of 1
representing a perfect classifier.
To determine the best model among your three prediction models, you should evaluate them
using these metrics on the testing dataset. Here's an example of how you can present and
comment on the results:
Model Accuracy Precision Recall F1-score AUC-ROC
Logistic Regression 0.85 0.80 0.75 0.77 0.83
Random Forest 0.88 0.85 0.82 0.83 0.90
Gradient Boosting 0.90 0.88 0.86 0.87 0.93
Based on the evaluation metrics, the Gradient Boosting model appears to be the best-
performing model among the three. It achieves the highest accuracy, precision, recall, F1-
score, and AUC-ROC compared to Logistic Regression and Random Forest.
The high accuracy indicates that the Gradient Boosting model correctly predicts the churn
status for a large proportion of customers. The high precision suggests that when the model
predicts a customer as churned, it is highly likely to be correct. The high recall indicates that
the model captures a significant portion of the actual churned customers.
The F1-score, which balances precision and recall, is also the highest for the Gradient
Boosting model, indicating a good overall performance. The high AUC-ROC value suggests
that the model has a strong ability to discriminate between churned and non-churned
customers.
However, it's important to consider the specific business requirements and the trade-off
between different metrics. For example, if the cost of false positives (incorrectly predicting a
customer as churned) is high, you may prioritize precision over recall.
In your report, present the evaluation metrics for each model in a clear and concise manner,
using tables or visualizations. Provide a commentary on the performance of each model and
justify your selection of the best model based on the evaluation results and business
considerations.
Remember to also discuss any limitations or assumptions made during the evaluation process
and consider the practical implications of implementing the chosen model for churn
prediction in the business context.
• Discuss what steps you may have taken to improve your individual models.
• Include any R-code you used in the appendix of your report.
To improve the performance of your individual churn prediction models (Logistic
Regression, Random Forest, and Gradient Boosting), you can take several steps. Here are
some common techniques and considerations:
1. Feature Engineering:
o Explore creating new features or transforming existing ones to capture
additional information relevant to churn prediction.
o Derive features based on domain knowledge, such as calculating customer
lifetime value, customer engagement scores, or aggregating transactional data.
o Experiment with feature interactions or combinations that may have predictive
power.
o Apply feature scaling or normalization techniques to ensure that the features
have similar ranges and avoid any particular feature dominating the others.
2. Handling Class Imbalance:
o If your dataset has a significant class imbalance (i.e., a much higher proportion
of non-churned customers compared to churned customers), it can affect
model performance.
o Consider techniques such as oversampling the minority class (e.g., using
SMOTE), undersampling the majority class, or assigning class weights to
balance the importance of each class during training.
o Evaluate the impact of these techniques on the model's performance and
choose the approach that yields the best results.
3. Hyperparameter Tuning:
o Fine-tune the hyperparameters of each model to optimize its performance.
o Use techniques like grid search or random search to systematically explore
different combinations of hyperparameters.
o For Logistic Regression, tune the regularization strength (e.g., C parameter)
and consider different penalty terms (L1 or L2).
o For Random Forest, experiment with the number of trees, maximum depth,
minimum samples per leaf, and other relevant hyperparameters.
o For Gradient Boosting, tune the learning rate, maximum depth, number of
boosting rounds, and regularization parameters.
o Use cross-validation to evaluate the model's performance for each
hyperparameter combination and select the best configuration.
4. Ensemble Methods:
o Explore combining multiple models to create an ensemble that leverages the
strengths of each individual model.
o Techniques like voting (majority or weighted) or stacking can be used to
aggregate the predictions of different models.
o Experiment with different combinations of models and evaluate the
performance of the ensemble compared to the individual models.
5. Model Interpretation and Feature Importance:
o Analyze the coefficients or feature importances of each model to gain insights
into the most influential predictors of churn.
o Use techniques like permutation importance or SHAP (SHapley Additive
exPlanations) to assess the impact of each feature on the model's predictions.
o Visualize the feature importances or create partial dependence plots to
understand the relationship between the features and churn probability.
o Use these insights to refine the feature selection process or guide further data
exploration and business decisions.
6. Cross-Validation and Model Stability:
o Assess the stability and generalization ability of each model using cross-
validation techniques.
o Use strategies like k-fold cross-validation or stratified k-fold cross-validation
to evaluate the model's performance across different subsets of the data.
o Analyze the variation in performance metrics across the cross-validation folds
to ensure that the model is not overfitting or underfitting.
o Consider using techniques like early stopping or model checkpointing to
prevent overfitting during training.
In your report, discuss the specific steps you took to improve each individual model. Provide
details on the feature engineering techniques applied, the handling of class imbalance,
hyperparameter tuning process, and any ensemble methods used.
Present the results of these improvement steps, comparing the performance metrics before
and after each modification. Highlight the most effective techniques for each model and
discuss the insights gained from model interpretation and feature importance analysis.
Finally, summarize the overall impact of these improvements on the models' performance and
discuss any trade-offs or limitations encountered during the optimization process.
By documenting the steps taken to improve the individual models, you demonstrate a
thorough and iterative approach to model development and showcase your ability to optimize
the models for better churn prediction performance.