INTERNSHIP REPORT
A report submitted in partial fulfilment of the requirements for the Award of
Degree of
BACHELOR OF TECHNOLOGY
In
INFORMATION TECHNOLOGY
Submitted By:
Sehajdeep Singh
(06411503121)
Bharati Vidyapeeth College of Engineering
New Delhi
Under the Guidance of:
Internship Supervisor
Mr. Biswajit Chakraborty
Analytics centre of Excellence
Punjab National Bank, HO
New Delhi
1
(Duration: 17th July, 2023 – 14th August, 2023)
TABLE OF CONTENTS
1. Acknowledgement
2. Internship Objective
3. Profile of the Organisation
4. Introduction
a. Background and problem Statement
b. Objectives of the Project
5. Data Collection and Pre-processing
a. Data Sources
b. Data Cleaning and transformation
6. Exploratory Data Analysis (EDA)
a. Descriptive Statistics
b. Visualizations and Insights
7. Models
8. Scoring
9. Conclusion
10. Future work
11. References
2
DECLARATION
I, Sehajdeep Singh, hereby declare that the presented report of
internship titled "Machine Learning Driven Customer Segmentation
And Churn Analysis" is a unique and original work that I have
prepared based on my one month of internship experience at Punjab
National Bank, HO, New Delhi. The report covers various concepts
and techniques related to data science and machine learning that I
have learned and applied during my internship, such as data
cleansing, Exploratory Data Analysis (EDA), feature engineering, and
Machine learning models. I confirm that the report is solely intended
for my academic requirement and not for any other purpose. I assure
that the information and data presented in the report is true and
accurate to the best of my knowledge. I also understand that the
report may not be used for any commercial or personal gain, and it
should not be used with the interest of any opposite party or
corporation without my prior consent.
Signed: ___________
Date: 14 August 2023
3
ACKNOWLEDGEMENT
I would like to express my sincere gratitude to the following individuals and
organizations for their invaluable support and guidance during my internship:
Mr. Biswajit Chakraborty : Thank you for being a supportive mentor and
providing valuable insights throughout my internship. Your guidance has
been instrumental in my learning journey.
Punjab National Bank: I extend my appreciation to Punjab National Bank
for providing me with the opportunity to gain hands-on experience in
Data analytic Cell. The exposure I gained has been invaluable for my
professional development.
I would also like to acknowledge the contribution of Bharati Vidyapeeth
College of Engineering for facilitating and endorsing this internship. Your
support has allowed me to apply classroom learning to real-world
scenarios.
This internship experience has enriched my skills, expanded my horizons, and
provided me with a deeper understanding of Data Analytics. I am truly grateful
for the lessons learned and the relationships formed.
Sehajdeep Singh
Bharati Vidyapeeth College of Engineering
Date: 14 August, 2023
4
INTERNSHIP OBJECTIVE
Training Internships offer a valuable opportunity for a diverse range of
individuals to gain real-world experience and enhance their skill sets.
The objective for this position should highlight both our existing skills in
the field and our eagerness to further expand our knowledge and
capabilities.
Internships are utilized in a number of different career fields, including
architecture, engineering, healthcare, economics, advertising and many
more.
Some internship is used to allow individuals to perform scientific
research while others are specifically designed to allow people to gain
first-hand experience working.
Utilizing internships has proven to be a great method for constructing
our resume and cultivating skills that can be accentuated in our CV for
prospective employment. When applying for a Training Internship,
ensure that we showcase any distinctive skills or talents that can set us
apart from other candidates, thereby increasing our likelihood of
securing the position.
5
Profile of the Organization
Punjab National Bank (abbreviated as PNB) is an Indian public sector bank based in New Delhi. The
bank was founded in May 1893 and is the second largest public sector bank in India in terms of its
business volumes and in terms of its network with assets of ₹17.95 lakh crore (US$220 billion),
180 million customers, 12,248 branches, and 13,000+ ATMs.
Punjab National Bank is a PSU working under the Government of India regulated by the Reserve
Bank of India Act, 1934 and the Banking Regulation Act, 1949. It was registered on 19 May 1894
under the Indian Companies Act, with its office in Anarkali Bazaar, in pre-independent India
(present-day Pakistan). The founding board was drawn from different parts of India professing
different faiths and of varying backgrounds, with the common objective of creating a truly national
bank that would further the economic interest of the country. PNB's founders included several
leaders of the Swadeshi movement such as Dyal Singh Majithia and Lala Harkishen Lal, Lala Lalchand,
Kali Prosanna Roy, E. C. Jessawala, Prabhu Dayal, Bakshi Jaishi Ram, and Lala Dholan Dass.
VISION:
“To be a globally trusted banking partner through customer-centric innovations,
empowering employees and enriching lives of all stakeholders.”
MISSION:
“To offer quality financial services by leveraging technology to create value for
customers and other stakeholders, opportunities for employees and thus, contributing to
the economic growth of nation.”
VALUES:
We will work as a team for the benefit of customers.
We will incorporate innovation to drive business.
We will be objective in decision making.
We will always be willing to learn and embrace change.
We will adopt ethical practices to develop a culture of trust.
Online Services
PNB SMS Banking
Available to customers registered for SMS Alert.
SMS Alerts on registered mobile number for all the transactions done through Cards (ATMs,
POS, E Commerce), Internet Banking, Mobile Banking, CDM, etc.
PNB Merchant Acquiring Products
Merchant Acquiring Business is an integral part of digital ecosystem by providing necessary
infrastructure to acquire merchants and facilitates the merchants to accept payments
through Debit/CreditCard, UPI, Aadhaar Enabled Payment System (AePS) etc.
PNB Retail Internet Banking
Anytime & Anywhere Banking through Internet.
6
Quick, simple and convenient way of Banking.
PNB ONE
PNB ONE is a unified Mobile Banking application enriched with several features providing all
banking facilities at a single platform.
It allows user to perform major banking requirements through the application on 24*7 basis
anywhere and anytime without visiting the branch.
PNB’s Banking Services through WhatsApp
In September 2022, Punjab National Bank launches WhatsApp banking for customers and
non-customers.
At present, it said, PNB would be offering non-financial services such as balance inquiry, last
five transactions, stop cheque, request cheque book to its account holders through the
WhatsApp banking service.
Corporate social responsibility
The Bank also continues to discharge its social obligations and addresses environmental concerns
with added vigour, which include free medical camps, distribute artificial limbs, tree plantation and
blood donation camps, besides donations to Hospitals, Schools etc. The Bank supports various
societies, charitable institutions and NGO /organisations working for the benefit of downtrodden,
weaker sections of society, orphans, underprivileged, spastics, handicapped, mentally retarded
children, women in shelter homes, etc. The Bank also contributes for fighting diseases like diabetes,
tuberculosis, AIDS, leprosy etc. Donations are also extended for purchase of water coolers,
ambulances and building infrastructure facilities at hospitals/schools.
For the financial year 2018–2019, PNB incurred a sum of ₹2,954.15 lakh (US$3.7 million) on CSR
initiatives. Among many initiatives, one such effort is PNB Ladli, which started in 2014. The scheme
initiated to promote education among girls of Rural / Semi-urban areas.
To help young talents in India the bank also initiated PNB Hockey Academy, one of the major efforts
towards supporting the national games.
Present Status of the Organization
PNB has always looked at technology as a key facilitator to provide better customer service and
ensured that its 'IT strategy' follows the 'Business strategy' so as to arrive at "Best Fit". The bank has
made rapid strides in this direction. Along with the achievement of 100% branch computerization,
one of the major achievements of the Bank is covering all the branches of the Bank under Core
Banking Solution (CBS), thus covering 100% of its business and providing 'Anytime Anywhere'
banking facility to all customers including customers of more than 2000 rural branches. The bank has
also been offering Internet banking services to the customers of CBS branches like booking of tickets,
payment of bills of utilities, purchase of airline tickets etc. Towards developing a cost effective
alternative channels of delivery, the bank with more than 13,000+ ATMs has the largest ATM
network amongst Nationalised Banks.
Future Expansion of the Organization
Under the long term vision, Bank proposes to start its operation in Fiji Island, Australia and
Indonesia. Bank continues with its goal to become a household brand with global expertise. Amongst
Top 1000 Banks in the World, 'The Banker' listed PNB at 250 th place. Further, PNB is at the 16th
7
position among 48 Indian firms making it to a list of the world's biggest companies compiled by the
US magazine 'Forbes'.
INTRODUCTION
Segmentation is that the method of dividing a bank's customer base into distinct groups or
segments based on specific characteristics, behaviours, or needs. The thought is to form
custom-made selling methods for hand-picked segments so as to satisfy clients' wants in a
better way. Banks can give custom-made product to those market segments- so as to extend
the customer’s profit. In applying selling strategy in banking/financial establishments, the
supplier of banking/financial services makes a distinction among numerous market
segments. Services, the group of products for sale, and the methods of communication are
customized for one or more selected groups of customers. Market segmentation may be
done consistent with numerous criteria once it's applied to a private shopper market, e.g.
geographic, demographic, and psychographic and activity.
In this study, I attempted to segment the customers based on their transaction timings.
Through specific marketing activities, the bank primarily aims to up-sell to these customers,
increasing their share-of-wallet and overall profitability. Within the banking sector, there
exist five overarching market segments (categorized by type of market segmentation), the
majority of which can be further consolidated into two main groups. This consolidation
results in a total of eight potential market segments, outlined as follows:
High Value Customers
Uses a wider range of banking products and services.
Generates substantial transaction volumes.
Medium value customers
Maintains moderate account balance.
Contributes to bank’s overall financial health
Low-value customers
Limited account activity
Requires cost-effective services and basic banking options
High-Maintenance Customers
Requires significant assistance and support
Frequent interactions with customer service or branch personnel.
Digital-First Customers
Primarily interact with the bank through digital channels.
Appreciates technology-driven solutions
Long-Term Customers
Maintained accounts with the bank for an extended period.
Appreciates recognition and rewards for their continued commitment
8
Inactive
Closed Account
Problem Statement:
To segment the customers based on their preferred time to get contacted by the
bank.
Motivation:
A lot of data of the customers are stored in the databases of the bank. These data
remains unexplored unless they are analyzed, processed and proper steps taken to
improve the services provided to the customer.
When a customer is contacted either through a SMS/ email/WhatsApp or through
calling, the majority of them do not receive or look the message the bank want to
convey.
Client segmentation not only enhances customer service but also contributes to
fostering client loyalty and retention. Marketing materials tailored through client
segmentation are often more cherished and well-received by customers due to their
personalized nature.
The Over-reaching Goals of the segmentation:
Better targeting and position of the product
Encourages two-way communication among the customer and the bank.
Maintaining effective relationship with the customers.
Retaining the existing customers and attracting new ones.
Improving bank’s marketing strategies and marketing plans.
TASK PERFORMED
In this Internship, Data analytics and Machine Learning with Python was applied such as data
Cleansing, Exploratory Data Analysis (EDA), feature engineering and ML models to make required
predictions.
DATA EXTRACTION
A sample database was taken from the open source public portals like Kaggle (Kaggle.com). Some
appropriate features were taken from the multiple tables in the database with the help of SQL.
MySql software to extract the dataset using JOINS onto the tables and creating various Sessions with
the help of aggregate functions to get the required features for our model.
The data is then finally exported as a .csv file.
9
About Dataset:
The derived dataset is imported into the python notebook where it will be analyzed further.
Most banks have a large customer base - with different characteristics in terms of age, income,
values, lifestyle, and more. Customer segmentation is the process of dividing a customer dataset
into specific groups based on shared traits.
According to a report from Ernst & Young, “A more granular understanding of consumers is no
longer a nice-to-have item, but a strategic and competitive imperative for banking providers.
Customer understanding should be a living, breathing part of everyday business, with insights
underpinning the full range of banking operations.
This dataset consists of 1 Million+ transaction by over 800K customers for a bank in India. The data
contains information such as - transaction history, customer demographics, account details etc.
Cleaning the Data
Data preprocessing involved handling missing values, outlier detection, and normalization.
Relevant features were selected, and new features were engineered for analysis.
The ‘CustomerDOB’ and ‘TransactionTime’ were converted into dateTime datatype.
The, Age of the customers is calculated from above columns using:
Handling NULL values:
Rows where CustomerDOB, Account_balance, and Gender had NULL values, were
dropped off from our DataFrame.
10
Normalization:
Normalization is the technique in which the goal is to change the values of numeric columns in
the dataset to use a common scale, without distorting differences in the ranges of values or
losing information.
The Age group was showing some anomalies which were removed.
Exploratory data Analytics (EDA):
Exploratory Data Analysis (EDA) is a process of examining and analyzing data to understand
its characteristics, patterns, and relationships. It involves visually exploring the data,
summarizing its main features, and identifying potential trends, outliers, and anomalies.
EDA is typically performed as an initial step in the data analysis process, before applying
more advanced statistical techniques or building predictive models.
EDA aims to gain insights and generate hypothesis about the data, and it helps in
formulating questions and hypotheses to be tested in subsequent analyses. By examining
the data closely, EDA can reveal patterns, correlations, and trends that may not be apparent
at first glance. It also helps in identifying data quality issues, missing values, and outliers that
may impact the validity of subsequent analyses.
11
Descriptive statistics and visualizations were used to gain insights into customer profiles,
transaction patterns, and product usage
Summary of Analysis:
Exploring "Customer Gender", it is observed that men outnumbered women across all ages (age
groups included) while female being only 27% of total customers, despite the presence of an
insignificant undefined gender.
This is also a recognised gender gap that the bank can use to generate new product offerings aimed
at attracting more women to the business.
Comparing Gender relation under different age groups
Before comparing, the age of customers were grouped into
Less than 18
(18<x<25)
(25<x<30)
(30<x<35)
(35<x<40)
More than 40.
The majority of the bank's customers are the Adults and Middle age. This is a gap that the bank
should investigate, as well as a chance to develop and offer acceptable solutions for other customer
age groups, such as the Flex group, which includes young entrepreneurs, then retirees, and men and
women in their fifties and sixties. It is also observed that the Mid_Adult age group has the highest
number of customers followed by the Adults and young_adults.
12
We can observe that the most customers in the bank are of age 18 and more. To improve this
composition, banks may provide better schemes for underage which promote this age group to
interact and handle transactions on their own.
Below is the gender gap between customers with respect to different age groups.
It can be observed that in every age group, count of women is almost below 30% of the total
customers in that age-group. To improve this, more awareness programs and Women-
centric schemes must be launched.
Displaying relation between the age Groups and timing of transaction:
13
We observed here that as the age increases, the percentage of customers transacting
before_lunch period is increasing and transactions during the night is decreasing.
We may also imply that the most transactions are done during the evening time of the day.
Almost 44% customers of above age 45 and above prefer to transact in evening (highest),
while other age groups have 40% customers transacting during evening.
Young people tend to transact during late-night nearly twice as that of experienced
customers.
Analysing the Gender gap when account balance were looked upon:
Graph displaying relation btw Account Balance and count of customers transacting at different
time periods:
14
This graph also implies the same conclusion that in every group the number of customers
transacting during evening period is maximum.
Also, the higher the account_balance, the higher is their probability that they remain active during
late night period.
This may also imply that the richer group of customers are generally busy during evening / night
time and are active more before lunch comparative to other groups.
Analysing transaction timings done by Male and females separately:
The female customers tend to have any transaction during evening period over other timings
while, the male customers are more scattered
Comparing the Location of the customers:
"Customer Location" The bank has customers in 2829 locations across India which are not
evenly distributed, so a presentation of the top 20 locations was done. This is also an
opportunity for the bank to examine client dispersion and implement creative methods of
expanding their customer base across India.
15
We can observe here that the metro cities are the highest transacting cities with over 70%
transactions.
Also we can conclude here that the transactions in metro cities will be more scattered
throughout the day, whereas the rest cities, Evening / Before Lunch time-period dominates
the transaction timings.
Creation of Model
A machine learning model is a file that has been trained to recognize certain types of
patterns. You train a model over a set of data, providing it an algorithm that it can use to
reason over and learn from those data.
The learning algorithm discovers patterns within the training data, and it outputs an ML
model which captures these patterns and makes predictions on new data.
Creating dummies
Before passing the variables into the model, they are converted into 0 & 1 by creating the
dummies with the help of pandas library. Dummies can be passed into the model as any
categorical value gets converted into numerical type value which the model can understand
without any issue. Also it helps in converting the columns into binary classified variables.
The dataset obtained was divided into 3 subsets:
a) Training set : 55%
b) Testing set: 25%
c) Validation set: 20%
16
Worked with 3 types of models under Supervised Learning:
a) Random Forest Classifier
b) Decision Tree classifier
c) Logistic Regression
Random Forest Classifier
For the classification task, the outcome of the random forest is taken from the
majority of votes. Whereas in the regression task, the outcome is taken from the
mean or average of the predictions generated by each tree.
Library used: SkLearn.ensemble
Variables chosen for this model were:
‘clf’ : stores the randomForest Classifier model with random sate as 42.
‘param_dist’ : stores all the parameters paased for k-fold validation.
‘cv’ : stores the splitted data according to Startified k-fold.
Parameters Used for the Model making:
'n_estimators': This hyperparameter specifies the number of decision trees in the random
forest. It also helps reducing over-fitting.
'max_depth': Similar to a single Decision Tree, this hyperparameter determines the
maximum depth of individual decision trees within the forest.
'min_samples_split': This hyperparameter sets the minimum number of samples required to
split an internal node in each decision tree.
'min_samples_leaf': This hyperparameter sets the minimum number of samples required to
be at a leaf node in each decision tree.
'bootstrap': This hyperparameter determines whether bootstrap samples (randomly
sampled with replacement) are used when building trees.
17
Decision Tree Classifier
A decision tree uses a tree-like structure of decisions along with their possible consequences and
outcomes. In this, each internal node is used to represent a test on an attribute; each branch is used
to represent the outcome of the test. The more nodes a decision tree has, the more accurate the
result will be.
The advantage of decision trees is that they are intuitive and easy to implement, but they lack
accuracy.
Library Used: from sklearn.tree and sklearn.model_selection
Variables used:
tree-classifier : Stores the DecisionTree Model having depth as 12.
Score: gives us an estimate of how well our Decision Tree classifier is likely to perform on
unseen data. The mean accuracy score provides a more robust evaluation metric compared
to a single train-test split.
y-pred: Stores the predicted values of the model from the Testing dataset.
cm: Stores the confusion matrix for the predictions of Testing dataset
18
y-validate-score: Stores the predicted values of the model from the Validation dataset.
Parameters for Hyper Tuning:
These hyperparameters control various aspects of how the Decision Tree is constructed and how it
prevents overfitting.
'criterion': This hyperparameter defines the function to measure the quality of a split. The
two options are 'gini' and 'entropy'.
'max_depth': This hyperparameter determines the maximum depth of the Decision Tree. A
deeper tree can capture more complex relationships in the data, but it's also more prone to
overfitting.
'min_samples_split': This hyperparameter sets the minimum number of samples required to
split an internal node. It controls how fine-grained the tree's splits can be.
'min_samples_leaf': This hyperparameter sets the minimum number of samples required to
be at a leaf node. It prevents further splitting of nodes if the number of samples is below this
threshold.
19
Logistic Regression
Logistic Regression is used to solve the classification problems in machine learning. They are similar
to linear regression but used to predict the categorical variables. It can predict the output in either
Yes or No, 0 or 1, True or False, etc. However, rather than giving the exact values, it provides the
probabilistic values between 0 & 1.
Library used: sklearn.linear_model
Parameters Used:
20
'C': This hyperparameter represents the inverse of the regularization strength. In Logistic
Regression, smaller values of C increase the strength of regularization, which helps prevent
overfitting by penalizing large coefficients.
'penalty': This hyperparameter specifies the type of regularization to be applied. 'l2' refers
to Ridge regularization, which adds the squared magnitudes of the coefficients to the cost
function.
Scoring
Terms to consider when comparing data analysis models are their performance and accuracy.
There are different metrics and methods that can help you evaluate and compare data analysis
models based on their performance and accuracy.
1. Accuracy of the Model: Accuracy is one metric for evaluating classification models.
Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has
the following definition:
Number of correctpredictions
Accuracy =
number of predictions
2. Precision: Precision is the proportion of positive identifications which are actually correct
A model that produces no false positives has a precision of 1.0.
3. Recall: Recall is the proportion of actual positives which were identified correctly.
A model that produces no false negatives has a recall of 1.0.
4. ROC curve: An ROC curve (receiver operating characteristic curve) is a graph showing the
performance of a classification model at all classification thresholds. This curve plots two
parameters:
a. True Positive Rate (TPR)
TP
TPR=
TP+ FN
21
b. False Positive Rate (FPR)
FP
TPR=
TP+ FN
5. AUC: AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-
dimensional area underneath the entire ROC curve
6. F1 Score: F1 score is a machine learning evaluation metric that measures a model’s accuracy.
It combines the precision and recall scores of a model.
Precision measures how many of the “positive” predictions made by the model were
correct.
Recall measures how many of the positive class samples present in the dataset were
correctly identified by the model.
For Random- forest model
Training Set Testing Validation
Accuarcy of MODEL 0.576834233 0.578081423 0.576057837
ROCAUC Score 0.504114167 0.504861519 0.504045104
F1 Score 0.050693795 0.061378305 0.051111518
Precision 0.51552795 0.510211146 0.513413993
Reacall Senstivity 0.026657567 0.032653242 0.026894461
Reacall Specificity 0.981570767 0.977069795 0.981195747
Cross Entropy loss 15.25244023 15.20748696 15.28042437
Kappa 0.009407478 0.011120323 0.009238167
For Decision Tree Classifier
DECISION TREE training Set testing Validation
Accuarcy of MODEL 0.567027105 0.567158206 0.562499496
ROCAUC Score 0.509124451 0.508194397 0.508421193
F1 Score 0.113061048 0.112560488 0.112729948
Precision 0.519471488 0.508265248 0.516327142
Reacall Senstivity 0.063433565 0.063288144 0.063272086
Reacall Specificity 0.954815338 0.953100649 0.953570299
Cross Entropy loss 15.60592496 15.6011996 15.76911652
Kappa 0.020297728 0.018259487 0.018611392
Logistic Regression
LOGISTIC REGRESSION Training Set Testing Validation
Accuarcy of MODEL 0.576157842 0.577538606 0.575765375
ROCAUC Score 0.50021197 0.500228168 0.500524444
F1 Score 0.003141665 0.003267974 0.003955175
Precision 0.501597444 0.503401361 0.610169492
Reacall Senstivity 0.001575767 0.001639308 0.001984018
Reacall Specificity 0.998848173 0.998817028 0.99906487
Cross Entropy loss 15.27681985 15.22705207 15.29096579
Kappa 0.000488276 0.000526829 0.001206596
22
Conclusion
The internship project successfully analysed customer data to provide insights into segmentation
and churn analysis for Punjab National Bank. The findings can guide strategic decisions to enhance
customer satisfaction and bank performance.
Comparing the models created by the help of Accuracy score:
We can conclude here that the Accuracy of all models are comparatively equal and low for a very
accurate model.
Thus we can imply that the data in the dataset acquired was inadequate and other variables are
required to improve the accuracy of these models.
Future Work
Future work could involve refining the churn prediction model, exploring advanced segmentation
techniques, and implementing A/B tests for the recommended strategies.
We need to look for below pointers that are helpful for the improvement of this project.
Occupation of the customer
Type of transaction
Medium of the transaction
Transaction Amount
Education
Credit card availability
These pointers may improve the accuracy, precision and other metrics of our model.
23
References
https://www.kaggle.com/datasets/apoorvwatsky/bank-transaction-data
https://www.kaggle.com/datasets/shivamb/bank-customer-segmentation
https://www.w3schools.com/python/pandas/default.asp
https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-
scratch-2/
https://github.com/NishantBhavsar/data-science-internship-project/blob/master/final_report.pdf
https://sjcit.ac.in/wp-content/uploads/2022/11/1SJ18CS101-SUBHASH-K-V.pdf
https://www.studocu.com/in/document/maharshi-dayanand-university/master-of-business-
admistration/internship-report/56362328
24