Data Analyst Interview Assignment
Data Analyst Interview Assignment
# Necessary imports
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", None)
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 1/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
# Load data
data = pd.read_csv("data/pilot Report as at 31st jan.csv", parse_dates=['CreateDate', 'Inv
data
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 2/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
# Data descriptions
data_dict = pd.read_excel("data/data dictionary.xlsx")
data_dict
Attribute Descriprion
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 3/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 4/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
datastand(data)
General stats:
==================
Size of DataFrame: 30667
Shape of DataFrame: (2359, 13)
Number of unique data types : {dtype('float64'), dtype('O'), dtype('int64'), dtype('<M8[ns]')}
Number of numerical columns: 3
Number of non-numerical columns: 8
Head of DataFrame:
__________________
PartnerID CreditLimit SONumber Cleared Overdue CreditUsed AmountRepaid \
0 36262 26,100 SO11705794 True False 1,464 1,464
1 36262 26,100 SO11705909 True True 146 148
2 36262 26,100 SO11780664 True False 1,650 1,650
3 36262 26,100 SO11833594 True False 8,220 8,220
4 36262 26,100 SO11909592 True False 2,080 2,080
Tail of DataFrame:
__________________
PartnerID CreditLimit SONumber Cleared Overdue CreditUsed \
2354 1298401 1,669 SO13572455 True True 870
2355 1298401 1,669 SO13572754 True True 220
2356 1298401 1,669 SO13810848 True False 1,344
Column:
SONumber
_______________
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 5/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
You can visualize missing data automatically right away or you can use the
function plot_missing() after importing it from DataStand. Visualize now?(y/n): y
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 6/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
<datastand.datastand.datastand at 0x7fbc74280a30>
if str(value) != 'nan':
value = int(''.join(str(value).split(',')))
return value
def fix_dtypes(df):
df.CreditLimit = df.CreditLimit.apply(to_int)
#df.Cleared = df.Cleared.astype(bool)
#df.Overdue = df.Overdue.astype(bool)
df.CreditUsed = df.CreditUsed.apply(to_int)
df.AmountRepaid = df.AmountRepaid.apply(to_int)
df.Balance = df.Balance.apply(to_int)
return df
data = fix_dtypes(data)
print(data.dtypes)
PartnerID int64
CreditLimit int64
SONumber object
Cleared object
Overdue object
CreditUsed float64
AmountRepaid float64
Balance float64
Fees float64
DaysOverdue float64
CreateDate datetime64[ns]
InvoiceDate datetime64[ns]
group object
dtype: object
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 7/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
data.dropna(subset=['Overdue',
'CreditUsed', 'AmountRepaid', 'Balance', 'Fees', 'DaysOverdue',
'CreateDate'], axis=0, inplace=True)
# reset index
data.reset_index(drop=True, inplace=True)
data.isnull().sum()
PartnerID 0
CreditLimit 0
SONumber 72
Cleared 44
Overdue 0
CreditUsed 0
AmountRepaid 0
Balance 0
Fees 0
DaysOverdue 0
CreateDate 0
InvoiceDate 0
group 0
dtype: int64
data[data.Cleared.isna()].sample(5, random_state=101)
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 8/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
data[data.Fees == 0].count()[0]
1351
Now we have two columns with missing data points and since the SONumber column is an index column, we
only have the Cleared column to fill.
We use information from other columns to fill this column. In this case the CreditUsed and AmountRepaid
columns.
Clearly, if amount repaid is equal(for zero fee) or more than credit used(with charged fee), this means that
the customer repaid the loan.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████
Feature Engineering
Design features from the already available ones:
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 9/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
def feature_eng(df):
# datetime features
df['CreateDate_year'] = df.CreateDate.dt.year.astype(int)
df['CreateDate_month'] = df.CreateDate.dt.month.astype(int)
df['CreateDate_day'] = df.CreateDate.dt.day.astype(int)
df['CreateDate_dayname'] = df.CreateDate.dt.day_name()
df['InvoiceDate_year'] = df.InvoiceDate.dt.year.astype(int)
df['InvoiceDate_month'] = df.InvoiceDate.dt.month.astype(int)
df['InvoiceDate_day'] = df.InvoiceDate.dt.day.astype(int)
df['InvoiceDate_dayname'] = df.InvoiceDate.dt.day_name()
return df
data = feature_eng(data)
data.head()
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 10/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
# as a percentage
print(data.Cleared.value_counts()*100 / len(data))
plt.figure(figsize=(10,6))
sns.countplot(data.Cleared)
plt.title("Distribution of Cleared status")
True 1937
False 400
Name: Cleared, dtype: int64
True 82.884039
False 17.115961
Name: Cleared, dtype: float64
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 11/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
# as a percentage
print(data.Overdue.value_counts()*100 / len(data))
plt.figure(figsize=(10,6))
sns.countplot(data.Overdue)
plt.title("Distribution of Overdue status")
False 1351
True 986
Name: Overdue, dtype: int64
False 57.809157
True 42.190843
Name: Overdue, dtype: float64
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 12/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
loan_times_dict = data.groupby('PartnerID')['SONumber'].count().to_dict()
# sort dictionary
marklist = sorted(loan_times_dict.items(), key=lambda x:x[1], reverse=True)
loan_times_dict = dict(marklist)
388436 : 122
363796 : 116
548447 : 86
437063 : 84
400649 : 78
105975 : 73
303101 : 60
302148 : 54
410793 : 52
340828 : 44
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 13/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
PartnerID Total_balance
0 398110 27739.0
1 274819 26048.0
2 805373 20759.0
3 105975 19337.0
4 548447 18940.0
defaulting_customers.head(10)
PartnerID Total_balance
0 398110 27739.0
1 274819 26048.0
2 805373 20759.0
3 105975 19337.0
4 548447 18940.0
5 400649 15790.0
6 437063 12970.0
7 668827 11477.0
8 351590 10022.0
9 174216 9802.0
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 14/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
Out of 190 customers, 108 have not yet cleared their loans, with 9 customers having total pending balances
of above KShs. 10000.
15
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 15/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
plt.figure(figsize=(15,8))
sns.lineplot(x='CreateDate', y='CreditUsed', data=data.sort_values(by='CreateDate'),
hue='Cleared')
plt.title("Credit usage over time")
plt.figure(figsize=(10,6))
sns.countplot(y=data.CreateDate_dayname,
order=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])
plt.title("Number of Loan orders day-wise(CreateDate)")
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 17/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
plt.figure(figsize=(10,6))
sns.countplot(y=data.InvoiceDate_dayname,
order=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])
plt.title("Number of Invoices day-wise(InvoiceDate)")
Most loan orders are made on Thursday and Saturday while most invoices are sent out on Wednesday.
Wednesdays, Thursdays and Saturdays are the busy days in a week.
data.CreateDate_year.value_counts()
2021 2072
2022 265
Name: CreateDate_year, dtype: int64
data.CreateDate_month.value_counts()
11 790
10 772
12 510
1 265
Name: CreateDate_month, dtype: int64
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 18/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
data.InvoiceDate_month.value_counts()
11 783
10 700
12 577
1 277
Name: InvoiceDate_month, dtype: int64
The dataset spans 4 months; October, November, December 2021 and January 2022.
January orders have cut down to close than half those of December, with maximum orders being made in
November 2021.
data.CrtInv_dateDiff.value_counts()
2 1452
3 540
4 266
6 30
7 19
9 16
5 14
Name: CrtInv_dateDiff, dtype: int64
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 19/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 20/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
cols, rows = 3, 2
print(data[['CreditLimit', 'CreditUsed']].corr())
sns.heatmap(data[['CreditLimit', 'CreditUsed']].corr())
CreditLimit CreditUsed
CreditLimit 1.000000 0.360038
CreditUsed 0.360038 1.000000
<AxesSubplot:>
Normally we would expect a customer with a high credit limit to borrow more. The above heatmap shows
weak positive correlation, meaning that although increased limit increases a customers borrowing amount, it
does not always have to be that increased credit limit for a customer will make them borrow more.
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 22/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
<seaborn.axisgrid.JointGrid at 0x7fbc2e9d1bb0>
The smaller the credit limit, the more the small amounts loans. -> More borrowers are small scale.
Finish off EDA with a pairplot to see if we gain insights from an overall plot with more than two variables and
a pandas profile report to summarise everything.
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 23/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
<seaborn.axisgrid.PairGrid at 0x7fbc2e947d60>
We see a linear relationship between CreditUsed and Balance column (where balance is not zero).
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 24/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
ProfileReport(data)
Overview
Dataset statistics
Number of variables 24
Missing cells 72
Duplicate rows 0
Variable types
Numeric 10
Categorical 10
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 25/26
6/4/25, 3:51 PM Data Analyst Interview Assignment
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 26/26