Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
25 views26 pages

Data Analyst Interview Assignment

The document outlines a data analysis assignment for a Data Analyst interview, focusing on a dataset of loans issued to customers as of January 31st. It includes instructions for data loading, cleaning, imputation, and feature engineering, along with necessary Python imports and data descriptions. The assignment emphasizes visualizing key aspects of the dataset to derive insights on loan statuses and repayment behaviors.

Uploaded by

邓雯卿
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views26 pages

Data Analyst Interview Assignment

The document outlines a data analysis assignment for a Data Analyst interview, focusing on a dataset of loans issued to customers as of January 31st. It includes instructions for data loading, cleaning, imputation, and feature engineering, along with necessary Python imports and data descriptions. The assignment emphasizes visualizing key aspects of the dataset to derive insights on loan statuses and repayment behaviors.

Uploaded by

邓雯卿
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

6/4/25, 3:51 PM Data Analyst Interview Assignment

Data Analyst Interview Assignment

Data Analysis - Pilot report as at 31st Jan


Assignment Task;
Using the attached dataset(pilot report as at 31st jan.csv) analyse the data and visualize the most important
aspects using your preferred method. This dataset contains information on loans that have been issued to
customers and their status as at 31st of January. Attached (data dictionary.xlsx) find a data dictionary to aid
with understanding the different attributes.

# Necessary imports
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", None)

import matplotlib.pyplot as plt


%matplotlib inline
import seaborn as sns
sns.set_style("darkgrid")

from tqdm import tqdm, trange


import warnings; warnings.filterwarnings("ignore")

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 1/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

# Load data
data = pd.read_csv("data/pilot Report as at 31st jan.csv", parse_dates=['CreateDate', 'Inv
data
 

PartnerID CreditLimit SONumber Cleared Overdue CreditUsed Amo

0 36262 26,100 SO11705794 True False 1,464 1,4

1 36262 26,100 SO11705909 True True 146 148

2 36262 26,100 SO11780664 True False 1,650 1,6

3 36262 26,100 SO11833594 True False 8,220 8,2

4 36262 26,100 SO11909592 True False 2,080 2,0

... ... ... ... ... ... ... ...

2354 1298401 1,669 SO13572455 True True 870 875

2355 1298401 1,669 SO13572754 True True 220 222

2356 1298401 1,669 SO13810848 True False 1,344 1,3

2357 1298401 1,669 SO13810938 True False 236 236

2358 1298401 1,669 SO14061060 False False 1,320 0

2359 rows × 13 columns

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 2/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

# Data descriptions
data_dict = pd.read_excel("data/data dictionary.xlsx")
data_dict

Attribute Descriprion

0 PartnerID Customer Unique Identifier

1 CreditLimit Maximum amount a customer can borrow at a give...

2 SONumber Unique loan identifier

3 Cleared Loan Status\nTrue = Loan has been paid\nFalse ...

4 Overdue Loan Tenure Status\nTrue = Loan has exceeded i...

5 CreditUsed Total Amount borrowed

6 AmountRepaid Total Loan amount paid back

7 Balance CreditUsed - AmountRepaid

8 Fees Fees accrued from late repayment

9 DaysOverdue Number of days the loan is overdue by

10 CreatedDate Date order was placed

11 InvoiceDate Date order was delivered

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 3/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

# Correct column name and print descriptions in full


data_dict.columns = ['Attribute', 'Description']

for row in data_dict.index:


print(f"{data_dict.loc[row, 'Attribute']} : {data_dict.loc[row, 'Description']}\n")

PartnerID : Customer Unique Identifier

CreditLimit : Maximum amount a customer can borrow at a given time

SONumber : Unique loan identifier

Cleared : Loan Status


True = Loan has been paid
False = Loan is still pending

Overdue : Loan Tenure Status


True = Loan has exceeded its repayment days
False = Loan is still within its repayment days

CreditUsed : Total Amount borrowed

AmountRepaid : Total Loan amount paid back

Balance : CreditUsed - AmountRepaid

Fees : Fees accrued from late repayment

DaysOverdue : Number of days the loan is overdue by

CreatedDate : Date order was placed

InvoiceDate : Date order was delivered

General Data Statistics


Preview general dataset stats using package datastand
Source code: https://github.com/lyraxvincent/datastand/blob/master/datastand/datastand.py

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 4/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

from datastand.datastand import datastand

datastand(data)


General stats:
==================
Size of DataFrame: 30667
Shape of DataFrame: (2359, 13)
Number of unique data types : {dtype('float64'), dtype('O'), dtype('int64'), dtype('<M8[ns]')}
Number of numerical columns: 3
Number of non-numerical columns: 8

Head of DataFrame:
__________________
PartnerID CreditLimit SONumber Cleared Overdue CreditUsed AmountRepaid \
0 36262 26,100 SO11705794 True False 1,464 1,464
1 36262 26,100 SO11705909 True True 146 148
2 36262 26,100 SO11780664 True False 1,650 1,650
3 36262 26,100 SO11833594 True False 8,220 8,220
4 36262 26,100 SO11909592 True False 2,080 2,080

Balance Fees DaysOverdue CreateDate InvoiceDate group


0 0 0.0 0.0 2021-10-15 2021-10-18 Test
1 0 2.0 0.0 2021-10-15 2021-10-18 Test
2 0 0.0 0.0 2021-10-19 2021-10-21 Test
3 0 0.0 0.0 2021-10-22 2021-10-25 Test
4 0 0.0 0.0 2021-10-27 2021-10-29 Test

Tail of DataFrame:
__________________
PartnerID CreditLimit SONumber Cleared Overdue CreditUsed \
2354 1298401 1,669 SO13572455 True True 870
2355 1298401 1,669 SO13572754 True True 220
2356 1298401 1,669 SO13810848 True False 1,344 

Do you wish to long-list missing data statistics?(y/n): y

Column:
SONumber
_______________

Missing data points 94 out of total 2359.


Most occurring value: SO11554320, count: 1
Column:
Cleared
_______________

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 5/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

Missing data points 66 out of total 2359. 

Most occurring value: True, count: 400


Column:
Overdue
_______________

Missing data points 22 out of total 2359.


Most occurring value: False, count: 1351
Column:
CreditUsed
_______________

Missing data points 22 out of total 2359.


Most occurring value: 860, count: 61
Column:
AmountRepaid
_______________

You can visualize missing data automatically right away or you can use the
function plot_missing() after importing it from DataStand. Visualize now?(y/n): y

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 6/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

<datastand.datastand.datastand at 0x7fbc74280a30>

Data Cleaning and Imputation


Fix inconsistent data types
Deal with missing values:
Drop rows that have missing values in almost all columns
Fill other missing data points

# fix data types


def to_int(value):

if str(value) != 'nan':
value = int(''.join(str(value).split(',')))

return value

def fix_dtypes(df):

df.CreditLimit = df.CreditLimit.apply(to_int)
#df.Cleared = df.Cleared.astype(bool)
#df.Overdue = df.Overdue.astype(bool)
df.CreditUsed = df.CreditUsed.apply(to_int)
df.AmountRepaid = df.AmountRepaid.apply(to_int)
df.Balance = df.Balance.apply(to_int)

return df

data = fix_dtypes(data)
print(data.dtypes)

PartnerID int64
CreditLimit int64
SONumber object
Cleared object
Overdue object
CreditUsed float64
AmountRepaid float64
Balance float64
Fees float64
DaysOverdue float64
CreateDate datetime64[ns]
InvoiceDate datetime64[ns]
group object
dtype: object

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 7/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

Missing values imputation:


First drop those rows with missing data points in several columns. These are the 22 rows with missing points
from the Overdue upto CreateDate columns as shown from our general stats output above.

data.dropna(subset=['Overdue',
'CreditUsed', 'AmountRepaid', 'Balance', 'Fees', 'DaysOverdue',
'CreateDate'], axis=0, inplace=True)

# reset index
data.reset_index(drop=True, inplace=True)

data.isnull().sum()

PartnerID 0
CreditLimit 0
SONumber 72
Cleared 44
Overdue 0
CreditUsed 0
AmountRepaid 0
Balance 0
Fees 0
DaysOverdue 0
CreateDate 0
InvoiceDate 0
group 0
dtype: int64

data[data.Cleared.isna()].sample(5, random_state=101)

PartnerID CreditLimit SONumber Cleared Overdue CreditUsed Amoun

73 60592 13100 NaN NaN True 1050.0 1053.

1853 370338 3556 NaN NaN True 2910.0 3270.

1794 47288 1975 NaN NaN True 1875.0 1935.

916 363796 16000 NaN NaN True 860.0 869.0

2218 855202 1142 NaN NaN True 445.0 473.0

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 8/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

data[data.Fees == 0].count()[0]

1351

Now we have two columns with missing data points and since the SONumber column is an index column, we
only have the Cleared column to fill.
We use information from other columns to fill this column. In this case the CreditUsed and AmountRepaid
columns.
Clearly, if amount repaid is equal(for zero fee) or more than credit used(with charged fee), this means that
the customer repaid the loan.

for idx in tqdm(data[data.Cleared.isna()].index):

if data.loc[idx, 'AmountRepaid'] >= data.loc[idx, 'CreditUsed']:


data.loc[idx, 'Cleared'] = True
else:
data.loc[idx, 'Cleared'] = False

100%|████████████████████████████████████████████████████████████████████████████████████████████████████
 

# Check for duplicates


data[data.duplicated(subset=['CreditLimit', 'SONumber', 'Cleared', 'Overdue',
'CreditUsed', 'AmountRepaid', 'Balance', 'Fees', 'DaysOverdue',
'CreateDate', 'InvoiceDate', 'group'])]

PartnerID CreditLimit SONumber Cleared Overdue CreditUsed AmountRep

Feature Engineering
Design features from the already available ones:

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 9/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

def feature_eng(df):

# datetime features
df['CreateDate_year'] = df.CreateDate.dt.year.astype(int)
df['CreateDate_month'] = df.CreateDate.dt.month.astype(int)
df['CreateDate_day'] = df.CreateDate.dt.day.astype(int)
df['CreateDate_dayname'] = df.CreateDate.dt.day_name()

df['InvoiceDate_year'] = df.InvoiceDate.dt.year.astype(int)
df['InvoiceDate_month'] = df.InvoiceDate.dt.month.astype(int)
df['InvoiceDate_day'] = df.InvoiceDate.dt.day.astype(int)
df['InvoiceDate_dayname'] = df.InvoiceDate.dt.day_name()

# time difference in days between create date and invoice date


df['CrtInv_dateDiff'] = (df.InvoiceDate - df.CreateDate).apply(lambda x: int(str(x).sp

# binary category columns for cleared and overdue columns


df['cleared_cat'] = df.Cleared.map({True: 'Cleared', False: 'Not Cleared'})
df['overdue_cat'] = df.Overdue.map({True: 'Overdue', False: 'On time'})

return df

data = feature_eng(data)
 

data.head()

PartnerID CreditLimit SONumber Cleared Overdue CreditUsed Amount

0 36262 26100 SO11705794 True False 1464.0 1464.0

1 36262 26100 SO11705909 True True 146.0 148.0

2 36262 26100 SO11780664 True False 1650.0 1650.0

3 36262 26100 SO11833594 True False 8220.0 8220.0

4 36262 26100 SO11909592 True False 2080.0 2080.0

Exploratory Data Analysis


Univariate Analysis
Studying selected columns one by one:

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 10/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

# Total number of customers spanned in this dataset


print(f"Total Customers: {data.PartnerID.nunique()}")

Total Customers: 190

# Distribution of Cleared column


print(data.Cleared.value_counts())

# as a percentage
print(data.Cleared.value_counts()*100 / len(data))
plt.figure(figsize=(10,6))
sns.countplot(data.Cleared)
plt.title("Distribution of Cleared status")

True 1937
False 400
Name: Cleared, dtype: int64
True 82.884039
False 17.115961
Name: Cleared, dtype: float64

Text(0.5, 1.0, 'Distribution of Cleared status')

Most customers have cleared their loans.

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 11/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

# Distribution of Overdue column


print(data.Overdue.value_counts())

# as a percentage
print(data.Overdue.value_counts()*100 / len(data))
plt.figure(figsize=(10,6))
sns.countplot(data.Overdue)
plt.title("Distribution of Overdue status")

False 1351
True 986
Name: Overdue, dtype: int64
False 57.809157
True 42.190843
Name: Overdue, dtype: float64

Text(0.5, 1.0, 'Distribution of Overdue status')

Most customers paid their loans on time.


Top 10 customers who have taken the most number of loans:

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 12/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

loan_times_dict = data.groupby('PartnerID')['SONumber'].count().to_dict()

# sort dictionary
marklist = sorted(loan_times_dict.items(), key=lambda x:x[1], reverse=True)
loan_times_dict = dict(marklist)

for key, val in zip(list(loan_times_dict.keys())[:10], list(loan_times_dict.values())[:10]


print(f"{key} : {val}")
 

388436 : 122
363796 : 116
548447 : 86
437063 : 84
400649 : 78
105975 : 73
303101 : 60
302148 : 54
410793 : 52
340828 : 44

Customers that are still defaulting on their loans:

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 13/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

defaulting_customers = pd.DataFrame(sorted(data[data.Cleared == False].groupby('PartnerID'


key=lambda x:x[1], reverse=True),
columns=['PartnerID', 'Total_balance'])
defaulting_customers = defaulting_customers[defaulting_customers.Total_balance > 0]
defaulting_customers
 

PartnerID Total_balance

0 398110 27739.0

1 274819 26048.0

2 805373 20759.0

3 105975 19337.0

4 548447 18940.0

... ... ...

103 360373 85.0

104 171681 67.0

105 309779 20.0

106 538014 18.0

107 65627 5.0

108 rows × 2 columns

defaulting_customers.head(10)

PartnerID Total_balance

0 398110 27739.0

1 274819 26048.0

2 805373 20759.0

3 105975 19337.0

4 548447 18940.0

5 400649 15790.0

6 437063 12970.0

7 668827 11477.0

8 351590 10022.0

9 174216 9802.0

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 14/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

Out of 190 customers, 108 have not yet cleared their loans, with 9 customers having total pending balances
of above KShs. 10000.

Loans that are long overdue:

data[data.DaysOverdue >= 100]#.count()[0]

PartnerID CreditLimit SONumber Cleared Overdue CreditUsed Amo

63 58981 11200 SO11566125 False True 540.0 0.0

64 58981 11200 SO11566450 False True 324.0 0.0

288 174216 12700 SO11671068 False True 995.0 31.

646 334599 10500 SO11593867 False True 2626.0 0.0

647 334599 10500 SO11640698 False True 1081.0 0.0

1646 668827 18500 SO11640140 False True 2120.0 324

1718 964932 6200 SO11563593 False True 178.0 0.0

1719 964932 6200 SO11633125 False True 272.0 0.0

1720 964932 6200 SO11633163 False True 106.0 0.0

1721 964932 6200 SO11634061 False True 378.0 0.0

1722 964932 6200 SO11635044 False True 403.0 0.0

1723 964932 6200 SO11635099 False True 255.0 0.0

1724 964932 6200 SO11635347 False True 769.0 0.0

1821 309779 957 SO11667635 False True 220.0 220

1973 519688 1925 SO11664370 False True 89.0 89.

data[data.DaysOverdue >= 100].count()[0]

15

Credit usage over time:

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 15/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

plt.figure(figsize=(15,8))
sns.lineplot(x='CreateDate', y='CreditUsed', data=data.sort_values(by='CreateDate'),
hue='Cleared')
plt.title("Credit usage over time")

Text(0.5, 1.0, 'Credit usage over time')

Investigate the spiked Credit that is not yet cleared:

data[(data.CreateDate > pd.to_datetime('2021-12-01')) & (data.CreateDate < pd.to_datetime(


(data.CreditUsed > 6000) & (data.Cleared == False)]
 

PartnerID CreditLimit SONumber Cleared Overdue CreditUsed Amou

990 384931 14900 SO12926220 False True 13152.0 1322

It is partly paid with a minimal balance of KShs. 127


Days when most loan orders are created and when they are invoiced:
https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 16/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

plt.figure(figsize=(10,6))
sns.countplot(y=data.CreateDate_dayname,
order=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])
plt.title("Number of Loan orders day-wise(CreateDate)")

Text(0.5, 1.0, 'Number of Loan orders day-wise(CreateDate)')

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 17/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

plt.figure(figsize=(10,6))
sns.countplot(y=data.InvoiceDate_dayname,
order=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])
plt.title("Number of Invoices day-wise(InvoiceDate)")

Text(0.5, 1.0, 'Number of Invoices day-wise(InvoiceDate)')

Most loan orders are made on Thursday and Saturday while most invoices are sent out on Wednesday.
Wednesdays, Thursdays and Saturdays are the busy days in a week.

data.CreateDate_year.value_counts()

2021 2072
2022 265
Name: CreateDate_year, dtype: int64

data.CreateDate_month.value_counts()

11 790
10 772
12 510
1 265
Name: CreateDate_month, dtype: int64

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 18/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

data.InvoiceDate_month.value_counts()

11 783
10 700
12 577
1 277
Name: InvoiceDate_month, dtype: int64

The dataset spans 4 months; October, November, December 2021 and January 2022.
January orders have cut down to close than half those of December, with maximum orders being made in
November 2021.

Difference in days from create date to invoice date:

data.CrtInv_dateDiff.value_counts()

2 1452
3 540
4 266
6 30
7 19
9 16
5 14
Name: CrtInv_dateDiff, dtype: int64

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 19/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

fig, ax = plt.subplots(1,2, figsize=(14,6))


ax[0].pie(data.CrtInv_dateDiff.value_counts(), labels=['2 days','3 days','4 days','6 days'
sns.countplot(y=data.CrtInv_dateDiff, ax=ax[1])
fig.suptitle("Difference in days from create date to invoice date")
 

Text(0.5, 0.98, 'Difference in days from create date to invoice date')

The shortest time it takes for a loan order to be invoiced is 2 days.


There are 16 worse cases where orders took upto 9 days, and total 35 cases took over a week.(7 and 9
days)
Bivariate Analysis
Studying relationships between columns:
Boxplots to visualize quartiles and their ranges as well as detect outliers:

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 20/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

cols, rows = 3, 2

fig, axes = plt.subplots(rows, cols, figsize=(16,12))

columns = ['CreditLimit', 'CreditUsed', 'AmountRepaid', 'Balance', 'Fees', 'DaysOverdue']

for index, col in enumerate(columns):


# new subplot with (i + 1)-th index laying on a grid
plt.subplot(rows, cols, index + 1)
# drawing the plot
sns.boxplot(x='cleared_cat', y=col, data=data)
plt.title(f"{col}")

fig.suptitle("Numerical columns in relation to Cleared status")


plt.show()
 

Inspecting correlation of credit limit and credit used:


https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 21/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

print(data[['CreditLimit', 'CreditUsed']].corr())
sns.heatmap(data[['CreditLimit', 'CreditUsed']].corr())

CreditLimit CreditUsed
CreditLimit 1.000000 0.360038
CreditUsed 0.360038 1.000000

<AxesSubplot:>

Normally we would expect a customer with a high credit limit to borrow more. The above heatmap shows
weak positive correlation, meaning that although increased limit increases a customers borrowing amount, it
does not always have to be that increased credit limit for a customer will make them borrow more.

Further credit limit vs credit used analysis:

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 22/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

sns.jointplot(x='CreditLimit', y='CreditUsed', data=data, kind='reg')

<seaborn.axisgrid.JointGrid at 0x7fbc2e9d1bb0>

The smaller the credit limit, the more the small amounts loans. -> More borrowers are small scale.

Finish off EDA with a pairplot to see if we gain insights from an overall plot with more than two variables and
a pandas profile report to summarise everything.

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 23/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

sns.pairplot(data, vars=['CreditLimit', 'CreditUsed', 'AmountRepaid', 'Balance', 'Fees', '


 

<seaborn.axisgrid.PairGrid at 0x7fbc2e947d60>

We see a linear relationship between CreditUsed and Balance column (where balance is not zero).

from pandas_profiling import ProfileReport

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 24/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

ProfileReport(data)

Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]

Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]

Render HTML: 0%| | 0/1 [00:00<?, ?it/s]

Overview

Dataset statistics
Number of variables 24

Number of observations 2337

Missing cells 72

Missing cells (%) 0.1%

Duplicate rows 0

Duplicate rows (%) 0.0%

Total size in memory 438.3 KiB

Average record size in memory 192.1 B

Variable types
Numeric 10

Categorical 10

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 25/26
6/4/25, 3:51 PM Data Analyst Interview Assignment

Overall Data Analysis Report


There are 190 total customers in the dataset.
Most customers have cleared their loans (=~ 82.88%)
Most customers paid their loans on time (=~ 57.8%)
The customer who have taken the most number of loans is Partner ID 388436 with 122 loans taken
followed by Partner ID 363796 with 116.
Out of 190 customers, 108 have not yet cleared their loans, with 9 customers having total pending
balances of above KShs. 10000.
There are a number of loans that are long overdue, more than 15 defaulting loans are more than 3
months overdue.
Most loan orders are made on Thursday and Saturday while most invoices are sent out on Wednesday.
Wednesdays, Thursdays and Saturdays are the busy days in a week.
The dataset spans 4 months; October, November, December 2021 and January 2022.
January orders have cut down to close than half those of December, with maximum orders being made
in November 2021.
The shortest time it takes for a loan order to be invoiced is 2 days.
There are 16 worse cases where orders took upto 9 days, and total 35 cases took over a week.(7 and 9
days)
There is weak relationship between credit limit and credit used. Although increased limit increases a
customers borrowing amount, increased credit limit for a customer does not make them borrow more.
The smaller the credit limit, the more the small amounts loans. More borrowers are small scale.

https://deepnote.com/app/lyraxvincent/Data-Analyst-Interview-Assignment-7e8895fe-2ae6-45cf-9cd5-788364d86d1f 26/26

You might also like