Business Statistics II
Project Report-Group 9
Submitted by:
Midhun Mohan
Braja kishor Prida
Paras kohli
Deepak Kannan
Mahesh kumar Yadav
Ashutosh Tiwari
Project 1(instalment 3)
Q1: a. If the amounts garnished from each merchant’s credit card transactions is accumulating
at a rate that will pay off the loan at around 12 months, then approximately what should be
the slope and intercept of the least squares regression of the amount repaid at six months (y)
versus the total amount to be repaid (x)?
As per the question, we can assume that payment is done uniformly, so the slope will become 0.5
And equation will be y=0.5x
b. For your data set, what is the slope and intercept of the least squares regression of
the amount repaid at six months (y) versus the total amount to be repaid (x)?
Compare these estimates with the values anticipated in “a”? Explain any differences.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
[Workspace loaded from ~/.RData]
> library(readr)
> Project_dataset <- read_csv("F:/Term 2/Business Statistics
2/Project/Project_dataset.csv")
> View(Project_dataset)
> Project_dataset
# A tibble: 628 x 27
Sno CaseNumber Amt.Repaid.at.6.Months Nominal.Loan.Amount
Total.Amt.to.be.Repaid
<int> <int> <int> <dbl> <dbl>
1 1 344012260 7879 19200 22656.00
2 2 378114786 19392 37000 43475.00
3 3 351015636 14253 26600 31654.00
4 4 79719718 6934 15600 18330.00
5 5 104110253 6199 18400 22816.00
6 6 371413226 21268 40000 49000.00
7 7 253511876 27233 44000 51700.00
8 8 355114334 4841 8900 10635.55
9 9 214917684 8040 14800 18352.00
10 10 161017385 5885 15000 17850.00
# ... with 618 more rows, and 22 more variables: Repayment.Percentage <dbl>,
# Commission.Upfront <dbl>, Validated.Monthly.Batch <dbl>,
# Historical.Monthly.Credit.Card.Receipts <dbl>, Loan.Type <chr>, Loan.Size.Class
<chr>,
# FICO <int>, Years.In.Business <int>, Num.of.Credit.Lines <int>,
# Num.of.Paid.off.Credit.Lines <int>, Current.Delinquent.Credit.Lines <int>,
# Previous.Delinquent.Credit.Lines <int>, Business.Entity.Type <chr>,
Num.of.Trade.Lines <int>,
# Num.of.Derog.Legal.Item <int>, Two.Digit.SIC.Code <int>, Two.Digit.SIC.Description
<chr>,
# Population.in.Zip.Code <int>, Average.House.Value.in.Zip.Code <int>,
# Income.Per.Household.in.Zip.... <int>, State <chr>, ISO.Name <chr>
> y<-Project_dataset$Amt.Repaid.at.6.Months
>y
[1] 7879 19392 14253 6934 6199 21268 27233 4841 8040 5885
12249
[12] 1207 27408 10628 52687 46141 41531 69413 4253
> x<-Project_dataset$Total.Amt.to.be.Repaid
> plot(y,x)
> plot(x,y)
> plot(x, y, xlab = "Total amount to be repaid", ylab = "Amount paid in six months",
main="scatter plot of y and x")
> lm(x,y)
Error in formula.default(object, env = baseenv()) : invalid formula
> lm(x~y)
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
4.349e+04 4.516e-02
> model1<-lm(x~y)
> model1$fitted.values
1 2 3 4 5 6 7 8 9
43841.32 44361.21 44129.15 43798.65 43765.46 44445.92 44715.27 43704.14
43848.59
10 11 12 13 14 15 16 17
> yfit<-model1$fitted.values
> plot(x,y)
> lines(x,yfit)
> residual<-y-yfit
> plot(x,residual)
> cor(x,y)
[1] 0.1422644
(c) Use residuals to determine whether the data used to estimate the simple regression in
“b” conform to the assumptions of the simple regression model (SRM). If the data do
conform to the SRM, report and briefly interpret the value of R2 and RMSE. If the
data are not consistent with the assumptions of the SRM, explain how the data deviate
from those assumptions.
Checklist for Assumptions in Linear Regression:
✓ is the association between Y and X linear?
✓ Have we ruled out obvious lurking variables?
✓ Are the errors evidently independent?
✓ Are the variances of the residuals similar?
✓ Are the residuals nearly normal?
Linearity:
Variables Amt.Repaid.at.6.Months and Total.Amt.to.be.Repaid are not related linearly. As
we can clearly figure out by looking at below picture.
Residual plot:
Independence:
Residuals in the plot are following a pattern, it suggests that variables which are not linearly
related, we are forcefully fitting as linear in this equation.
Variance:
Pattern in the residuals shows that the variance is not constant.
2. The lender uses a performance metric known as PRSM, performance ratio at six
months. In this question, you will form and describe this new variable. To construct
PRSM, define a new column in your data table using a formula. In the formula, divide
two times the amount repaid at six months by the total amount to be repaid:
(a) If small loans and large loans are performing comparably and accumulating at a
rate that will pay off the loans at around 12 months, then approximately what
should be the slope and intercept of the least squares regression of PRSM (y)
versus the total amount to be repaid (x)?
Y = PRSM
X = amount to be re paid
Loans are accumulating at a rate that pays off loans, it indicates loans are performing
well. People are paying their installment without defaulting.
In this scenario PRSM is equal to one and amount to be repaid(x) is equal to the
Amount repaid at six months.
Equation: Y = 1
Intercept = 1, slope = 0
(b) For your data set, what is the slope and intercept of the least squares regression
of PRSM (y) versus the total amount to be repaid (x)? Compare these
estimates with the values of the slope and intercept anticipated in “a”? Explain
any differences.
Reg4 <- lm(Treated_sample2$PRSM~Treated_sample2$Total.Amt.to.be.Repaid)
Call:
lm(formula = Treated_sample2$PRSM ~
Treated_sample2$Total.Amt.to.be.Repaid)
Coefficients:
(Intercept) 7.801e-01
Treated_sample2$Total.Amt.to.be.Repaid 7.279e-07
Equation: PRSM = 7.801e-01+ 7.279e-07 * Total.Amt.to.be.Repaid
summary(Reg4)
Residuals:
Min 1Q Median 3Q Max
-0.67970 -0.14559 0.00587 0.15022 0.76540
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.801e-01 1.100e-02 70.908 < 2e-16 ***
Treated_sample2$Total.Amt.to.be.Repaid 7.279e-07 1.502e-07 4.847 1.58e-06 ***
Explanation:
1) In question 2(a), Ideal case intercept is 1, here it is 7.801e-01. It is due to
extrapolation outside the dataset.
2) In ideal case slope is zero, here it is 7.279e-07. This is happening due to the
presence of irregular or partial payments of instalments. p –value of the slope is
1.58e-06. With 95% confidence level we can reject the null hypothesis.
Presence of slope in this equation is significant.
(C) Use residuals to determine whether the data used to estimate the simple
regression in “b” conform to the assumptions of the simple regression model
(SRM). If the data do conform to the SRM, report and briefly interpret the
value of R2 and RMSE. If the data are not consistent with the assumptions of
the SRM, explain how the data deviate from those assumptions.
Assumptions of SRM :
1) Linearity:
2) Residuals should be independent:
(d) Does your analysis of the data indicate that loans with larger total amount to be
repaid have smaller average PRSM (i.e., tend on average to underperform)
compared to those with smaller principal?
The analysis of the data indicates that loans with larger total amount to be repaid have
larger average PRSM (i.e., tend on average to over perform) compared to those with smaller
principal. But this inference cannot be justified because the residuals do not fulfil the SRM
conditions (heteroscedasticity).
3. Suppose that the lender could lend sufficient money so that $160,000 has to be repaid,
but in one of two ways. It can lend the money to one merchant, who needs to pay back the
entire $160,000, or to 16 distinct merchants, each of which must repay $10,000. Due to its
own credit obligations, the lender must have at least $60,000 of its money paid back within
six months. Which approach would you recommend to management of the lender: lend to a
single merchant or spread the money over 16 merchants? Support your choice by supplying
an estimate of the probability that at least $60,000 will be paid back within six months
under both scenarios. Your answer should note any key assumptions that are necessary for
your analysis. The following questions will help guide your answer.
(a) Which of the two models considered in questions 1 and 2 better conforms to the Simple
Regression Model (SRM) assumptions? Briefly explain why.
The model 2 conforms better to the SRM assumptions because, the second regression model
depicts a better linear relationship between the predictor and the response variable . Also
based on the results of correlation function, it can be seen that the model doesn’t suffer
from issues of multi-collinearity.
(b) Using the model from Q2, estimate the average PRSM of loans which must repay $10k
and of loans which must repay $160k.
Case 1 : PRSM = 7.801e-01 + (7.279e-07)(10) = 2.93
Case 2 : PRSM = 7.801e-01 + (7.279e-07)(160)= 3.93
(c) For each lending scheme, what is the smallest average PRSM that the lender must
observe for each lending scheme to ensure $60k has been paid back within 6 months?
PRSM = 2 * (Amt.Repaid.at.6.Months/Total.Amt.to.be.Repaid)
Case1 : 2 * (60/(160*1)) = 0.75
Case2 : 2 * (60/(10*16)) = 0.75
(d) What is the approximate distribution of the average PRSM score in each of the two
lending schemes?
Case 1 : Mean = 2.93 ; RMSE = 0.218477
Case 2 : Mean = 3.93 ; RMSE = 0.218477
(e) For each lending scheme, determine the probability that at last $60k is paid back within
6 months. Which lending scheme is better – that is, which has the higher probability of
repayment?
Case 1 : Mean = 2.93 ; RMSE = 0.218477; Z value of PRSM = 0.75 (calculated above)
= ( 0.75 – 2.93)/0.222 = 9.81
Case 2 : Mean = 3.93 ; RMSE = 0.218477; Z value of PRSM = 0.75 (calculated above)
= (0.75 – 3.93)/0.222 = 14.32
4. To gauge the risk of loans, the lender has available several easily-obtained variables: the
FICO score, the Years in Business, the number of Credit Lines, and the Income per
Household in the zip code. Choose the best one of these four variables as an explanatory
variable in a simple regression to explain variation in PRSM. You want to choose the single
explanatory variable that makes these predictions as accurate as possible. (Be aware that a
transformation of an explanatory variable may produce a better fitting, more predictive
model, but keep any transformations simple and do not combine the predictor variables).
Absolutely, do not transform the PRSM score.
library(data.table)
library(dplyr)
library(ggplot2)
setwd("C:/NIIT/Term 2/Business Stats II/proj/")
data <- fread("insdata.csv")
data$Amt.Repaid.at.6.Months <- as.integer(data$Amt.Repaid.at.6.Months)
data <- data %>%
filter(!is.na(Amt.Repaid.at.6.Months)) %>%
mutate(prsm = 2*(Amt.Repaid.at.6.Months/Total.Amt.to.be.Repaid))
four <- data %>% select(c(12:14,25,28))
mod_fico <- lm(prsm~FICO, data = four)
summary(mod_fico)
mod_yib <- lm(prsm~Years.In.Business, data = four)
summary(mod_yib)
mod_ncl <- lm(prsm~Num.of.Credit.Lines, data = four)
summary(mod_ncl)
mod_iph <- lm(prsm~Income.Per.Household.in.Zip...., data = four)
summary(mod_iph)
(a) Explain briefly the reasoning behind your choice of the explanatory variable used
in your simple regression. Empirically, why choose this variable?
The variable ‘Years.In.Business’ explains most part of the variation in PRSM among the
other variables (FICO, Number.of.Credit.Lines & Income.Per.Household). This is evident
from the summary of the model and the coefficients of the intercept.
(b) Summarize the estimated simple regression, including R2, RMSE, and the
estimated slope and intercept. Interpret the values of these estimates appropriately.
coefficients(mod_yib)
(Intercept) Years.In.Business
0.8127613 0.1207524
summary(mod_yib)$sigma
[1] 28.03813
The expected value of PRSM given the average Years.In.Business (all the values) is
0.827613. i.e for the average value of Years.In.Business, the expected value of PRSM is
0.827613. Also, for every 1yr increase in Years.In.Business, the PRSM value increases by
0.1207524