Exploring the Dynamics of Annual Income, Home Ownership, and
Loan Amount: An In-depth Exploratory Data Analysis
Name & Student ID
Abstract
This study employs exploratory data analysis methods to examine the
relationships between annual income, home ownership, and loan amount. Through
statistical analysis and data visualization, we investigate correlations, patterns, and
potential outliers within the loan50 dataset. The findings provide valuable insights for
financial institutions, policymakers, and individuals making housing and loan
decisions. The methodology establishes a foundation for future socio-economic
research and modeling.
Keywords: annual income, loan amount, home ownership
Introduction
The increasing popularity of online lending platforms, such as the Lending Club,
has revolutionized the way individuals access loans. As the lending industry evolves,
it becomes crucial to explore the dynamics between key variables that influence loan
decisions. Previous research in the field of personal finance and lending has focused
on factors like credit scores and debt-to-income ratios (Agarwal et al. 2015).
However, limited attention has been given to the interplay between annual income,
home ownership, and loan amount. Understanding the relationships between these
variables can provide valuable insights into the loan borrowing process and inform
better lending practices. (Research Background)
The objective of this study is to conduct an in-depth exploratory data analysis to
investigate the dynamics between annual income, home ownership, and loan amount
within the loan50 dataset. By analyzing the relationships and patterns that emerge
from the data, this study aims to uncover potential insights and identify any
significant associations between these variables. The findings will contribute to our
understanding of the factors influencing loan decisions and provide valuable
information for financial institutions, policymakers, and individuals in making
informed housing and loan choices. Additionally, the study aims to lay the
groundwork for future socio-economic research and modeling in the lending industry.
(Statement of Purpose)
Methods
The loan50 dataset was obtained from the "OpenIntro Statistics" website and
consists of 50 samples and 18 variables. For this analysis, the focus is on three
variables: home_ownership, annual_income, and loan_amount. Home_ownership
represents the ownership status of the applicant's residence, annual_income represents
the applicants' annual income, and loan_amount represents the amount of the loan
they received. (Data Description)
The home_ownership variable is categorical, with categories including rent,
mortgage, and own. The annual_income and loan_amount variables are numerical,
representing continuous values. (Nature of Variables)
To gain insights from the data, descriptive statistics were calculated and
visualizations were created. These descriptive statistics provide measures of central
tendency (mean and median) and data dispersion (standard deviation) for each
variable. Furthermore, various visualizations, including histogram for annual_income
variable, bar chart for homeownership variable, scatter plot of the variables
annual_income and loan_amount, and parallel box plots of the variables
annual_income and homeownership, were created to explore the relationships,
patterns, and potential outliers within the loan50 dataset. The parallel boxplots allow
for a visual comparison of the annual income distributions among different
homeownership groups (e.g., rent, mortgage, own). By examining the positions and
shapes of the boxes and whiskers, we can gain insights into potential variations in
income levels. The scatter plot can help us determine whether there is a relationship
between annual income and loan amount. If the data points are tightly clustered
around a linear pattern, it suggests a strong correlation. On the other hand, if the data
points are more scattered and do not form a clear linear pattern, it suggests a weak or
no correlation. These methods allow for a comprehensive understanding of the
dataset, enabling further analysis and interpretation of the findings. (Descriptive
Statistics and Data Visualization)
Results
Table 1 shows the mean annual income in the dataset is $82,276. The median
income is slightly lower at $75,000, indicating that the distribution of incomes may be
slightly right-skewed (Figure 1). The standard deviation for annual income is
relatively high at $66,631.74, indicating a wide dispersion of income values around
the mean. The mean loan amount is $17,989.74, representing the average loan amount
in the dataset. The median loan amount of $16,000 is slightly lower than the mean,
also suggesting a slightly right-skewed distribution of loan amounts. The range of
loan amounts varies from a minimum of $5,825 to a maximum of $40,000, indicating
the spread of loan amounts in the dataset. The standard deviation for loan amount is
$8,195.35, indicating a considerable variation in loan amounts around the mean.
Table 1. The descriptive statistics for the annual_income and loan_amount variables.
Variable Mean Median Standard Deviation
Annual Income $82,276 $75,000 $66,631.74
Loan Amount $17,989.74 $16,000 $8,195.35
Figure 1: The histogram with density plot for annual_income.
Figure 2 shows the most common homeownership status among the provided
dataset is "mortgage," which suggests that a significant number of individuals in the
sample own homes with mortgage loans.
Figure 2: The bar plot for homeownership variable.
It could be interesting to explore the relationship between homeownership status
(rent, mortgage, and own) and annual income. In figure 3, the horizontal lines inside
each box represent the median annual income for each homeownership category.
Figure 3 shows that the median annual income for mortgage holders is higher than the
median annual income for renters and homeowners. The interquartile range (IQR) for
annual income is higher for renters than for homeowners and mortgage holders. This
means that there is a wider range of incomes among renters than among homeowners
and mortgage holders. There are a few outliers in the box plot for annual income for
renters and mortgage holders.
Figure 3: The parallel box plots of the variables homeownership and annual_income.
From the scatter plot in Figure 4, we can see a general positive relationship
between annual income and loan amount. As annual income increases, there is a
tendency for the loan amount to also increase. This suggests that individuals with
higher incomes tend to qualify for larger loan amounts.
Figure 4: The scatter plot of the variables annual_income and loan_amount.
Conclusions
For the results of annual income in Table 1, the median income of $75,000 is
slightly lower than the mean ($82,276), suggesting a slightly right-skewed distribution
of incomes (Figure 1). This skewness indicates that there may be a few high-income
outliers or a group of individuals with relatively higher incomes. The standard
deviation of annual income is $66,631.74, which is relatively high. This indicates a
wide dispersion of income values around the mean. The large standard deviation
suggests significant variability in income levels within the dataset, with some
individuals having considerably higher or lower incomes compared to the mean.
The most common homeownership status among the dataset is "mortgage,"
implying that a significant number of individuals in the sample own homes with
mortgage loans (Figure 2). This finding aligns with the expectation that
homeownership is a common form of housing arrangement. This is consistent with
the fact that the mean loan amount is $17,989.74.
The slightly right-skewed distribution of loan amounts also indicates that there
may be a few higher loan amount outliers or a group of individuals with relatively
larger loans. The standard deviation for loan amount ($8,195.35) indicates a
considerable variation in loan amounts around the mean. This suggests that there is a
range of loan sizes within the dataset, with some individuals borrowing substantially
higher or lower amounts compared to the average.
Based on the findings in Figure 3, it can be concluded that there is a potential
disparity in income between homeownership groups, with mortgage holders having
the highest median income and renters having the lowest. The distribution of annual
income is also more evenly distributed for homeowners than for renters or mortgage
holders. This suggests that there is less variation in income levels among homeowners
than among renters or mortgage holders. The presence of outliers in the box plots for
renters and mortgage holders suggests that there are some people in these groups with
very high or very low incomes. This could be due to a number of factors, such as
differences in employment status, education level, or occupation.
Figure 4 shows a positive correlation between annual income and loan amount,
where higher incomes qualify for larger loans. This relationship is consistent with
general expectations, as higher incomes provide a stronger financial foundation for
borrowing larger sums. The majority of data points appear to be concentrated in the
lower range of both annual income and loan amount. This indicates that the dataset
predominantly consists of individuals with lower to moderate incomes and smaller
loan amounts. There are a few outliers in the scatter plot, represented by data points
that deviate significantly from the general pattern. The scatter plot reveals that the
data points are spread out rather than forming a tightly clustered pattern, suggesting
that while there is a positive relationship, it is not a perfect correlation. There is
variation in loan amounts even within similar income levels, indicating the influence
of other factors in determining loan amounts.
This study provides initial insights into the relationship between homeownership,
annual income, and loan amounts. However, it is important to acknowledge that the
conclusions are based on a relatively small dataset with only 50 samples. Therefore,
the findings may not be representative of the entire population or generalize to
broader contexts. Further research and analysis are warranted to explore additional
factors that may influence loan decisions, such as credit scores, employment status, or
loan terms. Additionally, studying a larger dataset and conducting more sophisticated
statistical analyses, such as hypothesis testing or regression analysis, would enhance
the reliability and validity of the conclusions.
References
Agarwal, S., I. Ben-David and V.W. Yao. 2015. Collateral Valuation and Borrower
Financial Constraints: Evidence from the Residential Real Estate Market.
Management Science 61: 2220– 2240.