Modeling and Predicting Income from Socioeconomic Data

Overview

This project focuses on predicting individual income levels based on demographic and socioeconomic factors using the Adult dataset from 1996. By building and evaluating several machine learning models, the goal is to identify the factors most strongly correlated with income and create a robust predictive system.

Explore the web app here!

Problem Statement

The objective is to develop a model that predicts whether an individual's income exceeds $50,000 using features such as age, education, occupation, and other relevant factors.

Dataset source: Adult dataset details

Notable Findings and Insights

1. Initial Exploration: Uncovering Patterns in the Data

Our exploratory data analysis (EDA) revealed several key patterns:

Age, education, and hours worked per week show strong correlations with income levels. This heatmap shows the correlation between numeric features. We observe that age, hours-per-week, and educational-num have moderate positive correlations with income, indicating that these factors are good predictors of higher income. This justifies their inclusion in the model.

This violin plot highlights the distribution of ages across income groups. It shows that higher-income individuals tend to be older, particularly between the ages of 35 and 50. This further supports the insight that age is a significant predictor of income.
Education, workclass & occupation, and marital status also appear to influence income levels: The bar plot shows the income distribution across workclass and occupation. We see that managerial and professional occupations are associated with higher income levels. This insight highlights the importance of occupation in determining income, as higher-skilled professions tend to lead to higher earnings.

This analysis of relationship status versus income reveals that married individuals, especially those in dual-income households, are more likely to be high earners. This provides additional evidence that marital status is a significant predictor of income.

These insights guided us in selecting the most important features for model development.

2. Identifying Key Factors: What Influences Income the Most?

Through statistical tests, we found significant differences between individuals earning above and below $50,000:

Marital status: Married individuals, especially in dual-income households, were more likely to be high earners.
Education: Higher educational attainment was associated with higher income.
Age: Older individuals were more likely to earn more.
Occupation and work hours: Managerial roles and longer work hours correlated strongly with higher income.

3. Building Predictive Models: From Insights to Action

With these insights, we built several predictive models to classify whether an individual's income exceeds $50,000:

Logistic Regression served as our initial benchmark, offering interpretability but lacking the ability to capture complex, non-linear relationships.
Decision Trees and Random Forests captured intricate interactions between factors like marital status, education, and work hours.
Gradient Boosting emerged as the top performer, delivering the highest accuracy and precision.

4. Model Performance: Measuring Success

Through cross-validation and hyperparameter tuning, we optimized the models for better performance. Both Random Forest and Gradient Boosting models achieved high accuracy and precision, outperforming the Logistic Regression model. The tree-based models were able to capture more complex relationships between the variables.

Feature Importance Analysis

The following chart highlights the top 10 most important features for predicting income levels using the Decision Tree model:

As seen in the chart, features such as marital status, education, and age are the strongest predictors of income, which align with our initial findings in the exploratory analysis.

5. Conclusion: Insights Beyond Prediction

This project successfully developed a reliable model for income classification, but it also deepened our understanding of the factors driving income inequality. Education, occupation, and marital status consistently emerged as the most impactful variables, both for prediction and understanding socioeconomic disparities.

Future directions for this project could include:

Exploring interaction effects between demographic variables.
Extending the model to predict income across different industries or regions.
Using more advanced models, such as neural networks, to capture additional complexities in the data.

Tools and Libraries

Languages: Python
Libraries: Pandas, NumPy, Scikit-learn, Seaborn, Matplotlib, SciPy
Machine Learning Models: Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, Support Vector Machines (SVM), K-Nearest Neighbors, Naive Bayes

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
images		images
README.md		README.md
best_decision_tree_model.pkl		best_decision_tree_model.pkl
income_and_its_correlated_factors.ipynb		income_and_its_correlated_factors.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Modeling and Predicting Income from Socioeconomic Data

Overview

Problem Statement

Notable Findings and Insights

1. Initial Exploration: Uncovering Patterns in the Data

2. Identifying Key Factors: What Influences Income the Most?

3. Building Predictive Models: From Insights to Action

4. Model Performance: Measuring Success

Feature Importance Analysis

5. Conclusion: Insights Beyond Prediction

Tools and Libraries

About

Uh oh!

Releases

Packages

Languages

yukims19/adult-income-prediction

Folders and files

Latest commit

History

Repository files navigation

Modeling and Predicting Income from Socioeconomic Data

Overview

Problem Statement

Notable Findings and Insights

1. Initial Exploration: Uncovering Patterns in the Data

2. Identifying Key Factors: What Influences Income the Most?

3. Building Predictive Models: From Insights to Action

4. Model Performance: Measuring Success

Feature Importance Analysis

5. Conclusion: Insights Beyond Prediction

Tools and Libraries

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages