This project focuses on predicting individual income levels based on demographic and socioeconomic factors using the Adult dataset from 1996. By building and evaluating several machine learning models, the goal is to identify the factors most strongly correlated with income and create a robust predictive system.
The objective is to develop a model that predicts whether an individual's income exceeds $50,000 using features such as age, education, occupation, and other relevant factors.
Dataset source: Adult dataset details
Our exploratory data analysis (EDA) revealed several key patterns:
-
Age, education, and hours worked per week show strong correlations with income levels.
This heatmap shows the correlation between numeric features. We observe that age, hours-per-week, and educational-num have moderate positive correlations with income, indicating that these factors are good predictors of higher income. This justifies their inclusion in the model.
This violin plot highlights the distribution of ages across income groups. It shows that higher-income individuals tend to be older, particularly between the ages of 35 and 50. This further supports the insight that age is a significant predictor of income.
-
Education, workclass & occupation, and marital status also appear to influence income levels:
The bar plot shows the income distribution across workclass and occupation. We see that managerial and professional occupations are associated with higher income levels. This insight highlights the importance of occupation in determining income, as higher-skilled professions tend to lead to higher earnings.
This analysis of relationship status versus income reveals that married individuals, especially those in dual-income households, are more likely to be high earners. This provides additional evidence that marital status is a significant predictor of income.
These insights guided us in selecting the most important features for model development.
Through statistical tests, we found significant differences between individuals earning above and below $50,000:
- Marital status: Married individuals, especially in dual-income households, were more likely to be high earners.
- Education: Higher educational attainment was associated with higher income.
- Age: Older individuals were more likely to earn more.
- Occupation and work hours: Managerial roles and longer work hours correlated strongly with higher income.
With these insights, we built several predictive models to classify whether an individual's income exceeds $50,000:
- Logistic Regression served as our initial benchmark, offering interpretability but lacking the ability to capture complex, non-linear relationships.
- Decision Trees and Random Forests captured intricate interactions between factors like marital status, education, and work hours.
- Gradient Boosting emerged as the top performer, delivering the highest accuracy and precision.
Through cross-validation and hyperparameter tuning, we optimized the models for better performance. Both Random Forest and Gradient Boosting models achieved high accuracy and precision, outperforming the Logistic Regression model. The tree-based models were able to capture more complex relationships between the variables.
The following chart highlights the top 10 most important features for predicting income levels using the Decision Tree model:
As seen in the chart, features such as marital status, education, and age are the strongest predictors of income, which align with our initial findings in the exploratory analysis.
This project successfully developed a reliable model for income classification, but it also deepened our understanding of the factors driving income inequality. Education, occupation, and marital status consistently emerged as the most impactful variables, both for prediction and understanding socioeconomic disparities.
Future directions for this project could include:
- Exploring interaction effects between demographic variables.
- Extending the model to predict income across different industries or regions.
- Using more advanced models, such as neural networks, to capture additional complexities in the data.
- Languages: Python
- Libraries: Pandas, NumPy, Scikit-learn, Seaborn, Matplotlib, SciPy
- Machine Learning Models: Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, Support Vector Machines (SVM), K-Nearest Neighbors, Naive Bayes