The goal of NumericEnsembles is to automatically conduct a thorough analysis of numeric data. The user only needs to provide the data and answer a few questions (such as which column to analyze). NumericEnsembles fits 18 individual models to the training data, and also makes predictions and checks accuracy for each of the individual models. It also builds 14 ensembles from the ensembles of data, fits each ensemble model to the training data then makes predictions and tracks accuracy for each ensemble. The package also automatically returns 26 plots (such as train vs holdout for the best model), 6 tables (such as head of the data), and a grand summary table sorted by accuracy with the best model at the top of the report.
You can install the development version of NumericEnsembles like so:
devtools::install_github("InfiniteCuriosity/NumericEnsembles")
NumericEnsembles will automatically build 32 models to predict the sale price of houses in Boston, from the Boston housing data set.
library(NumericEnsembles)
Numeric(data = MASS::Boston,
colnum = 14,
numresamples = 2,
remove_VIF_above = 5.00,
remove_ensemble_correlations_greater_than = 1.00,
scale_all_predictors_in_data = "N",
data_reduction_method = 0,
ensemble_reduction_method = 0,
how_to_handle_strings = 0,
predict_on_new_data = "N",
save_all_trained_models = "N",
set_seed = "N",
save_all_plots = "N",
use_parallel = "Y",
train_amount = 0.60,
test_amount = 0.20,
validation_amount = 0.20)
The 32 models which are all built automatically and without error are:
- Bagging
- BayesGLM
- BayesRNN
- Cubist
- Earth
- Elastic (optimized by cross-validation)
- Ensemble Bagging
- Ensemble BayesGLM
- Ensemble BayesRNN
- Ensemble Cubist
- Ensemble Earth
- Ensemble Elastic (optimized by cross-validation)
- Ensemble Gradient Boosted
- Ensemble Lasso (optimized by cross-validation)
- Ensemble Linear (tuned)
- Ensemble Ridge (optimized by cross-validation)
- Ensemble RPart
- EnsembleSVM (tuned)
- Ensemble Trees
- Ensemble XGBoost
- GAM (Generalized Additive Models, with smoothing splines)
- Gradient Boosted (optimized)
- Lasso
- Linear (tuned)
- Neuralnet
- PCR (Principal Components Regression)
- PLS (Partial Least Squares)
- Ridge (optimized by cross-validation)
- RPart
- SVM (Support Vector Machines, tuned)
- Tree
- XGBoost
The 30 plots created automatically:
- Correlation plot of the numeric data (as numbers and colors)
- Correlation plot of the numeric data (as circles with colors)
- Cook's D Bar Plot
- Four plots in one for the most accurate model: Predicted vs actual, Residuals, Histogram of residuals, Q-Q plot
- Most accurate model: Predicted vs actual
- Most accurate model: Residuals
- Most accurate model: Histogram of residuals
- Most accurate model: Q-Q plot
- Accuracy by resample and model, fixed scales
- Accuracy by resample and model, free scales
- Holdout RMSE/train RMSE, fixed scales
- Holdout RMSE/train RMSE, free scales
- Histograms of each numeric column
- Boxplots of each numeric column
- Predictor vs target variable
- Model accuracy bar chart (RMSE)
- t-test p-value bar chart
- Train vs holdout by resample and model, free scales
- Train vs holdout by resampleand model, fixed scales
- Duration bar chart
- Holdout RMSE / train RMSE bar chart
- Mean bias bar chart
- Mean MSE bar chart
- Mean MAE bar chart
- Mean SSE bar chart
- Kolmogorov-Smirnof test bar chart
- Bias plot by model and resample
- MSE plot by model and resample
- MAE plot by model and resample
- SSE plot by model and resample
The tables created automatically (which are both searchable and sortable) are:
- Variance Inflation Factor
- Correlation of the ensemble
- Head of the ensemble
- Data summary
- Correlation of the data
- Grand summary table includes:
- Mean holdout RMSE
- Standard deviation of mean holdout RMSE
- t-test value
- t-test p-value
- t-test p-value standard deviation
- Kolmogorov-Smirnov stat mean
- Kolmogorov-Smirnov stat p-value
- Kolmogorov-Smirnov stat standard deviation
- Mean bias
- Mean bias standard deviation
- Mean MAE
- Mean MAE standard deviation
- Mean MSE
- Mean MSE standard deviation
- Mean SSE
- Mean SSE standard deviation
- Mean data (this is the mean of the target column in the original data set)
- Standard deviation of mean data (this is the standard deviation of the data in the target column in the original data set)
- Mean train RMSE
- Mean test RMSE
- Mean validation RMSE
- Holdout vs train mean
- Holdout vs train standard deviation
- Duration
- Duration standard deviation
The NumericEnsembles package also has a way to create trained models and test those pre-trained models on totally unseen data using the same pre-trained models as on the initial analysis.
The package contains two example data sets to demonstrate this result. Boston_Housing is the Boston Housing data set, but the first five rows have been removed. We will build our models on that data set. NewBoston is totally new data, and actually the first five rows from the original Boston Housing data set.
library(NumericEnsembles)
Numeric(data = Boston_housing,
colnum = 14,
numresamples = 25,
remove_VIF_above = 5.00,
remove_ensemble_correlations_greater_than = 1.00,
scale_all_predictors_in_data = "N",
data_reduction_method = 0,
ensemble_reduction_method = 0,
how_to_handle_strings = 0,
predict_on_new_data = "Y",
set_seed = "N",
save_all_trained_models = "N",
save_all_plots = "N",
use_parallel = "Y",
train_amount = 0.60,
test_amount = 0.20,
validation_amount = 0.20)
Use the data set New_Boston when asked for "What is the URL of the new data?". The URL for the new data is: https://raw.githubusercontent.com/InfiniteCuriosity/EnsemblesData/refs/heads/main/NewBoston.csv
External data may be used to accomplish the same result.