diff --git a/doc/modules/impute.rst b/doc/modules/impute.rst index 45523d74fe9b8..1db20e9c6dcdb 100644 --- a/doc/modules/impute.rst +++ b/doc/modules/impute.rst @@ -9,19 +9,19 @@ Imputation of missing values For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an -array are numerical, and that all have and hold meaning. A basic strategy to use -incomplete datasets is to discard entire rows and/or columns containing missing -values. However, this comes at the price of losing data which may be valuable -(even though incomplete). A better strategy is to impute the missing values, -i.e., to infer them from the known part of the data. See the :ref:`glossary` -entry on imputation. +array are numerical, and that all have and hold meaning. A basic strategy to +use incomplete datasets is to discard entire rows and/or columns containing +missing values. However, this comes at the price of losing data which may be +valuable (even though incomplete). A better strategy is to impute the missing +values, i.e., to infer them from the known part of the data. See the +:ref:`glossary` entry on imputation. Univariate vs. Multivariate Imputation ====================================== -One type of imputation algorithm is univariate, which imputes values in the i-th -feature dimension using only non-missing values in that feature dimension +One type of imputation algorithm is univariate, which imputes values in the +i-th feature dimension using only non-missing values in that feature dimension (e.g. :class:`impute.SimpleImputer`). By contrast, multivariate imputation algorithms use the entire set of available feature dimensions to estimate the missing values (e.g. :class:`impute.IterativeImputer`). @@ -66,9 +66,9 @@ The :class:`SimpleImputer` class also supports sparse matrices:: [6. 3.] [7. 6.]] -Note that this format is not meant to be used to implicitly store missing values -in the matrix because it would densify it at transform time. Missing values encoded -by 0 must be used with dense input. +Note that this format is not meant to be used to implicitly store missing +values in the matrix because it would densify it at transform time. Missing +values encoded by 0 must be used with dense input. The :class:`SimpleImputer` class also supports categorical data represented as string values or pandas categoricals when using the ``'most_frequent'`` or @@ -110,7 +110,7 @@ round are returned. IterativeImputer(imputation_order='ascending', initial_strategy='mean', max_value=None, min_value=None, missing_values=nan, n_iter=10, n_nearest_features=None, predictor=None, random_state=0, - sample_posterior=False, verbose=False) + sample_posterior=False, verbose=0) >>> X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]] >>> # the model learns that the second feature is double the first >>> print(np.round(imp.transform(X_test))) @@ -118,23 +118,35 @@ round are returned. [ 6. 12.] [ 3. 6.]] -Both :class:`SimpleImputer` and :class:`IterativeImputer` can be used in a Pipeline -as a way to build a composite estimator that supports imputation. +Both :class:`SimpleImputer` and :class:`IterativeImputer` can be used in a +Pipeline as a way to build a composite estimator that supports imputation. See :ref:`sphx_glr_auto_examples_impute_plot_missing_values.py`. +Flexibility of IterativeImputer +------------------------------- + +There are many well-established imputation packages in the R data science +ecosystem: Amelia, mi, mice, missForest, etc. missForest is popular, and turns +out to be a particular instance of different sequential imputation algorithms +that can all be implemented with :class:`IterativeImputer` by passing in +different regressors to be used for predicting missing feature values. In the +case of missForest, this regressor is a Random Forest. +See :ref:`sphx_glr_auto_examples_plot_iterative_imputer_variants_comparison.py`. + + .. _multiple_imputation: Multiple vs. Single Imputation -============================== +------------------------------ -In the statistics community, it is common practice to perform multiple imputations, -generating, for example, ``m`` separate imputations for a single feature matrix. -Each of these ``m`` imputations is then put through the subsequent analysis pipeline -(e.g. feature engineering, clustering, regression, classification). The ``m`` final -analysis results (e.g. held-out validation errors) allow the data scientist -to obtain understanding of how analytic results may differ as a consequence -of the inherent uncertainty caused by the missing values. The above practice -is called multiple imputation. +In the statistics community, it is common practice to perform multiple +imputations, generating, for example, ``m`` separate imputations for a single +feature matrix. Each of these ``m`` imputations is then put through the +subsequent analysis pipeline (e.g. feature engineering, clustering, regression, +classification). The ``m`` final analysis results (e.g. held-out validation +errors) allow the data scientist to obtain understanding of how analytic +results may differ as a consequence of the inherent uncertainty caused by the +missing values. The above practice is called multiple imputation. Our implementation of :class:`IterativeImputer` was inspired by the R MICE package (Multivariate Imputation by Chained Equations) [1]_, but differs from @@ -144,13 +156,13 @@ it repeatedly to the same dataset with different random seeds when ``sample_posterior=True``. See [2]_, chapter 4 for more discussion on multiple vs. single imputations. -It is still an open problem as to how useful single vs. multiple imputation is in -the context of prediction and classification when the user is not interested in -measuring uncertainty due to missing values. +It is still an open problem as to how useful single vs. multiple imputation is +in the context of prediction and classification when the user is not +interested in measuring uncertainty due to missing values. -Note that a call to the ``transform`` method of :class:`IterativeImputer` is not -allowed to change the number of samples. Therefore multiple imputations cannot be -achieved by a single call to ``transform``. +Note that a call to the ``transform`` method of :class:`IterativeImputer` is +not allowed to change the number of samples. Therefore multiple imputations +cannot be achieved by a single call to ``transform``. References ========== diff --git a/examples/impute/plot_iterative_imputer_variants_comparison.py b/examples/impute/plot_iterative_imputer_variants_comparison.py new file mode 100644 index 0000000000000..a850deb273f24 --- /dev/null +++ b/examples/impute/plot_iterative_imputer_variants_comparison.py @@ -0,0 +1,126 @@ +""" +========================================================= +Imputing missing values with variants of IterativeImputer +========================================================= + +The :class:`sklearn.impute.IterativeImputer` class is very flexible - it can be +used with a variety of predictors to do round-robin regression, treating every +variable as an output in turn. + +In this example we compare some predictors for the purpose of missing feature +imputation with :class:`sklearn.imputeIterativeImputer`:: + + :class:`~sklearn.linear_model.BayesianRidge`: regularized linear regression + :class:`~sklearn.tree.DecisionTreeRegressor`: non-linear regression + :class:`~sklearn.ensemble.ExtraTreesRegressor`: similar to missForest in R + :class:`~sklearn.neighbors.KNeighborsRegressor`: comparable to other KNN + imputation approaches + +Of particular interest is the ability of +:class:`sklearn.impute.IterativeImputer` to mimic the behavior of missForest, a +popular imputation package for R. In this example, we have chosen to use +:class:`sklearn.ensemble.ExtraTreesRegressor` instead of +:class:`sklearn.ensemble.RandomForestRegressor` (as in missForest) due to its +increased speed. + +Note that :class:`sklearn.neighbors.KNeighborsRegressor` is different from KNN +imputation, which learns from samples with missing values by using a distance +metric that accounts for missing values, rather than imputing them. + +The goal is to compare different predictors to see which one is best for the +:class:`sklearn.impute.IterativeImputer` when using a +:class:`sklearn.linear_model.BayesianRidge` estimator on the California housing +dataset with a single value randomly removed from each row. + +For this particular pattern of missing values we see that +:class:`sklearn.ensemble.ExtraTreesRegressor` and +:class:`sklearn.linear_model.BayesianRidge` give the best results. +""" +print(__doc__) + +import numpy as np +import matplotlib.pyplot as plt +import pandas as pd + +from sklearn.datasets import fetch_california_housing +from sklearn.impute import SimpleImputer +from sklearn.impute import IterativeImputer +from sklearn.linear_model import BayesianRidge +from sklearn.tree import DecisionTreeRegressor +from sklearn.ensemble import ExtraTreesRegressor +from sklearn.neighbors import KNeighborsRegressor +from sklearn.pipeline import make_pipeline +from sklearn.model_selection import cross_val_score + +N_SPLITS = 5 + +rng = np.random.RandomState(0) + +X_full, y_full = fetch_california_housing(return_X_y=True) +n_samples, n_features = X_full.shape + +# Estimate the score on the entire dataset, with no missing values +br_estimator = BayesianRidge() +score_full_data = pd.DataFrame( + cross_val_score( + br_estimator, X_full, y_full, scoring='neg_mean_squared_error', + cv=N_SPLITS + ), + columns=['Full Data'] +) + +# Add a single missing value to each row +X_missing = X_full.copy() +y_missing = y_full +missing_samples = np.arange(n_samples) +missing_features = rng.choice(n_features, n_samples, replace=True) +X_missing[missing_samples, missing_features] = np.nan + +# Estimate the score after imputation (mean and median strategies) +score_simple_imputer = pd.DataFrame() +for strategy in ('mean', 'median'): + estimator = make_pipeline( + SimpleImputer(missing_values=np.nan, strategy=strategy), + br_estimator + ) + score_simple_imputer[strategy] = cross_val_score( + estimator, X_missing, y_missing, scoring='neg_mean_squared_error', + cv=N_SPLITS + ) + +# Estimate the score after iterative imputation of the missing values +# with different predictors +predictors = [ + BayesianRidge(), + DecisionTreeRegressor(max_features='sqrt', random_state=0), + ExtraTreesRegressor(n_estimators=10, n_jobs=-1, random_state=0), + KNeighborsRegressor(n_neighbors=15) +] +score_iterative_imputer = pd.DataFrame() +for predictor in predictors: + estimator = make_pipeline( + IterativeImputer(random_state=0, predictor=predictor), + br_estimator + ) + score_iterative_imputer[predictor.__class__.__name__] = \ + cross_val_score( + estimator, X_missing, y_missing, scoring='neg_mean_squared_error', + cv=N_SPLITS + ) + +scores = pd.concat( + [score_full_data, score_simple_imputer, score_iterative_imputer], + keys=['Original', 'SimpleImputer', 'IterativeImputer'], axis=1 +) + +# plot boston results +fig, ax = plt.subplots(figsize=(13, 6)) +means = -scores.mean() +errors = scores.std() +means.plot.barh(xerr=errors, ax=ax) +ax.set_title('California Housing Regression with Different Imputation Methods') +ax.set_xlabel('MSE (smaller is better)') +ax.set_yticks(np.arange(means.shape[0])) +ax.set_yticklabels([" w/ ".join(label) for label in means.index.get_values()]) +plt.tight_layout(pad=1) +plt.show() diff --git a/examples/impute/plot_missing_values.py b/examples/impute/plot_missing_values.py index 43d7ddfc497f3..897b66aad246c 100644 --- a/examples/impute/plot_missing_values.py +++ b/examples/impute/plot_missing_values.py @@ -12,12 +12,13 @@ round-robin linear regression, treating every variable as an output in turn. The version implemented assumes Gaussian (output) variables. If your features are obviously non-Normal, consider transforming them to look more -Normal so as to improve performance. +Normal so as to potentially improve performance. In addition of using an imputing method, we can also keep an indication of the missing information using :func:`sklearn.impute.MissingIndicator` which might carry some information. """ +print(__doc__) import numpy as np import matplotlib.pyplot as plt @@ -31,6 +32,19 @@ rng = np.random.RandomState(0) +N_SPLITS = 5 +REGRESSOR = RandomForestRegressor(random_state=0, n_estimators=100) + + +def get_scores_for_imputer(imputer, X_missing, y_missing): + estimator = make_pipeline( + make_union(imputer, MissingIndicator(missing_values=0)), + REGRESSOR) + impute_scores = cross_val_score(estimator, X_missing, y_missing, + scoring='neg_mean_squared_error', + cv=N_SPLITS) + return impute_scores + def get_results(dataset): X_full, y_full = dataset.data, dataset.target @@ -38,9 +52,9 @@ def get_results(dataset): n_features = X_full.shape[1] # Estimate the score on the entire dataset, with no missing values - estimator = RandomForestRegressor(random_state=0, n_estimators=100) - full_scores = cross_val_score(estimator, X_full, y_full, - scoring='neg_mean_squared_error', cv=5) + full_scores = cross_val_score(REGRESSOR, X_full, y_full, + scoring='neg_mean_squared_error', + cv=N_SPLITS) # Add missing values in 75% of the lines missing_rate = 0.75 @@ -51,35 +65,27 @@ def get_results(dataset): dtype=np.bool))) rng.shuffle(missing_samples) missing_features = rng.randint(0, n_features, n_missing_samples) - - # Estimate the score after replacing missing values by 0 X_missing = X_full.copy() X_missing[np.where(missing_samples)[0], missing_features] = 0 y_missing = y_full.copy() - estimator = RandomForestRegressor(random_state=0, n_estimators=100) - zero_impute_scores = cross_val_score(estimator, X_missing, y_missing, - scoring='neg_mean_squared_error', - cv=5) + + # Estimate the score after replacing missing values by 0 + imputer = SimpleImputer(missing_values=0, + strategy='constant', + fill_value=0) + zero_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing) # Estimate the score after imputation (mean strategy) of the missing values - X_missing = X_full.copy() - X_missing[np.where(missing_samples)[0], missing_features] = 0 - y_missing = y_full.copy() - estimator = make_pipeline( - make_union(SimpleImputer(missing_values=0, strategy="mean"), - MissingIndicator(missing_values=0)), - RandomForestRegressor(random_state=0, n_estimators=100)) - mean_impute_scores = cross_val_score(estimator, X_missing, y_missing, - scoring='neg_mean_squared_error', - cv=5) + imputer = SimpleImputer(missing_values=0, strategy="mean") + mean_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing) # Estimate the score after iterative imputation of the missing values - estimator = make_pipeline( - make_union(IterativeImputer(missing_values=0, random_state=0), - MissingIndicator(missing_values=0)), - RandomForestRegressor(random_state=0, n_estimators=100)) - iterative_impute_scores = cross_val_score(estimator, X_missing, y_missing, - scoring='neg_mean_squared_error') + imputer = IterativeImputer(missing_values=0, + random_state=0, + n_nearest_features=5) + iterative_impute_scores = get_scores_for_imputer(imputer, + X_missing, + y_missing) return ((full_scores.mean(), full_scores.std()), (zero_impute_scores.mean(), zero_impute_scores.std()), diff --git a/sklearn/impute.py b/sklearn/impute.py index 6dfce49f7b1f2..ef4e552260e05 100644 --- a/sklearn/impute.py +++ b/sklearn/impute.py @@ -556,7 +556,7 @@ def __init__(self, initial_strategy="mean", min_value=None, max_value=None, - verbose=False, + verbose=0, random_state=None): self.missing_values = missing_values