scikit-learn · jnothman · Feb 15, 2019 · Sep 3, 2018 · Sep 3, 2018 · Sep 15, 2018
diff --git a/doc/modules/classes.rst b/doc/modules/classes.rst
@@ -655,8 +655,9 @@ Kernels:
    :template: class.rst
 
    impute.SimpleImputer
+   impute.IterativeImputer
    impute.MissingIndicator
-
+   
 .. _kernel_approximation_ref:
 
 :mod:`sklearn.kernel_approximation` Kernel Approximation

diff --git a/doc/modules/impute.rst b/doc/modules/impute.rst
@@ -9,12 +9,28 @@ Imputation of missing values
 For various reasons, many real world datasets contain missing values, often
 encoded as blanks, NaNs or other placeholders. Such datasets however are
 incompatible with scikit-learn estimators which assume that all values in an
-array are numerical, and that all have and hold meaning. A basic strategy to use
-incomplete datasets is to discard entire rows and/or columns containing missing
-values. However, this comes at the price of losing data which may be valuable
-(even though incomplete). A better strategy is to impute the missing values,
-i.e., to infer them from the known part of the data. See the :ref:`glossary`
-entry on imputation.
+array are numerical, and that all have and hold meaning. A basic strategy to
+use incomplete datasets is to discard entire rows and/or columns containing
+missing values. However, this comes at the price of losing data which may be
+valuable (even though incomplete). A better strategy is to impute the missing
+values, i.e., to infer them from the known part of the data. See the
+:ref:`glossary` entry on imputation.
+
+
+Univariate vs. Multivariate Imputation
+======================================
+
+One type of imputation algorithm is univariate, which imputes values in the
+i-th feature dimension using only non-missing values in that feature dimension
+(e.g. :class:`impute.SimpleImputer`). By contrast, multivariate imputation
+algorithms use the entire set of available feature dimensions to estimate the
+missing values (e.g. :class:`impute.IterativeImputer`).
+
+
+.. _single_imputer:
+
+Univariate feature imputation
+=============================
 
 The :class:`SimpleImputer` class provides basic strategies for imputing missing
 values. Missing values can be imputed with a provided constant value, or using
@@ -50,9 +66,9 @@ The :class:`SimpleImputer` class also supports sparse matrices::
      [6. 3.]
      [7. 6.]]
 
-Note that this format is not meant to be used to implicitly store missing values
-in the matrix because it would densify it at transform time. Missing values encoded
-by 0 must be used with dense input.
+Note that this format is not meant to be used to implicitly store missing
+values in the matrix because it would densify it at transform time. Missing
+values encoded by 0 must be used with dense input.
 
 The :class:`SimpleImputer` class also supports categorical data represented as
 string values or pandas categoricals when using the ``'most_frequent'`` or
@@ -71,9 +87,92 @@ string values or pandas categoricals when using the ``'most_frequent'`` or
      ['a' 'y']
      ['b' 'y']]
 
+.. _iterative_imputer:
+
+
+Multivariate feature imputation
+===============================
+
+A more sophisticated approach is to use the :class:`IterativeImputer` class,
+which models each feature with missing values as a function of other features,
+and uses that estimate for imputation. It does so in an iterated round-robin
+fashion: at each step, a feature column is designated as output ``y`` and the
+other feature columns are treated as inputs ``X``. A regressor is fit on ``(X,
+y)`` for known ``y``. Then, the regressor is used to predict the missing values
+of ``y``.  This is done for each feature in an iterative fashion, and then is
+repeated for ``max_iter`` imputation rounds. The results of the final
+imputation round are returned.
+
+    >>> import numpy as np
+    >>> from sklearn.impute import IterativeImputer
+    >>> imp = IterativeImputer(max_iter=10, random_state=0)
+    >>> imp.fit([[1, 2], [3, 6], [4, 8], [np.nan, 3], [7, np.nan]])  # doctest: +NORMALIZE_WHITESPACE
+    IterativeImputer(estimator=None, imputation_order='ascending',
+                     initial_strategy='mean', max_iter=10, max_value=None,
+                     min_value=None, missing_values=nan, n_nearest_features=None,
+                     random_state=0, sample_posterior=False, tol=0.001, verbose=0)
+    >>> X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
+    >>> # the model learns that the second feature is double the first
+    >>> print(np.round(imp.transform(X_test)))
+    [[ 1.  2.]
+     [ 6. 12.]
+     [ 3.  6.]]
+
+Both :class:`SimpleImputer` and :class:`IterativeImputer` can be used in a
+Pipeline as a way to build a composite estimator that supports imputation.
+See :ref:`sphx_glr_auto_examples_impute_plot_missing_values.py`.
+
+Flexibility of IterativeImputer
+-------------------------------
+
+There are many well-established imputation packages in the R data science
+ecosystem: Amelia, mi, mice, missForest, etc. missForest is popular, and turns
+out to be a particular instance of different sequential imputation algorithms
+that can all be implemented with :class:`IterativeImputer` by passing in
+different regressors to be used for predicting missing feature values. In the
+case of missForest, this regressor is a Random Forest.
+See :ref:`sphx_glr_auto_examples_plot_iterative_imputer_variants_comparison.py`.
+
+
+.. _multiple_imputation:
+
+Multiple vs. Single Imputation
+------------------------------
+
+In the statistics community, it is common practice to perform multiple
+imputations, generating, for example, ``m`` separate imputations for a single
+feature matrix. Each of these ``m`` imputations is then put through the
+subsequent analysis pipeline (e.g. feature engineering, clustering, regression,
+classification). The ``m`` final analysis results (e.g. held-out validation
+errors) allow the data scientist to obtain understanding of how analytic
+results may differ as a consequence of the inherent uncertainty caused by the
+missing values. The above practice is called multiple imputation.
+
+Our implementation of :class:`IterativeImputer` was inspired by the R MICE
+package (Multivariate Imputation by Chained Equations) [1]_, but differs from
+it by returning a single imputation instead of multiple imputations.  However,
+:class:`IterativeImputer` can also be used for multiple imputations by applying
+it repeatedly to the same dataset with different random seeds when
+``sample_posterior=True``. See [2]_, chapter 4 for more discussion on multiple
+vs. single imputations.
+
+It is still an open problem as to how useful single vs. multiple imputation is
+in the context of prediction and classification when the user is not
+interested in measuring uncertainty due to missing values.
+
+Note that a call to the ``transform`` method of :class:`IterativeImputer` is
+not allowed to change the number of samples. Therefore multiple imputations
+cannot be achieved by a single call to ``transform``.
+
+References
+==========
+
+.. [1] Stef van Buuren, Karin Groothuis-Oudshoorn (2011). "mice: Multivariate
+   Imputation by Chained Equations in R". Journal of Statistical Software 45:
+   1-67.
 
-:class:`SimpleImputer` can be used in a Pipeline as a way to build a composite
-estimator that supports imputation. See :ref:`sphx_glr_auto_examples_plot_missing_values.py`.
+.. [2] Roderick J A Little and Donald B Rubin (1986). "Statistical Analysis
+   with Missing Data". John Wiley & Sons, Inc., New York, NY, USA.
 
 .. _missing_indicator:
 

diff --git a/doc/whats_new/v0.21.rst b/doc/whats_new/v0.21.rst
@@ -115,6 +115,15 @@ Support for Python 3.4 and below has been officially dropped.
 - |API| Deprecated :mod:`externals.six` since we have dropped support for
   Python 2.7. :issue:`12916` by :user:`Hanmin Qin <qinhanmin2014>`.
 
+:mod:`sklearn.impute`
+.....................
+
+- |MajorFeature| Added :class:`impute.IterativeImputer`, which is a strategy
+  for imputing missing values by modeling each feature with missing values as a
+  function of other features in a round-robin fashion. :issue:`8478` and
+  :issue:`12177` by :user:`Sergey Feldman <sergeyf>` :user:`Ben Lawson
+  <benlawson>`.
+
 :mod:`sklearn.linear_model`
 ...........................
 

diff --git a/examples/impute/README.txt b/examples/impute/README.txt
@@ -0,0 +1,6 @@
+.. _impute_examples:
+
+Missing Value Imputation
+------------------------
+
+Examples concerning the :mod:`sklearn.impute` module.
diff --git a/examples/impute/plot_iterative_imputer_variants_comparison.py b/examples/impute/plot_iterative_imputer_variants_comparison.py
@@ -0,0 +1,126 @@
+"""
+=========================================================
+Imputing missing values with variants of IterativeImputer
+=========================================================
+
+The :class:`sklearn.impute.IterativeImputer` class is very flexible - it can be
+used with a variety of estimators to do round-robin regression, treating every
+variable as an output in turn.
+
+In this example we compare some estimators for the purpose of missing feature
+imputation with :class:`sklearn.imputeIterativeImputer`::
+
+    :class:`~sklearn.linear_model.BayesianRidge`: regularized linear regression
+    :class:`~sklearn.tree.DecisionTreeRegressor`: non-linear regression
+    :class:`~sklearn.ensemble.ExtraTreesRegressor`: similar to missForest in R
+    :class:`~sklearn.neighbors.KNeighborsRegressor`: comparable to other KNN
+    imputation approaches
+
+Of particular interest is the ability of
+:class:`sklearn.impute.IterativeImputer` to mimic the behavior of missForest, a
+popular imputation package for R. In this example, we have chosen to use
+:class:`sklearn.ensemble.ExtraTreesRegressor` instead of
+:class:`sklearn.ensemble.RandomForestRegressor` (as in missForest) due to its
+increased speed.
+
+Note that :class:`sklearn.neighbors.KNeighborsRegressor` is different from KNN
+imputation, which learns from samples with missing values by using a distance
+metric that accounts for missing values, rather than imputing them.
+
+The goal is to compare different estimators to see which one is best for the
+:class:`sklearn.impute.IterativeImputer` when using a
+:class:`sklearn.linear_model.BayesianRidge` estimator on the California housing
+dataset with a single value randomly removed from each row.
+
+For this particular pattern of missing values we see that
+:class:`sklearn.ensemble.ExtraTreesRegressor` and
+:class:`sklearn.linear_model.BayesianRidge` give the best results.
+"""
+print(__doc__)
+
+import numpy as np
+import matplotlib.pyplot as plt
+import pandas as pd
+
+from sklearn.datasets import fetch_california_housing
+from sklearn.impute import SimpleImputer
+from sklearn.impute import IterativeImputer
+from sklearn.linear_model import BayesianRidge
+from sklearn.tree import DecisionTreeRegressor
+from sklearn.ensemble import ExtraTreesRegressor
+from sklearn.neighbors import KNeighborsRegressor
+from sklearn.pipeline import make_pipeline
+from sklearn.model_selection import cross_val_score
+
+N_SPLITS = 5
+
+rng = np.random.RandomState(0)
+
+X_full, y_full = fetch_california_housing(return_X_y=True)
+n_samples, n_features = X_full.shape
+
+# Estimate the score on the entire dataset, with no missing values
+br_estimator = BayesianRidge()
+score_full_data = pd.DataFrame(
+    cross_val_score(
+        br_estimator, X_full, y_full, scoring='neg_mean_squared_error',
+        cv=N_SPLITS
+    ),
+    columns=['Full Data']
+)
+
+# Add a single missing value to each row
+X_missing = X_full.copy()
+y_missing = y_full
+missing_samples = np.arange(n_samples)
+missing_features = rng.choice(n_features, n_samples, replace=True)
+X_missing[missing_samples, missing_features] = np.nan
+
+# Estimate the score after imputation (mean and median strategies)
+score_simple_imputer = pd.DataFrame()
+for strategy in ('mean', 'median'):
+    estimator = make_pipeline(
+        SimpleImputer(missing_values=np.nan, strategy=strategy),
+        br_estimator
+    )
+    score_simple_imputer[strategy] = cross_val_score(
+        estimator, X_missing, y_missing, scoring='neg_mean_squared_error',
+        cv=N_SPLITS
+    )
+
+# Estimate the score after iterative imputation of the missing values
+# with different estimators
+estimators = [
+    BayesianRidge(),
+    DecisionTreeRegressor(max_features='sqrt', random_state=0),
+    ExtraTreesRegressor(n_estimators=10, n_jobs=-1, random_state=0),
+    KNeighborsRegressor(n_neighbors=15)
+]
+score_iterative_imputer = pd.DataFrame()
+for estimator in estimators:
+    estimator = make_pipeline(
+        IterativeImputer(random_state=0, estimator=estimator),
+        br_estimator
+    )
+    score_iterative_imputer[estimator.__class__.__name__] = \
+        cross_val_score(
+            estimator, X_missing, y_missing, scoring='neg_mean_squared_error',
+            cv=N_SPLITS
+        )
+
+scores = pd.concat(
+    [score_full_data, score_simple_imputer, score_iterative_imputer],
+    keys=['Original', 'SimpleImputer', 'IterativeImputer'], axis=1
+)
+
+# plot boston results
+fig, ax = plt.subplots(figsize=(13, 6))
+means = -scores.mean()
+errors = scores.std()
+means.plot.barh(xerr=errors, ax=ax)
+ax.set_title('California Housing Regression with Different Imputation Methods')
+ax.set_xlabel('MSE (smaller is better)')
+ax.set_yticks(np.arange(means.shape[0]))
+ax.set_yticklabels([" w/ ".join(label) for label in means.index.get_values()])
+plt.tight_layout(pad=1)
+plt.show()