Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Merge IterativeImputer into master branch #11977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Feb 15, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
dc67ec0
FEA Reinstate ChainedImputer
jnothman Sep 3, 2018
cbf89ec
Fix import of time
jnothman Sep 3, 2018
e4fa514
Merge branch 'master' into iterativeimputer
jnothman Sep 15, 2018
a4f2a89
[MRG] ChainedImputer -> IterativeImputer, and documentation update (#…
sergeyf Sep 16, 2018
09a9a21
[MRG] sample from a truncated normal instead of clipping samples from…
benlawson Oct 5, 2018
d854b45
Merge branch 'master' into iterativeimputer
jnothman Oct 7, 2018
caa089e
DOC Merge IterativeImputer what's news
jnothman Oct 7, 2018
1550d65
Merge branch 'master' into iterativeimputer
jnothman Jan 16, 2019
f103c6b
Undo changes to v0.20.rst
jnothman Jan 16, 2019
9e10658
Revert changes to v0.20.rst
jnothman Jan 16, 2019
0aab6dc
DOC Normalize whitespace in doctest
jnothman Jan 16, 2019
d34f227
Fix for SciPy 0.17
jnothman Jan 17, 2019
b44dff8
Fix doctest
jnothman Jan 17, 2019
0453c19
Create examples/impute gallery
jnothman Jan 22, 2019
8758561
Add missing readme file
jnothman Jan 22, 2019
f4d970e
Undo change to circle build
jnothman Jan 22, 2019
34b7a46
DOC Make IterativeImputer doctest more stable (#13026)
jnothman Jan 23, 2019
b58bd0b
TST IterativeImputer: Check predictor type (#13039)
jnothman Jan 24, 2019
cf4670c
EHN: Changing default model for IterativeImputer to BayesianRidge (#1…
sergeyf Jan 24, 2019
dc304a4
EXA Add IterativeImputer extended example (#12100)
sergeyf Jan 25, 2019
3c2716a
Merge branch 'master' into iterativeimputer
jnothman Jan 30, 2019
92e7316
ENH IterativeImputer: n_iter->max_iter (#13061)
sergeyf Feb 12, 2019
d409e5a
Merge branch 'master' into iterativeimputer
jnothman Feb 12, 2019
cb3ec84
pep8
jnothman Feb 12, 2019
c123440
API estimator is now first param of IterativeImputer (#13153)
sergeyf Feb 13, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -655,8 +655,9 @@ Kernels:
:template: class.rst

impute.SimpleImputer
impute.IterativeImputer
impute.MissingIndicator

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman I assume you don't want me to make a PR for this =)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry I did not see it was @jnothman PR. I'll fix it then :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove this

.. _kernel_approximation_ref:

:mod:`sklearn.kernel_approximation` Kernel Approximation
Expand Down
121 changes: 110 additions & 11 deletions doc/modules/impute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,28 @@ Imputation of missing values
For various reasons, many real world datasets contain missing values, often
encoded as blanks, NaNs or other placeholders. Such datasets however are
incompatible with scikit-learn estimators which assume that all values in an
array are numerical, and that all have and hold meaning. A basic strategy to use
incomplete datasets is to discard entire rows and/or columns containing missing
values. However, this comes at the price of losing data which may be valuable
(even though incomplete). A better strategy is to impute the missing values,
i.e., to infer them from the known part of the data. See the :ref:`glossary`
entry on imputation.
array are numerical, and that all have and hold meaning. A basic strategy to
use incomplete datasets is to discard entire rows and/or columns containing
missing values. However, this comes at the price of losing data which may be
valuable (even though incomplete). A better strategy is to impute the missing
values, i.e., to infer them from the known part of the data. See the
:ref:`glossary` entry on imputation.


Univariate vs. Multivariate Imputation
======================================

One type of imputation algorithm is univariate, which imputes values in the
i-th feature dimension using only non-missing values in that feature dimension
(e.g. :class:`impute.SimpleImputer`). By contrast, multivariate imputation
algorithms use the entire set of available feature dimensions to estimate the
missing values (e.g. :class:`impute.IterativeImputer`).


.. _single_imputer:

Univariate feature imputation
=============================

The :class:`SimpleImputer` class provides basic strategies for imputing missing
values. Missing values can be imputed with a provided constant value, or using
Expand Down Expand Up @@ -50,9 +66,9 @@ The :class:`SimpleImputer` class also supports sparse matrices::
[6. 3.]
[7. 6.]]

Note that this format is not meant to be used to implicitly store missing values
in the matrix because it would densify it at transform time. Missing values encoded
by 0 must be used with dense input.
Note that this format is not meant to be used to implicitly store missing
values in the matrix because it would densify it at transform time. Missing
values encoded by 0 must be used with dense input.

The :class:`SimpleImputer` class also supports categorical data represented as
string values or pandas categoricals when using the ``'most_frequent'`` or
Expand All @@ -71,9 +87,92 @@ string values or pandas categoricals when using the ``'most_frequent'`` or
['a' 'y']
['b' 'y']]

.. _iterative_imputer:


Multivariate feature imputation
===============================

A more sophisticated approach is to use the :class:`IterativeImputer` class,
which models each feature with missing values as a function of other features,
and uses that estimate for imputation. It does so in an iterated round-robin
fashion: at each step, a feature column is designated as output ``y`` and the
other feature columns are treated as inputs ``X``. A regressor is fit on ``(X,
y)`` for known ``y``. Then, the regressor is used to predict the missing values
of ``y``. This is done for each feature in an iterative fashion, and then is
repeated for ``max_iter`` imputation rounds. The results of the final
imputation round are returned.

>>> import numpy as np
>>> from sklearn.impute import IterativeImputer
>>> imp = IterativeImputer(max_iter=10, random_state=0)
>>> imp.fit([[1, 2], [3, 6], [4, 8], [np.nan, 3], [7, np.nan]]) # doctest: +NORMALIZE_WHITESPACE
IterativeImputer(estimator=None, imputation_order='ascending',
initial_strategy='mean', max_iter=10, max_value=None,
min_value=None, missing_values=nan, n_nearest_features=None,
random_state=0, sample_posterior=False, tol=0.001, verbose=0)
>>> X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
>>> # the model learns that the second feature is double the first
>>> print(np.round(imp.transform(X_test)))
[[ 1. 2.]
[ 6. 12.]
[ 3. 6.]]

Both :class:`SimpleImputer` and :class:`IterativeImputer` can be used in a
Pipeline as a way to build a composite estimator that supports imputation.
See :ref:`sphx_glr_auto_examples_impute_plot_missing_values.py`.

Flexibility of IterativeImputer
-------------------------------

There are many well-established imputation packages in the R data science
ecosystem: Amelia, mi, mice, missForest, etc. missForest is popular, and turns
out to be a particular instance of different sequential imputation algorithms
that can all be implemented with :class:`IterativeImputer` by passing in
different regressors to be used for predicting missing feature values. In the
case of missForest, this regressor is a Random Forest.
See :ref:`sphx_glr_auto_examples_plot_iterative_imputer_variants_comparison.py`.


.. _multiple_imputation:

Multiple vs. Single Imputation
------------------------------

In the statistics community, it is common practice to perform multiple
imputations, generating, for example, ``m`` separate imputations for a single
feature matrix. Each of these ``m`` imputations is then put through the
subsequent analysis pipeline (e.g. feature engineering, clustering, regression,
classification). The ``m`` final analysis results (e.g. held-out validation
errors) allow the data scientist to obtain understanding of how analytic
results may differ as a consequence of the inherent uncertainty caused by the
missing values. The above practice is called multiple imputation.

Our implementation of :class:`IterativeImputer` was inspired by the R MICE
package (Multivariate Imputation by Chained Equations) [1]_, but differs from
it by returning a single imputation instead of multiple imputations. However,
:class:`IterativeImputer` can also be used for multiple imputations by applying
it repeatedly to the same dataset with different random seeds when
``sample_posterior=True``. See [2]_, chapter 4 for more discussion on multiple
vs. single imputations.

It is still an open problem as to how useful single vs. multiple imputation is
in the context of prediction and classification when the user is not
interested in measuring uncertainty due to missing values.

Note that a call to the ``transform`` method of :class:`IterativeImputer` is
not allowed to change the number of samples. Therefore multiple imputations
cannot be achieved by a single call to ``transform``.

References
==========

.. [1] Stef van Buuren, Karin Groothuis-Oudshoorn (2011). "mice: Multivariate
Imputation by Chained Equations in R". Journal of Statistical Software 45:
1-67.

:class:`SimpleImputer` can be used in a Pipeline as a way to build a composite
estimator that supports imputation. See :ref:`sphx_glr_auto_examples_plot_missing_values.py`.
.. [2] Roderick J A Little and Donald B Rubin (1986). "Statistical Analysis
with Missing Data". John Wiley & Sons, Inc., New York, NY, USA.

.. _missing_indicator:

Expand Down
9 changes: 9 additions & 0 deletions doc/whats_new/v0.21.rst
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,15 @@ Support for Python 3.4 and below has been officially dropped.
- |API| Deprecated :mod:`externals.six` since we have dropped support for
Python 2.7. :issue:`12916` by :user:`Hanmin Qin <qinhanmin2014>`.

:mod:`sklearn.impute`
.....................

- |MajorFeature| Added :class:`impute.IterativeImputer`, which is a strategy
for imputing missing values by modeling each feature with missing values as a
function of other features in a round-robin fashion. :issue:`8478` and
:issue:`12177` by :user:`Sergey Feldman <sergeyf>` :user:`Ben Lawson
<benlawson>`.

:mod:`sklearn.linear_model`
...........................

Expand Down
6 changes: 6 additions & 0 deletions examples/impute/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.. _impute_examples:

Missing Value Imputation
------------------------

Examples concerning the :mod:`sklearn.impute` module.
126 changes: 126 additions & 0 deletions examples/impute/plot_iterative_imputer_variants_comparison.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
"""
=========================================================
Imputing missing values with variants of IterativeImputer
=========================================================

The :class:`sklearn.impute.IterativeImputer` class is very flexible - it can be
used with a variety of estimators to do round-robin regression, treating every
variable as an output in turn.

In this example we compare some estimators for the purpose of missing feature
imputation with :class:`sklearn.imputeIterativeImputer`::

:class:`~sklearn.linear_model.BayesianRidge`: regularized linear regression
:class:`~sklearn.tree.DecisionTreeRegressor`: non-linear regression
:class:`~sklearn.ensemble.ExtraTreesRegressor`: similar to missForest in R
:class:`~sklearn.neighbors.KNeighborsRegressor`: comparable to other KNN
imputation approaches

Of particular interest is the ability of
:class:`sklearn.impute.IterativeImputer` to mimic the behavior of missForest, a
popular imputation package for R. In this example, we have chosen to use
:class:`sklearn.ensemble.ExtraTreesRegressor` instead of
:class:`sklearn.ensemble.RandomForestRegressor` (as in missForest) due to its
increased speed.

Note that :class:`sklearn.neighbors.KNeighborsRegressor` is different from KNN
imputation, which learns from samples with missing values by using a distance
metric that accounts for missing values, rather than imputing them.

The goal is to compare different estimators to see which one is best for the
:class:`sklearn.impute.IterativeImputer` when using a
:class:`sklearn.linear_model.BayesianRidge` estimator on the California housing
dataset with a single value randomly removed from each row.

For this particular pattern of missing values we see that
:class:`sklearn.ensemble.ExtraTreesRegressor` and
:class:`sklearn.linear_model.BayesianRidge` give the best results.
"""
print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.impute import SimpleImputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

N_SPLITS = 5

rng = np.random.RandomState(0)

X_full, y_full = fetch_california_housing(return_X_y=True)
n_samples, n_features = X_full.shape

# Estimate the score on the entire dataset, with no missing values
br_estimator = BayesianRidge()
score_full_data = pd.DataFrame(
cross_val_score(
br_estimator, X_full, y_full, scoring='neg_mean_squared_error',
cv=N_SPLITS
),
columns=['Full Data']
)

# Add a single missing value to each row
X_missing = X_full.copy()
y_missing = y_full
missing_samples = np.arange(n_samples)
missing_features = rng.choice(n_features, n_samples, replace=True)
X_missing[missing_samples, missing_features] = np.nan

# Estimate the score after imputation (mean and median strategies)
score_simple_imputer = pd.DataFrame()
for strategy in ('mean', 'median'):
estimator = make_pipeline(
SimpleImputer(missing_values=np.nan, strategy=strategy),
br_estimator
)
score_simple_imputer[strategy] = cross_val_score(
estimator, X_missing, y_missing, scoring='neg_mean_squared_error',
cv=N_SPLITS
)

# Estimate the score after iterative imputation of the missing values
# with different estimators
estimators = [
BayesianRidge(),
DecisionTreeRegressor(max_features='sqrt', random_state=0),
ExtraTreesRegressor(n_estimators=10, n_jobs=-1, random_state=0),
KNeighborsRegressor(n_neighbors=15)
]
score_iterative_imputer = pd.DataFrame()
for estimator in estimators:
estimator = make_pipeline(
IterativeImputer(random_state=0, estimator=estimator),
br_estimator
)
score_iterative_imputer[estimator.__class__.__name__] = \
cross_val_score(
estimator, X_missing, y_missing, scoring='neg_mean_squared_error',
cv=N_SPLITS
)

scores = pd.concat(
[score_full_data, score_simple_imputer, score_iterative_imputer],
keys=['Original', 'SimpleImputer', 'IterativeImputer'], axis=1
)

# plot boston results
fig, ax = plt.subplots(figsize=(13, 6))
means = -scores.mean()
errors = scores.std()
means.plot.barh(xerr=errors, ax=ax)
ax.set_title('California Housing Regression with Different Imputation Methods')
ax.set_xlabel('MSE (smaller is better)')
ax.set_yticks(np.arange(means.shape[0]))
ax.set_yticklabels([" w/ ".join(label) for label in means.index.get_values()])
plt.tight_layout(pad=1)
plt.show()
Loading