Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit b8d1226

Browse files
authored
FEA Add IterativeImputer (#11977)
By Sergey Feldman
1 parent 64f5630 commit b8d1226

File tree

9 files changed

+1346
-45
lines changed

9 files changed

+1346
-45
lines changed

doc/modules/classes.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -655,8 +655,9 @@ Kernels:
655655
:template: class.rst
656656

657657
impute.SimpleImputer
658+
impute.IterativeImputer
658659
impute.MissingIndicator
659-
660+
660661
.. _kernel_approximation_ref:
661662

662663
:mod:`sklearn.kernel_approximation` Kernel Approximation

doc/modules/impute.rst

Lines changed: 110 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,28 @@ Imputation of missing values
99
For various reasons, many real world datasets contain missing values, often
1010
encoded as blanks, NaNs or other placeholders. Such datasets however are
1111
incompatible with scikit-learn estimators which assume that all values in an
12-
array are numerical, and that all have and hold meaning. A basic strategy to use
13-
incomplete datasets is to discard entire rows and/or columns containing missing
14-
values. However, this comes at the price of losing data which may be valuable
15-
(even though incomplete). A better strategy is to impute the missing values,
16-
i.e., to infer them from the known part of the data. See the :ref:`glossary`
17-
entry on imputation.
12+
array are numerical, and that all have and hold meaning. A basic strategy to
13+
use incomplete datasets is to discard entire rows and/or columns containing
14+
missing values. However, this comes at the price of losing data which may be
15+
valuable (even though incomplete). A better strategy is to impute the missing
16+
values, i.e., to infer them from the known part of the data. See the
17+
:ref:`glossary` entry on imputation.
18+
19+
20+
Univariate vs. Multivariate Imputation
21+
======================================
22+
23+
One type of imputation algorithm is univariate, which imputes values in the
24+
i-th feature dimension using only non-missing values in that feature dimension
25+
(e.g. :class:`impute.SimpleImputer`). By contrast, multivariate imputation
26+
algorithms use the entire set of available feature dimensions to estimate the
27+
missing values (e.g. :class:`impute.IterativeImputer`).
28+
29+
30+
.. _single_imputer:
31+
32+
Univariate feature imputation
33+
=============================
1834

1935
The :class:`SimpleImputer` class provides basic strategies for imputing missing
2036
values. Missing values can be imputed with a provided constant value, or using
@@ -50,9 +66,9 @@ The :class:`SimpleImputer` class also supports sparse matrices::
5066
[6. 3.]
5167
[7. 6.]]
5268

53-
Note that this format is not meant to be used to implicitly store missing values
54-
in the matrix because it would densify it at transform time. Missing values encoded
55-
by 0 must be used with dense input.
69+
Note that this format is not meant to be used to implicitly store missing
70+
values in the matrix because it would densify it at transform time. Missing
71+
values encoded by 0 must be used with dense input.
5672

5773
The :class:`SimpleImputer` class also supports categorical data represented as
5874
string values or pandas categoricals when using the ``'most_frequent'`` or
@@ -71,9 +87,92 @@ string values or pandas categoricals when using the ``'most_frequent'`` or
7187
['a' 'y']
7288
['b' 'y']]
7389

90+
.. _iterative_imputer:
91+
92+
93+
Multivariate feature imputation
94+
===============================
95+
96+
A more sophisticated approach is to use the :class:`IterativeImputer` class,
97+
which models each feature with missing values as a function of other features,
98+
and uses that estimate for imputation. It does so in an iterated round-robin
99+
fashion: at each step, a feature column is designated as output ``y`` and the
100+
other feature columns are treated as inputs ``X``. A regressor is fit on ``(X,
101+
y)`` for known ``y``. Then, the regressor is used to predict the missing values
102+
of ``y``. This is done for each feature in an iterative fashion, and then is
103+
repeated for ``max_iter`` imputation rounds. The results of the final
104+
imputation round are returned.
105+
106+
>>> import numpy as np
107+
>>> from sklearn.impute import IterativeImputer
108+
>>> imp = IterativeImputer(max_iter=10, random_state=0)
109+
>>> imp.fit([[1, 2], [3, 6], [4, 8], [np.nan, 3], [7, np.nan]]) # doctest: +NORMALIZE_WHITESPACE
110+
IterativeImputer(estimator=None, imputation_order='ascending',
111+
initial_strategy='mean', max_iter=10, max_value=None,
112+
min_value=None, missing_values=nan, n_nearest_features=None,
113+
random_state=0, sample_posterior=False, tol=0.001, verbose=0)
114+
>>> X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
115+
>>> # the model learns that the second feature is double the first
116+
>>> print(np.round(imp.transform(X_test)))
117+
[[ 1. 2.]
118+
[ 6. 12.]
119+
[ 3. 6.]]
120+
121+
Both :class:`SimpleImputer` and :class:`IterativeImputer` can be used in a
122+
Pipeline as a way to build a composite estimator that supports imputation.
123+
See :ref:`sphx_glr_auto_examples_impute_plot_missing_values.py`.
124+
125+
Flexibility of IterativeImputer
126+
-------------------------------
127+
128+
There are many well-established imputation packages in the R data science
129+
ecosystem: Amelia, mi, mice, missForest, etc. missForest is popular, and turns
130+
out to be a particular instance of different sequential imputation algorithms
131+
that can all be implemented with :class:`IterativeImputer` by passing in
132+
different regressors to be used for predicting missing feature values. In the
133+
case of missForest, this regressor is a Random Forest.
134+
See :ref:`sphx_glr_auto_examples_plot_iterative_imputer_variants_comparison.py`.
135+
136+
137+
.. _multiple_imputation:
138+
139+
Multiple vs. Single Imputation
140+
------------------------------
141+
142+
In the statistics community, it is common practice to perform multiple
143+
imputations, generating, for example, ``m`` separate imputations for a single
144+
feature matrix. Each of these ``m`` imputations is then put through the
145+
subsequent analysis pipeline (e.g. feature engineering, clustering, regression,
146+
classification). The ``m`` final analysis results (e.g. held-out validation
147+
errors) allow the data scientist to obtain understanding of how analytic
148+
results may differ as a consequence of the inherent uncertainty caused by the
149+
missing values. The above practice is called multiple imputation.
150+
151+
Our implementation of :class:`IterativeImputer` was inspired by the R MICE
152+
package (Multivariate Imputation by Chained Equations) [1]_, but differs from
153+
it by returning a single imputation instead of multiple imputations. However,
154+
:class:`IterativeImputer` can also be used for multiple imputations by applying
155+
it repeatedly to the same dataset with different random seeds when
156+
``sample_posterior=True``. See [2]_, chapter 4 for more discussion on multiple
157+
vs. single imputations.
158+
159+
It is still an open problem as to how useful single vs. multiple imputation is
160+
in the context of prediction and classification when the user is not
161+
interested in measuring uncertainty due to missing values.
162+
163+
Note that a call to the ``transform`` method of :class:`IterativeImputer` is
164+
not allowed to change the number of samples. Therefore multiple imputations
165+
cannot be achieved by a single call to ``transform``.
166+
167+
References
168+
==========
169+
170+
.. [1] Stef van Buuren, Karin Groothuis-Oudshoorn (2011). "mice: Multivariate
171+
Imputation by Chained Equations in R". Journal of Statistical Software 45:
172+
1-67.
74173
75-
:class:`SimpleImputer` can be used in a Pipeline as a way to build a composite
76-
estimator that supports imputation. See :ref:`sphx_glr_auto_examples_plot_missing_values.py`.
174+
.. [2] Roderick J A Little and Donald B Rubin (1986). "Statistical Analysis
175+
with Missing Data". John Wiley & Sons, Inc., New York, NY, USA.
77176
78177
.. _missing_indicator:
79178

doc/whats_new/v0.21.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,15 @@ Support for Python 3.4 and below has been officially dropped.
125125
- |API| Deprecated :mod:`externals.six` since we have dropped support for
126126
Python 2.7. :issue:`12916` by :user:`Hanmin Qin <qinhanmin2014>`.
127127

128+
:mod:`sklearn.impute`
129+
.....................
130+
131+
- |MajorFeature| Added :class:`impute.IterativeImputer`, which is a strategy
132+
for imputing missing values by modeling each feature with missing values as a
133+
function of other features in a round-robin fashion. :issue:`8478` and
134+
:issue:`12177` by :user:`Sergey Feldman <sergeyf>` :user:`Ben Lawson
135+
<benlawson>`.
136+
128137
:mod:`sklearn.linear_model`
129138
...........................
130139

examples/impute/README.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
.. _impute_examples:
2+
3+
Missing Value Imputation
4+
------------------------
5+
6+
Examples concerning the :mod:`sklearn.impute` module.
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
"""
2+
=========================================================
3+
Imputing missing values with variants of IterativeImputer
4+
=========================================================
5+
6+
The :class:`sklearn.impute.IterativeImputer` class is very flexible - it can be
7+
used with a variety of estimators to do round-robin regression, treating every
8+
variable as an output in turn.
9+
10+
In this example we compare some estimators for the purpose of missing feature
11+
imputation with :class:`sklearn.imputeIterativeImputer`::
12+
13+
:class:`~sklearn.linear_model.BayesianRidge`: regularized linear regression
14+
:class:`~sklearn.tree.DecisionTreeRegressor`: non-linear regression
15+
:class:`~sklearn.ensemble.ExtraTreesRegressor`: similar to missForest in R
16+
:class:`~sklearn.neighbors.KNeighborsRegressor`: comparable to other KNN
17+
imputation approaches
18+
19+
Of particular interest is the ability of
20+
:class:`sklearn.impute.IterativeImputer` to mimic the behavior of missForest, a
21+
popular imputation package for R. In this example, we have chosen to use
22+
:class:`sklearn.ensemble.ExtraTreesRegressor` instead of
23+
:class:`sklearn.ensemble.RandomForestRegressor` (as in missForest) due to its
24+
increased speed.
25+
26+
Note that :class:`sklearn.neighbors.KNeighborsRegressor` is different from KNN
27+
imputation, which learns from samples with missing values by using a distance
28+
metric that accounts for missing values, rather than imputing them.
29+
30+
The goal is to compare different estimators to see which one is best for the
31+
:class:`sklearn.impute.IterativeImputer` when using a
32+
:class:`sklearn.linear_model.BayesianRidge` estimator on the California housing
33+
dataset with a single value randomly removed from each row.
34+
35+
For this particular pattern of missing values we see that
36+
:class:`sklearn.ensemble.ExtraTreesRegressor` and
37+
:class:`sklearn.linear_model.BayesianRidge` give the best results.
38+
"""
39+
print(__doc__)
40+
41+
import numpy as np
42+
import matplotlib.pyplot as plt
43+
import pandas as pd
44+
45+
from sklearn.datasets import fetch_california_housing
46+
from sklearn.impute import SimpleImputer
47+
from sklearn.impute import IterativeImputer
48+
from sklearn.linear_model import BayesianRidge
49+
from sklearn.tree import DecisionTreeRegressor
50+
from sklearn.ensemble import ExtraTreesRegressor
51+
from sklearn.neighbors import KNeighborsRegressor
52+
from sklearn.pipeline import make_pipeline
53+
from sklearn.model_selection import cross_val_score
54+
55+
N_SPLITS = 5
56+
57+
rng = np.random.RandomState(0)
58+
59+
X_full, y_full = fetch_california_housing(return_X_y=True)
60+
n_samples, n_features = X_full.shape
61+
62+
# Estimate the score on the entire dataset, with no missing values
63+
br_estimator = BayesianRidge()
64+
score_full_data = pd.DataFrame(
65+
cross_val_score(
66+
br_estimator, X_full, y_full, scoring='neg_mean_squared_error',
67+
cv=N_SPLITS
68+
),
69+
columns=['Full Data']
70+
)
71+
72+
# Add a single missing value to each row
73+
X_missing = X_full.copy()
74+
y_missing = y_full
75+
missing_samples = np.arange(n_samples)
76+
missing_features = rng.choice(n_features, n_samples, replace=True)
77+
X_missing[missing_samples, missing_features] = np.nan
78+
79+
# Estimate the score after imputation (mean and median strategies)
80+
score_simple_imputer = pd.DataFrame()
81+
for strategy in ('mean', 'median'):
82+
estimator = make_pipeline(
83+
SimpleImputer(missing_values=np.nan, strategy=strategy),
84+
br_estimator
85+
)
86+
score_simple_imputer[strategy] = cross_val_score(
87+
estimator, X_missing, y_missing, scoring='neg_mean_squared_error',
88+
cv=N_SPLITS
89+
)
90+
91+
# Estimate the score after iterative imputation of the missing values
92+
# with different estimators
93+
estimators = [
94+
BayesianRidge(),
95+
DecisionTreeRegressor(max_features='sqrt', random_state=0),
96+
ExtraTreesRegressor(n_estimators=10, n_jobs=-1, random_state=0),
97+
KNeighborsRegressor(n_neighbors=15)
98+
]
99+
score_iterative_imputer = pd.DataFrame()
100+
for estimator in estimators:
101+
estimator = make_pipeline(
102+
IterativeImputer(random_state=0, estimator=estimator),
103+
br_estimator
104+
)
105+
score_iterative_imputer[estimator.__class__.__name__] = \
106+
cross_val_score(
107+
estimator, X_missing, y_missing, scoring='neg_mean_squared_error',
108+
cv=N_SPLITS
109+
)
110+
111+
scores = pd.concat(
112+
[score_full_data, score_simple_imputer, score_iterative_imputer],
113+
keys=['Original', 'SimpleImputer', 'IterativeImputer'], axis=1
114+
)
115+
116+
# plot boston results
117+
fig, ax = plt.subplots(figsize=(13, 6))
118+
means = -scores.mean()
119+
errors = scores.std()
120+
means.plot.barh(xerr=errors, ax=ax)
121+
ax.set_title('California Housing Regression with Different Imputation Methods')
122+
ax.set_xlabel('MSE (smaller is better)')
123+
ax.set_yticks(np.arange(means.shape[0]))
124+
ax.set_yticklabels([" w/ ".join(label) for label in means.index.get_values()])
125+
plt.tight_layout(pad=1)
126+
plt.show()

0 commit comments

Comments
 (0)