Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 93382cc

Browse files
sergeyfogrisel
authored andcommitted
Rename MICEImputer to ChainedImputer (#11314)
1 parent 5b29ae6 commit 93382cc

File tree

8 files changed

+158
-113
lines changed

8 files changed

+158
-113
lines changed

doc/glossary.rst

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -435,6 +435,13 @@ General Concepts
435435
hyper-parameter
436436
See :term:`parameter`.
437437

438+
impute
439+
imputation
440+
Most machine learning algorithms require that their inputs have no
441+
:term:`missing values`, and will not work if this requirement is
442+
violated. Algorithms that attempt to fill in (or impute) missing values
443+
are referred to as imputation algorithms.
444+
438445
indexable
439446
An :term:`array-like`, :term:`sparse matrix`, pandas DataFrame or
440447
sequence (usually a list).
@@ -486,7 +493,7 @@ General Concepts
486493
do (e.g. in :class:`impute.SimpleImputer`), NaN is the preferred
487494
representation of missing values in float arrays. If the array has
488495
integer dtype, NaN cannot be represented. For this reason, we support
489-
specifying another ``missing_values`` value when imputation or
496+
specifying another ``missing_values`` value when :term:`imputation` or
490497
learning can be performed in integer space. :term:`Unlabeled data`
491498
is a special case of missing values in the :term:`target`.
492499

doc/modules/classes.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -653,7 +653,7 @@ Kernels:
653653
:template: class.rst
654654

655655
impute.SimpleImputer
656-
impute.MICEImputer
656+
impute.ChainedImputer
657657

658658
.. _kernel_approximation_ref:
659659

doc/modules/impute.rst

Lines changed: 54 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,22 @@ array are numerical, and that all have and hold meaning. A basic strategy to use
1313
incomplete datasets is to discard entire rows and/or columns containing missing
1414
values. However, this comes at the price of losing data which may be valuable
1515
(even though incomplete). A better strategy is to impute the missing values,
16-
i.e., to infer them from the known part of the data.
16+
i.e., to infer them from the known part of the data. See the :ref:`glossary`
17+
entry on imputation.
1718

1819

20+
Univariate vs. Multivariate Imputation
21+
======================================
22+
23+
One type of imputation algorithm is univariate, which imputes values in the i-th
24+
feature dimension using only non-missing values in that feature dimension
25+
(e.g. :class:`impute.SimpleImputer`). By contrast, multivariate imputation
26+
algorithms use the entire set of available feature dimensions to estimate the
27+
missing values (e.g. :class:`impute.ChainedImputer`).
28+
29+
30+
.. _single_imputer:
31+
1932
Univariate feature imputation
2033
=============================
2134

@@ -74,35 +87,58 @@ string values or pandas categoricals when using the ``'most_frequent'`` or
7487
['a' 'y']
7588
['b' 'y']]
7689

77-
.. _mice:
90+
.. _chained_imputer:
91+
7892

7993
Multivariate feature imputation
8094
===============================
8195

82-
A more sophisticated approach is to use the :class:`MICEImputer` class, which
83-
implements the Multivariate Imputation by Chained Equations technique. MICE
84-
models each feature with missing values as a function of other features, and
85-
uses that estimate for imputation. It does so in a round-robin fashion: at
86-
each step, a feature column is designated as output `y` and the other feature
87-
columns are treated as inputs `X`. A regressor is fit on `(X, y)` for known `y`.
88-
Then, the regressor is used to predict the unknown values of `y`. This is
89-
repeated for each feature, and then is done for a number of imputation rounds.
90-
Here is an example snippet::
96+
A more sophisticated approach is to use the :class:`ChainedImputer` class, which
97+
implements the imputation technique from MICE (Multivariate Imputation by
98+
Chained Equations). MICE models each feature with missing values as a function of
99+
other features, and uses that estimate for imputation. It does so in a round-robin
100+
fashion: at each step, a feature column is designated as output `y` and the other
101+
feature columns are treated as inputs `X`. A regressor is fit on `(X, y)` for known `y`.
102+
Then, the regressor is used to predict the unknown values of `y`. This is repeated
103+
for each feature in a chained fashion, and then is done for a number of imputation
104+
rounds. Here is an example snippet::
91105

92106
>>> import numpy as np
93-
>>> from sklearn.impute import MICEImputer
94-
>>> imp = MICEImputer(n_imputations=10, random_state=0)
107+
>>> from sklearn.impute import ChainedImputer
108+
>>> imp = ChainedImputer(n_imputations=10, random_state=0)
95109
>>> imp.fit([[1, 2], [np.nan, 3], [7, np.nan]])
96-
MICEImputer(imputation_order='ascending', initial_strategy='mean',
97-
max_value=None, min_value=None, missing_values=nan, n_burn_in=10,
98-
n_imputations=10, n_nearest_features=None, predictor=None,
99-
random_state=0, verbose=False)
110+
ChainedImputer(imputation_order='ascending', initial_strategy='mean',
111+
max_value=None, min_value=None, missing_values=nan, n_burn_in=10,
112+
n_imputations=10, n_nearest_features=None, predictor=None,
113+
random_state=0, verbose=False)
100114
>>> X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
101115
>>> print(np.round(imp.transform(X_test)))
102116
[[ 1. 2.]
103117
[ 6. 4.]
104118
[13. 6.]]
105119

106-
Both :class:`SimpleImputer` and :class:`MICEImputer` can be used in a Pipeline
120+
Both :class:`SimpleImputer` and :class:`ChainedImputer` can be used in a Pipeline
107121
as a way to build a composite estimator that supports imputation.
108122
See :ref:`sphx_glr_auto_examples_plot_missing_values.py`.
123+
124+
125+
.. _multiple_imputation:
126+
127+
Multiple vs. Single Imputation
128+
==============================
129+
130+
In the statistics community, it is common practice to perform multiple imputations,
131+
generating, for example, 10 separate imputations for a single feature matrix.
132+
Each of these 10 imputations is then put through the subsequent analysis pipeline
133+
(e.g. feature engineering, clustering, regression, classification). The 10 final
134+
analysis results (e.g. held-out validation error) allow the data scientist to
135+
obtain understanding of the uncertainty inherent in the missing values. The above
136+
practice is called multiple imputation. As implemented, the :class:`ChainedImputer`
137+
class generates a single (averaged) imputation for each missing value because this
138+
is the most common use case for machine learning applications. However, it can also be used
139+
for multiple imputations by applying it repeatedly to the same dataset with different
140+
random seeds with the ``n_imputations`` parameter set to 1.
141+
142+
Note that a call to the ``transform`` method of :class:`ChainedImputer` is not
143+
allowed to change the number of samples. Therefore multiple imputations cannot be
144+
achieved by a single call to ``transform``.

doc/whats_new/v0.20.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,7 @@ Preprocessing
123123
back to the original space via an inverse transform. :issue:`9041` by
124124
`Andreas Müller`_ and :user:`Guillaume Lemaitre <glemaitre>`.
125125

126-
- Added :class:`impute.MICEImputer`, which is a strategy for imputing missing
126+
- Added :class:`impute.ChainedImputer`, which is a strategy for imputing missing
127127
values by modeling each feature with missing values as a function of
128128
other features in a round-robin fashion. :issue:`8478` by
129129
:user:`Sergey Feldman <sergeyf>`.

examples/plot_missing_values.py

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,11 @@
88
The median is a more robust estimator for data with high magnitude variables
99
which could dominate results (otherwise known as a 'long tail').
1010
11-
Another option is the MICE imputer. This uses round-robin linear regression,
12-
treating every variable as an output in turn. The version implemented assumes
13-
Gaussian (output) variables. If your features are obviously non-Normal,
14-
consider transforming them to look more Normal so as to improve performance.
11+
Another option is the ``ChainedImputer``. This uses round-robin linear
12+
regression, treating every variable as an output in turn. The version
13+
implemented assumes Gaussian (output) variables. If your features are obviously
14+
non-Normal, consider transforming them to look more Normal so as to improve
15+
performance.
1516
"""
1617

1718
import numpy as np
@@ -21,7 +22,7 @@
2122
from sklearn.datasets import load_boston
2223
from sklearn.ensemble import RandomForestRegressor
2324
from sklearn.pipeline import Pipeline
24-
from sklearn.impute import SimpleImputer, MICEImputer
25+
from sklearn.impute import SimpleImputer, ChainedImputer
2526
from sklearn.model_selection import cross_val_score
2627

2728
rng = np.random.RandomState(0)
@@ -66,18 +67,18 @@ def get_results(dataset):
6667
mean_impute_scores = cross_val_score(estimator, X_missing, y_missing,
6768
scoring='neg_mean_squared_error')
6869

69-
# Estimate the score after imputation (MICE strategy) of the missing values
70-
estimator = Pipeline([("imputer", MICEImputer(missing_values=0,
71-
random_state=0)),
70+
# Estimate the score after chained imputation of the missing values
71+
estimator = Pipeline([("imputer", ChainedImputer(missing_values=0,
72+
random_state=0)),
7273
("forest", RandomForestRegressor(random_state=0,
7374
n_estimators=100))])
74-
mice_impute_scores = cross_val_score(estimator, X_missing, y_missing,
75-
scoring='neg_mean_squared_error')
75+
chained_impute_scores = cross_val_score(estimator, X_missing, y_missing,
76+
scoring='neg_mean_squared_error')
7677

7778
return ((full_scores.mean(), full_scores.std()),
7879
(zero_impute_scores.mean(), zero_impute_scores.std()),
7980
(mean_impute_scores.mean(), mean_impute_scores.std()),
80-
(mice_impute_scores.mean(), mice_impute_scores.std()))
81+
(chained_impute_scores.mean(), chained_impute_scores.std()))
8182

8283

8384
results_diabetes = np.array(get_results(load_diabetes()))
@@ -94,7 +95,7 @@ def get_results(dataset):
9495
x_labels = ['Full data',
9596
'Zero imputation',
9697
'Mean Imputation',
97-
'MICE Imputation']
98+
'Chained Imputation']
9899
colors = ['r', 'g', 'b', 'orange']
99100

100101
# plot diabetes results

sklearn/impute.py

Lines changed: 21 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -30,13 +30,13 @@
3030
zip = six.moves.zip
3131
map = six.moves.map
3232

33-
MICETriplet = namedtuple('MICETriplet', ['feat_idx',
34-
'neighbor_feat_idx',
35-
'predictor'])
33+
ImputerTriplet = namedtuple('ImputerTriplet', ['feat_idx',
34+
'neighbor_feat_idx',
35+
'predictor'])
3636

3737
__all__ = [
3838
'SimpleImputer',
39-
'MICEImputer',
39+
'ChainedImputer',
4040
]
4141

4242

@@ -423,12 +423,12 @@ def transform(self, X):
423423
return X
424424

425425

426-
class MICEImputer(BaseEstimator, TransformerMixin):
427-
"""MICE transformer to impute missing values.
426+
class ChainedImputer(BaseEstimator, TransformerMixin):
427+
"""Chained imputer transformer to impute missing values.
428428
429-
Basic implementation of MICE (Multivariate Imputations by Chained
430-
Equations) package from R. This version assumes all of the features are
431-
Gaussian.
429+
Basic implementation of chained imputer from MICE (Multivariate
430+
Imputations by Chained Equations) package from R. This version assumes all
431+
of the features are Gaussian.
432432
433433
Read more in the :ref:`User Guide <mice>`.
434434
@@ -453,11 +453,11 @@ class MICEImputer(BaseEstimator, TransformerMixin):
453453
A random order for each round.
454454
455455
n_imputations : int, optional (default=100)
456-
Number of MICE rounds to perform, the results of which will be
457-
used in the final average.
456+
Number of chained imputation rounds to perform, the results of which
457+
will be used in the final average.
458458
459459
n_burn_in : int, optional (default=10)
460-
Number of initial MICE rounds to perform the results of which
460+
Number of initial imputation rounds to perform the results of which
461461
will not be returned.
462462
463463
predictor : estimator object, default=BayesianRidge()
@@ -858,7 +858,8 @@ def fit_transform(self, X, y=None):
858858
Xt = np.zeros((n_samples, n_features), dtype=X.dtype)
859859
self.imputation_sequence_ = []
860860
if self.verbose > 0:
861-
print("[MICE] Completing matrix with shape %s" % (X.shape,))
861+
print("[ChainedImputer] Completing matrix with shape %s"
862+
% (X.shape,))
862863
start_t = time()
863864
for i_rnd in range(n_rounds):
864865
if self.imputation_order == 'random':
@@ -871,15 +872,15 @@ def fit_transform(self, X, y=None):
871872
X_filled, predictor = self._impute_one_feature(
872873
X_filled, mask_missing_values, feat_idx, neighbor_feat_idx,
873874
predictor=None, fit_mode=True)
874-
predictor_triplet = MICETriplet(feat_idx,
875-
neighbor_feat_idx,
876-
predictor)
875+
predictor_triplet = ImputerTriplet(feat_idx,
876+
neighbor_feat_idx,
877+
predictor)
877878
self.imputation_sequence_.append(predictor_triplet)
878879

879880
if i_rnd >= self.n_burn_in:
880881
Xt += X_filled
881882
if self.verbose > 0:
882-
print('[MICE] Ending imputation round '
883+
print('[ChainedImputer] Ending imputation round '
883884
'%d/%d, elapsed time %0.2f'
884885
% (i_rnd + 1, n_rounds, time() - start_t))
885886

@@ -921,7 +922,8 @@ def transform(self, X):
921922
i_rnd = 0
922923
Xt = np.zeros(X.shape, dtype=X.dtype)
923924
if self.verbose > 0:
924-
print("[MICE] Completing matrix with shape %s" % (X.shape,))
925+
print("[ChainedImputer] Completing matrix with shape %s"
926+
% (X.shape,))
925927
start_t = time()
926928
for it, predictor_triplet in enumerate(self.imputation_sequence_):
927929
X_filled, _ = self._impute_one_feature(
@@ -936,7 +938,7 @@ def transform(self, X):
936938
if i_rnd >= self.n_burn_in:
937939
Xt += X_filled
938940
if self.verbose > 1:
939-
print('[MICE] Ending imputation round '
941+
print('[ChainedImputer] Ending imputation round '
940942
'%d/%d, elapsed time %0.2f'
941943
% (i_rnd + 1, n_rounds, time() - start_t))
942944
i_rnd += 1

0 commit comments

Comments
 (0)