@@ -9,12 +9,28 @@ Imputation of missing values
9
9
For various reasons, many real world datasets contain missing values, often
10
10
encoded as blanks, NaNs or other placeholders. Such datasets however are
11
11
incompatible with scikit-learn estimators which assume that all values in an
12
- array are numerical, and that all have and hold meaning. A basic strategy to use
13
- incomplete datasets is to discard entire rows and/or columns containing missing
14
- values. However, this comes at the price of losing data which may be valuable
15
- (even though incomplete). A better strategy is to impute the missing values,
16
- i.e., to infer them from the known part of the data. See the :ref: `glossary `
17
- entry on imputation.
12
+ array are numerical, and that all have and hold meaning. A basic strategy to
13
+ use incomplete datasets is to discard entire rows and/or columns containing
14
+ missing values. However, this comes at the price of losing data which may be
15
+ valuable (even though incomplete). A better strategy is to impute the missing
16
+ values, i.e., to infer them from the known part of the data. See the
17
+ :ref: `glossary ` entry on imputation.
18
+
19
+
20
+ Univariate vs. Multivariate Imputation
21
+ ======================================
22
+
23
+ One type of imputation algorithm is univariate, which imputes values in the
24
+ i-th feature dimension using only non-missing values in that feature dimension
25
+ (e.g. :class: `impute.SimpleImputer `). By contrast, multivariate imputation
26
+ algorithms use the entire set of available feature dimensions to estimate the
27
+ missing values (e.g. :class: `impute.IterativeImputer `).
28
+
29
+
30
+ .. _single_imputer :
31
+
32
+ Univariate feature imputation
33
+ =============================
18
34
19
35
The :class: `SimpleImputer ` class provides basic strategies for imputing missing
20
36
values. Missing values can be imputed with a provided constant value, or using
@@ -50,9 +66,9 @@ The :class:`SimpleImputer` class also supports sparse matrices::
50
66
[6. 3.]
51
67
[7. 6.]]
52
68
53
- Note that this format is not meant to be used to implicitly store missing values
54
- in the matrix because it would densify it at transform time. Missing values encoded
55
- by 0 must be used with dense input.
69
+ Note that this format is not meant to be used to implicitly store missing
70
+ values in the matrix because it would densify it at transform time. Missing
71
+ values encoded by 0 must be used with dense input.
56
72
57
73
The :class: `SimpleImputer ` class also supports categorical data represented as
58
74
string values or pandas categoricals when using the ``'most_frequent' `` or
@@ -71,9 +87,92 @@ string values or pandas categoricals when using the ``'most_frequent'`` or
71
87
['a' 'y']
72
88
['b' 'y']]
73
89
90
+ .. _iterative_imputer :
91
+
92
+
93
+ Multivariate feature imputation
94
+ ===============================
95
+
96
+ A more sophisticated approach is to use the :class: `IterativeImputer ` class,
97
+ which models each feature with missing values as a function of other features,
98
+ and uses that estimate for imputation. It does so in an iterated round-robin
99
+ fashion: at each step, a feature column is designated as output ``y `` and the
100
+ other feature columns are treated as inputs ``X ``. A regressor is fit on ``(X,
101
+ y) `` for known ``y ``. Then, the regressor is used to predict the missing values
102
+ of ``y ``. This is done for each feature in an iterative fashion, and then is
103
+ repeated for ``max_iter `` imputation rounds. The results of the final
104
+ imputation round are returned.
105
+
106
+ >>> import numpy as np
107
+ >>> from sklearn.impute import IterativeImputer
108
+ >>> imp = IterativeImputer(max_iter = 10 , random_state = 0 )
109
+ >>> imp.fit([[1 , 2 ], [3 , 6 ], [4 , 8 ], [np.nan, 3 ], [7 , np.nan]]) # doctest: +NORMALIZE_WHITESPACE
110
+ IterativeImputer(estimator=None, imputation_order='ascending',
111
+ initial_strategy='mean', max_iter=10, max_value=None,
112
+ min_value=None, missing_values=nan, n_nearest_features=None,
113
+ random_state=0, sample_posterior=False, tol=0.001, verbose=0)
114
+ >>> X_test = [[np.nan, 2 ], [6 , np.nan], [np.nan, 6 ]]
115
+ >>> # the model learns that the second feature is double the first
116
+ >>> print (np.round(imp.transform(X_test)))
117
+ [[ 1. 2.]
118
+ [ 6. 12.]
119
+ [ 3. 6.]]
120
+
121
+ Both :class: `SimpleImputer ` and :class: `IterativeImputer ` can be used in a
122
+ Pipeline as a way to build a composite estimator that supports imputation.
123
+ See :ref: `sphx_glr_auto_examples_impute_plot_missing_values.py `.
124
+
125
+ Flexibility of IterativeImputer
126
+ -------------------------------
127
+
128
+ There are many well-established imputation packages in the R data science
129
+ ecosystem: Amelia, mi, mice, missForest, etc. missForest is popular, and turns
130
+ out to be a particular instance of different sequential imputation algorithms
131
+ that can all be implemented with :class: `IterativeImputer ` by passing in
132
+ different regressors to be used for predicting missing feature values. In the
133
+ case of missForest, this regressor is a Random Forest.
134
+ See :ref: `sphx_glr_auto_examples_plot_iterative_imputer_variants_comparison.py `.
135
+
136
+
137
+ .. _multiple_imputation :
138
+
139
+ Multiple vs. Single Imputation
140
+ ------------------------------
141
+
142
+ In the statistics community, it is common practice to perform multiple
143
+ imputations, generating, for example, ``m `` separate imputations for a single
144
+ feature matrix. Each of these ``m `` imputations is then put through the
145
+ subsequent analysis pipeline (e.g. feature engineering, clustering, regression,
146
+ classification). The ``m `` final analysis results (e.g. held-out validation
147
+ errors) allow the data scientist to obtain understanding of how analytic
148
+ results may differ as a consequence of the inherent uncertainty caused by the
149
+ missing values. The above practice is called multiple imputation.
150
+
151
+ Our implementation of :class: `IterativeImputer ` was inspired by the R MICE
152
+ package (Multivariate Imputation by Chained Equations) [1 ]_, but differs from
153
+ it by returning a single imputation instead of multiple imputations. However,
154
+ :class: `IterativeImputer ` can also be used for multiple imputations by applying
155
+ it repeatedly to the same dataset with different random seeds when
156
+ ``sample_posterior=True ``. See [2 ]_, chapter 4 for more discussion on multiple
157
+ vs. single imputations.
158
+
159
+ It is still an open problem as to how useful single vs. multiple imputation is
160
+ in the context of prediction and classification when the user is not
161
+ interested in measuring uncertainty due to missing values.
162
+
163
+ Note that a call to the ``transform `` method of :class: `IterativeImputer ` is
164
+ not allowed to change the number of samples. Therefore multiple imputations
165
+ cannot be achieved by a single call to ``transform ``.
166
+
167
+ References
168
+ ==========
169
+
170
+ .. [1 ] Stef van Buuren, Karin Groothuis-Oudshoorn (2011). "mice: Multivariate
171
+ Imputation by Chained Equations in R". Journal of Statistical Software 45:
172
+ 1-67.
74
173
75
- :class: ` SimpleImputer ` can be used in a Pipeline as a way to build a composite
76
- estimator that supports imputation. See :ref: ` sphx_glr_auto_examples_plot_missing_values.py ` .
174
+ .. [ 2 ] Roderick J A Little and Donald B Rubin (1986). "Statistical Analysis
175
+ with Missing Data". John Wiley & Sons, Inc., New York, NY, USA .
77
176
78
177
.. _missing_indicator :
79
178
0 commit comments