@@ -13,9 +13,22 @@ array are numerical, and that all have and hold meaning. A basic strategy to use
1313incomplete datasets is to discard entire rows and/or columns containing missing
1414values. However, this comes at the price of losing data which may be valuable
1515(even though incomplete). A better strategy is to impute the missing values,
16- i.e., to infer them from the known part of the data.
16+ i.e., to infer them from the known part of the data. See the :ref: `glossary `
17+ entry on imputation.
1718
1819
20+ Univariate vs. Multivariate Imputation
21+ ======================================
22+
23+ One type of imputation algorithm is univariate, which imputes values in the i-th
24+ feature dimension using only non-missing values in that feature dimension
25+ (e.g. :class: `impute.SimpleImputer `). By contrast, multivariate imputation
26+ algorithms use the entire set of available feature dimensions to estimate the
27+ missing values (e.g. :class: `impute.ChainedImputer `).
28+
29+
30+ .. _single_imputer :
31+
1932Univariate feature imputation
2033=============================
2134
@@ -74,35 +87,58 @@ string values or pandas categoricals when using the ``'most_frequent'`` or
7487 ['a' 'y']
7588 ['b' 'y']]
7689
77- .. _mice :
90+ .. _chained_imputer :
91+
7892
7993Multivariate feature imputation
8094===============================
8195
82- A more sophisticated approach is to use the :class: `MICEImputer ` class, which
83- implements the Multivariate Imputation by Chained Equations technique. MICE
84- models each feature with missing values as a function of other features, and
85- uses that estimate for imputation. It does so in a round-robin fashion: at
86- each step, a feature column is designated as output `y ` and the other feature
87- columns are treated as inputs `X `. A regressor is fit on `(X, y) ` for known `y `.
88- Then, the regressor is used to predict the unknown values of `y `. This is
89- repeated for each feature, and then is done for a number of imputation rounds.
90- Here is an example snippet::
96+ A more sophisticated approach is to use the :class: `ChainedImputer ` class, which
97+ implements the imputation technique from MICE (Multivariate Imputation by
98+ Chained Equations). MICE models each feature with missing values as a function of
99+ other features, and uses that estimate for imputation. It does so in a round-robin
100+ fashion: at each step, a feature column is designated as output `y ` and the other
101+ feature columns are treated as inputs `X `. A regressor is fit on `(X, y) ` for known `y `.
102+ Then, the regressor is used to predict the unknown values of `y `. This is repeated
103+ for each feature in a chained fashion , and then is done for a number of imputation
104+ rounds. Here is an example snippet::
91105
92106 >>> import numpy as np
93- >>> from sklearn.impute import MICEImputer
94- >>> imp = MICEImputer (n_imputations=10, random_state=0)
107+ >>> from sklearn.impute import ChainedImputer
108+ >>> imp = ChainedImputer (n_imputations=10, random_state=0)
95109 >>> imp.fit([[1, 2], [np.nan, 3], [7, np.nan]])
96- MICEImputer (imputation_order='ascending', initial_strategy='mean',
97- max_value=None, min_value=None, missing_values=nan, n_burn_in=10,
98- n_imputations=10, n_nearest_features=None, predictor=None,
99- random_state=0, verbose=False)
110+ ChainedImputer (imputation_order='ascending', initial_strategy='mean',
111+ max_value=None, min_value=None, missing_values=nan, n_burn_in=10,
112+ n_imputations=10, n_nearest_features=None, predictor=None,
113+ random_state=0, verbose=False)
100114 >>> X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
101115 >>> print(np.round(imp.transform(X_test)))
102116 [[ 1. 2.]
103117 [ 6. 4.]
104118 [13. 6.]]
105119
106- Both :class: `SimpleImputer ` and :class: `MICEImputer ` can be used in a Pipeline
120+ Both :class: `SimpleImputer ` and :class: `ChainedImputer ` can be used in a Pipeline
107121as a way to build a composite estimator that supports imputation.
108122See :ref: `sphx_glr_auto_examples_plot_missing_values.py `.
123+
124+
125+ .. _multiple_imputation :
126+
127+ Multiple vs. Single Imputation
128+ ==============================
129+
130+ In the statistics community, it is common practice to perform multiple imputations,
131+ generating, for example, 10 separate imputations for a single feature matrix.
132+ Each of these 10 imputations is then put through the subsequent analysis pipeline
133+ (e.g. feature engineering, clustering, regression, classification). The 10 final
134+ analysis results (e.g. held-out validation error) allow the data scientist to
135+ obtain understanding of the uncertainty inherent in the missing values. The above
136+ practice is called multiple imputation. As implemented, the :class: `ChainedImputer `
137+ class generates a single (averaged) imputation for each missing value because this
138+ is the most common use case for machine learning applications. However, it can also be used
139+ for multiple imputations by applying it repeatedly to the same dataset with different
140+ random seeds with the ``n_imputations `` parameter set to 1.
141+
142+ Note that a call to the ``transform `` method of :class: `ChainedImputer ` is not
143+ allowed to change the number of samples. Therefore multiple imputations cannot be
144+ achieved by a single call to ``transform ``.
0 commit comments