-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Added k-Nearest Neighbor imputation for missing data #9212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please merge in #9348 and make use of that implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding tests and documentation might help you think this through at this stage.
sklearn/preprocessing/imputation.py
Outdated
@@ -247,6 +340,11 @@ def _sparse_fit(self, X, strategy, missing_values, axis): | |||
|
|||
return most_frequent | |||
|
|||
# KNN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unhelpful comment
sklearn/preprocessing/imputation.py
Outdated
@@ -298,6 +396,80 @@ def _dense_fit(self, X, strategy, missing_values, axis): | |||
|
|||
return most_frequent | |||
|
|||
# KNN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unhelpful
sklearn/preprocessing/imputation.py
Outdated
X = X.transpose() | ||
mask = mask.transpose() | ||
|
||
#Get dimensions and missing count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8 requires space between # and comment. But I also don't think this comment helps.
sklearn/preprocessing/imputation.py
Outdated
@@ -115,12 +204,16 @@ class Imputer(BaseEstimator, TransformerMixin): | |||
contain missing values). | |||
""" | |||
def __init__(self, missing_values="NaN", strategy="mean", | |||
axis=0, verbose=0, copy=True): | |||
axis=0, verbose=0, copy=True, n_neighbors=10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please describe the new parameters in the class docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to default to current behaviour, i.e. n_neighbors=None
.
sklearn/preprocessing/imputation.py
Outdated
@@ -115,12 +204,16 @@ class Imputer(BaseEstimator, TransformerMixin): | |||
contain missing values). | |||
""" | |||
def __init__(self, missing_values="NaN", strategy="mean", | |||
axis=0, verbose=0, copy=True): | |||
axis=0, verbose=0, copy=True, n_neighbors=10, | |||
row_max_missing=0.5, col_max_missing=0.8): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
col_max_missing
isn't really part of the KNN imputation feature (except when axis=1
, which is a weird case). Raising an error in general would break current behaviour. Perhaps this deserves a warning, not an error; and I think it belongs in a separate PR.
Avoiding using largely-missing rows in KNN is a sensible feature, and differs from the whole-dataset statistics behaviour. It may even be sufficient to justify a separate class to KNN.
sklearn/preprocessing/imputation.py
Outdated
|
||
# Arg-partition (quasi-argsort) of n_neighbors and retrieve them | ||
nbors_index = np.argpartition(dist, n_neighbors - 1, axis=1) | ||
knn_row_index = nbors_index[:, :n_neighbors] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we do in the case that the same feature(s) is missing in some or many of the nearest neighbors? What do other implementations do? This case may be harder to handle using the nearest neighbors infrastructure than rewriting it as you do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can remember, the package referenced above does the following:
i) if some of the neighbors have that feature not missing then only those neighbors are used for that particular feature
ii) if non of the neighbors have non-missing for a particular feature then it uses the column/feature mean
I will have a look at their source code when I start working on this.
sklearn/preprocessing/imputation.py
Outdated
@@ -298,6 +396,80 @@ def _dense_fit(self, X, strategy, missing_values, axis): | |||
|
|||
return most_frequent | |||
|
|||
# KNN | |||
elif strategy == "knn": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think knn is a separate strategy. you could still opt to adopt the median of the k-nearest neighbors' values. (Perhaps this makes limited sense given the masked-euclidean metric, and maybe we should use masked-manhattan (equivalently Gower distance, I think) in the strategy='median'
case...?)
sklearn/preprocessing/imputation.py
Outdated
and np.ma.allequal(masked_X, statistics): | ||
X = statistics.data | ||
else: | ||
X = self._dense_fit(X, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we fitting in transform??
sklearn/preprocessing/imputation.py
Outdated
mask_after_knn = _get_mask(X, self.missing_values) | ||
if np.any(mask_after_knn): | ||
missing_index = np.where(mask_after_knn) | ||
X_col_means = masked_X.mean(axis=0).data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we support a distance-weighted mean, as KNNRegressor does?
@jnothman Before we dive too deep into this implementation, I was wondering what your thoughts were on implementing KNNImputer as a separate class of its own? |
I had thought it belonged in the same class, particularly because
strategies are orthogonal to number of neighbors, but I don't mind it
appearing in a class of its own. FYI, I have proposed in #9463 that we
ditch the axis parameter to Imputer. So if we go with extending Imputer,
you might not need to worry about the axis=1 case.
…On 30 July 2017 at 14:51, ashimb9 ***@***.***> wrote:
@jnothman <https://github.com/jnothman> Before we dive too deep into this
implementation, I was wondering what your thoughts were on implementing
KNNImputer as a separate class of its own?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9212 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6936ynMok5j8EAZaemHMP1I3bWHeks5sTAvNgaJpZM4OERd8>
.
|
Since you are okay with a new class, I think I will do that. I have a feeling that if features get added to knn-imputation in the future (ex, sequential imputation) it might become too awkward for the general-purpose Imputer. |
@jnothman tests are failing after your commits. |
common tests are failing still... |
The :class:`KNNImputer` class provides imputation for completing missing | ||
values using the k-Nearest Neighbors approach. Each sample's missing values | ||
are imputed using values from ``n_neighbors`` nearest neighbors found in the | ||
training set. Note that if a sample has more than one feature missing, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is a bit cryptic to me. Maybe say "If multiple features are missing, not the sets of neighbors used for imputation can be different" or something like that?
what was the conclusion of the discussion of matching the reference implementation vs matching the paper. Did we hear from Hastie? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok apart from the comments. Mostly agree with @jnothman, and it looks some of the comments from December are not addressed yet.
|
||
Each missing feature is then imputed as the average, either weighted or | ||
unweighted, of these neighbors. Where the number of donor neighbors is less | ||
than ``n_neighbors``, the training set average for that feature is used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this sentence a bit cryptic. Maybe define donors somewhere?
examples/plot_missing_values.py
Outdated
return ((full_scores.mean(), full_scores.std()), | ||
(zero_impute_scores.mean(), zero_impute_scores.std()), | ||
(mean_impute_scores.mean(), mean_impute_scores.std())) | ||
(mean_impute_scores.mean(), mean_impute_scores.std()), | ||
(mice_impute_scores.mean(), mice_impute_scores.std()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mice_impute_scores? We don't have mice, right?
Each sample's missing values are imputed using values from ``n_neighbors`` | ||
nearest neighbors found in the training set. Each missing feature is then | ||
imputed as the average, either weighted or unweighted, of these neighbors. | ||
Note that if a sample has more than one feature missing, then the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is better than the sentence I disliked above, I think. So feel free to replicate it in the user guide?
imputed as the average, either weighted or unweighted, of these neighbors. | ||
Note that if a sample has more than one feature missing, then the | ||
neighbors for that sample can be different depending on the particular | ||
feature being imputed. Finally, where the number of donor neighbors is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again the sentence about the number of neighbors seems unclear without explaining row_max_missing
.
X = X_merged | ||
return X | ||
|
||
def fit_transform(self, X, y=None, **fit_params): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this inherited?
# Check if % missing in any row > row_max_missing | ||
bad_rows = mask.sum(axis=1) > (mask.shape[1] * self.row_max_missing) | ||
if np.any(bad_rows): | ||
warnings.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would only print a message for that if there's some verbosity set. I feel like this doesn't require a warning.
if X.shape[0] < self.n_neighbors: | ||
raise ValueError("There are only %d samples, but n_neighbors=%d." | ||
% (X.shape[0], self.n_neighbors)) | ||
self.fitted_X_ = X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
if np.any(mask.sum(axis=0) > (X.shape[0] * self.col_max_missing)): | ||
raise ValueError("Some column(s) have more than {}% missing values" | ||
.format(self.col_max_missing * 100)) | ||
X_col_means = np.ma.array(X, mask=mask).mean(axis=0).data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we compute this after the input validation?
@@ -615,6 +619,598 @@ def test_missing_indicator_sparse_param(arr_type, missing_values, | |||
assert isinstance(X_trans_mask, np.ndarray) | |||
|
|||
|
|||
############################################################################# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should move to the impute module, right?
|
||
# Get distance from potential donors | ||
dist_pdonors = dist[receivers_row_idx][:, pdonors_row_idx] | ||
dist_pdonors = dist_pdonors.reshape(-1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed?
IIRC you heard from Hastie that there might be something dodgy going on, but not so much as an admission of a bug... We concluded that it makes sense to match the paper, not the reference implementation. That is, we always seek the k nearest neighbors that have a value for the target feature. |
Btw, I'm initially just trying to get this working at master, and I or someone else can go back and iron out all the requests above, in this PR or another. I don't know yet when I'll get to that. |
Ok cool. I want to prioritize the estimator tags first. Do you wanna review the deprecation removals by any chance? ;) |
we might want to add Tutz, Gerhard; Ramzan, Shahla (13. October 2014): Improved Methods for the Imputation of Missing Data by Nearest Neighbor Method to the references and also possibly think about adding gower similarity in the future. But maybe not delay this. |
Gower's approach to missing values is, if I understand correctly, just a L1
equivalent of what we have here for L2. (There may be some scaling proposed
by Gower too but that's already a problem in the proposed Gower
implementation because scaling statistics need to be collected external to
pairwise_distances in order for subset invariance to hold.)
What does Tutz and Ramzan (2014) add?
|
they use weighted distances I think. |
They are not easy to rename unambiguously and succinctly :)
|
Reference Issue
Fixes #2989
Modifies and closes #4844
Builds upon #9348
This PR implements a k-Nearest Neighbor based missing data imputation algorithm. The algorithm is based on the one proposed in Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001 Pages 520-525 and implemented in the R-package Impute from Bioconductor.
This algorithm uses euclidean distance to calculate the k nearest neighbors of the data point with one or more missing coordinate(s). The mean coordinate value of the neighbors is then used to impute the missing coordinate value. The algorithm can also handle missing value in the neighbors if they have missing in either the same coordinate or other coordinates. While calculating the euclidean distance, it zero-weights coordinates with missing value in either vector in the pair and up-weights the remaining coordinates.
Any other comments?
As things stand, the imputation procedure runs fine
if the matrix passed in fit() and transform() are the same, which is probably the vast majority of use cases. I am working on making it work when two different matrices are passed to the two methods. It should not be too much work and I am hoping to finish this soon.Further, this implementation currently only works with dense matrices, and it would be really awesome if somebody could take up the sparse matrix case (please let us know if you do, so that there is no duplication of effort).PS: This is my first ever submission to Scikit Learn, so please forgive me for the numerous silly and/or frustrating things you are bound to bump into while you examine the code :)
Task List
NOTE:
For those folks who would like to use KNNImputer in the meantime, I have released it as a separate package called missingpy.
That package will be kept up-to-date with any changes to this branch.(Some of the newer changes might not be carried over but that does not affect the core imputation algorithm, which is ready for use).