Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Added k-Nearest Neighbor imputation for missing data #9212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 106 commits into from

Conversation

ashimb9
Copy link
Contributor

@ashimb9 ashimb9 commented Jun 24, 2017

Reference Issue

Fixes #2989
Modifies and closes #4844
Builds upon #9348

This PR implements a k-Nearest Neighbor based missing data imputation algorithm. The algorithm is based on the one proposed in Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001 Pages 520-525 and implemented in the R-package Impute from Bioconductor.

This algorithm uses euclidean distance to calculate the k nearest neighbors of the data point with one or more missing coordinate(s). The mean coordinate value of the neighbors is then used to impute the missing coordinate value. The algorithm can also handle missing value in the neighbors if they have missing in either the same coordinate or other coordinates. While calculating the euclidean distance, it zero-weights coordinates with missing value in either vector in the pair and up-weights the remaining coordinates.

Any other comments?

As things stand, the imputation procedure runs fine if the matrix passed in fit() and transform() are the same, which is probably the vast majority of use cases. I am working on making it work when two different matrices are passed to the two methods. It should not be too much work and I am hoping to finish this soon. Further, this implementation currently only works with dense matrices, and it would be really awesome if somebody could take up the sparse matrix case (please let us know if you do, so that there is no duplication of effort).

PS: This is my first ever submission to Scikit Learn, so please forgive me for the numerous silly and/or frustrating things you are bound to bump into while you examine the code :)

Task List

  • Resolve issue of euclidean distance calculation with NaN values
  • Get kNN imputation working
  • Fix issue with passing separate matrices in fit() and transform()
  • Add weighted imputation
  • Add documentation
  • Add tests
  • Add examples

NOTE:

For those folks who would like to use KNNImputer in the meantime, I have released it as a separate package called missingpy. That package will be kept up-to-date with any changes to this branch. (Some of the newer changes might not be carried over but that does not affect the core imputation algorithm, which is ready for use).

@jnothman
Copy link
Member

Please merge in #9348 and make use of that implementation.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding tests and documentation might help you think this through at this stage.

@@ -247,6 +340,11 @@ def _sparse_fit(self, X, strategy, missing_values, axis):

return most_frequent

# KNN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unhelpful comment

@@ -298,6 +396,80 @@ def _dense_fit(self, X, strategy, missing_values, axis):

return most_frequent

# KNN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unhelpful

X = X.transpose()
mask = mask.transpose()

#Get dimensions and missing count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 requires space between # and comment. But I also don't think this comment helps.

@@ -115,12 +204,16 @@ class Imputer(BaseEstimator, TransformerMixin):
contain missing values).
"""
def __init__(self, missing_values="NaN", strategy="mean",
axis=0, verbose=0, copy=True):
axis=0, verbose=0, copy=True, n_neighbors=10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe the new parameters in the class docstring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to default to current behaviour, i.e. n_neighbors=None.

@@ -115,12 +204,16 @@ class Imputer(BaseEstimator, TransformerMixin):
contain missing values).
"""
def __init__(self, missing_values="NaN", strategy="mean",
axis=0, verbose=0, copy=True):
axis=0, verbose=0, copy=True, n_neighbors=10,
row_max_missing=0.5, col_max_missing=0.8):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

col_max_missing isn't really part of the KNN imputation feature (except when axis=1, which is a weird case). Raising an error in general would break current behaviour. Perhaps this deserves a warning, not an error; and I think it belongs in a separate PR.

Avoiding using largely-missing rows in KNN is a sensible feature, and differs from the whole-dataset statistics behaviour. It may even be sufficient to justify a separate class to KNN.


# Arg-partition (quasi-argsort) of n_neighbors and retrieve them
nbors_index = np.argpartition(dist, n_neighbors - 1, axis=1)
knn_row_index = nbors_index[:, :n_neighbors]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we do in the case that the same feature(s) is missing in some or many of the nearest neighbors? What do other implementations do? This case may be harder to handle using the nearest neighbors infrastructure than rewriting it as you do here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can remember, the package referenced above does the following:
i) if some of the neighbors have that feature not missing then only those neighbors are used for that particular feature
ii) if non of the neighbors have non-missing for a particular feature then it uses the column/feature mean
I will have a look at their source code when I start working on this.

@@ -298,6 +396,80 @@ def _dense_fit(self, X, strategy, missing_values, axis):

return most_frequent

# KNN
elif strategy == "knn":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think knn is a separate strategy. you could still opt to adopt the median of the k-nearest neighbors' values. (Perhaps this makes limited sense given the masked-euclidean metric, and maybe we should use masked-manhattan (equivalently Gower distance, I think) in the strategy='median' case...?)

and np.ma.allequal(masked_X, statistics):
X = statistics.data
else:
X = self._dense_fit(X,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we fitting in transform??

mask_after_knn = _get_mask(X, self.missing_values)
if np.any(mask_after_knn):
missing_index = np.where(mask_after_knn)
X_col_means = masked_X.mean(axis=0).data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we support a distance-weighted mean, as KNNRegressor does?

@ashimb9
Copy link
Contributor Author

ashimb9 commented Jul 30, 2017

@jnothman Before we dive too deep into this implementation, I was wondering what your thoughts were on implementing KNNImputer as a separate class of its own?

@jnothman
Copy link
Member

jnothman commented Jul 30, 2017 via email

@ashimb9
Copy link
Contributor Author

ashimb9 commented Jul 31, 2017

Since you are okay with a new class, I think I will do that. I have a feeling that if features get added to knn-imputation in the future (ex, sequential imputation) it might become too awkward for the general-purpose Imputer.

@amueller
Copy link
Member

amueller commented Oct 1, 2018

@jnothman tests are failing after your commits.

@amueller
Copy link
Member

amueller commented Oct 3, 2018

common tests are failing still...

The :class:`KNNImputer` class provides imputation for completing missing
values using the k-Nearest Neighbors approach. Each sample's missing values
are imputed using values from ``n_neighbors`` nearest neighbors found in the
training set. Note that if a sample has more than one feature missing, then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is a bit cryptic to me. Maybe say "If multiple features are missing, not the sets of neighbors used for imputation can be different" or something like that?

@amueller
Copy link
Member

amueller commented Oct 3, 2018

what was the conclusion of the discussion of matching the reference implementation vs matching the paper. Did we hear from Hastie?

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok apart from the comments. Mostly agree with @jnothman, and it looks some of the comments from December are not addressed yet.


Each missing feature is then imputed as the average, either weighted or
unweighted, of these neighbors. Where the number of donor neighbors is less
than ``n_neighbors``, the training set average for that feature is used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this sentence a bit cryptic. Maybe define donors somewhere?

return ((full_scores.mean(), full_scores.std()),
(zero_impute_scores.mean(), zero_impute_scores.std()),
(mean_impute_scores.mean(), mean_impute_scores.std()))
(mean_impute_scores.mean(), mean_impute_scores.std()),
(mice_impute_scores.mean(), mice_impute_scores.std()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mice_impute_scores? We don't have mice, right?

Each sample's missing values are imputed using values from ``n_neighbors``
nearest neighbors found in the training set. Each missing feature is then
imputed as the average, either weighted or unweighted, of these neighbors.
Note that if a sample has more than one feature missing, then the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is better than the sentence I disliked above, I think. So feel free to replicate it in the user guide?

imputed as the average, either weighted or unweighted, of these neighbors.
Note that if a sample has more than one feature missing, then the
neighbors for that sample can be different depending on the particular
feature being imputed. Finally, where the number of donor neighbors is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again the sentence about the number of neighbors seems unclear without explaining row_max_missing.

X = X_merged
return X

def fit_transform(self, X, y=None, **fit_params):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this inherited?

# Check if % missing in any row > row_max_missing
bad_rows = mask.sum(axis=1) > (mask.shape[1] * self.row_max_missing)
if np.any(bad_rows):
warnings.warn(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would only print a message for that if there's some verbosity set. I feel like this doesn't require a warning.

if X.shape[0] < self.n_neighbors:
raise ValueError("There are only %d samples, but n_neighbors=%d."
% (X.shape[0], self.n_neighbors))
self.fitted_X_ = X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

if np.any(mask.sum(axis=0) > (X.shape[0] * self.col_max_missing)):
raise ValueError("Some column(s) have more than {}% missing values"
.format(self.col_max_missing * 100))
X_col_means = np.ma.array(X, mask=mask).mean(axis=0).data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we compute this after the input validation?

@@ -615,6 +619,598 @@ def test_missing_indicator_sparse_param(arr_type, missing_values,
assert isinstance(X_trans_mask, np.ndarray)


#############################################################################
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should move to the impute module, right?


# Get distance from potential donors
dist_pdonors = dist[receivers_row_idx][:, pdonors_row_idx]
dist_pdonors = dist_pdonors.reshape(-1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed?

@jnothman
Copy link
Member

jnothman commented Oct 4, 2018

what was the conclusion of the discussion of matching the reference implementation vs matching the paper. Did we hear from Hastie?

IIRC you heard from Hastie that there might be something dodgy going on, but not so much as an admission of a bug...

We concluded that it makes sense to match the paper, not the reference implementation. That is, we always seek the k nearest neighbors that have a value for the target feature.

@jnothman
Copy link
Member

jnothman commented Oct 4, 2018

Btw, I'm initially just trying to get this working at master, and I or someone else can go back and iron out all the requests above, in this PR or another. I don't know yet when I'll get to that.

@amueller
Copy link
Member

amueller commented Oct 4, 2018

Ok cool. I want to prioritize the estimator tags first. Do you wanna review the deprecation removals by any chance? ;)

@amueller
Copy link
Member

we might want to add

Tutz, Gerhard; Ramzan, Shahla (13. October 2014): Improved Methods for the Imputation of Missing Data by Nearest Neighbor Method

to the references and also possibly think about adding gower similarity in the future. But maybe not delay this.

@jnothman
Copy link
Member

jnothman commented Oct 17, 2018 via email

@amueller
Copy link
Member

they use weighted distances I think.

@jnothman
Copy link
Member

jnothman commented Feb 19, 2019 via email

@amueller amueller added Superseded PR has been replace by a newer PR and removed Stalled help wanted labels Aug 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

imputation by knn
7 participants