Thanks to visit codestin.com
Credit goes to github.com

[MRG] Added k-Nearest Neighbor imputation for missing data #9212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

ashimb9 wants to merge 106 commits into scikit-learn:master from ashimb9:knnimpute

Contributor

ashimb9 commented Jun 24, 2017 •

edited

Loading

Reference Issue

Fixes #2989
Modifies and closes #4844
Builds upon #9348

This PR implements a k-Nearest Neighbor based missing data imputation algorithm. The algorithm is based on the one proposed in Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001 Pages 520-525 and implemented in the R-package Impute from Bioconductor.

This algorithm uses euclidean distance to calculate the k nearest neighbors of the data point with one or more missing coordinate(s). The mean coordinate value of the neighbors is then used to impute the missing coordinate value. The algorithm can also handle missing value in the neighbors if they have missing in either the same coordinate or other coordinates. While calculating the euclidean distance, it zero-weights coordinates with missing value in either vector in the pair and up-weights the remaining coordinates.

Any other comments?

As things stand, the imputation procedure runs fine if the matrix passed in fit() and transform() are the same, which is probably the vast majority of use cases. I am working on making it work when two different matrices are passed to the two methods. It should not be too much work and I am hoping to finish this soon. Further, this implementation currently only works with dense matrices, and it would be really awesome if somebody could take up the sparse matrix case (please let us know if you do, so that there is no duplication of effort).

PS: This is my first ever submission to Scikit Learn, so please forgive me for the numerous silly and/or frustrating things you are bound to bump into while you examine the code :)

Task List

Resolve issue of euclidean distance calculation with NaN values
Get kNN imputation working
Fix issue with passing separate matrices in fit() and transform()
Add weighted imputation
Add documentation
Add tests
Add examples

NOTE:

For those folks who would like to use KNNImputer in the meantime, I have released it as a separate package called missingpy. ~~That package will be kept up-to-date with any changes to this branch.~~ (Some of the newer changes might not be carried over but that does not affect the core imputation algorithm, which is ready for use).


          Added k-Nearest Neighbor imputation of missing data

287eb1e

This was referenced Jun 24, 2017

[MRG] Add KNN strategy for imputation #4844

Closed

imputation by knn #2989

Closed

ashimb9 added 3 commits

June 24, 2017 20:06


          Fixed issue with passing seperate matrices in fit() and transform()

cd6d3a2


          Retreived fitted data with self.statistics_ rather than passing it as…

3fc9596

… argument


          Modified metrics to enable euclidean distance calculation with missin…

d707dcd

…g (NaN) values

ashimb9 mentioned this pull request

[MRG] Modified sklearn.metrics to enable euclidean distance calculation with NaN #9348

Closed

ashimb9 added 15 commits

July 17, 2017 21:21


          Changes to ensure Python 2.x compatibility

b4b5ae9


          Fixed pep8 issues

04ed4a0


          Addressed comments from review

a6d8ef6


          Docstring example issues

e4f8612


          Formatting fixes on docstring

daf247f


          And yet more fixes

10f5adb


          Addressed review comments (Part 2)

22cf9ef


          Changed nan-mask from int8 to int32

2482c8a


          Addressed review comments (scikit-learn#3)

66527cd


          Pep8 fix

a968b1e


          Comment edit on test_pairwise

356c8e8


          Addressed review comments scikit-learn#4

d6aeaf3


          replaced or with in

e8ccdee


          Changed allow_nans assignment

4a8309b


          One more or to in

5cbc156

Member

jnothman commented Jul 30, 2017

Please merge in #9348 and make use of that implementation.

jnothman reviewed

View reviewed changes

Member

jnothman left a comment •

edited

Loading

Adding tests and documentation might help you think this through at this stage.

sklearn/preprocessing/imputation.py Outdated

		@@ -247,6 +340,11 @@ def _sparse_fit(self, X, strategy, missing_values, axis):

		return most_frequent

		# KNN

Member

jnothman Jul 30, 2017

unhelpful comment

sklearn/preprocessing/imputation.py Outdated

		@@ -298,6 +396,80 @@ def _dense_fit(self, X, strategy, missing_values, axis):

		return most_frequent

		# KNN

Member

jnothman Jul 30, 2017

unhelpful

sklearn/preprocessing/imputation.py Outdated

+                              X = X.transpose()
+                              mask = mask.transpose()
+                          #Get dimensions and missing count

Member

jnothman Jul 30, 2017

PEP8 requires space between # and comment. But I also don't think this comment helps.

sklearn/preprocessing/imputation.py Outdated

@@ @@ -115,12 +204,16 @@ class Imputer(BaseEstimator, TransformerMixin): @@
                     contain missing values).
                   """
                   def __init__(self, missing_values="NaN", strategy="mean",
-                               axis=0, verbose=0, copy=True):
+                               axis=0, verbose=0, copy=True, n_neighbors=10,

Member

jnothman Jul 30, 2017

Please describe the new parameters in the class docstring.

Member

jnothman Jul 30, 2017

We need to default to current behaviour, i.e. n_neighbors=None.

sklearn/preprocessing/imputation.py Outdated

@@ @@ -115,12 +204,16 @@ class Imputer(BaseEstimator, TransformerMixin): @@
                     contain missing values).
                   """
                   def __init__(self, missing_values="NaN", strategy="mean",
-                               axis=0, verbose=0, copy=True):
+                               axis=0, verbose=0, copy=True, n_neighbors=10,
+                               row_max_missing=0.5, col_max_missing=0.8):

Member

jnothman Jul 30, 2017

col_max_missing isn't really part of the KNN imputation feature (except when axis=1, which is a weird case). Raising an error in general would break current behaviour. Perhaps this deserves a warning, not an error; and I think it belongs in a separate PR.

Avoiding using largely-missing rows in KNN is a sensible feature, and differs from the whole-dataset statistics behaviour. It may even be sufficient to justify a separate class to KNN.

sklearn/preprocessing/imputation.py Outdated

+                  # Arg-partition (quasi-argsort) of n_neighbors and retrieve them
+                  nbors_index = np.argpartition(dist, n_neighbors - 1, axis=1)
+                  knn_row_index = nbors_index[:, :n_neighbors]

Member

jnothman Jul 30, 2017

What do we do in the case that the same feature(s) is missing in some or many of the nearest neighbors? What do other implementations do? This case may be harder to handle using the nearest neighbors infrastructure than rewriting it as you do here.

Contributor Author

ashimb9 Jul 31, 2017

As far as I can remember, the package referenced above does the following:
i) if some of the neighbors have that feature not missing then only those neighbors are used for that particular feature
ii) if non of the neighbors have non-missing for a particular feature then it uses the column/feature mean
I will have a look at their source code when I start working on this.

sklearn/preprocessing/imputation.py Outdated

@@ @@ -298,6 +396,80 @@ def _dense_fit(self, X, strategy, missing_values, axis): @@
                           return most_frequent
+                      # KNN
+                      elif strategy == "knn":

Member

jnothman Jul 30, 2017

I don't think knn is a separate strategy. you could still opt to adopt the median of the k-nearest neighbors' values. (Perhaps this makes limited sense given the masked-euclidean metric, and maybe we should use masked-manhattan (equivalently Gower distance, I think) in the strategy='median' case...?)

sklearn/preprocessing/imputation.py Outdated

+                                      and np.ma.allequal(masked_X, statistics):
+                                      X = statistics.data
+                              else:
+                                  X = self._dense_fit(X,

Member

jnothman Jul 30, 2017

why are we fitting in transform??

sklearn/preprocessing/imputation.py Outdated

+                          mask_after_knn = _get_mask(X, self.missing_values)
+                          if np.any(mask_after_knn):
+                              missing_index = np.where(mask_after_knn)
+                              X_col_means = masked_X.mean(axis=0).data

Member

jnothman Jul 30, 2017

Should we support a distance-weighted mean, as KNNRegressor does?

Contributor Author

ashimb9 commented Jul 30, 2017

@jnothman Before we dive too deep into this implementation, I was wondering what your thoughts were on implementing KNNImputer as a separate class of its own?

Member

jnothman commented Jul 30, 2017 via email

I had thought it belonged in the same class, particularly because strategies are orthogonal to number of neighbors, but I don't mind it appearing in a class of its own. FYI, I have proposed in #9463 that we ditch the axis parameter to Imputer. So if we go with extending Imputer, you might not need to worry about the axis=1 case.

On 30 July 2017 at 14:51, ashimb9 ***@***.***> wrote: @jnothman <https://github.com/jnothman> Before we dive too deep into this implementation, I was wondering what your thoughts were on implementing KNNImputer as a separate class of its own? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9212 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6936ynMok5j8EAZaemHMP1I3bWHeks5sTAvNgaJpZM4OERd8> .

ashimb9 added 3 commits

July 31, 2017 00:54


          Addressed review comments scikit-learn#5

a31c43a


          Edited comments

eacb19d


          Merge branch 'naneuclid' into knnimpute

d4049e2

Contributor Author

ashimb9 commented Jul 31, 2017

Since you are okay with a new class, I think I will do that. I have a feeling that if features get added to knn-imputation in the future (ex, sequential imputation) it might become too awkward for the general-purpose Imputer.

jnothman added 2 commits

September 30, 2018 13:52


          COSMIT _MASKED_METRICS -> _NAN_METRICS


          'NaN' no longer stands for NaN

2825fcc

Member

amueller commented Oct 1, 2018

@jnothman tests are failing after your commits.

jnothman added 3 commits

October 3, 2018 12:33


          Fix missing_values validation

745fa2d


          Attempt to reinstate neighbors changes

44f0210


          Fix up test failures

82d5d20

Member

amueller commented Oct 3, 2018

common tests are failing still...

amueller reviewed

View reviewed changes

doc/modules/impute.rst

+              The :class:`KNNImputer` class provides imputation for completing missing
+              values using the k-Nearest Neighbors approach. Each sample's missing values
+              are imputed using values from ``n_neighbors`` nearest neighbors found in the
+              training set. Note that if a sample has more than one feature missing, then

Member

amueller Oct 3, 2018

This sentence is a bit cryptic to me. Maybe say "If multiple features are missing, not the sets of neighbors used for imputation can be different" or something like that?

Member

amueller commented Oct 3, 2018

what was the conclusion of the discussion of matching the reference implementation vs matching the paper. Did we hear from Hastie?


          Fix flake8 issues in example

d8b23e6

amueller reviewed

View reviewed changes

Member

amueller left a comment

Looks ok apart from the comments. Mostly agree with @jnothman, and it looks some of the comments from December are not addressed yet.

doc/modules/impute.rst

+              Each missing feature is then imputed as the average, either weighted or
+              unweighted, of these neighbors. Where the number of donor neighbors is less
+              than ``n_neighbors``, the training set average for that feature is used

Member

amueller Oct 3, 2018

I find this sentence a bit cryptic. Maybe define donors somewhere?

examples/plot_missing_values.py Outdated

                   return ((full_scores.mean(), full_scores.std()),
                           (zero_impute_scores.mean(), zero_impute_scores.std()),
-                          (mean_impute_scores.mean(), mean_impute_scores.std()))
+                          (mean_impute_scores.mean(), mean_impute_scores.std()),
+                          (mice_impute_scores.mean(), mice_impute_scores.std()),

Member

amueller Oct 3, 2018

mice_impute_scores? We don't have mice, right?

sklearn/impute.py

+                  Each sample's missing values are imputed using values from ``n_neighbors``
+                  nearest neighbors found in the training set. Each missing feature is then
+                  imputed as the average, either weighted or unweighted, of these neighbors.
+                  Note that if a sample has more than one feature missing, then the

Member

amueller Oct 3, 2018

This sentence is better than the sentence I disliked above, I think. So feel free to replicate it in the user guide?

sklearn/impute.py

+                  imputed as the average, either weighted or unweighted, of these neighbors.
+                  Note that if a sample has more than one feature missing, then the
+                  neighbors for that sample can be different depending on the particular
+                  feature being imputed. Finally, where the number of donor neighbors is

Member

amueller Oct 3, 2018

again the sentence about the number of neighbors seems unclear without explaining row_max_missing.

sklearn/impute.py

+                          X = X_merged
+                      return X
+                  def fit_transform(self, X, y=None, **fit_params):

Member

amueller Oct 3, 2018

Isn't this inherited?

sklearn/impute.py

+                      # Check if % missing in any row > row_max_missing
+                      bad_rows = mask.sum(axis=1) > (mask.shape[1] * self.row_max_missing)
+                      if np.any(bad_rows):
+                          warnings.warn(

Member

amueller Oct 3, 2018

I would only print a message for that if there's some verbosity set. I feel like this doesn't require a warning.

sklearn/impute.py

+                      if X.shape[0] < self.n_neighbors:
+                          raise ValueError("There are only %d samples, but n_neighbors=%d."
+                                           % (X.shape[0], self.n_neighbors))
+                      self.fitted_X_ = X

Member

amueller Oct 3, 2018

+1

sklearn/impute.py

+                      if np.any(mask.sum(axis=0) > (X.shape[0] * self.col_max_missing)):
+                          raise ValueError("Some column(s) have more than {}% missing values"
+                                           .format(self.col_max_missing * 100))
+                      X_col_means = np.ma.array(X, mask=mask).mean(axis=0).data

Member

amueller Oct 3, 2018

shouldn't we compute this after the input validation?

sklearn/tests/test_impute.py

		@@ -615,6 +619,598 @@ def test_missing_indicator_sparse_param(arr_type, missing_values,
		assert isinstance(X_trans_mask, np.ndarray)


		#############################################################################

Member

amueller Oct 3, 2018

these should move to the impute module, right?

sklearn/impute.py

+                          # Get distance from potential donors
+                          dist_pdonors = dist[receivers_row_idx][:, pdonors_row_idx]
+                          dist_pdonors = dist_pdonors.reshape(-1,

Member

amueller Oct 3, 2018

indeed?


          Default force_all_finite to True rather than False

c682361

Member

jnothman commented Oct 4, 2018

what was the conclusion of the discussion of matching the reference implementation vs matching the paper. Did we hear from Hastie?

IIRC you heard from Hastie that there might be something dodgy going on, but not so much as an admission of a bug...

We concluded that it makes sense to match the paper, not the reference implementation. That is, we always seek the k nearest neighbors that have a value for the target feature.

Member

jnothman commented Oct 4, 2018

Btw, I'm initially just trying to get this working at master, and I or someone else can go back and iron out all the requests above, in this PR or another. I don't know yet when I'll get to that.

jnothman added 2 commits

October 4, 2018 14:19


          Fix example usage


          Fix masked_euclidean testing in nearest neighbors

607ff7f

Member

amueller commented Oct 4, 2018

Ok cool. I want to prioritize the estimator tags first. Do you wanna review the deprecation removals by any chance? ;)

jnothman added 2 commits

October 4, 2018 23:59


          Fix missing_values in masked_euclidean_distances

87677e7


          Can't subtract list and set in Py2

39e1da8

Member

amueller commented Oct 16, 2018

we might want to add

Tutz, Gerhard; Ramzan, Shahla (13. October 2014): Improved Methods for the Imputation of Missing Data by Nearest Neighbor Method

to the references and also possibly think about adding gower similarity in the future. But maybe not delay this.

Member

jnothman commented Oct 17, 2018 via email

Gower's approach to missing values is, if I understand correctly, just a L1 equivalent of what we have here for L2. (There may be some scaling proposed by Gower too but that's already a problem in the proposed Gower implementation because scaling statistics need to be collected external to pairwise_distances in order for subset invariance to hold.) What does Tutz and Ramzan (2014) add?

Member

amueller commented Oct 17, 2018

they use weighted distances I think.

thomasjpfan mentioned this pull request

[MRG] Adds KNNImputer #12852

Merged


          Merge branch 'master' into knnimpute

e1afa12

Member

jnothman commented Feb 19, 2019 via email

They are not easy to rename unambiguously and succinctly :)

amueller added Superseded and removed Stalled help wanted labels

jnothman closed this in #12852

jamesmyatt mentioned this pull request

sklearn.impute.KNNImputer iskandr/fancyimpute#136

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels