[MRG] Adds KNNImputer #12852

thomasjpfan · 2018-12-21T21:22:20Z

Reference Issues/PRs

Continues #9348, and a part of #9212

Closes #2989
Closes #9348
Closes #9212

What does this implement/fix? Explain your changes.

This PR cleans up the work done on #9348 and #9212:

Adds more tests.
Cleans up warnings/errors.
Currently, the distance between two completely nan vectors is nan. This causes the diagonal to sometimes be nan. This behavior can be changed depending on our preferences.

Edit: KNNImputer has been added this to PR with the following updates:

Address comments in [MRG] Added k-Nearest Neighbor imputation for missing data #9212
Reduces the number of variables.
Memory optimizations.
Raises as soon as possible (before calculations).

jnothman

This is another comment I had pending. Not sure when I'll give it another full pass

sklearn/impute/tests/test_knn.py

jnothman

@amueller, this one doesn't store any attributes ending in _.

Sorry, still not really able to give this a full review.

jnothman · 2019-08-18T11:36:56Z

sklearn/impute/_knn.py

+        return X[:, valid_idx]
+
+    def _more_tags(self):
+        return {'allow_nan': True}


Suggested change

return {'allow_nan': True}

return {'allow_nan': is_scalar_nan(self.missing_values)}

jnothman · 2019-08-18T11:37:48Z

sklearn/impute/_knn.py

+
+        Parameters
+        ----------
+        dist_pot_donors : array-like, shape=(n_receivers, n_train_samples)


Should reformat according to the recent edicts on these matters.

thomasjpfan · 2019-08-18T17:33:34Z

@jnothman check_is_fitted checks for _ in the beginning as well:

scikit-learn/sklearn/utils/validation.py

Line 927 in 35c0ca0

if (v.endswith("_") or v.startswith("_"))

amueller · 2019-08-22T00:24:40Z

tests are failing...

thomasjpfan · 2019-08-22T00:43:17Z

Test failure was unrelated, now that #14689 is merged, merging this with master should fix it.

jnothman · 2019-08-20T23:43:22Z

doc/modules/impute.rst

+least one neighbor with a defined distance, the weighted or unweighted average
+of the remaining neighbors will be used during imputation. If a feature is
+always missing in training, it is removed during `transform`. For more
+information on the methodology, see ref. [OL2001]_.


Do they consider distance weighting? It might be worth noting the differences...

It looks like they use distance weighting by default.

Gerhard Tutz, Shahla Ramzan,
Improved Methods for the Imputation of Missing Data by Nearest Neighbor Methods shows in their experiments that weighted kNN performs a little better than unweighted.

Lorenzo Beretta* and Alessandro Santaniello, Nearest neighbor imputation algorithms: a critical evaluation, shows that weighted is slightly better than unweighted.

jnothman · 2019-08-20T23:49:42Z

sklearn/impute/tests/test_knn.py

+    ])
+    knn = KNNImputer(missing_values=na, n_neighbors=2).fit(X)
+
+    X_transform = knn.transform(X)


Please check with test data where the feature has values

PR updated with this suggestion.

jnothman · 2019-08-28T08:33:28Z

sklearn/impute/tests/test_knn.py

+    dist = nan_euclidean_distances(X)
+    r1c1_nbor_dists = dist[1, [0, 2, 3, 4, 5]]
+    r1c3_nbor_dists = dist[1, [0, 3, 4, 5, 6]]
+    r1c1_nbor_wt = (1 / r1c1_nbor_dists)


Suggested change

r1c1_nbor_wt = (1 / r1c1_nbor_dists)

r1c1_nbor_wt = 1 / r1c1_nbor_dists

jnothman · 2019-08-28T08:33:35Z

sklearn/impute/tests/test_knn.py

+    r1c1_nbor_dists = dist[1, [0, 2, 3, 4, 5]]
+    r1c3_nbor_dists = dist[1, [0, 3, 4, 5, 6]]
+    r1c1_nbor_wt = (1 / r1c1_nbor_dists)
+    r1c3_nbor_wt = (1 / r1c3_nbor_dists)


Suggested change

r1c3_nbor_wt = (1 / r1c3_nbor_dists)

r1c3_nbor_wt = 1 / r1c3_nbor_dists

jnothman · 2019-08-28T08:42:00Z

sklearn/impute/tests/test_knn.py

+
+
+@pytest.mark.parametrize("na", [-1, np.nan])
+def test_knn_imputer_with_weighted_features(na):


"weighted features" seems to be a strange name. This also appears to overlap in purpose with test_knn_imputer_weight_distance. What's the distinction?

There was not a distinction. test_knn_imputer_with_weighted_features was removed and the tests from there was moved into test_knn_imputer_weight_distance

jnothman

Otherwise LGTM!!!

thomasjpfan · 2019-09-03T12:26:08Z

The mask of fit_X is now stored as a private attribute during fit.

jnothman

I think this is good to hit the road.

Congrats @ashimb9 and thanks @thomasjpfan for finishing it off!

amueller · 2019-09-04T16:41:32Z

yayyyy finally ;)

banilo · 2019-09-04T16:44:12Z

Glad to see this one in, too! Danilo

…

Am 04.09.2019 um 6:43 PM schrieb Andreas Mueller ***@***.***>: yayyyy finally ;) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12852?email_source=notifications&email_token=AA5Z6RABX22LODMU7SHEIDDQH7QSZA5CNFSM4GL4N2AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD54GS7A#issuecomment-527985020>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA5Z6RB4ZL7CGTRMJDWPFODQH7QSZANCNFSM4GL4N2AA>.

chitcode · 2019-09-15T03:47:30Z

Very happy to see this.

ashimb9 added 30 commits July 31, 2017 00:54

Addressed review comments #5

a31c43a

Edited comments

eacb19d

Merge branch 'naneuclid' into knnimpute

d4049e2

KNN Imputation with masked_euclidean and sklearn.neighbors

cfb7c97

fixed array base check

aa8547a

Fix column mean to nanmean

009efa9

Added weight support and cleaned the code

70f294a

Added inf check

a54c162

Changed error message

c412e3b

Added test suite and example. Expanded docstring description

ffe6774

Changes to preprocessing __init__

c2d6a6c

Added KNNImputer exception for NaN and inf in estimator_checks

9a19677

Moved _check_weights() to fit()

a6a0a2f

Addressed review comments - 1

4fbbe40

Make NearestNeighbor import local to fit

29bdccb

Updated doc/modules/preprocessing.rst

6bb5471

More circular import fixes

e393cb0

pep8 fixes

6e5ec30

Minor comment updates

dd027f9

Addressed review comments (part 2)

f33bff4

Fixed pyflex issues

2e1ea48

Added test for callable weights and updated comments.

1098499

Pep8 fixes

a698120

Comment, doc, and pep8 fixes

95e0f56

Docstring changes

215c8c9

Changes to unit tests as per review comments

fab313b

Tests moved to test_imputation

b2d5640

Addressed review comments

cd90614

test changes

2c9993a

Test changes part 2

473b191

thomasjpfan added 5 commits August 14, 2019 10:53

CLN Address comments

118eef2

CLN Address comments

f808d20

Merge remote-tracking branch 'upstream/master' into masked_euclidean

655f51c

ENH Updates check_is_fitted

606bb48

CLN Removes squared

62bd37b

jnothman reviewed Aug 15, 2019

View reviewed changes

sklearn/impute/tests/test_knn.py Show resolved Hide resolved

CLN Address jnothman's comment

8201cfb

jnothman reviewed Aug 18, 2019

View reviewed changes

thomasjpfan added 2 commits August 18, 2019 13:59

CLN Refactor and improve docstring

a9eefbd

CLN Address joels comments

1ea4456

Merge remote-tracking branch 'upstream/master' into masked_euclidean

795adbd

jnothman reviewed Aug 28, 2019

View reviewed changes

thomasjpfan added 6 commits August 31, 2019 17:02

Merge remote-tracking branch 'upstream/master' into masked_euclidean

9a3a01a

STY Fix

b75c376

CLN Combines weighted tests

e533575

BUG Fixes bug with test without missing values

6672bb2

STY Fix

30972ae

ENH Stores fit_X mask during fit

d9dc8b9

jnothman approved these changes Sep 3, 2019

View reviewed changes

jnothman merged commit 2d1b9e3 into scikit-learn:master Sep 3, 2019

thomasjpfan mentioned this pull request Sep 25, 2019

MAINT Makes sure mask is private #15091

Merged

jamesmyatt mentioned this pull request Jan 16, 2021

sklearn.impute.KNNImputer iskandr/fancyimpute#136

Closed

	return {'allow_nan': True}
	return {'allow_nan': is_scalar_nan(self.missing_values)}

	r1c1_nbor_wt = (1 / r1c1_nbor_dists)
	r1c1_nbor_wt = 1 / r1c1_nbor_dists

	r1c3_nbor_wt = (1 / r1c3_nbor_dists)
	r1c3_nbor_wt = 1 / r1c3_nbor_dists



		@pytest.mark.parametrize("na", [-1, np.nan])
		def test_knn_imputer_with_weighted_features(na):

Uh oh!

[MRG] Adds KNNImputer #12852

[MRG] Adds KNNImputer #12852

Uh oh!

Conversation

thomasjpfan commented Dec 21, 2018 • edited by jnothman Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Aug 18, 2019

Uh oh!

amueller commented Aug 22, 2019

Uh oh!

thomasjpfan commented Aug 22, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Sep 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Sep 3, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

amueller commented Sep 4, 2019

Uh oh!

banilo commented Sep 4, 2019 via email

Uh oh!

chitcode commented Sep 15, 2019

Uh oh!

Uh oh!

thomasjpfan commented Dec 21, 2018 •

edited by jnothman

Loading

thomasjpfan Sep 3, 2019 •

edited

Loading