-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Adds KNNImputer #12852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Adds KNNImputer #12852
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another comment I had pending. Not sure when I'll give it another full pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller, this one doesn't store any attributes ending in _
.
Sorry, still not really able to give this a full review.
sklearn/impute/_knn.py
Outdated
return X[:, valid_idx] | ||
|
||
def _more_tags(self): | ||
return {'allow_nan': True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return {'allow_nan': True} | |
return {'allow_nan': is_scalar_nan(self.missing_values)} |
sklearn/impute/_knn.py
Outdated
|
||
Parameters | ||
---------- | ||
dist_pot_donors : array-like, shape=(n_receivers, n_train_samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should reformat according to the recent edicts on these matters.
@jnothman scikit-learn/sklearn/utils/validation.py Line 927 in 35c0ca0
|
tests are failing... |
Test failure was unrelated, now that #14689 is merged, merging this with master should fix it. |
least one neighbor with a defined distance, the weighted or unweighted average | ||
of the remaining neighbors will be used during imputation. If a feature is | ||
always missing in training, it is removed during `transform`. For more | ||
information on the methodology, see ref. [OL2001]_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do they consider distance weighting? It might be worth noting the differences...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like they use distance weighting by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gerhard Tutz, Shahla Ramzan,
Improved Methods for the Imputation of Missing Data by Nearest Neighbor Methods shows in their experiments that weighted kNN performs a little better than unweighted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lorenzo Beretta* and Alessandro Santaniello, Nearest neighbor imputation algorithms: a critical evaluation, shows that weighted is slightly better than unweighted.
]) | ||
knn = KNNImputer(missing_values=na, n_neighbors=2).fit(X) | ||
|
||
X_transform = knn.transform(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check with test data where the feature has values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR updated with this suggestion.
sklearn/impute/tests/test_knn.py
Outdated
dist = nan_euclidean_distances(X) | ||
r1c1_nbor_dists = dist[1, [0, 2, 3, 4, 5]] | ||
r1c3_nbor_dists = dist[1, [0, 3, 4, 5, 6]] | ||
r1c1_nbor_wt = (1 / r1c1_nbor_dists) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r1c1_nbor_wt = (1 / r1c1_nbor_dists) | |
r1c1_nbor_wt = 1 / r1c1_nbor_dists |
sklearn/impute/tests/test_knn.py
Outdated
r1c1_nbor_dists = dist[1, [0, 2, 3, 4, 5]] | ||
r1c3_nbor_dists = dist[1, [0, 3, 4, 5, 6]] | ||
r1c1_nbor_wt = (1 / r1c1_nbor_dists) | ||
r1c3_nbor_wt = (1 / r1c3_nbor_dists) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r1c3_nbor_wt = (1 / r1c3_nbor_dists) | |
r1c3_nbor_wt = 1 / r1c3_nbor_dists |
sklearn/impute/tests/test_knn.py
Outdated
|
||
|
||
@pytest.mark.parametrize("na", [-1, np.nan]) | ||
def test_knn_imputer_with_weighted_features(na): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"weighted features" seems to be a strange name. This also appears to overlap in purpose with test_knn_imputer_weight_distance
. What's the distinction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was not a distinction. test_knn_imputer_with_weighted_features
was removed and the tests from there was moved into test_knn_imputer_weight_distance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM!!!
The mask of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to hit the road.
Congrats @ashimb9 and thanks @thomasjpfan for finishing it off!
yayyyy finally ;) |
Glad to see this one in, too!
Danilo
… Am 04.09.2019 um 6:43 PM schrieb Andreas Mueller ***@***.***>:
yayyyy finally ;)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#12852?email_source=notifications&email_token=AA5Z6RABX22LODMU7SHEIDDQH7QSZA5CNFSM4GL4N2AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD54GS7A#issuecomment-527985020>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA5Z6RB4ZL7CGTRMJDWPFODQH7QSZANCNFSM4GL4N2AA>.
|
Very happy to see this. |
Reference Issues/PRs
Continues #9348, and a part of #9212
Closes #2989
Closes #9348
Closes #9212
What does this implement/fix? Explain your changes.
This PR cleans up the work done on #9348 and #9212:
Edit:
KNNImputer
has been added this to PR with the following updates: