Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Adds KNNImputer #12852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 269 commits into from
Sep 3, 2019
Merged

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Dec 21, 2018

Reference Issues/PRs

Continues #9348, and a part of #9212

Closes #2989
Closes #9348
Closes #9212

What does this implement/fix? Explain your changes.

This PR cleans up the work done on #9348 and #9212:

  1. Adds more tests.
  2. Cleans up warnings/errors.
  3. Currently, the distance between two completely nan vectors is nan. This causes the diagonal to sometimes be nan. This behavior can be changed depending on our preferences.

Edit: KNNImputer has been added this to PR with the following updates:

  1. Address comments in [MRG] Added k-Nearest Neighbor imputation for missing data #9212
  2. Reduces the number of variables.
  3. Memory optimizations.
  4. Raises as soon as possible (before calculations).

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another comment I had pending. Not sure when I'll give it another full pass

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller, this one doesn't store any attributes ending in _.

Sorry, still not really able to give this a full review.

return X[:, valid_idx]

def _more_tags(self):
return {'allow_nan': True}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return {'allow_nan': True}
return {'allow_nan': is_scalar_nan(self.missing_values)}


Parameters
----------
dist_pot_donors : array-like, shape=(n_receivers, n_train_samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should reformat according to the recent edicts on these matters.

@thomasjpfan
Copy link
Member Author

@jnothman check_is_fitted checks for _ in the beginning as well:

if (v.endswith("_") or v.startswith("_"))

@amueller
Copy link
Member

tests are failing...

@thomasjpfan
Copy link
Member Author

Test failure was unrelated, now that #14689 is merged, merging this with master should fix it.

least one neighbor with a defined distance, the weighted or unweighted average
of the remaining neighbors will be used during imputation. If a feature is
always missing in training, it is removed during `transform`. For more
information on the methodology, see ref. [OL2001]_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they consider distance weighting? It might be worth noting the differences...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like they use distance weighting by default.

Copy link
Member Author

@thomasjpfan thomasjpfan Sep 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gerhard Tutz, Shahla Ramzan,
Improved Methods for the Imputation of Missing Data by Nearest Neighbor Methods shows in their experiments that weighted kNN performs a little better than unweighted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lorenzo Beretta* and Alessandro Santaniello, Nearest neighbor imputation algorithms: a critical evaluation, shows that weighted is slightly better than unweighted.

])
knn = KNNImputer(missing_values=na, n_neighbors=2).fit(X)

X_transform = knn.transform(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check with test data where the feature has values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR updated with this suggestion.

dist = nan_euclidean_distances(X)
r1c1_nbor_dists = dist[1, [0, 2, 3, 4, 5]]
r1c3_nbor_dists = dist[1, [0, 3, 4, 5, 6]]
r1c1_nbor_wt = (1 / r1c1_nbor_dists)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
r1c1_nbor_wt = (1 / r1c1_nbor_dists)
r1c1_nbor_wt = 1 / r1c1_nbor_dists

r1c1_nbor_dists = dist[1, [0, 2, 3, 4, 5]]
r1c3_nbor_dists = dist[1, [0, 3, 4, 5, 6]]
r1c1_nbor_wt = (1 / r1c1_nbor_dists)
r1c3_nbor_wt = (1 / r1c3_nbor_dists)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
r1c3_nbor_wt = (1 / r1c3_nbor_dists)
r1c3_nbor_wt = 1 / r1c3_nbor_dists



@pytest.mark.parametrize("na", [-1, np.nan])
def test_knn_imputer_with_weighted_features(na):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"weighted features" seems to be a strange name. This also appears to overlap in purpose with test_knn_imputer_weight_distance. What's the distinction?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was not a distinction. test_knn_imputer_with_weighted_features was removed and the tests from there was moved into test_knn_imputer_weight_distance

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM!!!

@thomasjpfan
Copy link
Member Author

The mask of fit_X is now stored as a private attribute during fit.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to hit the road.

Congrats @ashimb9 and thanks @thomasjpfan for finishing it off!

@jnothman jnothman merged commit 2d1b9e3 into scikit-learn:master Sep 3, 2019
@amueller
Copy link
Member

amueller commented Sep 4, 2019

yayyyy finally ;)

@banilo
Copy link
Contributor

banilo commented Sep 4, 2019 via email

@chitcode
Copy link

Very happy to see this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

imputation by knn