-
-
Notifications
You must be signed in to change notification settings - Fork 26.6k
Sequential feature selection - unsupervised learning #19568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @ShyamDesai, thanks for your pull request. There is a linting error that should be fixed. Do you mind having a look? Thanks! |
Fixed linting error within fit docstring & more formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I am not opposed to add support for unsupervised estimators to SequentialFeatureSelector in general, I doubt that it can be of any use with the default .score function of KMeans: by default KMeans is using the negative inertia (sum of squared distances to nearest centroid). Depending on the relative magnitude of the features the inertia can vary significantly and is not necessarily meaningful for feature selection.
Alternative unsupervised clustering metrics such as the Davies Bouldin Index or Silhouette Coefficient would probably make more sense but those are currently not registered as sklearn.metrics.SCORERS and therefore cannot be passed as valid scoring= argument (I think).
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to also add a test with another non-conventional y, but I'm not immediately thinking of what that might be (complex numbers, perhaps??)
|
We would also need a changelog entry in |
These were my initial thoughts as well. Would the alternative option be to exclude unsupervised estimators from |
Cleaner code suggestion Co-authored-by: Olivier Grisel <[email protected]>
| sfs.fit(X) | ||
| assert(sfs.transform(X).shape[1] == n_features_to_select) | ||
|
|
||
| @pytest.mark.parametrize('y', ('no_validation', 1j, 99.9, np.nan, 3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had meant an array of complex numbers sized like X. This certainly tests a lack of validation! Now I'm not sure you need a separate test for None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was to test for any kind of input similar to y=None, and yes it seems that we don't need a separate test for None. What would we expect SFS to do in the case of unconventional input with the shape of X (whether complex numbers, NaN, strings, etc)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We expect it to let the underlying estimator deal with it!
sphinx error fix Co-authored-by: Chiara Marmo <[email protected]>
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the tests now having rather strange examples of unusual y, this LGTM
|
Hi @ShyamDesai thanks for your patience and your work so far! Waiting for a second approval, do you mind synchronizing with upstream? The continuous integration workflows have changed for version 1.0 and we need all the checks be rerun before merging. Thanks! |
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I solved the conflicts. We will check if the tests are still fine but the code is fine.
…thod (scikit-learn#19568) Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Chiara Marmo <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]>
…thod (scikit-learn#19568) Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Chiara Marmo <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]>
Reference Issues/PRs
Fixes #19538 .
What does this implement/fix? Explain your changes.
Modified the
fitsignature to default toy=None, modified thevalidate_datafunction call accordingly, changed the docstring, & added SFS test cases to demonstrate usability. For cases such as KMeans clustering and other unsupervised learning methods that do not requireyvalidation,fitwill no longer produce an error.Any other comments?
Please let me know if there are any fixes or suggestions, I am fairly new at this and would be more than happy to go back and work on it :)