Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ShyamDesai
Copy link
Contributor

Reference Issues/PRs

Fixes #19538 .

What does this implement/fix? Explain your changes.

Modified the fit signature to default to y=None, modified the validate_data function call accordingly, changed the docstring, & added SFS test cases to demonstrate usability. For cases such as KMeans clustering and other unsupervised learning methods that do not require y validation, fit will no longer produce an error.

Any other comments?

Please let me know if there are any fixes or suggestions, I am fairly new at this and would be more than happy to go back and work on it :)

@cmarmo
Copy link
Contributor

cmarmo commented Feb 26, 2021

Hi @ShyamDesai, thanks for your pull request. There is a linting error that should be fixed.

sklearn/feature_selection/_sequential.py:128:80: E501 line too long (83 > 79 characters)

Do you mind having a look? Thanks!

Fixed linting error within fit docstring & more formatting
@ShyamDesai ShyamDesai changed the title Sequential forwards selection - unsupervised learning #19538 Sequential forwards selection - unsupervised learning Feb 26, 2021
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I am not opposed to add support for unsupervised estimators to SequentialFeatureSelector in general, I doubt that it can be of any use with the default .score function of KMeans: by default KMeans is using the negative inertia (sum of squared distances to nearest centroid). Depending on the relative magnitude of the features the inertia can vary significantly and is not necessarily meaningful for feature selection.

Alternative unsupervised clustering metrics such as the Davies Bouldin Index or Silhouette Coefficient would probably make more sense but those are currently not registered as sklearn.metrics.SCORERS and therefore cannot be passed as valid scoring= argument (I think).

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to also add a test with another non-conventional y, but I'm not immediately thinking of what that might be (complex numbers, perhaps??)

@ogrisel
Copy link
Member

ogrisel commented Feb 27, 2021

We would also need a changelog entry in doc/whats_new/v1.0.rst.

@ShyamDesai
Copy link
Contributor Author

While I am not opposed to add support for unsupervised estimators to SequentialFeatureSelector in general, I doubt that it can be of any use with the default .score function of KMeans: by default KMeans is using the negative inertia (sum of squared distances to nearest centroid). Depending on the relative magnitude of the features the inertia can vary significantly and is not necessarily meaningful for feature selection.

Alternative unsupervised clustering metrics such as the Davies Bouldin Index or Silhouette Coefficient would probably make more sense but those are currently not registered as sklearn.metrics.SCORERS and therefore cannot be passed as valid scoring= argument (I think).

These were my initial thoughts as well. Would the alternative option be to exclude unsupervised estimators from SequentialFeatureSelection given that the default scoring method for some of these models is not compatible and/or irrelevant for feature selection purposes?

Cleaner code suggestion

Co-authored-by: Olivier Grisel <[email protected]>
@ShyamDesai ShyamDesai changed the title Sequential forwards selection - unsupervised learning Sequential feature selection - unsupervised learning Feb 27, 2021
sfs.fit(X)
assert(sfs.transform(X).shape[1] == n_features_to_select)

@pytest.mark.parametrize('y', ('no_validation', 1j, 99.9, np.nan, 3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had meant an array of complex numbers sized like X. This certainly tests a lack of validation! Now I'm not sure you need a separate test for None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was to test for any kind of input similar to y=None, and yes it seems that we don't need a separate test for None. What would we expect SFS to do in the case of unconventional input with the shape of X (whether complex numbers, NaN, strings, etc)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We expect it to let the underlying estimator deal with it!

sphinx error fix

Co-authored-by: Chiara Marmo <[email protected]>
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the tests now having rather strange examples of unusual y, this LGTM

@cmarmo
Copy link
Contributor

cmarmo commented May 27, 2021

Hi @ShyamDesai thanks for your patience and your work so far! Waiting for a second approval, do you mind synchronizing with upstream? The continuous integration workflows have changed for version 1.0 and we need all the checks be rerun before merging. Thanks!

@glemaitre glemaitre self-assigned this Jul 29, 2021
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I solved the conflicts. We will check if the tests are still fine but the code is fine.

@glemaitre glemaitre merged commit acc13de into scikit-learn:main Jul 29, 2021
TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Jul 29, 2021
…thod (scikit-learn#19568)

Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Chiara Marmo <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
…thod (scikit-learn#19568)

Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Chiara Marmo <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sequential forward selection - unsupervised fit_transform bug

5 participants