Sequential feature selection - unsupervised learning #19568

ShyamDesai · 2021-02-26T03:54:36Z

Reference Issues/PRs

Fixes #19538 .

What does this implement/fix? Explain your changes.

Modified the fit signature to default to y=None, modified the validate_data function call accordingly, changed the docstring, & added SFS test cases to demonstrate usability. For cases such as KMeans clustering and other unsupervised learning methods that do not require y validation, fit will no longer produce an error.

Any other comments?

Please let me know if there are any fixes or suggestions, I am fairly new at this and would be more than happy to go back and work on it :)

cmarmo · 2021-02-26T13:56:20Z

Hi @ShyamDesai, thanks for your pull request. There is a linting error that should be fixed.

sklearn/feature_selection/_sequential.py:128:80: E501 line too long (83 > 79 characters)

Do you mind having a look? Thanks!

Fixed linting error within fit docstring & more formatting

sklearn/feature_selection/tests/test_sequential.py

ogrisel

While I am not opposed to add support for unsupervised estimators to SequentialFeatureSelector in general, I doubt that it can be of any use with the default .score function of KMeans: by default KMeans is using the negative inertia (sum of squared distances to nearest centroid). Depending on the relative magnitude of the features the inertia can vary significantly and is not necessarily meaningful for feature selection.

Alternative unsupervised clustering metrics such as the Davies Bouldin Index or Silhouette Coefficient would probably make more sense but those are currently not registered as sklearn.metrics.SCORERS and therefore cannot be passed as valid scoring= argument (I think).

jnothman

It might be good to also add a test with another non-conventional y, but I'm not immediately thinking of what that might be (complex numbers, perhaps??)

ogrisel · 2021-02-27T14:01:43Z

We would also need a changelog entry in doc/whats_new/v1.0.rst.

ShyamDesai · 2021-02-27T18:48:54Z

While I am not opposed to add support for unsupervised estimators to SequentialFeatureSelector in general, I doubt that it can be of any use with the default .score function of KMeans: by default KMeans is using the negative inertia (sum of squared distances to nearest centroid). Depending on the relative magnitude of the features the inertia can vary significantly and is not necessarily meaningful for feature selection.

Alternative unsupervised clustering metrics such as the Davies Bouldin Index or Silhouette Coefficient would probably make more sense but those are currently not registered as sklearn.metrics.SCORERS and therefore cannot be passed as valid scoring= argument (I think).

These were my initial thoughts as well. Would the alternative option be to exclude unsupervised estimators from SequentialFeatureSelection given that the default scoring method for some of these models is not compatible and/or irrelevant for feature selection purposes?

Cleaner code suggestion Co-authored-by: Olivier Grisel <[email protected]>

jnothman · 2021-02-27T23:11:37Z

sklearn/feature_selection/tests/test_sequential.py

    sfs.fit(X)
    assert(sfs.transform(X).shape[1] == n_features_to_select)
+
+@pytest.mark.parametrize('y', ('no_validation', 1j, 99.9, np.nan, 3))


I had meant an array of complex numbers sized like X. This certainly tests a lack of validation! Now I'm not sure you need a separate test for None

My thinking was to test for any kind of input similar to y=None, and yes it seems that we don't need a separate test for None. What would we expect SFS to do in the case of unconventional input with the shape of X (whether complex numbers, NaN, strings, etc)?

We expect it to let the underlying estimator deal with it!

doc/whats_new/v1.0.rst

sphinx error fix Co-authored-by: Chiara Marmo <[email protected]>

jnothman

Apart from the tests now having rather strange examples of unusual y, this LGTM

cmarmo · 2021-05-27T14:00:30Z

Hi @ShyamDesai thanks for your patience and your work so far! Waiting for a second approval, do you mind synchronizing with upstream? The continuous integration workflows have changed for version 1.0 and we need all the checks be rerun before merging. Thanks!

glemaitre

LGTM. I solved the conflicts. We will check if the tests are still fine but the code is fine.

…thod (scikit-learn#19568) Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Chiara Marmo <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]>

Sequential forwards selection - unsupervised learning scikit-learn#19538

18ab2e4

github-actions bot added the module:feature_selection label Feb 26, 2021

ShyamDesai added 2 commits February 26, 2021 13:50

Update _sequential.py

accc339

Fixed linting error within fit docstring & more formatting

_sequential.py trailing white space

e302be8

ShyamDesai changed the title ~~Sequential forwards selection - unsupervised learning #19538~~ Sequential forwards selection - unsupervised learning Feb 26, 2021

ogrisel reviewed Feb 26, 2021

View reviewed changes

sklearn/feature_selection/tests/test_sequential.py Outdated Show resolved Hide resolved

ogrisel reviewed Feb 26, 2021

View reviewed changes

jnothman reviewed Feb 27, 2021

View reviewed changes

Update sklearn/feature_selection/tests/test_sequential.py

3c97beb

Cleaner code suggestion Co-authored-by: Olivier Grisel <[email protected]>

ShyamDesai changed the title ~~Sequential forwards selection - unsupervised learning~~ Sequential feature selection - unsupervised learning Feb 27, 2021

ShyamDesai added 2 commits February 27, 2021 14:58

further y validation tests & updated whats_new/v1.0 SFS fix

a80355a

test file lint fix

ab70ffd

jnothman reviewed Feb 27, 2021

View reviewed changes

cmarmo reviewed Mar 1, 2021

View reviewed changes

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved

Update doc/whats_new/v1.0.rst

02f858d

sphinx error fix Co-authored-by: Chiara Marmo <[email protected]>

jnothman approved these changes Mar 7, 2021

View reviewed changes

cmarmo added the Waiting for Reviewer label May 27, 2021

glemaitre self-assigned this Jul 29, 2021

Merge remote-tracking branch 'origin/main' into pr/ShyamDesai/19568

c213e3e

glemaitre approved these changes Jul 29, 2021

View reviewed changes

glemaitre merged commit acc13de into scikit-learn:main Jul 29, 2021

Uh oh!

Sequential feature selection - unsupervised learning #19568

Sequential feature selection - unsupervised learning #19568

Uh oh!

Conversation

ShyamDesai commented Feb 26, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

cmarmo commented Feb 26, 2021

Uh oh!

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Feb 27, 2021

Uh oh!

ShyamDesai commented Feb 27, 2021

Uh oh!

jnothman Feb 27, 2021

Choose a reason for hiding this comment

Uh oh!

ShyamDesai Mar 1, 2021

Choose a reason for hiding this comment

Uh oh!

jnothman Mar 2, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

cmarmo commented May 27, 2021

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ogrisel left a comment •

edited

Loading