[MRG] Allow nan/inf in feature selection #11635

adpeters · 2018-07-19T16:56:01Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Allows for NaN/Inf values in RFE/RFECV.fit method as well as SelectorMixin.transform. This affects all feature selection estimators that inherit from SelectorMixin (except for univariate selectors), which includes those in sklearn.feature_selection.variance_threshold and sklearn.linear_model.randomized_l1.

The RFE/RFECV.fit method does not need to check y, as any checks should be done by the estimator when it runs its own fit, so I changed check_X_y to check_array with just X, and allowed NaN/Inf values.

For SelectorMixin.transform, the method itself does not require no NaN/Inf, so we should let any inheritors to do that check themselves if they need it.

Any other comments?

…CV.fit

jnothman · 2018-07-22T04:56:26Z

You have test failures. Also, we are working towards a release, which this will not be included in, so please ping for review after 0.20 is released.

georgewambold · 2019-01-07T17:51:45Z

Is anyone working on this issue? I'm also running into problems with check_X_y in RFE and I think this fix would be great.

jnothman

This is looking pretty good. Please add a test that transforming for univariate selection with NaN at transform time is acceptable. But we could (and should) also use force_all_finite=False in univariate_selection._BaseFilter.fit since the underlying scoring functions check it. SelectFromModel should already be lenient in fit.

You could consider creating a feature_selection/tests/test_common.py file that checks this for all the feature selection estimators (although we would have to use nanvar; mean_variance_axis should already exclude NaNs).

It would be really wonderful to have all feature selectors be insensitive to missing values and this is only a few lines of code away.

sklearn/feature_selection/tests/test_rfe.py

jnothman · 2019-01-08T07:43:47Z

sklearn/feature_selection/tests/test_rfe.py

+    rfe = RFE(estimator=clf)
+    rfe.fit(X, y)
+    rfe.transform(X)
+


Two blank lines

jnothman · 2019-01-08T07:43:54Z

sklearn/feature_selection/tests/test_rfe.py

+    rfecv = RFECV(estimator=clf)
+    rfecv.fit(X, y)
+    rfecv.transform(X)
+


No blank lines

jnothman · 2019-01-08T07:44:07Z

sklearn/feature_selection/tests/test_rfe.py

+    rfe.fit(X, y)
+    rfe.transform(X)
+
+def test_rfecv_allow_nan_inf_in_x():


use a loop or pytest.mark.parametrize rather than duplicating code

Added pytest.mark.parametrize.

adpeters · 2019-01-08T22:28:22Z

Okay I made most of the changes mentioned above: fixed pep8 errors, added tests for univariate fit and transform and SelectFromModel transform, and updated univariate fit to allow nan/inf as well. I also updated estimator_checks to not perform the nan/inf check on SelectorMixin objects.

There's probably a more logical/efficient way to do all the tests, but this is what I've got for now.

jnothman · 2019-01-08T23:06:03Z

please merge in an updated master. Circle CI should not be running on Python 2 anymore.

jnothman · 2019-01-08T23:10:16Z

sklearn/utils/estimator_checks.py

@@ -103,7 +104,7 @@ def _yield_non_meta_checks(name, estimator):
        # cross-decomposition's "transform" returns X and Y
        yield check_pipeline_consistency

-    if name not in ALLOW_NAN:
+    if name not in ALLOW_NAN and not isinstance(estimator, SelectorMixin):


Hmm... Not sure if we'd rather this. There still may be selectors (not in our library) that should error on NaN/inf, and many of these selectors with their default parameters still should/will error on NaN/inf.

In the (near) future, estimator tags (#8022) will solve this.

Ahh okay. I'm running into an issue with adding the univariate estimators to the ALLOW_NAN list because they fail check_estimators_pickle. The issue is that while _BaseFilter allows NaN/inf in fit and transform, the default scorer (and all scorers thus far) don't allow them, so when fit is called it errors on NaN. In my tests I use a dummy scorer that avoids this, but check_estimators_pickle is a generic check that applies to all estimators so I'm not sure what the best way to avoid this is without changing the default scorer for _BaseFilter.

Why do they fail the pickle test with default parameters? I think for default parameters all the feature selectors should still raise error on nan/inf except for VarianceThreshold so you should not otherwise need to modify estimator_checks should you?

If you add them to ALLOW_NAN then they fail the pickle test because they don’t allow nan/inf with the default scorer. However if you don’t add them to ALLOW_NAN then check_estimators_nan_inf errors because the transform method does not check for nan/inf.

Thanks for explaining.
I'm trying to work out if there's a better solution than creating ALLOW_NAN_TRANSFORM. Basically, at the moment we need to classify an estimator (and its parameter setting) as allowing or disallowing NaNs. Alternatively we could weaken the pickle test to fit on non-NaN data in the case that fitting with NaNs raises an error. (The ability to fit on NaN data there is not the point of the check. It's the interpretation of NaNs in transform/predict if NaNs are pickled with the model parameters/attributes that is the concern of that check.) Then we would be redefining ALLOW_NAN to mean "allows NaN in transform/predict, and may also allow NaN in fit". Your thoughts?

Sorry this is harder than I thought!

Thanks for helping work through it. I think relaxing the pickle test makes sense because it's not supposed to be directly checking NaN/Inf handling in the estimators. There are really two issues here:

a discrepancy in whether an estimator allows NaN/Inf between its different methods (e.g. VarianceThreshold currently allows in transform but not fit), and

the default parameter instance of an estimator not exhibiting its allowance of NaN/Inf.

Your suggestion of ALLOW_NAN_TRANSFORM would solve 1 and I could try to implement it if we decide it's the right way to go. But 2 would still not be solved since these estimators would belong in the ALLOW_NAN_FIT category but still error in the pickle check with NaN. So I think weakening the pickle check is the best option without changing the default scorer. I've updated my code to include this new version of check_estimators_pickle.

georgewambold · 2019-01-09T00:18:16Z

Thanks @jnothman for the quick comments and @adpeters for the changes, and thank you both for your time!

jnothman

I think VarianceThreshold should be updated as well. All it should require is use of nanvar as far as I can see (and testing in both the sparse and dense cases).

I'm still not certain that the change to the pickle test is the best thing to do, but it's okay.

Are there other changes you hope to make before making this MRG?

When the scope of the PR is finalised, please add an entry to the change log at doc/whats_new/v0.21.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

jnothman · 2019-01-10T07:21:34Z

It would probably be good to document this feature somewhere, e.g. in Notes sections of estimator docstrings.

adpeters · 2019-01-17T22:10:00Z

Okay I updated VarianceThreshold, added tests, and added to Notes documentation and whats_new.
I don't really like the change to the pickle test either, but I'm not sure of a better way right now. The other option would be relaxing the default scorer, f_classif, to allow NaN/Inf just by removing rows that contain them, but I don't think that's how we want functions to handle those cases.

I think this is all I can think of for this PR right now. Should I change it to MRG?

jnothman · 2019-01-18T00:24:33Z

Yes, change to mrg. I hope to review soon! Thanks

… default score_funcs do not currently allow nans

…or univariate selectors

jnothman

Thanks for continuing on this. I wonder if we should pull univariate into a separate PR and merge the rest :|

jnothman · 2019-10-30T00:49:48Z

sklearn/feature_selection/_base.py

-        X = check_array(X, dtype=None, accept_sparse='csr')
+        tags = self._get_tags()
+        X = check_array(X, dtype=None, accept_sparse='csr',
+                        force_all_finite=not tags.get('allow_nan', True))


this logic now won't work for univariate :(

Ah yeah I removed the test for the univariate but I wasn't sure what to do for that case. I guess the only way you could really use it would be to subclass and override the tags.

jnothman · 2019-10-30T00:50:09Z

sklearn/feature_selection/_univariate_selection.py

@@ -361,6 +362,9 @@ def fit(self, X, y):
    def _check_params(self, X, y):
        pass

+    def _more_tags(self):
+        return {'allow_nan': False}


This should probably have a FIXME or similar to say this should depend on the score func

adpeters · 2019-10-30T00:57:36Z

Yeah if we're not quite sure how we want it to work, we could leave that part out of it and just leave Univariate as not allowing nan/inf until it gets figured out.

jnothman · 2019-10-31T23:25:38Z

We'd like to release very soon. If you could strip this back to leave out univariate, and then propose that in another PR with a question on what to do about tags, I think that would be the most pragmatic solution.

…o separate PR

…timator

jnothman

Thank you. Can we get another reviewer before release??

amueller · 2019-11-02T21:03:24Z

I'm slightly confused (but also severely jet-lagged so that might be it).
The transform method of a feature selector always supports NaN, right? It's only fit that depends on the base estimator, right?

jnothman · 2019-11-02T21:44:04Z

At master, transform does the default check for finiteness. We could make it always permissive but not sure if that plays nicely with estimator checks.

amueller · 2019-11-02T21:51:51Z

hm... but do we want to restrict this because of estimator checks? I'm not sure if we check nan in transform. Would the "right" fix be to add separate tags for fit and transform?
I was just very confused because the test is for NaN in transform, not in fit, which seemed very strange to me.

amueller · 2019-11-02T22:29:02Z

sklearn/feature_selection/tests/test_from_model.py

@@ -320,3 +336,25 @@ def test_threshold_without_refitting():
    # Set a higher threshold to filter out more features.
    model.threshold = "1.0 * mean"
    assert X_transform.shape[1] > model.transform(data).shape[1]
+
+
+def test_transform_accepts_nan_inf():


can you also check for fit, maybe use HistGradientBoostingClassifier?

amueller

happy to go with this solution for the release, adding a test for SelectFromModel with NaN in fit would be nice, though.

adpeters · 2019-11-04T17:36:36Z

Thanks for the reviews. I added a test for NaN and Inf in SelectFromModel.fit using HistGradientBoostingClassifier as suggested. We still have to use a different classifier for the transform test since HistGradientBoostingClassifier does not generate feature importance metrics that SelectFromModel relies on to transform (coef_ or feature_importances_), but I think it's still an effective test of NaN/Inf in fit.

jnothman · 2019-11-05T00:35:12Z

Thanks @adpeters! It's been very nice working with you.

adpeters · 2019-11-05T22:00:52Z

@jnothman thanks for all your help on this! Glad I could contribute a little.

Alec Peters added 3 commits July 19, 2018 09:41

removed check for nan/inf in SelectorMixin.tranform, RFE.fit, and RFE…

03eb769

…CV.fit

removed extra blank line

03042a2

added tests for changes to RFE/RFECV and SelectorMixin

d6ffba7

jnothman reviewed Jan 8, 2019

View reviewed changes

Alec Peters added 7 commits January 8, 2019 11:40

pep8 changes and pytest.mark.parametrize for nan_inf tests

da4c5f4

added force_all_finite=False for univariate feature selection fit

d3601f4

added tests for nan and inf being allowed in univariate select

27e1399

added test for nan and inf being allowed in SelectorMixin transform

2b06f2a

updated checks to not run check_estimators_nan_inf on feature selectors

fb7d7d6

pep8 corrections

4d4662c

created dummy_score function for use by all tests needing a fake scorer

eadb804

fixed travis test error with RFECV not having cv specified

da866eb

jnothman reviewed Jan 8, 2019

View reviewed changes

Merge branch 'master' into allow-nan-inf-in-feature-selection

9f2ff90

allowed no NaN in pickle check for estimators in ALLOW_NAN

f7c2fb8

jnothman reviewed Jan 10, 2019

View reviewed changes

Alec Peters added 5 commits January 11, 2019 11:42

added support for nan/inf in VarianceThreshold.fit, added tests for it

288cecc

added whats new documentation for feature selection allowing NaN/Inf

67fa9e7

added Notes for modified feature selectors about allowing NaN/Inf

9be3870

fixed pep8 error

c9e8fcc

shortened line length for whats new documentation

34e2710

jnothman mentioned this pull request Jan 17, 2019

[MRG] Estimator tags #8022

Merged

4 tasks

adpeters added 7 commits October 29, 2019 16:08

set allow_nan tag to false for univariate feature selectors since the…

055d4ea

… default score_funcs do not currently allow nans

merge with master

61d8277

updated VarianceThreshold.fit() to handle nan when threshold is 0

4499960

removed test that no longer applies now that allow_nan tag is False f…

92bc14f

…or univariate selectors

added pytest import since it was accidentally removed

e1add7f

removed duplicate numpy import

4921954

updated assert_equals to assert

6783182

jnothman reviewed Oct 30, 2019

View reviewed changes

adpeters added 2 commits November 1, 2019 11:27

reverted changes to univariate feature selectors to break those out t…

8a9164f

…o separate PR

updated documentation to reflect removal of univariate changes

8167671

adpeters mentioned this pull request Nov 1, 2019

[WIP] Allow NaN/Inf in univariate feature selectors #15434

Closed

adpeters added 2 commits November 1, 2019 12:40

fixed rfe handling of tags so they are properly inherited from the es…

6176538

…timator

added get_tags to mockclassifier for rfe tests

e43d8f8

jnothman approved these changes Nov 2, 2019

View reviewed changes

amueller reviewed Nov 2, 2019

View reviewed changes

amueller approved these changes Nov 2, 2019

View reviewed changes

adpeters added 2 commits November 4, 2019 09:15

added test for nan/inf in fit of SelectFromModel

6950892

added noqa tag to import to avoid flake errors

20f2426

jnothman merged commit 70b0dde into scikit-learn:master Nov 5, 2019

jimmywan mentioned this pull request Nov 5, 2019

Update RFE to allow 2d output and non-finite inputs. #10822

Closed

bmreiniger mentioned this pull request Nov 22, 2021

Missing values break feature selection with pipeline estimator #21743

Closed

thomasjpfan mentioned this pull request Jan 17, 2022

set force_all_finite=False for SelectFromModel.transform #11026

Closed

Uh oh!

[MRG] Allow nan/inf in feature selection #11635

[MRG] Allow nan/inf in feature selection #11635

Uh oh!

Conversation

adpeters commented Jul 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented Jul 22, 2018

Uh oh!

georgewambold commented Jan 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adpeters commented Jan 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Jan 8, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

georgewambold commented Jan 9, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jan 10, 2019 via email

Uh oh!

adpeters commented Jan 17, 2019

Uh oh!

jnothman commented Jan 18, 2019 via email

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adpeters commented Oct 30, 2019

Uh oh!

jnothman commented Oct 31, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

amueller commented Nov 2, 2019

Uh oh!

jnothman commented Nov 2, 2019 via email

Uh oh!

amueller commented Nov 2, 2019

Uh oh!

adpeters commented Jul 19, 2018 •

edited

Loading

georgewambold commented Jan 7, 2019 •

edited

Loading

adpeters commented Jan 8, 2019 •

edited

Loading

adpeters commented Nov 4, 2019 •

edited

Loading