-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Allow nan/inf in feature selection #11635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Allow nan/inf in feature selection #11635
Conversation
You have test failures. Also, we are working towards a release, which this will not be included in, so please ping for review after 0.20 is released. |
Is anyone working on this issue? I'm also running into problems with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty good. Please add a test that transforming for univariate selection with NaN at transform time is acceptable. But we could (and should) also use force_all_finite=False
in univariate_selection._BaseFilter.fit
since the underlying scoring functions check it. SelectFromModel
should already be lenient in fit.
You could consider creating a feature_selection/tests/test_common.py
file that checks this for all the feature selection estimators (although we would have to use nanvar
; mean_variance_axis
should already exclude NaNs).
It would be really wonderful to have all feature selectors be insensitive to missing values and this is only a few lines of code away.
rfe = RFE(estimator=clf) | ||
rfe.fit(X, y) | ||
rfe.transform(X) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two blank lines
rfecv = RFECV(estimator=clf) | ||
rfecv.fit(X, y) | ||
rfecv.transform(X) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No blank lines
rfe.fit(X, y) | ||
rfe.transform(X) | ||
|
||
def test_rfecv_allow_nan_inf_in_x(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a loop or pytest.mark.parametrize rather than duplicating code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added pytest.mark.parametrize.
Okay I made most of the changes mentioned above: fixed pep8 errors, added tests for univariate fit and transform and SelectFromModel transform, and updated univariate fit to allow nan/inf as well. I also updated estimator_checks to not perform the nan/inf check on SelectorMixin objects. There's probably a more logical/efficient way to do all the tests, but this is what I've got for now. |
please merge in an updated master. Circle CI should not be running on Python 2 anymore. |
sklearn/utils/estimator_checks.py
Outdated
@@ -103,7 +104,7 @@ def _yield_non_meta_checks(name, estimator): | |||
# cross-decomposition's "transform" returns X and Y | |||
yield check_pipeline_consistency | |||
|
|||
if name not in ALLOW_NAN: | |||
if name not in ALLOW_NAN and not isinstance(estimator, SelectorMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Not sure if we'd rather this. There still may be selectors (not in our library) that should error on NaN/inf, and many of these selectors with their default parameters still should/will error on NaN/inf.
In the (near) future, estimator tags (#8022) will solve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh okay. I'm running into an issue with adding the univariate estimators to the ALLOW_NAN list because they fail check_estimators_pickle. The issue is that while _BaseFilter allows NaN/inf in fit and transform, the default scorer (and all scorers thus far) don't allow them, so when fit is called it errors on NaN. In my tests I use a dummy scorer that avoids this, but check_estimators_pickle is a generic check that applies to all estimators so I'm not sure what the best way to avoid this is without changing the default scorer for _BaseFilter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do they fail the pickle test with default parameters? I think for default parameters all the feature selectors should still raise error on nan/inf except for VarianceThreshold so you should not otherwise need to modify estimator_checks should you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add them to ALLOW_NAN then they fail the pickle test because they donβt allow nan/inf with the default scorer. However if you donβt add them to ALLOW_NAN then check_estimators_nan_inf errors because the transform method does not check for nan/inf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining.
I'm trying to work out if there's a better solution than creating ALLOW_NAN_TRANSFORM
. Basically, at the moment we need to classify an estimator (and its parameter setting) as allowing or disallowing NaNs. Alternatively we could weaken the pickle test to fit on non-NaN data in the case that fitting with NaNs raises an error. (The ability to fit on NaN data there is not the point of the check. It's the interpretation of NaNs in transform
/predict
if NaNs are pickled with the model parameters/attributes that is the concern of that check.) Then we would be redefining ALLOW_NAN to mean "allows NaN in transform/predict, and may also allow NaN in fit". Your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is harder than I thought!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for helping work through it. I think relaxing the pickle test makes sense because it's not supposed to be directly checking NaN/Inf handling in the estimators. There are really two issues here:
- a discrepancy in whether an estimator allows NaN/Inf between its different methods (e.g. VarianceThreshold currently allows in transform but not fit), and
- the default parameter instance of an estimator not exhibiting its allowance of NaN/Inf.
Your suggestion of ALLOW_NAN_TRANSFORM
would solve 1 and I could try to implement it if we decide it's the right way to go. But 2 would still not be solved since these estimators would belong in the ALLOW_NAN_FIT
category but still error in the pickle check with NaN. So I think weakening the pickle check is the best option without changing the default scorer. I've updated my code to include this new version of check_estimators_pickle
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think VarianceThreshold should be updated as well. All it should require is use of nanvar
as far as I can see (and testing in both the sparse and dense cases).
I'm still not certain that the change to the pickle test is the best thing to do, but it's okay.
Are there other changes you hope to make before making this MRG?
When the scope of the PR is finalised, please add an entry to the change log at doc/whats_new/v0.21.rst
. Like the other entries there, please reference this pull request with :issue:
and credit yourself (and other contributors if applicable) with :user:
It would probably be good to document this feature somewhere, e.g. in Notes
sections of estimator docstrings.
|
Okay I updated VarianceThreshold, added tests, and added to Notes documentation and whats_new. I think this is all I can think of for this PR right now. Should I change it to MRG? |
Yes, change to mrg. I hope to review soon! Thanks
|
β¦ default score_funcs do not currently allow nans
β¦or univariate selectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for continuing on this. I wonder if we should pull univariate into a separate PR and merge the rest :|
X = check_array(X, dtype=None, accept_sparse='csr') | ||
tags = self._get_tags() | ||
X = check_array(X, dtype=None, accept_sparse='csr', | ||
force_all_finite=not tags.get('allow_nan', True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic now won't work for univariate :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah I removed the test for the univariate but I wasn't sure what to do for that case. I guess the only way you could really use it would be to subclass and override the tags.
@@ -361,6 +362,9 @@ def fit(self, X, y): | |||
def _check_params(self, X, y): | |||
pass | |||
|
|||
def _more_tags(self): | |||
return {'allow_nan': False} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably have a FIXME
or similar to say this should depend on the score func
Yeah if we're not quite sure how we want it to work, we could leave that part out of it and just leave Univariate as not allowing nan/inf until it gets figured out. |
We'd like to release very soon. If you could strip this back to leave out univariate, and then propose that in another PR with a question on what to do about tags, I think that would be the most pragmatic solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Can we get another reviewer before release??
I'm slightly confused (but also severely jet-lagged so that might be it). |
At master, transform does the default check for finiteness. We could make
it always permissive but not sure if that plays nicely with estimator
checks.
|
hm... but do we want to restrict this because of estimator checks? I'm not sure if we check nan in transform. Would the "right" fix be to add separate tags for fit and transform? |
@@ -320,3 +336,25 @@ def test_threshold_without_refitting(): | |||
# Set a higher threshold to filter out more features. | |||
model.threshold = "1.0 * mean" | |||
assert X_transform.shape[1] > model.transform(data).shape[1] | |||
|
|||
|
|||
def test_transform_accepts_nan_inf(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also check for fit
, maybe use HistGradientBoostingClassifier
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to go with this solution for the release, adding a test for SelectFromModel with NaN in fit would be nice, though.
Thanks for the reviews. I added a test for NaN and Inf in |
Thanks @adpeters! It's been very nice working with you. |
@jnothman thanks for all your help on this! Glad I could contribute a little. |
Reference Issues/PRs
Closes #10821
Closes #10985
What does this implement/fix? Explain your changes.
Allows for NaN/Inf values in
RFE/RFECV.fit
method as well asSelectorMixin.transform
. This affects all feature selection estimators that inherit fromSelectorMixin
(except for univariate selectors), which includes those insklearn.feature_selection.variance_threshold
andsklearn.linear_model.randomized_l1
.The
RFE/RFECV.fit
method does not need to checky
, as any checks should be done by the estimator when it runs its ownfit
, so I changed check_X_y to check_array with just X, and allowed NaN/Inf values.For
SelectorMixin.transform
, the method itself does not require no NaN/Inf, so we should let any inheritors to do that check themselves if they need it.Any other comments?