-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Missing values break feature selection with pipeline estimator #21743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In the interim, I had hoped that I could hack away the problem, but this didn't work: import types
from sklearn.base import _safe_tags
def custom_tag_hacks(self):
return {
"pairwise": _safe_tags(self.steps[0][1], "pairwise"), # from pipeline
"allow_nan": True, # fix our problem?
}
pipe._more_tags = types.MethodType(custom_tag_hacks, pipe)
pipe._get_tags()["allow_nan"]
# still outputs False |
This is indeed an issue and I am not sure what the right fix is. We could
I would think that the first option would be permissive enough but I don't know how bad it can fail. The second option would require to think about the mechanism to actually set the tag. |
Is there a reason not to just not do any validation, and leave it to the estimator? Trying to get a pipeline (and other composites; consider a large column transformer in this situation!) to figure out if it can handle missings seems tricky. Some transformers might accept but pass along NaNs, until later an imputer may fill them, or the final predictor may accept missings... |
This would option 1. But then, this is weird to me that |
Yes, might be simplest for RFE to always permit NaN. RFE only really needs to validate that X has columns that can be indexed. However I don't think it's right that Pipeline defaults to an allow_nan: False state. It should behave permissive until proven otherwise |
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
Feature selectors were adapted to allow missing values if their underlying estimator does in #11635, but that does not work if the estimator is a pipeline. In the traceback (below), the reason becomes apparent: the fix in the above mentioned PR checks the tag
"allow_nan"
of the underlyingestimator
, butPipeline
doesn't inherit that tag from a (its first, or any) step, and so keeps its defaultFalse
.Discovered at https://stackoverflow.com/q/69883401/10495893
Steps/Code to Reproduce
The same error occurs with
SelectKBest
andSequentialFeatureSelector
. A similar error occurs inSelectFromModel
but attransform
time rather thanfit
.Expected Results
No error is thrown
Actual Results
Versions
The text was updated successfully, but these errors were encountered: