Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Missing values break feature selection with pipeline estimator #21743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bmreiniger opened this issue Nov 22, 2021 · 5 comments Β· Fixed by #21807
Closed

Missing values break feature selection with pipeline estimator #21743

bmreiniger opened this issue Nov 22, 2021 · 5 comments Β· Fixed by #21807

Comments

@bmreiniger
Copy link
Contributor

bmreiniger commented Nov 22, 2021

Describe the bug

Feature selectors were adapted to allow missing values if their underlying estimator does in #11635, but that does not work if the estimator is a pipeline. In the traceback (below), the reason becomes apparent: the fix in the above mentioned PR checks the tag "allow_nan" of the underlying estimator, but Pipeline doesn't inherit that tag from a (its first, or any) step, and so keeps its default False.

Discovered at https://stackoverflow.com/q/69883401/10495893

Steps/Code to Reproduce

from sklearn.feature_selection import RFE, SelectFromModel, SelectKBest, SequentialFeatureSelector
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
import numpy as np

X, y = load_breast_cancer(return_X_y=True)
nan_inds = np.random.randint(2, size=X.shape) < 0.1
X[nan_inds] = np.nan

pipe = make_pipeline(
    SimpleImputer(),
    LogisticRegression(),
)

fs = RFE(estimator=pipe)
fs.fit(X, y)

The same error occurs with SelectKBest and SequentialFeatureSelector. A similar error occurs in SelectFromModel but at transform time rather than fit.

Expected Results

No error is thrown

Actual Results

ValueError                                Traceback (most recent call last)
<ipython-input-11-2744b13fcc5c> in <module>()
      1 fs = RFE(estimator=pipe)
----> 2 fs.fit(X, y)

5 frames
/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, **fit_params)
    220             Fitted estimator.
    221         """
--> 222         return self._fit(X, y, **fit_params)
    223 
    224     def _fit(self, X, y, step_score=None, **fit_params):

/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_rfe.py in _fit(self, X, y, step_score, **fit_params)
    235             ensure_min_features=2,
    236             force_all_finite=not tags.get("allow_nan", True),
--> 237             multi_output=True,
    238         )
    239         error_msg = (

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    574                 y = check_array(y, **check_y_params)
    575             else:
--> 576                 X, y = check_X_y(X, y, **check_params)
    577             out = X, y
    578 

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    966         ensure_min_samples=ensure_min_samples,
    967         ensure_min_features=ensure_min_features,
--> 968         estimator=estimator,
    969     )
    970 

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    790 
    791         if force_all_finite:
--> 792             _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
    793 
    794     if ensure_min_samples > 0:

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
    114             raise ValueError(
    115                 msg_err.format(
--> 116                     type_err, msg_dtype if msg_dtype is not None else X.dtype
    117                 )
    118             )

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Versions

System:
    python: 3.7.12 (default, Sep 10 2021, 00:21:48)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
      pip: 21.1.3
      setuptools: 57.4.0
      sklearn: 1.0.1
      numpy: 1.19.5
      scipy: 1.4.1
      Cython: 0.29.24
      pandas: 1.1.5
      matplotlib: 3.2.2
      joblib: 1.1.0
      threadpoolctl: 3.0.0

Built with OpenMP: True
@bmreiniger
Copy link
Contributor Author

In the interim, I had hoped that I could hack away the problem, but this didn't work:

import types
from sklearn.base import _safe_tags

def custom_tag_hacks(self):
    return {
        "pairwise": _safe_tags(self.steps[0][1], "pairwise"),  # from pipeline
        "allow_nan": True,  # fix our problem?
    }

pipe._more_tags = types.MethodType(custom_tag_hacks, pipe)
pipe._get_tags()["allow_nan"]
# still outputs False

@glemaitre glemaitre added Bug and removed Bug: triage labels Nov 22, 2021
@glemaitre
Copy link
Member

This is indeed an issue and I am not sure what the right fix is. We could

  • expose allow-nan for a pipeline and delegate the management of missing values to the underlying steps
  • introspect the pipeline and check if there is a step that should handle missing values

I would think that the first option would be permissive enough but I don't know how bad it can fail. The second option would require to think about the mechanism to actually set the tag.

@bmreiniger
Copy link
Contributor Author

bmreiniger commented Nov 22, 2021

Is there a reason not to just not do any validation, and leave it to the estimator?

Trying to get a pipeline (and other composites; consider a large column transformer in this situation!) to figure out if it can handle missings seems tricky. Some transformers might accept but pass along NaNs, until later an imputer may fill them, or the final predictor may accept missings...

@glemaitre
Copy link
Member

Is there a reason not to just not do any validation, and leave it to the estimator?

This would option 1. But then, this is weird to me that RFE does not do the exact them pattern by delegating.

@jnothman
Copy link
Member

Yes, might be simplest for RFE to always permit NaN. RFE only really needs to validate that X has columns that can be indexed.

However I don't think it's right that Pipeline defaults to an allow_nan: False state. It should behave permissive until proven otherwise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants