Missing values break feature selection with pipeline estimator #21743

bmreiniger · 2021-11-22T16:39:49Z

Describe the bug

Feature selectors were adapted to allow missing values if their underlying estimator does in #11635, but that does not work if the estimator is a pipeline. In the traceback (below), the reason becomes apparent: the fix in the above mentioned PR checks the tag "allow_nan" of the underlying estimator, but Pipeline doesn't inherit that tag from a (its first, or any) step, and so keeps its default False.

Discovered at https://stackoverflow.com/q/69883401/10495893

Steps/Code to Reproduce

from sklearn.feature_selection import RFE, SelectFromModel, SelectKBest, SequentialFeatureSelector
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
import numpy as np

X, y = load_breast_cancer(return_X_y=True)
nan_inds = np.random.randint(2, size=X.shape) < 0.1
X[nan_inds] = np.nan

pipe = make_pipeline(
    SimpleImputer(),
    LogisticRegression(),
)

fs = RFE(estimator=pipe)
fs.fit(X, y)

The same error occurs with SelectKBest and SequentialFeatureSelector. A similar error occurs in SelectFromModel but at transform time rather than fit.

Expected Results

No error is thrown

Actual Results

ValueError                                Traceback (most recent call last)
<ipython-input-11-2744b13fcc5c> in <module>()
      1 fs = RFE(estimator=pipe)
----> 2 fs.fit(X, y)

5 frames
/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, **fit_params)
    220             Fitted estimator.
    221         """
--> 222         return self._fit(X, y, **fit_params)
    223 
    224     def _fit(self, X, y, step_score=None, **fit_params):

/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_rfe.py in _fit(self, X, y, step_score, **fit_params)
    235             ensure_min_features=2,
    236             force_all_finite=not tags.get("allow_nan", True),
--> 237             multi_output=True,
    238         )
    239         error_msg = (

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    574                 y = check_array(y, **check_y_params)
    575             else:
--> 576                 X, y = check_X_y(X, y, **check_params)
    577             out = X, y
    578 

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    966         ensure_min_samples=ensure_min_samples,
    967         ensure_min_features=ensure_min_features,
--> 968         estimator=estimator,
    969     )
    970 

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    790 
    791         if force_all_finite:
--> 792             _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
    793 
    794     if ensure_min_samples > 0:

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
    114             raise ValueError(
    115                 msg_err.format(
--> 116                     type_err, msg_dtype if msg_dtype is not None else X.dtype
    117                 )
    118             )

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Versions

System:
    python: 3.7.12 (default, Sep 10 2021, 00:21:48)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
      pip: 21.1.3
      setuptools: 57.4.0
      sklearn: 1.0.1
      numpy: 1.19.5
      scipy: 1.4.1
      Cython: 0.29.24
      pandas: 1.1.5
      matplotlib: 3.2.2
      joblib: 1.1.0
      threadpoolctl: 3.0.0

Built with OpenMP: True

The text was updated successfully, but these errors were encountered:

bmreiniger · 2021-11-22T16:55:18Z

In the interim, I had hoped that I could hack away the problem, but this didn't work:

import types
from sklearn.base import _safe_tags

def custom_tag_hacks(self):
    return {
        "pairwise": _safe_tags(self.steps[0][1], "pairwise"),  # from pipeline
        "allow_nan": True,  # fix our problem?
    }

pipe._more_tags = types.MethodType(custom_tag_hacks, pipe)
pipe._get_tags()["allow_nan"]
# still outputs False

glemaitre · 2021-11-22T19:21:05Z

This is indeed an issue and I am not sure what the right fix is. We could

expose allow-nan for a pipeline and delegate the management of missing values to the underlying steps
introspect the pipeline and check if there is a step that should handle missing values

I would think that the first option would be permissive enough but I don't know how bad it can fail. The second option would require to think about the mechanism to actually set the tag.

bmreiniger · 2021-11-22T19:33:54Z

Is there a reason not to just not do any validation, and leave it to the estimator?

Trying to get a pipeline (and other composites; consider a large column transformer in this situation!) to figure out if it can handle missings seems tricky. Some transformers might accept but pass along NaNs, until later an imputer may fill them, or the final predictor may accept missings...

glemaitre · 2021-11-22T20:09:59Z

Is there a reason not to just not do any validation, and leave it to the estimator?

This would option 1. But then, this is weird to me that RFE does not do the exact them pattern by delegating.

jnothman · 2021-11-25T07:39:53Z

Yes, might be simplest for RFE to always permit NaN. RFE only really needs to validate that X has columns that can be indexed.

However I don't think it's right that Pipeline defaults to an allow_nan: False state. It should behave permissive until proven otherwise

bmreiniger added the Bug: triage label Nov 22, 2021

glemaitre added Bug and removed Bug: triage labels Nov 22, 2021

thomasjpfan mentioned this issue Nov 28, 2021

FIX Adjust tags in RFE to allow nans by default #21807

Merged

cmarmo added module:feature_selection module:pipeline labels Sep 14, 2022

ogrisel closed this as completed in #21807 Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Missing values break feature selection with pipeline estimator #21743

Missing values break feature selection with pipeline estimator #21743

bmreiniger commented Nov 22, 2021 •

edited by glemaitre

Loading

bmreiniger commented Nov 22, 2021

Uh oh!

glemaitre commented Nov 22, 2021

Uh oh!

bmreiniger commented Nov 22, 2021 •

edited

Loading

Uh oh!

glemaitre commented Nov 22, 2021

Uh oh!

jnothman commented Nov 25, 2021

Uh oh!

Uh oh!

Missing values break feature selection with pipeline estimator #21743

Missing values break feature selection with pipeline estimator #21743

Comments

bmreiniger commented Nov 22, 2021 • edited by glemaitre Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

bmreiniger commented Nov 22, 2021

Uh oh!

glemaitre commented Nov 22, 2021

Uh oh!

bmreiniger commented Nov 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Nov 22, 2021

Uh oh!

jnothman commented Nov 25, 2021

Uh oh!

bmreiniger commented Nov 22, 2021 •

edited by glemaitre

Loading

bmreiniger commented Nov 22, 2021 •

edited

Loading