Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Missing values break feature selection with pipeline estimator #21743

Closed
@bmreiniger

Description

@bmreiniger

Describe the bug

Feature selectors were adapted to allow missing values if their underlying estimator does in #11635, but that does not work if the estimator is a pipeline. In the traceback (below), the reason becomes apparent: the fix in the above mentioned PR checks the tag "allow_nan" of the underlying estimator, but Pipeline doesn't inherit that tag from a (its first, or any) step, and so keeps its default False.

Discovered at https://stackoverflow.com/q/69883401/10495893

Steps/Code to Reproduce

from sklearn.feature_selection import RFE, SelectFromModel, SelectKBest, SequentialFeatureSelector
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
import numpy as np

X, y = load_breast_cancer(return_X_y=True)
nan_inds = np.random.randint(2, size=X.shape) < 0.1
X[nan_inds] = np.nan

pipe = make_pipeline(
    SimpleImputer(),
    LogisticRegression(),
)

fs = RFE(estimator=pipe)
fs.fit(X, y)

The same error occurs with SelectKBest and SequentialFeatureSelector. A similar error occurs in SelectFromModel but at transform time rather than fit.

Expected Results

No error is thrown

Actual Results

ValueError                                Traceback (most recent call last)
<ipython-input-11-2744b13fcc5c> in <module>()
      1 fs = RFE(estimator=pipe)
----> 2 fs.fit(X, y)

5 frames
/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, **fit_params)
    220             Fitted estimator.
    221         """
--> 222         return self._fit(X, y, **fit_params)
    223 
    224     def _fit(self, X, y, step_score=None, **fit_params):

/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_rfe.py in _fit(self, X, y, step_score, **fit_params)
    235             ensure_min_features=2,
    236             force_all_finite=not tags.get("allow_nan", True),
--> 237             multi_output=True,
    238         )
    239         error_msg = (

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    574                 y = check_array(y, **check_y_params)
    575             else:
--> 576                 X, y = check_X_y(X, y, **check_params)
    577             out = X, y
    578 

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    966         ensure_min_samples=ensure_min_samples,
    967         ensure_min_features=ensure_min_features,
--> 968         estimator=estimator,
    969     )
    970 

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    790 
    791         if force_all_finite:
--> 792             _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
    793 
    794     if ensure_min_samples > 0:

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
    114             raise ValueError(
    115                 msg_err.format(
--> 116                     type_err, msg_dtype if msg_dtype is not None else X.dtype
    117                 )
    118             )

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Versions

System:
    python: 3.7.12 (default, Sep 10 2021, 00:21:48)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
      pip: 21.1.3
      setuptools: 57.4.0
      sklearn: 1.0.1
      numpy: 1.19.5
      scipy: 1.4.1
      Cython: 0.29.24
      pandas: 1.1.5
      matplotlib: 3.2.2
      joblib: 1.1.0
      threadpoolctl: 3.0.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions