Closed
Description
Describe the bug
Feature selectors were adapted to allow missing values if their underlying estimator does in #11635, but that does not work if the estimator is a pipeline. In the traceback (below), the reason becomes apparent: the fix in the above mentioned PR checks the tag "allow_nan"
of the underlying estimator
, but Pipeline
doesn't inherit that tag from a (its first, or any) step, and so keeps its default False
.
Discovered at https://stackoverflow.com/q/69883401/10495893
Steps/Code to Reproduce
from sklearn.feature_selection import RFE, SelectFromModel, SelectKBest, SequentialFeatureSelector
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_breast_cancer(return_X_y=True)
nan_inds = np.random.randint(2, size=X.shape) < 0.1
X[nan_inds] = np.nan
pipe = make_pipeline(
SimpleImputer(),
LogisticRegression(),
)
fs = RFE(estimator=pipe)
fs.fit(X, y)
The same error occurs with SelectKBest
and SequentialFeatureSelector
. A similar error occurs in SelectFromModel
but at transform
time rather than fit
.
Expected Results
No error is thrown
Actual Results
ValueError Traceback (most recent call last)
<ipython-input-11-2744b13fcc5c> in <module>()
1 fs = RFE(estimator=pipe)
----> 2 fs.fit(X, y)
5 frames
/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, **fit_params)
220 Fitted estimator.
221 """
--> 222 return self._fit(X, y, **fit_params)
223
224 def _fit(self, X, y, step_score=None, **fit_params):
/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_rfe.py in _fit(self, X, y, step_score, **fit_params)
235 ensure_min_features=2,
236 force_all_finite=not tags.get("allow_nan", True),
--> 237 multi_output=True,
238 )
239 error_msg = (
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
574 y = check_array(y, **check_y_params)
575 else:
--> 576 X, y = check_X_y(X, y, **check_params)
577 out = X, y
578
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
966 ensure_min_samples=ensure_min_samples,
967 ensure_min_features=ensure_min_features,
--> 968 estimator=estimator,
969 )
970
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
790
791 if force_all_finite:
--> 792 _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
793
794 if ensure_min_samples > 0:
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
114 raise ValueError(
115 msg_err.format(
--> 116 type_err, msg_dtype if msg_dtype is not None else X.dtype
117 )
118 )
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Versions
System:
python: 3.7.12 (default, Sep 10 2021, 00:21:48) [GCC 7.5.0]
executable: /usr/bin/python3
machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Python dependencies:
pip: 21.1.3
setuptools: 57.4.0
sklearn: 1.0.1
numpy: 1.19.5
scipy: 1.4.1
Cython: 0.29.24
pandas: 1.1.5
matplotlib: 3.2.2
joblib: 1.1.0
threadpoolctl: 3.0.0
Built with OpenMP: True