Closed
Description
Describe the bug
I tried using RFECV
with RandomForestClassifier
in version 1.4.0 on data containing NaNs and got the following error:
ValueError: Input contains NaN.
This is my first time opening an issue to an open-source project before, so I apologize if this is ill-formatted or lacking of details. Please let me know if I can provide more information.
Steps/Code to Reproduce
import numpy as np
import sklearn
from sklearn.datasets import load_iris
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X, y = load_iris(as_frame=True, return_X_y=True)
rng = np.random.RandomState(42)
X_missing = X.copy()
mask = rng.binomial(n=np.array([1, 1, 1, 1]).reshape(1, -1),
p=(X['petal length (cm)'] / 8).values.reshape(-1, 1)).astype(bool)
X_missing[mask] = np.NaN
X_train, X_test, y_train, y_test = train_test_split(X_missing, y, random_state=13)
clf = RandomForestClassifier()
selector = RFECV(clf, cv=3)
selector.fit(X_train, y_train)
Expected Results
I would expect no error since RandomForestClassifier
supports NaNs and according to the documentation for RFECV
,
For instance, the following code works just fine:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
Actual Results
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[32], [line 14](vscode-notebook-cell:?execution_count=32&line=14)
[11](vscode-notebook-cell:?execution_count=32&line=11) clf = RandomForestClassifier()
[12](vscode-notebook-cell:?execution_count=32&line=12) selector = RFECV(clf, cv=3)
---> [14](vscode-notebook-cell:?execution_count=32&line=14) selector.fit(X_train, y_train)
File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\base.py:1351](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1351), in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
[1344](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1344) estimator._validate_params()
[1346](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1346) with config_context(
[1347](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1347) skip_parameter_validation=(
[1348](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1348) prefer_skip_nested_validation or global_skip_validation
[1349](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1349) )
[1350](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1350) ):
-> [1351](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1351) return fit_method(estimator, *args, **kwargs)
File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\feature_selection\_rfe.py:746](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:746), in RFECV.fit(self, X, y, groups)
[743](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:743) parallel = Parallel(n_jobs=self.n_jobs)
[744](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:744) func = delayed(_rfe_single_fit)
--> [746](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:746) scores = parallel(
[747](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:747) func(rfe, self.estimator, X, y, train, test, scorer)
[748](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:748) for train, test in cv.split(X, y, groups)
[749](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:749) )
[751](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:751) scores = np.array(scores)
[752](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:752) scores_sum = np.sum(scores, axis=0)
File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\feature_selection\_rfe.py:747](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:747), in <genexpr>(.0)
[743](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:743) parallel = Parallel(n_jobs=self.n_jobs)
[744](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:744) func = delayed(_rfe_single_fit)
[746](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:746) scores = parallel(
--> [747](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:747) func(rfe, self.estimator, X, y, train, test, scorer)
[748](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:748) for train, test in cv.split(X, y, groups)
[749](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:749) )
[751](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:751) scores = np.array(scores)
[752](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:752) scores_sum = np.sum(scores, axis=0)
File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\feature_selection\_rfe.py:35](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:35), in _rfe_single_fit(rfe, estimator, X, y, train, test, scorer)
[33](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:33) X_train, y_train = _safe_split(estimator, X, y, train)
[34](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:34) X_test, y_test = _safe_split(estimator, X, y, test, train)
---> [35](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:35) return rfe._fit(
[36](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:36) X_train,
[37](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:37) y_train,
[38](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:38) lambda estimator, features: _score(
[39](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:39) # TODO(SLEP6): pass score_params here
[40](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:40) estimator,
[41](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:41) X_test[:, features],
[42](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:42) y_test,
[43](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:43) scorer,
[44](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:44) score_params=None,
[45](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:45) ),
[46](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:46) ).scores_
File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\feature_selection\_rfe.py:308](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:308), in RFE._fit(self, X, y, step_score, **fit_params)
[305](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:305) estimator.fit(X[:, features], y, **fit_params)
[307](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:307) # Get importance and rank them
--> [308](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:308) importances = _get_feature_importances(
[309](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:309) estimator,
[310](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:310) self.importance_getter,
[311](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:311) transform_func="square",
[312](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:312) )
[313](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:313) ranks = np.argsort(importances)
[315](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:315) # for sparse case ranks is matrix
File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\feature_selection\_base.py:238](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:238), in _get_feature_importances(estimator, getter, transform_func, norm_order)
[236](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:236) elif transform_func == "square":
[237](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:237) if importances.ndim == 1:
--> [238](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:238) importances = safe_sqr(importances)
[239](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:239) else:
[240](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:240) importances = safe_sqr(importances).sum(axis=0)
File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\utils\__init__.py:773](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:773), in safe_sqr(X, copy)
[757](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:757) def safe_sqr(X, *, copy=True):
[758](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:758) """Element wise squaring of array-likes and sparse matrices.
[759](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:759)
[760](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:760) Parameters
(...)
[771](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:771) Return the element-wise square of the input.
[772](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:772) """
--> [773](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:773) X = check_array(X, accept_sparse=["csr", "csc", "coo"], ensure_2d=False)
[774](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:774) if issparse(X):
[775](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:775) if copy:
File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\utils\validation.py:1003](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1003), in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
[997](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:997) raise ValueError(
[998](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:998) "Found array with dim %d. %s expected <= 2."
[999](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:999) % (array.ndim, estimator_name)
[1000](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1000) )
[1002](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1002) if force_all_finite:
-> [1003](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1003) _assert_all_finite(
[1004](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1004) array,
[1005](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1005) input_name=input_name,
[1006](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1006) estimator_name=estimator_name,
[1007](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1007) allow_nan=force_all_finite == "allow-nan",
[1008](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1008) )
[1010](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1010) if copy:
[1011](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1011) if _is_numpy_namespace(xp):
[1012](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1012) # only make a copy if `array` and `array_orig` may share memory`
File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\utils\validation.py:126](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:126), in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
[123](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:123) if first_pass_isfinite:
[124](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:124) return
--> [126](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:126) _assert_all_finite_element_wise(
[127](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:127) X,
[128](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:128) xp=xp,
[129](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:129) allow_nan=allow_nan,
[130](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:130) msg_dtype=msg_dtype,
[131](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:131) estimator_name=estimator_name,
[132](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:132) input_name=input_name,
[133](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:133) )
File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\utils\validation.py:175](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:175), in _assert_all_finite_element_wise(X, xp, allow_nan, msg_dtype, estimator_name, input_name)
[158](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:158) if estimator_name and input_name == "X" and has_nan_error:
[159](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:159) # Improve the error message on how to handle missing values in
[160](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:160) # scikit-learn.
[161](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:161) msg_err += (
[162](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:162) f"\n{estimator_name} does not accept missing values"
[163](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:163) " encoded as NaN natively. For supervised learning, you might want"
(...)
[173](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:173) "#estimators-that-handle-nan-values"
[174](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:174) )
--> [175](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:175) raise ValueError(msg_err)
ValueError: Input contains NaN.
Versions
System:
python: 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)]
executable: c:\Users\eugen\anaconda3\envs\ufc\python.exe
machine: Windows-10-10.0.19045-SP0
Python dependencies:
sklearn: 1.4.0
pip: 23.0.1
setuptools: 66.0.0
numpy: 1.26.0
scipy: 1.11.3
Cython: 3.0.0
pandas: 1.5.3
matplotlib: 3.6.2
joblib: 1.2.0
threadpoolctl: 2.2.0
Built with OpenMP: True
threadpoolctl info:
filepath: C:\Users\eugen\anaconda3\envs\ufc\Library\bin\mkl_rt.2.dll
prefix: mkl_rt
user_api: blas
internal_api: mkl
version: 2022.1-Product
num_threads: 8
threading_layer: intel
filepath: C:\Users\eugen\anaconda3\envs\ufc\vcomp140.dll
prefix: vcomp
user_api: openmp
internal_api: openmp
version: None
num_threads: 16