Thanks to visit codestin.com
Credit goes to github.com

Skip to content

DecisionTree does not handle properly missing values in criterion partitioning #28254

Closed
@ehan03

Description

@ehan03

Describe the bug

I tried using RFECV with RandomForestClassifier in version 1.4.0 on data containing NaNs and got the following error:

ValueError: Input contains NaN.

This is my first time opening an issue to an open-source project before, so I apologize if this is ill-formatted or lacking of details. Please let me know if I can provide more information.

Steps/Code to Reproduce

import numpy as np
import sklearn
from sklearn.datasets import load_iris
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X, y = load_iris(as_frame=True, return_X_y=True)

rng = np.random.RandomState(42)
X_missing = X.copy()
mask = rng.binomial(n=np.array([1, 1, 1, 1]).reshape(1, -1),
                    p=(X['petal length (cm)'] / 8).values.reshape(-1, 1)).astype(bool)
X_missing[mask] = np.NaN

X_train, X_test, y_train, y_test = train_test_split(X_missing, y, random_state=13)

clf = RandomForestClassifier()
selector = RFECV(clf, cv=3)

selector.fit(X_train, y_train)

Expected Results

I would expect no error since RandomForestClassifier supports NaNs and according to the documentation for RFECV,
image

For instance, the following code works just fine:

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Actual Results

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[32], [line 14](vscode-notebook-cell:?execution_count=32&line=14)
     [11](vscode-notebook-cell:?execution_count=32&line=11) clf = RandomForestClassifier()
     [12](vscode-notebook-cell:?execution_count=32&line=12) selector = RFECV(clf, cv=3)
---> [14](vscode-notebook-cell:?execution_count=32&line=14) selector.fit(X_train, y_train)

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\base.py:1351](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1351), in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   [1344](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1344)     estimator._validate_params()
   [1346](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1346) with config_context(
   [1347](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1347)     skip_parameter_validation=(
   [1348](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1348)         prefer_skip_nested_validation or global_skip_validation
   [1349](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1349)     )
   [1350](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1350) ):
-> [1351](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1351)     return fit_method(estimator, *args, **kwargs)

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\feature_selection\_rfe.py:746](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:746), in RFECV.fit(self, X, y, groups)
    [743](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:743)     parallel = Parallel(n_jobs=self.n_jobs)
    [744](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:744)     func = delayed(_rfe_single_fit)
--> [746](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:746) scores = parallel(
    [747](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:747)     func(rfe, self.estimator, X, y, train, test, scorer)
    [748](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:748)     for train, test in cv.split(X, y, groups)
    [749](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:749) )
    [751](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:751) scores = np.array(scores)
    [752](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:752) scores_sum = np.sum(scores, axis=0)

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\feature_selection\_rfe.py:747](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:747), in <genexpr>(.0)
    [743](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:743)     parallel = Parallel(n_jobs=self.n_jobs)
    [744](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:744)     func = delayed(_rfe_single_fit)
    [746](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:746) scores = parallel(
--> [747](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:747)     func(rfe, self.estimator, X, y, train, test, scorer)
    [748](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:748)     for train, test in cv.split(X, y, groups)
    [749](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:749) )
    [751](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:751) scores = np.array(scores)
    [752](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:752) scores_sum = np.sum(scores, axis=0)

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\feature_selection\_rfe.py:35](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:35), in _rfe_single_fit(rfe, estimator, X, y, train, test, scorer)
     [33](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:33) X_train, y_train = _safe_split(estimator, X, y, train)
     [34](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:34) X_test, y_test = _safe_split(estimator, X, y, test, train)
---> [35](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:35) return rfe._fit(
     [36](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:36)     X_train,
     [37](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:37)     y_train,
     [38](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:38)     lambda estimator, features: _score(
     [39](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:39)         # TODO(SLEP6): pass score_params here
     [40](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:40)         estimator,
     [41](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:41)         X_test[:, features],
     [42](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:42)         y_test,
     [43](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:43)         scorer,
     [44](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:44)         score_params=None,
     [45](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:45)     ),
     [46](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:46) ).scores_

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\feature_selection\_rfe.py:308](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:308), in RFE._fit(self, X, y, step_score, **fit_params)
    [305](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:305) estimator.fit(X[:, features], y, **fit_params)
    [307](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:307) # Get importance and rank them
--> [308](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:308) importances = _get_feature_importances(
    [309](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:309)     estimator,
    [310](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:310)     self.importance_getter,
    [311](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:311)     transform_func="square",
    [312](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:312) )
    [313](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:313) ranks = np.argsort(importances)
    [315](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:315) # for sparse case ranks is matrix

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\feature_selection\_base.py:238](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:238), in _get_feature_importances(estimator, getter, transform_func, norm_order)
    [236](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:236) elif transform_func == "square":
    [237](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:237)     if importances.ndim == 1:
--> [238](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:238)         importances = safe_sqr(importances)
    [239](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:239)     else:
    [240](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:240)         importances = safe_sqr(importances).sum(axis=0)

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\utils\__init__.py:773](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:773), in safe_sqr(X, copy)
    [757](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:757) def safe_sqr(X, *, copy=True):
    [758](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:758)     """Element wise squaring of array-likes and sparse matrices.
    [759](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:759) 
    [760](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:760)     Parameters
   (...)
    [771](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:771)          Return the element-wise square of the input.
    [772](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:772)     """
--> [773](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:773)     X = check_array(X, accept_sparse=["csr", "csc", "coo"], ensure_2d=False)
    [774](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:774)     if issparse(X):
    [775](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:775)         if copy:

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\utils\validation.py:1003](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1003), in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    [997](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:997)     raise ValueError(
    [998](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:998)         "Found array with dim %d. %s expected <= 2."
    [999](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:999)         % (array.ndim, estimator_name)
   [1000](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1000)     )
   [1002](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1002) if force_all_finite:
-> [1003](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1003)     _assert_all_finite(
   [1004](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1004)         array,
   [1005](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1005)         input_name=input_name,
   [1006](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1006)         estimator_name=estimator_name,
   [1007](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1007)         allow_nan=force_all_finite == "allow-nan",
   [1008](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1008)     )
   [1010](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1010) if copy:
   [1011](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1011)     if _is_numpy_namespace(xp):
   [1012](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1012)         # only make a copy if `array` and `array_orig` may share memory`

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\utils\validation.py:126](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:126), in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    [123](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:123) if first_pass_isfinite:
    [124](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:124)     return
--> [126](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:126) _assert_all_finite_element_wise(
    [127](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:127)     X,
    [128](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:128)     xp=xp,
    [129](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:129)     allow_nan=allow_nan,
    [130](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:130)     msg_dtype=msg_dtype,
    [131](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:131)     estimator_name=estimator_name,
    [132](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:132)     input_name=input_name,
    [133](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:133) )

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\utils\validation.py:175](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:175), in _assert_all_finite_element_wise(X, xp, allow_nan, msg_dtype, estimator_name, input_name)
    [158](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:158) if estimator_name and input_name == "X" and has_nan_error:
    [159](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:159)     # Improve the error message on how to handle missing values in
    [160](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:160)     # scikit-learn.
    [161](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:161)     msg_err += (
    [162](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:162)         f"\n{estimator_name} does not accept missing values"
    [163](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:163)         " encoded as NaN natively. For supervised learning, you might want"
   (...)
    [173](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:173)         "#estimators-that-handle-nan-values"
    [174](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:174)     )
--> [175](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:175) raise ValueError(msg_err)

ValueError: Input contains NaN.

Versions

System:
    python: 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)]
executable: c:\Users\eugen\anaconda3\envs\ufc\python.exe
   machine: Windows-10-10.0.19045-SP0

Python dependencies:
      sklearn: 1.4.0
          pip: 23.0.1
   setuptools: 66.0.0
        numpy: 1.26.0
        scipy: 1.11.3
       Cython: 3.0.0
       pandas: 1.5.3
   matplotlib: 3.6.2
       joblib: 1.2.0
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: C:\Users\eugen\anaconda3\envs\ufc\Library\bin\mkl_rt.2.dll
         prefix: mkl_rt
       user_api: blas
   internal_api: mkl
        version: 2022.1-Product
    num_threads: 8
threading_layer: intel

       filepath: C:\Users\eugen\anaconda3\envs\ufc\vcomp140.dll
         prefix: vcomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 16

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugHigh PriorityHigh priority issues and pull requests

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions