Thanks to visit codestin.com
Credit goes to github.com

Skip to content

DecisionTree does not handle properly missing values in criterion partitioning #28254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ehan03 opened this issue Jan 25, 2024 · 9 comments · Fixed by #28295
Closed

DecisionTree does not handle properly missing values in criterion partitioning #28254

ehan03 opened this issue Jan 25, 2024 · 9 comments · Fixed by #28295
Labels
Bug High Priority High priority issues and pull requests
Milestone

Comments

@ehan03
Copy link

ehan03 commented Jan 25, 2024

Describe the bug

I tried using RFECV with RandomForestClassifier in version 1.4.0 on data containing NaNs and got the following error:

ValueError: Input contains NaN.

This is my first time opening an issue to an open-source project before, so I apologize if this is ill-formatted or lacking of details. Please let me know if I can provide more information.

Steps/Code to Reproduce

import numpy as np
import sklearn
from sklearn.datasets import load_iris
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X, y = load_iris(as_frame=True, return_X_y=True)

rng = np.random.RandomState(42)
X_missing = X.copy()
mask = rng.binomial(n=np.array([1, 1, 1, 1]).reshape(1, -1),
                    p=(X['petal length (cm)'] / 8).values.reshape(-1, 1)).astype(bool)
X_missing[mask] = np.NaN

X_train, X_test, y_train, y_test = train_test_split(X_missing, y, random_state=13)

clf = RandomForestClassifier()
selector = RFECV(clf, cv=3)

selector.fit(X_train, y_train)

Expected Results

I would expect no error since RandomForestClassifier supports NaNs and according to the documentation for RFECV,
image

For instance, the following code works just fine:

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Actual Results

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[32], [line 14](vscode-notebook-cell:?execution_count=32&line=14)
     [11](vscode-notebook-cell:?execution_count=32&line=11) clf = RandomForestClassifier()
     [12](vscode-notebook-cell:?execution_count=32&line=12) selector = RFECV(clf, cv=3)
---> [14](vscode-notebook-cell:?execution_count=32&line=14) selector.fit(X_train, y_train)

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\base.py:1351](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1351), in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   [1344](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1344)     estimator._validate_params()
   [1346](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1346) with config_context(
   [1347](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1347)     skip_parameter_validation=(
   [1348](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1348)         prefer_skip_nested_validation or global_skip_validation
   [1349](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1349)     )
   [1350](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1350) ):
-> [1351](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/base.py:1351)     return fit_method(estimator, *args, **kwargs)

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\feature_selection\_rfe.py:746](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:746), in RFECV.fit(self, X, y, groups)
    [743](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:743)     parallel = Parallel(n_jobs=self.n_jobs)
    [744](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:744)     func = delayed(_rfe_single_fit)
--> [746](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:746) scores = parallel(
    [747](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:747)     func(rfe, self.estimator, X, y, train, test, scorer)
    [748](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:748)     for train, test in cv.split(X, y, groups)
    [749](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:749) )
    [751](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:751) scores = np.array(scores)
    [752](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:752) scores_sum = np.sum(scores, axis=0)

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\feature_selection\_rfe.py:747](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:747), in <genexpr>(.0)
    [743](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:743)     parallel = Parallel(n_jobs=self.n_jobs)
    [744](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:744)     func = delayed(_rfe_single_fit)
    [746](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:746) scores = parallel(
--> [747](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:747)     func(rfe, self.estimator, X, y, train, test, scorer)
    [748](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:748)     for train, test in cv.split(X, y, groups)
    [749](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:749) )
    [751](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:751) scores = np.array(scores)
    [752](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:752) scores_sum = np.sum(scores, axis=0)

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\feature_selection\_rfe.py:35](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:35), in _rfe_single_fit(rfe, estimator, X, y, train, test, scorer)
     [33](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:33) X_train, y_train = _safe_split(estimator, X, y, train)
     [34](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:34) X_test, y_test = _safe_split(estimator, X, y, test, train)
---> [35](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:35) return rfe._fit(
     [36](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:36)     X_train,
     [37](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:37)     y_train,
     [38](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:38)     lambda estimator, features: _score(
     [39](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:39)         # TODO(SLEP6): pass score_params here
     [40](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:40)         estimator,
     [41](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:41)         X_test[:, features],
     [42](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:42)         y_test,
     [43](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:43)         scorer,
     [44](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:44)         score_params=None,
     [45](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:45)     ),
     [46](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:46) ).scores_

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\feature_selection\_rfe.py:308](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:308), in RFE._fit(self, X, y, step_score, **fit_params)
    [305](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:305) estimator.fit(X[:, features], y, **fit_params)
    [307](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:307) # Get importance and rank them
--> [308](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:308) importances = _get_feature_importances(
    [309](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:309)     estimator,
    [310](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:310)     self.importance_getter,
    [311](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:311)     transform_func="square",
    [312](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:312) )
    [313](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:313) ranks = np.argsort(importances)
    [315](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_rfe.py:315) # for sparse case ranks is matrix

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\feature_selection\_base.py:238](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:238), in _get_feature_importances(estimator, getter, transform_func, norm_order)
    [236](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:236) elif transform_func == "square":
    [237](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:237)     if importances.ndim == 1:
--> [238](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:238)         importances = safe_sqr(importances)
    [239](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:239)     else:
    [240](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/feature_selection/_base.py:240)         importances = safe_sqr(importances).sum(axis=0)

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\utils\__init__.py:773](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:773), in safe_sqr(X, copy)
    [757](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:757) def safe_sqr(X, *, copy=True):
    [758](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:758)     """Element wise squaring of array-likes and sparse matrices.
    [759](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:759) 
    [760](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:760)     Parameters
   (...)
    [771](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:771)          Return the element-wise square of the input.
    [772](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:772)     """
--> [773](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:773)     X = check_array(X, accept_sparse=["csr", "csc", "coo"], ensure_2d=False)
    [774](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:774)     if issparse(X):
    [775](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/__init__.py:775)         if copy:

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\utils\validation.py:1003](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1003), in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    [997](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:997)     raise ValueError(
    [998](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:998)         "Found array with dim %d. %s expected <= 2."
    [999](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:999)         % (array.ndim, estimator_name)
   [1000](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1000)     )
   [1002](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1002) if force_all_finite:
-> [1003](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1003)     _assert_all_finite(
   [1004](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1004)         array,
   [1005](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1005)         input_name=input_name,
   [1006](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1006)         estimator_name=estimator_name,
   [1007](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1007)         allow_nan=force_all_finite == "allow-nan",
   [1008](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1008)     )
   [1010](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1010) if copy:
   [1011](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1011)     if _is_numpy_namespace(xp):
   [1012](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:1012)         # only make a copy if `array` and `array_orig` may share memory`

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\utils\validation.py:126](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:126), in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    [123](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:123) if first_pass_isfinite:
    [124](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:124)     return
--> [126](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:126) _assert_all_finite_element_wise(
    [127](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:127)     X,
    [128](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:128)     xp=xp,
    [129](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:129)     allow_nan=allow_nan,
    [130](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:130)     msg_dtype=msg_dtype,
    [131](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:131)     estimator_name=estimator_name,
    [132](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:132)     input_name=input_name,
    [133](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:133) )

File [c:\Users\eugen\anaconda3\envs\ufc\lib\site-packages\sklearn\utils\validation.py:175](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:175), in _assert_all_finite_element_wise(X, xp, allow_nan, msg_dtype, estimator_name, input_name)
    [158](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:158) if estimator_name and input_name == "X" and has_nan_error:
    [159](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:159)     # Improve the error message on how to handle missing values in
    [160](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:160)     # scikit-learn.
    [161](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:161)     msg_err += (
    [162](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:162)         f"\n{estimator_name} does not accept missing values"
    [163](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:163)         " encoded as NaN natively. For supervised learning, you might want"
   (...)
    [173](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:173)         "#estimators-that-handle-nan-values"
    [174](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:174)     )
--> [175](file:///C:/Users/eugen/anaconda3/envs/ufc/lib/site-packages/sklearn/utils/validation.py:175) raise ValueError(msg_err)

ValueError: Input contains NaN.

Versions

System:
    python: 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)]
executable: c:\Users\eugen\anaconda3\envs\ufc\python.exe
   machine: Windows-10-10.0.19045-SP0

Python dependencies:
      sklearn: 1.4.0
          pip: 23.0.1
   setuptools: 66.0.0
        numpy: 1.26.0
        scipy: 1.11.3
       Cython: 3.0.0
       pandas: 1.5.3
   matplotlib: 3.6.2
       joblib: 1.2.0
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: C:\Users\eugen\anaconda3\envs\ufc\Library\bin\mkl_rt.2.dll
         prefix: mkl_rt
       user_api: blas
   internal_api: mkl
        version: 2022.1-Product
    num_threads: 8
threading_layer: intel

       filepath: C:\Users\eugen\anaconda3\envs\ufc\vcomp140.dll
         prefix: vcomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 16
@ehan03 ehan03 added Bug Needs Triage Issue requires triage labels Jan 25, 2024
@glemaitre
Copy link
Member

The problem is not RFECV but a bug in RandomForest where the feature_importances_ is not computed properly:

clf.fit(X_missing, y)
clf.feature_importances_
array([nan, nan, nan, nan])

feature_importances_ need to be nan aware as well.

@glemaitre glemaitre removed the Needs Triage Issue requires triage label Jan 25, 2024
@glemaitre glemaitre changed the title RFECV does not work with data containing NaNs even if estimator supports NaN feature_importances_ in the tree-based model does not take into account nan values in X_train Jan 25, 2024
@ehan03
Copy link
Author

ehan03 commented Jan 25, 2024

Thank you for the correction. Maybe I'll use something like LightGBM as my estimator in the meantime until this gets fixed

@glemaitre
Copy link
Member

In both cases, you should be extremely careful when using the feature_importances_ since it tends to have some bias and can overfit: https://scikit-learn.org/dev/auto_examples/inspection/plot_permutation_importance.html#sphx-glr-auto-examples-inspection-plot-permutation-importance-py

Depending on your usecase, it could be better to use the permutation importances that will not have the issue with the missing values.

@ehan03
Copy link
Author

ehan03 commented Jan 25, 2024

Thank you!

@glemaitre
Copy link
Member

OK. I'm posting a minimum reproducer:

import numpy as np
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier

seed = 2
n_samples, n_missing_per_features = 100, 10
X, y = datasets.make_classification(n_samples=n_samples, n_features=4, random_state=0)
rng = np.random.RandomState(0)
for col in range(X.shape[1]):
    indices = rng.choice(X.shape[0], size=n_missing_per_features, replace=False)
    X[indices, col] = np.nan
tree = DecisionTreeClassifier(random_state=seed).fit(X, y)

Actually, this is not just a bug in the computation of the feature importance. The above tree will look like the below:

image

We can see that the gini index in the node #20 (most on the right) is equal to nan. I check what was the reason for it and it appear that weighted_n_right in the criterion is actually 0 for this particular node. However, I checked and for the 3 samples at this node, none of them have missing values. However, it looks like we are removing the number of missing values from the node #14 the weighted_n_right. However at the node #14, the criterion indicates that the missing values are sent to the left. So there is something wrong in the subtraction there.

I assume what we observe here is just by luck and we have a real bug in the partitioning or the way to track missing values that is visible here due to the zero division.

However, I now recall that @ogrisel and @ArturoAmorQ noticed a huge drop in performance with random forest that used missing values mechanism in comparison to imputation in one of the exercise in the scikit-learn MOOC. I'll try to reproduce to be sure that I don't say anything wrong.

@thomasjpfan would you mind to assist me at finding the root of the bug regarding the missing values issue. I'm almost there but it could quite speed-up the debugging :)

@glemaitre
Copy link
Member

glemaitre commented Jan 26, 2024

Here is an example where we can observe the bug on a full example. It is a regression Ames Housing where you have quite a lot of missing values. Here is a pipeline with imputation:

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.compose import ColumnTransformer, make_column_selector as selector
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_validate

ames_housing = fetch_openml("house_prices")
X, y = ames_housing.data, ames_housing.target

preprocessor = ColumnTransformer(transformers=[
    (
        "encoder",
        make_pipeline(
            SimpleImputer(strategy="most_frequent"),
            OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan),
        ),
        selector(dtype_include=object)
    ),
], remainder=SimpleImputer(strategy="mean"))
model = make_pipeline(preprocessor, RandomForestRegressor(random_state=0))

cv_results = cross_validate(
    model, X, y, cv=10, scoring="neg_mean_absolute_percentage_error", n_jobs=-1
)
cv_results = pd.DataFrame(cv_results)
mape = -cv_results["test_score"]
print(f"MAPE: {mape.mean() * 100:.1f}% +/- {mape.std() * 100:.1f}%")
MAPE: 10.1% +/- 1.0%

and now leveraging the current missing values mechanism:

preprocessor = ColumnTransformer(transformers=[
    (
        "encoder",
        OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan),
        selector(dtype_include=object)
    ),
], remainder="passthrough")
model = make_pipeline(preprocessor, RandomForestRegressor(random_state=0))

cv_results = cross_validate(
    model, X, y, cv=10, scoring="neg_mean_absolute_percentage_error", n_jobs=-1
)
cv_results = pd.DataFrame(cv_results)
mape = -cv_results["test_score"]
print(f"MAPE: {mape.mean() * 100:.1f}% +/- {mape.std() * 100:.1f}%")
MAPE: 19.4% +/- 2.2%

and here are the stats about Ames Housing:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     1452 non-null   object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
dtypes: float64(3), int64(34), object(43)
memory usage: 912.6+ KB

@glemaitre glemaitre changed the title feature_importances_ in the tree-based model does not take into account nan values in X_train DecisionTree does not handle properly missing values in criterion partitioning Jan 26, 2024
@glemaitre glemaitre added the High Priority High priority issues and pull requests label Jan 26, 2024
@glemaitre glemaitre added this to the 1.4.1 milestone Jan 26, 2024
@glemaitre
Copy link
Member

The above pipeline with an HistGradientBoostingRegressor will lead to MAPE: 9.3% +/- 1.0% for the first pipeline and MAPE: 9.4% +/- 1.1%. So we certainly have a bug.

@glemaitre
Copy link
Member

glemaitre commented Jan 27, 2024

Here, is a smaller reproducer:

import numpy as np
from sklearn.tree import DecisionTreeRegressor

y = np.arange(6)
X = np.array([np.nan, np.nan, 3, 4, 5, 6]).reshape(-1, 1)
tree = DecisionTreeRegressor().fit(X, y)

From this example, I think it will be easier stop what is going wrong because there are few splits.

Fixing max_depth=2 already show some weird stuff:

image

The mean squared error for node #3 is negative while it should be 1 because we have a single sample.

@glemaitre
Copy link
Member

So I found a first bug where we don't reinitialize the number of missing values of the criterion for each split. Therefore, in the case we consider a split with non-missing values but we had before missing values, then the statistics computed are wrong because it uses the n_missing_values from the previous split.

I'll make a PR for that and I think that the above example is a regression test because when we have a single sample in each leaf, we should always have an MSE of 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug High Priority High priority issues and pull requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants