-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DecisionTree does not handle properly missing values in criterion partitioning #28254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The problem is not clf.fit(X_missing, y)
clf.feature_importances_
|
feature_importances_
in the tree-based model does not take into account nan
values in X_train
Thank you for the correction. Maybe I'll use something like LightGBM as my estimator in the meantime until this gets fixed |
In both cases, you should be extremely careful when using the Depending on your usecase, it could be better to use the permutation importances that will not have the issue with the missing values. |
Thank you! |
OK. I'm posting a minimum reproducer: import numpy as np
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
seed = 2
n_samples, n_missing_per_features = 100, 10
X, y = datasets.make_classification(n_samples=n_samples, n_features=4, random_state=0)
rng = np.random.RandomState(0)
for col in range(X.shape[1]):
indices = rng.choice(X.shape[0], size=n_missing_per_features, replace=False)
X[indices, col] = np.nan
tree = DecisionTreeClassifier(random_state=seed).fit(X, y) Actually, this is not just a bug in the computation of the feature importance. The above tree will look like the below: ![]() We can see that the gini index in the node #20 (most on the right) is equal to I assume what we observe here is just by luck and we have a real bug in the partitioning or the way to track missing values that is visible here due to the zero division. However, I now recall that @ogrisel and @ArturoAmorQ noticed a huge drop in performance with random forest that used missing values mechanism in comparison to imputation in one of the exercise in the scikit-learn MOOC. I'll try to reproduce to be sure that I don't say anything wrong. @thomasjpfan would you mind to assist me at finding the root of the bug regarding the missing values issue. I'm almost there but it could quite speed-up the debugging :) |
Here is an example where we can observe the bug on a full example. It is a regression Ames Housing where you have quite a lot of missing values. Here is a pipeline with imputation: import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.compose import ColumnTransformer, make_column_selector as selector
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_validate
ames_housing = fetch_openml("house_prices")
X, y = ames_housing.data, ames_housing.target
preprocessor = ColumnTransformer(transformers=[
(
"encoder",
make_pipeline(
SimpleImputer(strategy="most_frequent"),
OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan),
),
selector(dtype_include=object)
),
], remainder=SimpleImputer(strategy="mean"))
model = make_pipeline(preprocessor, RandomForestRegressor(random_state=0))
cv_results = cross_validate(
model, X, y, cv=10, scoring="neg_mean_absolute_percentage_error", n_jobs=-1
)
cv_results = pd.DataFrame(cv_results)
mape = -cv_results["test_score"]
print(f"MAPE: {mape.mean() * 100:.1f}% +/- {mape.std() * 100:.1f}%")
and now leveraging the current missing values mechanism: preprocessor = ColumnTransformer(transformers=[
(
"encoder",
OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan),
selector(dtype_include=object)
),
], remainder="passthrough")
model = make_pipeline(preprocessor, RandomForestRegressor(random_state=0))
cv_results = cross_validate(
model, X, y, cv=10, scoring="neg_mean_absolute_percentage_error", n_jobs=-1
)
cv_results = pd.DataFrame(cv_results)
mape = -cv_results["test_score"]
print(f"MAPE: {mape.mean() * 100:.1f}% +/- {mape.std() * 100:.1f}%")
and here are the stats about Ames Housing:
|
feature_importances_
in the tree-based model does not take into account nan
values in X_train
The above pipeline with an |
Here, is a smaller reproducer: import numpy as np
from sklearn.tree import DecisionTreeRegressor
y = np.arange(6)
X = np.array([np.nan, np.nan, 3, 4, 5, 6]).reshape(-1, 1)
tree = DecisionTreeRegressor().fit(X, y) From this example, I think it will be easier stop what is going wrong because there are few splits. Fixing ![]() The mean squared error for node #3 is negative while it should be 1 because we have a single sample. |
So I found a first bug where we don't reinitialize the number of missing values of the criterion for each split. Therefore, in the case we consider a split with non-missing values but we had before missing values, then the statistics computed are wrong because it uses the I'll make a PR for that and I think that the above example is a regression test because when we have a single sample in each leaf, we should always have an MSE of 0. |
Describe the bug
I tried using
RFECV
withRandomForestClassifier
in version 1.4.0 on data containing NaNs and got the following error:This is my first time opening an issue to an open-source project before, so I apologize if this is ill-formatted or lacking of details. Please let me know if I can provide more information.
Steps/Code to Reproduce
Expected Results
I would expect no error since

RandomForestClassifier
supports NaNs and according to the documentation forRFECV
,For instance, the following code works just fine:
Actual Results
Versions
The text was updated successfully, but these errors were encountered: