Trees are doing too many split with missing values

In the following example:

```python
import numpy as np
import sklearn
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

X, y = load_iris(as_frame=True, return_X_y=True)

rng = np.random.RandomState(42)
X_missing = X.copy()
mask = rng.binomial(n=np.array([1, 1, 1, 1]).reshape(1, -1),
                    p=(X['petal length (cm)'] / 8).values.reshape(-1, 1)).astype(bool)
X_missing[mask] = np.NaN
indices = np.array([  2,  81,  39,  97,  91,  38,  46,  31, 101,  13,  89,  82, 100,
        42,  69,  27,  81,  16,  73,  74,  51,  47, 107,  17,  75, 110,
        20,  15, 104,  57,  26,  15,  75,  79,  35,  77,  90,  51,  46,
        13,  94,  91,  23,   8,  93,  93,  73,  77,  12,  13,  74, 109,
       110,  24,  10,  23, 104,  27,  92,  52,  20, 109,   8,   8,  28,
        27,  35,  12,  12,   7,  43,   0,  30,  31,  78,  12,  24, 105,
        50,   0,  73,  12, 102, 105,  13,  31,   1,  69,  11,  32,  75,
        90, 106,  94,  60,  56,  35,  17,  62,  85,  81,  39,  80,  16,
        63,   6,  80,  84,   3,   3,  76,  78], dtype=np.int32)
X_train, X_test, y_train, y_test = train_test_split(X_missing, y, random_state=13)
seed = 1857819720
clf = DecisionTreeClassifier(max_depth=None, max_features="sqrt", random_state=seed).fit(X_train.iloc[indices], y_train.iloc[indices])
```

we get the following tree::

![image](https://github.com/scikit-learn/scikit-learn/assets/7454015/def0ca84-fb2c-4022-8b9c-1b8a8b178033)

The path #12/#14 is weird. Indeed, we should have some missing values in #14 but we are still able to split based on `np.inf` that is not possible. So there is something fishy to investigate there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Trees are doing too many split with missing values #28298

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Trees are doing too many split with missing values #28298

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions