Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Trees are doing too many split with missing values #28298

@glemaitre

Description

@glemaitre

In the following example:

import numpy as np
import sklearn
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

X, y = load_iris(as_frame=True, return_X_y=True)

rng = np.random.RandomState(42)
X_missing = X.copy()
mask = rng.binomial(n=np.array([1, 1, 1, 1]).reshape(1, -1),
                    p=(X['petal length (cm)'] / 8).values.reshape(-1, 1)).astype(bool)
X_missing[mask] = np.NaN
indices = np.array([  2,  81,  39,  97,  91,  38,  46,  31, 101,  13,  89,  82, 100,
        42,  69,  27,  81,  16,  73,  74,  51,  47, 107,  17,  75, 110,
        20,  15, 104,  57,  26,  15,  75,  79,  35,  77,  90,  51,  46,
        13,  94,  91,  23,   8,  93,  93,  73,  77,  12,  13,  74, 109,
       110,  24,  10,  23, 104,  27,  92,  52,  20, 109,   8,   8,  28,
        27,  35,  12,  12,   7,  43,   0,  30,  31,  78,  12,  24, 105,
        50,   0,  73,  12, 102, 105,  13,  31,   1,  69,  11,  32,  75,
        90, 106,  94,  60,  56,  35,  17,  62,  85,  81,  39,  80,  16,
        63,   6,  80,  84,   3,   3,  76,  78], dtype=np.int32)
X_train, X_test, y_train, y_test = train_test_split(X_missing, y, random_state=13)
seed = 1857819720
clf = DecisionTreeClassifier(max_depth=None, max_features="sqrt", random_state=seed).fit(X_train.iloc[indices], y_train.iloc[indices])

we get the following tree::

image

The path #12/#14 is weird. Indeed, we should have some missing values in #14 but we are still able to split based on np.inf that is not possible. So there is something fishy to investigate there.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions