-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Closed
Description
In the following example:
import numpy as np
import sklearn
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
X, y = load_iris(as_frame=True, return_X_y=True)
rng = np.random.RandomState(42)
X_missing = X.copy()
mask = rng.binomial(n=np.array([1, 1, 1, 1]).reshape(1, -1),
p=(X['petal length (cm)'] / 8).values.reshape(-1, 1)).astype(bool)
X_missing[mask] = np.NaN
indices = np.array([ 2, 81, 39, 97, 91, 38, 46, 31, 101, 13, 89, 82, 100,
42, 69, 27, 81, 16, 73, 74, 51, 47, 107, 17, 75, 110,
20, 15, 104, 57, 26, 15, 75, 79, 35, 77, 90, 51, 46,
13, 94, 91, 23, 8, 93, 93, 73, 77, 12, 13, 74, 109,
110, 24, 10, 23, 104, 27, 92, 52, 20, 109, 8, 8, 28,
27, 35, 12, 12, 7, 43, 0, 30, 31, 78, 12, 24, 105,
50, 0, 73, 12, 102, 105, 13, 31, 1, 69, 11, 32, 75,
90, 106, 94, 60, 56, 35, 17, 62, 85, 81, 39, 80, 16,
63, 6, 80, 84, 3, 3, 76, 78], dtype=np.int32)
X_train, X_test, y_train, y_test = train_test_split(X_missing, y, random_state=13)
seed = 1857819720
clf = DecisionTreeClassifier(max_depth=None, max_features="sqrt", random_state=seed).fit(X_train.iloc[indices], y_train.iloc[indices])
we get the following tree::
The path #12/#14 is weird. Indeed, we should have some missing values in #14 but we are still able to split based on np.inf
that is not possible. So there is something fishy to investigate there.