-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Open
Labels
Description
Describe the bug
In decision trees (both classif. and regression), the impurity decrease calculation is sometimes wrong when there are missing values in X.
This can lead to unexpectedly shallow trees when using min_impurity_decrease
to control depth.
This was discovered by investigations started by this issue: #32175
Steps/Code to Reproduce
import numpy as np
from sklearn.tree import DecisionTreeRegressor
X = np.vstack([
[0, 0, 0, 0, 1, 2, 3, 4],
[1, 2, 1, 2, 1, 2, 1, 2]
]).swapaxes(0, 1).astype(float)
y = [0, 0, 0, 0, 1, 1, 1, 1]
n_leaves = []
for _ in range(1000):
tree = DecisionTreeRegressor(max_depth=1, min_impurity_decrease=0.25).fit(X, y)
# all the trees have two leaves
assert tree.tree_.n_leaves == 2
X[X == 0] = np.nan
n_leaves_w_missing = []
for _ in range(1000):
tree = DecisionTreeRegressor(max_depth=1, min_impurity_decrease=0.25).fit(X, y)
n_leaves_w_missing.append(tree.tree_.n_leaves)
print(np.bincount(n_leaves_w_missing))
# prints [0 ~500 ~500]
The last print shows that in approx. half of the cases, the tree has only one leaf (i.e. no split).
Expected Results
Chaning 0 by nan should have no impact on the tree construction in this example.
The tree should always have one split (and hence two leaves).
Actual Results
In approx. half of the cases, the tree has only one leaf (i.e. no split).
Versions
System:
python: 3.12.11 (main, Aug 18 2025, 19:19:11) [Clang 20.1.4 ]
executable: /home/arthur/dev-perso/scikit-learn/sklearn-env/bin/python
machine: Linux-6.14.0-29-generic-x86_64-with-glibc2.39
Python dependencies:
sklearn: 1.8.dev0
pip: None
setuptools: 80.9.0
numpy: 2.3.3
scipy: 1.16.2
Cython: 3.1.3
pandas: None
matplotlib: 3.10.6
joblib: 1.5.2
threadpoolctl: 3.6.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 16
prefix: libscipy_openblas
filepath: /home/arthur/dev-perso/scikit-learn/sklearn-env/lib/python3.12/site-packages/numpy.libs/libscipy_openblas64_-8fb3d286.so
version: 0.3.30
threading_layer: pthreads
architecture: Haswell
user_api: blas
internal_api: openblas
num_threads: 16
prefix: libscipy_openblas
filepath: /home/arthur/dev-perso/scikit-learn/sklearn-env/lib/python3.12/site-packages/scipy.libs/libscipy_openblas-b75cc656.so
version: 0.3.29.dev
threading_layer: pthreads
architecture: Haswell
user_api: openmp
internal_api: openmp
num_threads: 16
prefix: libgomp
filepath: /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0
version: None