Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Trees: impurity decrease calculation is buggy when there are missing values #32178

@cakedev0

Description

@cakedev0

Describe the bug

In decision trees (both classif. and regression), the impurity decrease calculation is sometimes wrong when there are missing values in X.

This can lead to unexpectedly shallow trees when using min_impurity_decrease to control depth.

This was discovered by investigations started by this issue: #32175

Steps/Code to Reproduce

import numpy as np
from sklearn.tree import DecisionTreeRegressor

X = np.vstack([
    [0, 0, 0, 0, 1, 2, 3, 4],
    [1, 2, 1, 2, 1, 2, 1, 2]
]).swapaxes(0, 1).astype(float)
y = [0, 0, 0, 0, 1, 1, 1, 1]

n_leaves = []
for _ in range(1000):
    tree = DecisionTreeRegressor(max_depth=1, min_impurity_decrease=0.25).fit(X, y)
    # all the trees have two leaves
    assert tree.tree_.n_leaves == 2

X[X == 0] = np.nan
n_leaves_w_missing = []
for _ in range(1000):
    tree = DecisionTreeRegressor(max_depth=1, min_impurity_decrease=0.25).fit(X, y)
    n_leaves_w_missing.append(tree.tree_.n_leaves)

print(np.bincount(n_leaves_w_missing))
# prints [0 ~500 ~500]

The last print shows that in approx. half of the cases, the tree has only one leaf (i.e. no split).

Expected Results

Chaning 0 by nan should have no impact on the tree construction in this example.

The tree should always have one split (and hence two leaves).

Actual Results

In approx. half of the cases, the tree has only one leaf (i.e. no split).

Versions

System:
    python: 3.12.11 (main, Aug 18 2025, 19:19:11) [Clang 20.1.4 ]
executable: /home/arthur/dev-perso/scikit-learn/sklearn-env/bin/python
   machine: Linux-6.14.0-29-generic-x86_64-with-glibc2.39

Python dependencies:
      sklearn: 1.8.dev0
          pip: None
   setuptools: 80.9.0
        numpy: 2.3.3
        scipy: 1.16.2
       Cython: 3.1.3
       pandas: None
   matplotlib: 3.10.6
       joblib: 1.5.2
threadpoolctl: 3.6.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libscipy_openblas
       filepath: /home/arthur/dev-perso/scikit-learn/sklearn-env/lib/python3.12/site-packages/numpy.libs/libscipy_openblas64_-8fb3d286.so
        version: 0.3.30
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libscipy_openblas
       filepath: /home/arthur/dev-perso/scikit-learn/sklearn-env/lib/python3.12/site-packages/scipy.libs/libscipy_openblas-b75cc656.so
        version: 0.3.29.dev
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: libgomp
       filepath: /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0
        version: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions