Thanks to visit codestin.com
Credit goes to github.com

Skip to content

DecisionTreeClassifier having unexpected behaviour with 'min_weight_fraction_leaf=0.5' #30917

Closed
@snath-xoc

Description

@snath-xoc

Describe the bug

When fitting DecisionTreeClassifier on a duplicated sample set (i.e. each sample repeated by two), the result is not the same as when fitting on the original sample set. This only happens for 'min_weight_fraction_leaf' specified as <0.5. This also effects ExtraTreesClassifier and ExtraTreeClassifier.

Steps/Code to Reproduce

from sklearn.tree import DecisionTreeClassifier
from scipy.stats import kstest
import numpy as np

rng = np.random.RandomState(0)
    
n_samples = 20
X = rng.rand(n_samples, n_samples * 2)
y = rng.randint(0, 3, size=n_samples)

X_repeated = np.repeat(X,2,axis=0)
y_repeated = np.repeat(y,2)

predictions = []
predictions_dup = []

## Fit estimator
for seed in range(100):
    est = DecisionTreeClassifier(random_state=seed, max_features=0.5, min_weight_fraction_leaf=0.5).fit(X,y)
    est_dup = DecisionTreeClassifier(random_state=seed, max_features=0.5, min_weight_fraction_leaf=0.5).fit(X_repeated,y_repeated)

    ##Get predictions
    predictions.append(est.predict_proba(X)[:,:-1])
    predictions_dup.append(est_dup.predict_proba(X)[:,:-1])

predictions = np.vstack(predictions)
predictions_dup = np.vstack(predictions_dup)

for pred, pred_dup in (predictions.T,predictions_dup.T):
    print(kstest(pred,pred_dup).pvalue)

Expected Results

p-values are more than ˜0.05

Actual Results

p-values = 2.0064970441275627e-69

Versions

System:
    python: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:13:44) [Clang 16.0.6 ]
executable: /Users/shrutinath/micromamba/envs/scikit-learn/bin/python
   machine: macOS-14.3-arm64-arm-64bit

Python dependencies:
      sklearn: 1.7.dev0
          pip: 24.0
   setuptools: 75.8.0
        numpy: 2.0.0
        scipy: 1.14.0
       Cython: 3.0.10
       pandas: 2.2.2
   matplotlib: 3.9.0
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
...
    num_threads: 8
         prefix: libomp
       filepath: /Users/shrutinath/micromamba/envs/scikit-learn/lib/libomp.dylib
        version: None
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions