Thanks to visit codestin.com
Credit goes to github.com

Skip to content

DecisionTreeClassifier having unexpected behaviour with 'min_weight_fraction_leaf=0.5' #30917

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
snath-xoc opened this issue Feb 28, 2025 · 5 comments
Assignees
Labels

Comments

@snath-xoc
Copy link
Contributor

snath-xoc commented Feb 28, 2025

Describe the bug

When fitting DecisionTreeClassifier on a duplicated sample set (i.e. each sample repeated by two), the result is not the same as when fitting on the original sample set. This only happens for 'min_weight_fraction_leaf' specified as <0.5. This also effects ExtraTreesClassifier and ExtraTreeClassifier.

Steps/Code to Reproduce

from sklearn.tree import DecisionTreeClassifier
from scipy.stats import kstest
import numpy as np

rng = np.random.RandomState(0)
    
n_samples = 20
X = rng.rand(n_samples, n_samples * 2)
y = rng.randint(0, 3, size=n_samples)

X_repeated = np.repeat(X,2,axis=0)
y_repeated = np.repeat(y,2)

predictions = []
predictions_dup = []

## Fit estimator
for seed in range(100):
    est = DecisionTreeClassifier(random_state=seed, max_features=0.5, min_weight_fraction_leaf=0.5).fit(X,y)
    est_dup = DecisionTreeClassifier(random_state=seed, max_features=0.5, min_weight_fraction_leaf=0.5).fit(X_repeated,y_repeated)

    ##Get predictions
    predictions.append(est.predict_proba(X)[:,:-1])
    predictions_dup.append(est_dup.predict_proba(X)[:,:-1])

predictions = np.vstack(predictions)
predictions_dup = np.vstack(predictions_dup)

for pred, pred_dup in (predictions.T,predictions_dup.T):
    print(kstest(pred,pred_dup).pvalue)

Expected Results

p-values are more than ˜0.05

Actual Results

p-values = 2.0064970441275627e-69

Versions

System:
    python: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:13:44) [Clang 16.0.6 ]
executable: /Users/shrutinath/micromamba/envs/scikit-learn/bin/python
   machine: macOS-14.3-arm64-arm-64bit

Python dependencies:
      sklearn: 1.7.dev0
          pip: 24.0
   setuptools: 75.8.0
        numpy: 2.0.0
        scipy: 1.14.0
       Cython: 3.0.10
       pandas: 2.2.2
   matplotlib: 3.9.0
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
...
    num_threads: 8
         prefix: libomp
       filepath: /Users/shrutinath/micromamba/envs/scikit-learn/lib/libomp.dylib
        version: None
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
@snath-xoc snath-xoc added Bug Needs Triage Issue requires triage labels Feb 28, 2025
@snath-xoc
Copy link
Contributor Author

This occurs for seed in range(300) as well with p-values of:

1.336814951659136e-178

@jeremiedbb
Copy link
Member

This is related to this RFC about sample weight invariance properties (#15657), and this case in particular as already been identified as problematic, see #15657 (comment).

This is another example of an hyper-parameter which depend on n_samples but makes things weird or not very user friendly if modified to depend on weight sum instead.

@jeremiedbb jeremiedbb removed the Needs Triage Issue requires triage label Feb 28, 2025
@pedroL0pes
Copy link
Contributor

/take

@pedroL0pes
Copy link
Contributor

Hi,

After investigating this issue and running unit tests, I believe the extremely low p-values reported might be due to incorrect usage of kstest rather than an issue with scikit-learn.

In the provided reproduction code, the line for pred, pred_dup in (predictions.T, predictions_dup.T): should probably be for pred, pred_dup in zip(predictions.T, predictions_dup.T):.
Without zip(), the comparison does not correctly pair the predictions from the original and duplicated datasets. Instead, it mistakenly compares different classes within the same predictions matrix, likely explaining the observed low p-values.
When corrected, kstest returns a p-value of 1.0, confirming that the distribution of predictions remains identical across both datasets, which aligns with the expected invariance between trees trained on the original and duplicated datasets.

I've also created unit tests to verify this invariance independently of kstest (for DecisionTreeClassifier and ExtraTreeClassifier, for range(100) and 300, and for min_weight_fraction_leaf 0, 0.1, (...), 0.5), and they confirm that np.array_equal(predictions, predictions_dup) holds (since the same random_state is used for both trees, their structures remain identical, leading to perfectly matching probability predictions), meaning the classifiers behave as expected.

Since this does not appear to be a bug in scikit-learn, should I still submit a patch with these unit tests? Also, in case I've misunderstood the problem, I'd be happy to take another look and work on a fix if it truly does have something to do with scikit-learn.

Let me know how you'd like to proceed. Thanks!

@snath-xoc
Copy link
Contributor Author

Hi @pedroL0pes thanks for that, I can confirm that I also get 1.0 p-values locally. Good call on checking the zip. Submitting a patch request sounds good to me, let me know when it's ready for review/if you need help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants