-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DecisionTreeClassifier having unexpected behaviour with 'min_weight_fraction_leaf=0.5' #30917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This occurs for seed in range(300) as well with p-values of:
|
This is related to this RFC about sample weight invariance properties (#15657), and this case in particular as already been identified as problematic, see #15657 (comment). This is another example of an hyper-parameter which depend on n_samples but makes things weird or not very user friendly if modified to depend on weight sum instead. |
/take |
Hi, After investigating this issue and running unit tests, I believe the extremely low p-values reported might be due to incorrect usage of In the provided reproduction code, the line I've also created unit tests to verify this invariance independently of Since this does not appear to be a bug in scikit-learn, should I still submit a patch with these unit tests? Also, in case I've misunderstood the problem, I'd be happy to take another look and work on a fix if it truly does have something to do with scikit-learn. Let me know how you'd like to proceed. Thanks! |
Hi @pedroL0pes thanks for that, I can confirm that I also get 1.0 p-values locally. Good call on checking the zip. Submitting a patch request sounds good to me, let me know when it's ready for review/if you need help. |
Describe the bug
When fitting DecisionTreeClassifier on a duplicated sample set (i.e. each sample repeated by two), the result is not the same as when fitting on the original sample set. This only happens for 'min_weight_fraction_leaf' specified as <0.5. This also effects ExtraTreesClassifier and ExtraTreeClassifier.
Steps/Code to Reproduce
Expected Results
p-values are more than ˜0.05
Actual Results
Versions
The text was updated successfully, but these errors were encountered: