-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix FEATURE_THRESHOLD initialization in trees #32259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix FEATURE_THRESHOLD initialization in trees #32259
Conversation
eb15f6b
to
b3efc12
Compare
b3efc12
to
76e630e
Compare
76e630e
to
2a3f7ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Good catch!
If you have some bandwidth to detail the use-case that relied on this "ignore almost constant features" behavior, I would be happy. But that's just for my curiosity ^^
# Mitigate precision differences between 32 bit and 64 bit | ||
FEATURE_THRESHOLD = 1e-7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been working on sklearn/tree/*
quite a lot lately, but this comment has remained a mystery to me. It seems you rely on this behavior, so maybe you can detail a bit more what's the purpose of "mitigating precision differences between 32 bit and 64 bit"?
(100% optional though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cakedev0 , I am also unsure of the purpose of this threshold. Actually, the test that failed on our side was based on randomly generated fake data. I don't believe we have features with such low min/max values. So, I think we also don't rely on this behavior.
sklearn/tree/tests/test_tree.py
Outdated
|
||
def test_almost_constant_feature(): | ||
random_state = check_random_state(0) | ||
X = random_state.rand(10, 20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X = random_state.rand(10, 20) | |
X = random_state.rand(10, 2) |
I think you just need 2 features for this test to work. It would make it clearer IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, let me push a commit for this . Also I will add an assertion that the other feature has an importance higher than 0.
Side note on force pushing: I like doing it as well but it seems to mess with links from notifications. Which means people get a notification, click on the link in it and then end up "in the middle of nowhere". So we recommend that people don't force push. The PR gets merged via squashing, so an "ugly" history doesn't matter so much. |
@betatim understood. Sorry for the noise! Will keep in mind for future contributions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding this. I think it would be good to get the eyes of a cython guru on this as well as other reviewers
Reference Issues/PRs
What does this implement/fix? Explain your changes.
I noticed one of our tests failing after upgrading from 1.5 to 1.6 and above. I traced the issue to the tree implementation change in #29458. The initialization of
cdef
constant cannot be made in the pxd file. This resulted inFEATURE_THRESHOLD
to be initialized to0.0
instead of1e-7
. This PR fixes that by moving the initialization to thepyx
file.Any other comments?
It's my first time contributing to scikit-learn, so please let me know if anything is missing.