Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

sercant
Copy link

@sercant sercant commented Sep 23, 2025

Reference Issues/PRs

What does this implement/fix? Explain your changes.

I noticed one of our tests failing after upgrading from 1.5 to 1.6 and above. I traced the issue to the tree implementation change in #29458. The initialization of cdef constant cannot be made in the pxd file. This resulted in FEATURE_THRESHOLD to be initialized to 0.0 instead of 1e-7. This PR fixes that by moving the initialization to the pyx file.

Any other comments?

It's my first time contributing to scikit-learn, so please let me know if anything is missing.

  • Implementation
  • Add the change to docs

Copy link

github-actions bot commented Sep 23, 2025

βœ”οΈ Linting Passed

All linting checks passed. Your pull request is in excellent shape! β˜€οΈ

Generated for commit: 97966d1. Link to the linter CI: here

@sercant sercant force-pushed the fix-tree-feature-threshold-regression branch 2 times, most recently from eb15f6b to b3efc12 Compare September 23, 2025 23:51
@sercant sercant marked this pull request as ready for review September 23, 2025 23:51
@sercant sercant force-pushed the fix-tree-feature-threshold-regression branch from b3efc12 to 76e630e Compare September 23, 2025 23:52
@sercant sercant force-pushed the fix-tree-feature-threshold-regression branch from 76e630e to 2a3f7ec Compare September 24, 2025 00:08
Copy link
Contributor

@cakedev0 cakedev0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Good catch!

If you have some bandwidth to detail the use-case that relied on this "ignore almost constant features" behavior, I would be happy. But that's just for my curiosity ^^

Comment on lines +21 to +22
# Mitigate precision differences between 32 bit and 64 bit
FEATURE_THRESHOLD = 1e-7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been working on sklearn/tree/* quite a lot lately, but this comment has remained a mystery to me. It seems you rely on this behavior, so maybe you can detail a bit more what's the purpose of "mitigating precision differences between 32 bit and 64 bit"?

(100% optional though)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cakedev0 , I am also unsure of the purpose of this threshold. Actually, the test that failed on our side was based on randomly generated fake data. I don't believe we have features with such low min/max values. So, I think we also don't rely on this behavior.


def test_almost_constant_feature():
random_state = check_random_state(0)
X = random_state.rand(10, 20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
X = random_state.rand(10, 20)
X = random_state.rand(10, 2)

I think you just need 2 features for this test to work. It would make it clearer IMO.

Copy link
Author

@sercant sercant Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, let me push a commit for this . Also I will add an assertion that the other feature has an importance higher than 0.

@betatim
Copy link
Member

betatim commented Sep 25, 2025

Side note on force pushing: I like doing it as well but it seems to mess with links from notifications. Which means people get a notification, click on the link in it and then end up "in the middle of nowhere". So we recommend that people don't force push. The PR gets merged via squashing, so an "ugly" history doesn't matter so much.

@sercant
Copy link
Author

sercant commented Sep 25, 2025

Which means people get a notification, click on the link in it and then end up "in the middle of nowhere".

@betatim understood. Sorry for the noise! Will keep in mind for future contributions.

@sercant sercant requested review from betatim and cakedev0 September 25, 2025 13:04
Copy link
Contributor

@cakedev0 cakedev0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding this. I think it would be good to get the eyes of a cython guru on this as well as other reviewers

@betatim betatim added the Waiting for Second Reviewer First reviewer is done, need a second one! label Sep 26, 2025
@lesteve
Copy link
Member

lesteve commented Sep 26, 2025

Thanks a lot @sercant for tracking this and fixing it πŸ™!

I am definitely not a Cython expert so maybe @adam2392 in case you have some spare bandwith and insights into why a variable initialization in pxd doesn't do anything?

Here is what I double-checked:

  • initialization in .pxd doesn't do anything indeed (based on printing the value, I was not able to find something explicit about this from the Cython doc but I found this forum post from 2012). In an ideal world Cython would not compile this silently ...
  • the added test makes sense and fail on main

I tweaked the test to be parametrized on the tree class and add a comment about the origin of the 1e-7 value.

From a quick git grep, we likely use this pattern in other places in .pxd files, so I guess this would need to be looked at in more details πŸ˜… (at least in two places according to the regexp below).

❯ git grep -P 'cdef\s+\S+\s+\S+\s+=' **/*.pxd
sklearn/neighbors/_quad_tree.pxd:cdef float EPSILON = 1e-6
sklearn/tree/_utils.pxd:#   cdef float32_t *p = NULL

Comment on lines +1261 to +1275
def test_almost_constant_feature():
# Non regression test for
# https://github.com/scikit-learn/scikit-learn/pull/32259
# Make sure that almost constant features are discarded.
random_state = check_random_state(0)
X = random_state.rand(10, 2)
X[:, 0] *= 1e-7 # almost constant feature
y = random_state.randint(0, 2, (10,))
for _, TreeEstimator in ALL_TREES.items():
est = TreeEstimator(random_state=0)
est.fit(X, y)
# the almost constant feature should not be used
assert est.feature_importances_[0] == 0
# other feature should be used
assert est.feature_importances_[1] > 0
Copy link
Member

@lesteve lesteve Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sercant for some reason, I can not push to your PR branch (maybe you have unticked the box "allow edits by maintainers"?), so doing this as a suggestion instead that you will need to accept. Those are the changes to the test I had in mind (parametrize + comment to explain where 1e-7 comes from):

Suggested change
def test_almost_constant_feature():
# Non regression test for
# https://github.com/scikit-learn/scikit-learn/pull/32259
# Make sure that almost constant features are discarded.
random_state = check_random_state(0)
X = random_state.rand(10, 2)
X[:, 0] *= 1e-7 # almost constant feature
y = random_state.randint(0, 2, (10,))
for _, TreeEstimator in ALL_TREES.items():
est = TreeEstimator(random_state=0)
est.fit(X, y)
# the almost constant feature should not be used
assert est.feature_importances_[0] == 0
# other feature should be used
assert est.feature_importances_[1] > 0
@pytest.mark.parametrize("tree_cls", ALL_TREES.values())
def test_almost_constant_feature(tree_cls):
# Non regression test for
# https://github.com/scikit-learn/scikit-learn/pull/32259
# Make sure that almost constant features are discarded.
random_state = check_random_state(0)
X = random_state.rand(10, 2)
# FEATURE_TRESHOLD=1e-7 is defined in sklearn/tree/_partitioner.pyx but not
# accessible from Python
feature_threshold = 1e-7
X[:, 0] *= feature_threshold # almost constant feature
y = random_state.randint(0, 2, (10,))
est = tree_cls(random_state=0)
est.fit(X, y)
# the almost constant feature should not be used
assert est.feature_importances_[0] == 0
# other feature should be used
assert est.feature_importances_[1] > 0


# Mitigate precision differences between 32 bit and 64 bit
cdef float32_t FEATURE_THRESHOLD = 1e-7
# Note: Has to be initialized in pyx file, not in the pxd file
Copy link
Member

@lesteve lesteve Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not too sure this comment is really needed. We are probably not going to add a similar comment in each Cython module-level variable declaration, but at the same time the behaviour is very suprising (full disclosure: I am definitely not a Cython expert).

@cakedev0
Copy link
Contributor

@lesteve

From a quick git grep, we likely use this pattern in other places in .pxd files, so I guess this would need to be looked at in more details πŸ˜…

sklearn/neighbors/_quad_tree.pxd:cdef float EPSILON = 1e-6
sklearn/tree/_utils.pxd:#   cdef float32_t *p = NULL

(nice idea the git grep + regexp ^^)

I took a quick look:

The first one is probably worth openning an issue, the code in sklearn/neighbors/_quad_tree.pyx uses this EPSILON and so it likely has a different behavior than what was intended.

The second one is ok, it's in a comment:

# safe_realloc(&p, n) resizes the allocation of p to n * sizeof(*p) bytes or
# raises a MemoryError. It never calls free, since that's __dealloc__'s job.
#   cdef float32_t *p = NULL
#   safe_realloc(&p, n)
# is equivalent to p = malloc(n * sizeof(*p)) with error checking.

@lesteve
Copy link
Member

lesteve commented Sep 26, 2025

@cakedev0 a PR would be more than welcome to fix the _quad_tree.pyx one πŸ™! Ideally with a non-regression test, to make sure that we have a test that covers the functionality that this variable is supposed to be useful for.

Note there may be a few more slightly different regex

git grep -C10 -P '\w+\s*=\s*\d' **/*.pxd

Some of them are inside enum and my guess is that it works fine (to be double-checked ...) but at least one of them is not:

sklearn/utils/_random.pxd:cdef inline uint32_t DEFAULT_SEED = 1

The value is set in the .pyx so I guess in this case it's fine to remove the initialization from the pxd (and I guess also the type declaration in the pyx).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cython module:tree Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants