Fix FEATURE_THRESHOLD initialization in trees #32259

sercant · 2025-09-23T23:38:53Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

I noticed one of our tests failing after upgrading from 1.5 to 1.6 and above. I traced the issue to the tree implementation change in #29458. The initialization of cdef constant cannot be made in the pxd file. This resulted in FEATURE_THRESHOLD to be initialized to 0.0 instead of 1e-7. This PR fixes that by moving the initialization to the pyx file.

Any other comments?

It's my first time contributing to scikit-learn, so please let me know if anything is missing.

Implementation
Add the change to docs

github-actions · 2025-09-23T23:39:52Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 97966d1. Link to the linter CI: here}

sklearn/tree/tests/test_tree.py

cakedev0

Overall LGTM. Good catch!

If you have some bandwidth to detail the use-case that relied on this "ignore almost constant features" behavior, I would be happy. But that's just for my curiosity ^^

cakedev0 · 2025-09-24T18:59:08Z

sklearn/tree/_partitioner.pyx

+# Mitigate precision differences between 32 bit and 64 bit
+FEATURE_THRESHOLD = 1e-7


I've been working on sklearn/tree/* quite a lot lately, but this comment has remained a mystery to me. It seems you rely on this behavior, so maybe you can detail a bit more what's the purpose of "mitigating precision differences between 32 bit and 64 bit"?

(100% optional though)

Thanks @cakedev0 , I am also unsure of the purpose of this threshold. Actually, the test that failed on our side was based on randomly generated fake data. I don't believe we have features with such low min/max values. So, I think we also don't rely on this behavior.

cakedev0 · 2025-09-24T19:02:32Z

sklearn/tree/tests/test_tree.py


+def test_almost_constant_feature():
+    random_state = check_random_state(0)
+    X = random_state.rand(10, 20)


Suggested change

X = random_state.rand(10, 20)

X = random_state.rand(10, 2)

I think you just need 2 features for this test to work. It would make it clearer IMO.

Makes sense, let me push a commit for this . Also I will add an assertion that the other feature has an importance higher than 0.

betatim · 2025-09-25T07:19:19Z

Side note on force pushing: I like doing it as well but it seems to mess with links from notifications. Which means people get a notification, click on the link in it and then end up "in the middle of nowhere". So we recommend that people don't force push. The PR gets merged via squashing, so an "ugly" history doesn't matter so much.

sercant · 2025-09-25T13:03:20Z

Which means people get a notification, click on the link in it and then end up "in the middle of nowhere".

@betatim understood. Sorry for the noise! Will keep in mind for future contributions.

cakedev0

LGTM.

betatim

Thanks for finding this. I think it would be good to get the eyes of a cython guru on this as well as other reviewers

lesteve · 2025-09-26T09:01:45Z

Thanks a lot @sercant for tracking this and fixing it 🙏!

I am definitely not a Cython expert so maybe @adam2392 in case you have some spare bandwith and insights into why a variable initialization in pxd doesn't do anything?

Here is what I double-checked:

initialization in .pxd doesn't do anything indeed (based on printing the value, I was not able to find something explicit about this from the Cython doc but I found this forum post from 2012). In an ideal world Cython would not compile this silently ...
the added test makes sense and fail on main

I tweaked the test to be parametrized on the tree class and add a comment about the origin of the 1e-7 value.

From a quick git grep, we likely use this pattern in other places in .pxd files, so I guess this would need to be looked at in more details 😅 (at least in two places according to the regexp below).

❯ git grep -P 'cdef\s+\S+\s+\S+\s+=' **/*.pxd
sklearn/neighbors/_quad_tree.pxd:cdef float EPSILON = 1e-6
sklearn/tree/_utils.pxd:#   cdef float32_t *p = NULL

lesteve · 2025-09-26T09:13:11Z

sklearn/tree/tests/test_tree.py

+def test_almost_constant_feature():
+    # Non regression test for
+    # https://github.com/scikit-learn/scikit-learn/pull/32259
+    # Make sure that almost constant features are discarded.
+    random_state = check_random_state(0)
+    X = random_state.rand(10, 2)
+    X[:, 0] *= 1e-7  # almost constant feature
+    y = random_state.randint(0, 2, (10,))
+    for _, TreeEstimator in ALL_TREES.items():
+        est = TreeEstimator(random_state=0)
+        est.fit(X, y)
+        # the almost constant feature should not be used
+        assert est.feature_importances_[0] == 0
+        # other feature should be used
+        assert est.feature_importances_[1] > 0


@sercant for some reason, I can not push to your PR branch (maybe you have unticked the box "allow edits by maintainers"?), so doing this as a suggestion instead that you will need to accept. Those are the changes to the test I had in mind (parametrize + comment to explain where 1e-7 comes from):

Suggested change

def test_almost_constant_feature():

# Non regression test for

# https://github.com/scikit-learn/scikit-learn/pull/32259

# Make sure that almost constant features are discarded.

random_state = check_random_state(0)

X = random_state.rand(10, 2)

X[:, 0] *= 1e-7 # almost constant feature

y = random_state.randint(0, 2, (10,))

for _, TreeEstimator in ALL_TREES.items():

est = TreeEstimator(random_state=0)

est.fit(X, y)

# the almost constant feature should not be used

assert est.feature_importances_[0] == 0

# other feature should be used

assert est.feature_importances_[1] > 0

@pytest.mark.parametrize("tree_cls", ALL_TREES.values())

def test_almost_constant_feature(tree_cls):

# Non regression test for

# https://github.com/scikit-learn/scikit-learn/pull/32259

# Make sure that almost constant features are discarded.

random_state = check_random_state(0)

X = random_state.rand(10, 2)

# FEATURE_TRESHOLD=1e-7 is defined in sklearn/tree/_partitioner.pyx but not

# accessible from Python

feature_threshold = 1e-7

X[:, 0] *= feature_threshold # almost constant feature

y = random_state.randint(0, 2, (10,))

est = tree_cls(random_state=0)

est.fit(X, y)

# the almost constant feature should not be used

assert est.feature_importances_[0] == 0

# other feature should be used

assert est.feature_importances_[1] > 0

lesteve · 2025-09-26T09:16:03Z

sklearn/tree/_partitioner.pxd


 # Mitigate precision differences between 32 bit and 64 bit
-cdef float32_t FEATURE_THRESHOLD = 1e-7
+# Note: Has to be initialized in pyx file, not in the pxd file


I am not too sure this comment is really needed. We are probably not going to add a similar comment in each Cython module-level variable declaration, but at the same time the behaviour is very suprising (full disclosure: I am definitely not a Cython expert).

cakedev0 · 2025-09-26T10:52:43Z

@lesteve

From a quick git grep, we likely use this pattern in other places in .pxd files, so I guess this would need to be looked at in more details 😅
sklearn/neighbors/_quad_tree.pxd:cdef float EPSILON = 1e-6
sklearn/tree/_utils.pxd:#   cdef float32_t *p = NULL

(nice idea the git grep + regexp ^^)

I took a quick look:

The first one is probably worth openning an issue, the code in sklearn/neighbors/_quad_tree.pyx uses this EPSILON and so it likely has a different behavior than what was intended.

The second one is ok, it's in a comment:

# safe_realloc(&p, n) resizes the allocation of p to n * sizeof(*p) bytes or
# raises a MemoryError. It never calls free, since that's __dealloc__'s job.
#   cdef float32_t *p = NULL
#   safe_realloc(&p, n)
# is equivalent to p = malloc(n * sizeof(*p)) with error checking.

lesteve · 2025-09-26T11:26:10Z

@cakedev0 a PR would be more than welcome to fix the _quad_tree.pyx one 🙏! Ideally with a non-regression test, to make sure that we have a test that covers the functionality that this variable is supposed to be useful for.

Note there may be a few more slightly different regex

git grep -C10 -P '\w+\s*=\s*\d' **/*.pxd

Some of them are inside enum and my guess is that it works fine (to be double-checked ...) but at least one of them is not:

sklearn/utils/_random.pxd:cdef inline uint32_t DEFAULT_SEED = 1

The value is set in the .pyx so I guess in this case it's fine to remove the initialization from the pxd (and I guess also the type declaration in the pyx).

github-actions bot added module:tree cython labels Sep 23, 2025

sercant force-pushed the fix-tree-feature-threshold-regression branch 2 times, most recently from eb15f6b to b3efc12 Compare September 23, 2025 23:51

sercant marked this pull request as ready for review September 23, 2025 23:51

sercant force-pushed the fix-tree-feature-threshold-regression branch from b3efc12 to 76e630e Compare September 23, 2025 23:52

fix FEATURE_THRESHOLD initialization in trees

2a3f7ec

sercant force-pushed the fix-tree-feature-threshold-regression branch from 76e630e to 2a3f7ec Compare September 24, 2025 00:08

betatim reviewed Sep 24, 2025

View reviewed changes

sklearn/tree/tests/test_tree.py Show resolved Hide resolved

cakedev0 suggested changes Sep 24, 2025

View reviewed changes

address review comments

97966d1

sercant requested review from betatim and cakedev0 September 25, 2025 13:04

cakedev0 approved these changes Sep 25, 2025

View reviewed changes

betatim approved these changes Sep 26, 2025

View reviewed changes

betatim added the Waiting for Second Reviewer First reviewer is done, need a second one! label Sep 26, 2025

lesteve reviewed Sep 26, 2025

View reviewed changes

lesteve approved these changes Sep 26, 2025

View reviewed changes

		# Mitigate precision differences between 32 bit and 64 bit
		FEATURE_THRESHOLD = 1e-7

Uh oh!

Fix FEATURE_THRESHOLD initialization in trees #32259

Are you sure you want to change the base?

Fix FEATURE_THRESHOLD initialization in trees #32259

Conversation

sercant commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

cakedev0 left a comment

Choose a reason for hiding this comment

Uh oh!

cakedev0 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

sercant Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

cakedev0 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

sercant Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

betatim commented Sep 25, 2025

Uh oh!

sercant commented Sep 25, 2025

Uh oh!

cakedev0 left a comment

Choose a reason for hiding this comment

Uh oh!

betatim left a comment

Choose a reason for hiding this comment

Uh oh!

lesteve commented Sep 26, 2025

Uh oh!

lesteve Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesteve Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cakedev0 commented Sep 26, 2025

Uh oh!

lesteve commented Sep 26, 2025

Uh oh!

Uh oh!

sercant commented Sep 23, 2025 •

edited

Loading

github-actions bot commented Sep 23, 2025 •

edited

Loading

sercant Sep 25, 2025 •

edited

Loading

lesteve Sep 26, 2025 •

edited

Loading

lesteve Sep 26, 2025 •

edited

Loading