FIX Draw indices using sample_weight in Forest #31529

antoinebaker · 2025-06-12T08:19:04Z

Part of #16298. Similar to #31414 (Bagging estimators) but for Forest estimators.

What does this implement/fix? Explain your changes.

When subsampling is activated (bootstrap=True), sample_weight are now used as probabilities to draw the indices. Forest estimators then pass the statistical repeated/weighted equivalence test.

Comments

This PR does not fix Forest estimators when bootstrap=False (no subsampling). sample_weight are still passed to the decision trees. Forest estimators then fail the statistical repeated/weighted equivalence test because the individual trees
also fail this test (probably because of tied splits in decision trees #23728).

TODO

choose how to generate indices in the sample_weight=None case
fix relative (float) max_samples as done in FIX Draw indices using sample_weight in Bagging #31414
docstrings
how to handle class_weight = "balanced", "balanced_subsample" options
changelog

github-actions · 2025-06-12T08:19:52Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: eae5d27. Link to the linter CI: here}

antoinebaker · 2025-06-12T08:53:13Z

sklearn/ensemble/_forest.py

+    if sample_weight is None:
+        sample_weight = np.ones(n_samples)
+    normalized_sample_weight = sample_weight / np.sum(sample_weight)
+    sample_indices = random_instance.choice(
+        n_samples, n_samples_bootstrap, replace=True, p=normalized_sample_weight
    )


I hesitate between two options for dealing with the sample_weight=None case.

Convert to all ones.

if sample_weight is None: sample_weight = np.ones(n_samples) normalized_sample_weight = sample_weight / np.sum(sample_weight) sample_indices = random_instance.choice( n_samples, n_samples_bootstrap, replace=True, p=normalized_sample_weight )

Use the old code path when sample_weight=None

if sample_weight is None: sample_indices = random_instance.randint( 0, n_samples, n_samples_bootstrap, dtype=np.int32 ) else: normalized_sample_weight = sample_weight / np.sum(sample_weight) sample_indices = random_instance.choice( n_samples, n_samples_bootstrap, replace=True, p=normalized_sample_weight, )

The benefit of 2. is that the code is backward compatible when sample_weight=None, this PR and main give the same fit for a given random_state.

The benefit of 1. is that sample_weight=None and sample_weight=np.ones(n_samples) give the same fit for a given random_state.

antoinebaker · 2025-06-12T08:56:40Z

sklearn/ensemble/_forest.py

+            # NOTE: "balanced_subsample" option is ignored, treated as "balanced"
+            class_weight = self.class_weight
+            if class_weight == "balanced_subsample":
+                class_weight = "balanced"
+            expanded_class_weight = compute_sample_weight(class_weight, y_original)


Here I choose to simply ignore the "balanced_subsample" option and treat it as the "balanced" case.

In the "balanced" case, the class_weight are set to n_samples / (n_classes * np.bincount(y)).

EDIT: we should probably compute the class_weight using the sample_weight as in #30057

antoinebaker · 2025-06-12T09:14:29Z

The forest estimators now pass the statistical repeated/weighted equivalence test, for example

antoinebaker · 2025-06-16T07:49:21Z

Relative (float) max_samples, with the new meaning of drawing max_samples * sw_sum indices as done in #31414 , also passes the statistical repeated/weighted equivalence test

use sample_weight in choice

9458a1c

github-actions bot added the module:ensemble label Jun 12, 2025

antoinebaker commented Jun 12, 2025

View reviewed changes

antoinebaker added 2 commits June 13, 2025 17:40

use old code path

2f30d7d

relative max_samples

a55643b

antoinebaker and others added 4 commits June 19, 2025 14:50

adapt tests

bcad08a

Merge branch 'main' into random_forest_sample_weight

cce2060

changelog

f77059e

add relative max_sample test

eae5d27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FIX Draw indices using sample_weight in Forest #31529

FIX Draw indices using sample_weight in Forest #31529

Uh oh!

antoinebaker commented Jun 12, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 12, 2025 •

edited

Loading

Uh oh!

antoinebaker Jun 12, 2025 •

edited

Loading

Uh oh!

antoinebaker Jun 12, 2025

Uh oh!

antoinebaker Jun 19, 2025 •

edited

Loading

Uh oh!

antoinebaker commented Jun 12, 2025

Uh oh!

antoinebaker commented Jun 16, 2025

Uh oh!

Uh oh!

Uh oh!

FIX Draw indices using sample_weight in Forest #31529

Are you sure you want to change the base?

FIX Draw indices using sample_weight in Forest #31529

Uh oh!

Conversation

antoinebaker commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this implement/fix? Explain your changes.

Comments

Uh oh!

github-actions bot commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

antoinebaker Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoinebaker Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

antoinebaker Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoinebaker commented Jun 12, 2025

Uh oh!

antoinebaker commented Jun 16, 2025

Uh oh!

Uh oh!

antoinebaker commented Jun 12, 2025 •

edited

Loading

github-actions bot commented Jun 12, 2025 •

edited

Loading

antoinebaker Jun 12, 2025 •

edited

Loading

antoinebaker Jun 19, 2025 •

edited

Loading