Added sample weight handling to BinMapper under HGBT #29641

snath-xoc · 2024-08-08T21:46:57Z

Fixes #29640 . See also #27117.

Calls sample weight within BinMapper and passes it on to _find_binning_thresholds where bin midpoints are calculated using weighted percentile. Sample weights also passed to rng.choice() subsampling in BinMapper for samples larger than 2e5

NOTE: when the n_bins<discrete_values the best workaround was to set the bins as the midpoints. In future, it may be worth getting rid of this altogether, however at the risk of getting inhomogeneous array from weighted_percentile. We will need to agree on the best methods of trimming.

Major changes proposed:

Subsampling with weights after which sample_weight is not propagated in _BinMapper (set to None)
Tests for sample_weight invariance for _BinMapper under deterministic case added
Some tests failed with the new changes and n_samples had to be increased e.g., test_interaction_cst_numerically
Some tests failed under special cases when sample weights are passed through but distinct values<max_bins. To be checked what to do
Statistical test added in comments below

NOTE: this also allows HGBT to pass weighted tests for more than 256 samples (but still less than 2e6)

TO DO:

Update tests under test_gradient_boosting so that sample weight is passed as positional argument
Fix test failures, something fishy changes such that e.g., test_interaction_cst_numerically is not working
Modified tests to have larger number of samples otherwise expected results are not attaines (i.e., r2 value threshold etc. is not reached).... was expecting this to be the other way so not sure if I set off another bug.
When distinct values are less than max_bins and sample weights are given we get a discrepancy between weighted and repeated under special cases (see test_zero_sample_weights_classification under test_gradient_boosting.py). Need to see what to do here.
Further test failure when max_bins are specified as > distinct values under weighted and repeated cases
Check that sample weight invariance holds under deterministic case (test added under test_binning)
For the case n_samples > subsample conduct a statistical analysis to check for that the repeated/reweighted equivalence holds for the binning procedure in expectation, similar to issue Incorrect sample weight handling in KBinsDiscretizer #29906. Conduct analysis for many different values random_state for match of the mean bin edges.

github-actions · 2024-08-08T21:48:23Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 2dc5e0a. Link to the linter CI: here}

ogrisel

I did not analyse the test failures yet but here is some early feedback.

For the case n_samples > subsample we would need to conduct a statistical analysis to check for that the repeated/reweighted equivalence holds for the binning procedure in expectation.

Once the tests pass for the deterministic case, we should conduct such an analysis (e.g. using a side notebook, not included in the repo, where we rerun the binning for many different values random_state and then check for match of the mean bin edges).

Please add a TODO item to the description of this PR not to forget about this.

sklearn/ensemble/_hist_gradient_boosting/binning.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

…oosting.py Co-authored-by: Olivier Grisel <[email protected]>

snath-xoc · 2025-03-21T13:29:43Z

Statistical tests were conducted for n_samples>subsample (subsample=int(2e5)) using the following code:

import numpy as np
from sklearn.ensemble._hist_gradient_boosting.binning import _BinMapper
from scipy.stats import kstest
import matplotlib.pyplot as plt

BONFERRONI_CORRECTION = 1 #To adjust later according  to agreed test dim.

rng = np.random.RandomState(0)

n_samples = int(3e5)
X = rng.randint(0, 30, size=(n_samples,3))
# Use random integers (including zero) as weights.
sw = rng.randint(0, 5, size=n_samples)

X_repeated = np.repeat(X, sw,axis=0)
assert len(X_repeated)>int(2e5)

bin_thresholds_weighted=[]
bin_thresholds_repeated = []
for seed in np.arange(100):
    est_weighted = _BinMapper(n_bins=6, random_state=seed).fit(
                X, sample_weight=sw
            )

    est_repeated = _BinMapper(n_bins=6, random_state=seed+500).fit(
                X_repeated, sample_weight=None)

    bin_thresholds_weighted.append(est_weighted.bin_thresholds_)
    bin_thresholds_repeated.append(est_repeated.bin_thresholds_)

bin_thresholds_weighted = np.asarray(bin_thresholds_weighted)
bin_thresholds_repeated = np.asarray(bin_thresholds_repeated)

fig,axs = plt.subplots(3,4,figsize=(14,12))

j=0
for i,ax in enumerate(axs.flatten()):
    if i>0 and i%4==0:
        j+=1
    ax.hist(bin_thresholds_weighted[:,j,i%4].flatten())
    ax.hist(bin_thresholds_repeated[:,j,i%4].flatten(),alpha=0.5)
    pval = kstest(bin_thresholds_weighted[:,j,i%4].flatten(),
                                    bin_thresholds_repeated[:,j,i%4].flatten()).pvalue
    if pval<(0.05*BONFERRONI_CORRECTION):
        ax.set_title(f'p-value: {pval:.4f},failed')
    else:
        ax.set_title(f'p-value: {pval:.4f},passed')

The output is as follows:

snath-xoc · 2025-03-21T13:40:08Z

The ARM lint test is still failing due to the test_find_binning_thresholds_small_regular_data assertion error but I can't reproduce it locally.

ogrisel

Modified tests to have larger number of samples otherwise expected results are not attaines (i.e., r2 value threshold etc. is not reached).... was expecting this to be the other way so not sure if I set off another bug.

This is annoying, but it's probably caused by the fact that we switched from np.percentile(..., method="linear") to np.percentile(..., method="averaged_inverted_cdf") in the sample_weight=None case. Assuming this change is really the cause and since it's necessary to fix the weight/repetition equivalence, we might consider this behavior change as a bug fix. It should be clearly documented as such in the changelog, possibly with a dedicated note in the "Changed models" section of the release notes to better warn the users about this change.

Another concern is that _averaged_weighted_percentile implementation is currently very naive and will cause a performance regression whenever the users passes sample_weight != None. We might want to postpone the final merge of this PR to wait for the review and merge of #30945 first.

ogrisel · 2025-03-21T14:07:41Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

+        func = _averaged_weighted_percentile
+        if len(distinct_values) <= max_bins:
+            max_bins = len(distinct_values)
+            func = _weighted_percentile


Why would _averaged_weighted_percentile not work when len(distinct_values) <= max_bins?

>>> import numpy as np >>> from sklearn.utils.stats import _weighted_percentile, _averaged_weighted_percentile >>> a = np.random.randint(0, 10, size=10000).astype(np.float64) >>> max_bins = 30 >>> max_bins > np.unique(a).shape[0] True >>> np.array( ... [_averaged_weighted_percentile(a, np.ones_like(a), percentile) for percentile in percentiles] ... ) array([0., 0., 0., 1., 1., 1., 1., 2., 2., 2., 3., 3., 3., 4., 4., 4., 5., 5., 5., 6., 6., 6., 7., 7., 7., 8., 8., 8., 9., 9., 9.]) >>> np.array( ... [_weighted_percentile(a, np.ones_like(a), percentile) for percentile in percentiles] ... ) array([0., 0., 0., 1., 1., 1., 1., 2., 2., 2., 3., 3., 3., 4., 4., 4., 5., 5., 5., 6., 6., 6., 7., 7., 7., 8., 8., 8., 9., 9., 9.])

When distinct values are less than max_bins and sample weights are given we get a discrepancy between weighted and repeated under special cases (see test_zero_sample_weights_classification under test_gradient_boosting.py). Need to see what to do here.

The manual midpoints between distinct values in the weighted case is probably responsible for this discrepancy.

We should probably explore if it's possible to post process the output of _averaged_weighted_percentile to trim duplicated thresholds. If the number of trimmed thresholds is lower than max_bins then it might be possible to further post-process to recover thresholds that match the unweighted case by shifting the thresholds to use 0.5 midpoints.

sklearn/ensemble/_hist_gradient_boosting/binning.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

ogrisel · 2025-03-21T14:37:56Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

@@ -198,6 +255,25 @@ def test_bin_mapper_repeated_values_invariance(n_distinct):
    assert_array_equal(binned_1, binned_2)


+def test_binmapper_weighted_vs_repeated_equivalence(global_random_seed):


Could you expand this test (maybe with @pytest.mark.parametrize) to include a case where n_bins > np.unique(X[:, some_arbitrary_feature_idx]).shape[0]?

It's expected to fail but we need to solve this case anyway.

Yes it fails but good to have this test in

sklearn/ensemble/_hist_gradient_boosting/binning.py

ogrisel · 2025-03-21T15:26:56Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

+    # and return different values as compared to when using weights
+    # since the extra 1 breakes the tie. For now the test is slightly
+    # modified, in future we need to decide what to do when
+    # distinct_values<=max_bins but sample weights are supplied


Please do not comment out this case since we need to find a way to make this test pass.

Uncommented, we get the fail now, it may be worth also trying a strategy e.g.,

midpoints = np.array([_averaged_weighted_percentile ... if np.unique(midpoints).shape[0]!=midpoints.shape[0]: midpoints, counts = np.unique(midpoints, return_counts=True)

and do some calculations based off that during trimming... need to think a bit more about it

UPDATE: test passes now after some fixes

sklearn/ensemble/_hist_gradient_boosting/binning.py

Co-authored-by: Olivier Grisel <[email protected]>

snath-xoc · 2025-03-28T13:16:00Z

The only remaining failing test is the weighted_vs_repeated equivalence test under test_binning when the max_bins>distinct values. I think this would need further changes to the _averaged_weighted_percentile implementation as mentioned before.

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

ogrisel · 2025-03-28T13:37:28Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

-        midpoints = np.percentile(col_data, percentiles, method="midpoint").astype(
-            X_DTYPE
+
+        midpoints = np.array(


I think this variable should be renamed bin_thresholds since they are no midpoints when n_unique > max_bins.

Bin thresholds are set to mid_points when there are fewer observed unique feature values than number of bins.

ogrisel

Some more feedback.

sklearn/ensemble/_hist_gradient_boosting/binning.py

ogrisel · 2025-03-28T13:39:39Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

    else:
        # We could compute approximate midpoint percentiles using the output of
        # np.unique(col_data, return_counts) instead but this is more
        # work and the performance benefit will be limited because we
        # work on a fixed-size subsample of the full data.
+        # TO DO: check if there is a better way to implement this


nitpick: the convention is to use TODO which is easier to grep consistently and often syntax highlighted by IDEs.

Suggested change

# TO DO: check if there is a better way to implement this

# TODO: check if there is a better way to implement this

ogrisel · 2025-03-28T13:40:55Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

+        else:
+            percentiles = np.linspace(0, 100, num=max_bins + 1)
+            percentiles = percentiles[1:-1]
+            midpoints = np.percentile(


Same here, those are no longer midpoints since we use method="average_inverted_cdf".

ogrisel · 2025-03-28T13:44:05Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

+        # by 0.5 if the unique points are less than distinct
+        # values
+        if np.unique(midpoints).shape[0] != midpoints.shape[0]:
+            midpoints = np.unique(midpoints)


We should avoid calling np.unique(midpoints) repeatedly. At this stage, the input array has max_bins values so it's not that costly, but still, let store this is a local variable.

ogrisel · 2025-03-28T14:21:40Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

+        if np.unique(midpoints).shape[0] != midpoints.shape[0]:
+            midpoints = np.unique(midpoints)
+        if len(distinct_values) <= len(midpoints):
+            midpoints *= 0.5


Actually, this cannot correctly match the midpoint-based bin threshold of the sample_weight=None side of the condition.

Maybe we should just abandon the idea of trying to set midpoint-based thresholds for len(distinct_values) < max_bins case.

Instead, we should just call np.percentile(..., method="averaged_inverted_cdf") for the unweighted case and _averaged_weighted_percentile() for the weighted case and trim the redundant thresholds a posteriori.

The code will be simpler and the equivalence guaranteed by construction in all the cases.

Currently I have midpoints in within the PR. I have tested locally without the if len(distinct_values) <= len(midpoints), the following tests fail:

test_bin_mapper_repeated_values_invariance (we seem to get one more bin value than expected so the shapes don't match)

test_find_binning_thresholds_small_regular_data: the last test which expects midpoints is now failing (we could instead just change the expected output here

test_binmapper_weighted_vs_repeated_equivalence[42-None]: we get an inhomogeneous array since each column returns a different number of unique percentile values

test_bin_mapper_identity_small: need to dig into this one more would not expect it to fail but it may be due to np.percentile calculation
amongst others. I think for some if we change the expected result it may be fine, but not sure what to do when non-homogeneous arrays are returned, let me know your thoughts.

Co-authored-by: Olivier Grisel <[email protected]>

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

Co-authored-by: Olivier Grisel <[email protected]>

sklearn/ensemble/_hist_gradient_boosting/binning.py

snath-xoc · 2025-04-25T19:46:11Z

I can reproduce the test_sample_weight_effect failure locally, it only happens for specific seeds though e.g., glbal_random_seed=3. This may be due to some larger issue, but it is the remaining test failure @ogrisel and @antoinebaker if you have any opinions?

Never mind: it seems finicky to sample size, I have lowered it to 200 now.

snath-xoc added 4 commits August 2, 2024 15:52

add weighted percentile to binning threshold calculation in HGBT

43da2a6

add weighted percentile to binning threshold calculation in HGBT

b7b1ce2

add sample_weight pass through to binning

d21a19c

fix _find_binning_thresholds and BinMapper sample weight handling

6b3f8ea

github-actions bot added the module:ensemble label Aug 8, 2024

snath-xoc added 4 commits August 8, 2024 17:49

minor fix

cf1149b

updated sampe weight handling in test_binning

fbd0365

updated sampe weight handling in test_binning

22c4c11

[all random seeds]

90665f1

ogrisel reviewed Aug 12, 2024

View reviewed changes

ogrisel mentioned this pull request Oct 25, 2024

List of estimators with known incorrect handling of sample_weight #16298

Open

54 tasks

snath-xoc and others added 11 commits March 13, 2025 21:43

Merge branch 'main' into HGBT_bin_weights

da8344e

Update sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_b…

00ead80

…oosting.py Co-authored-by: Olivier Grisel <[email protected]>

fix _find_binning_thresholds and BinMapper sample weight handling

2d74df6

[all random seeds]

398598e

comment

8d9544f

add None handling to _find_binning_threshold

3ffcbca

change sampling prob to None when no sample weights passed

0758886

fix distinct value handling and modify tests for convergence

4f58074

Merge branch 'main' into HGBT_bin_weights

b7ce7a9

update tests

664d9e4

Merge branch 'main' into HGBT_bin_weights

47e6b65

snath-xoc marked this pull request as ready for review March 21, 2025 13:32

ogrisel reviewed Mar 21, 2025

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/binning.py Show resolved Hide resolved

ogrisel reviewed Mar 21, 2025

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/binning.py Outdated Show resolved Hide resolved

ogrisel reviewed Mar 21, 2025

View reviewed changes

ogrisel reviewed Mar 27, 2025

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/binning.py Outdated Show resolved Hide resolved

snath-xoc and others added 5 commits March 27, 2025 18:33

modified tests

467ea6d

Merge branch 'main' into HGBT_bin_weights

8379c28

Update sklearn/ensemble/_hist_gradient_boosting/binning.py

0dce741

Co-authored-by: Olivier Grisel <[email protected]>

fix tests and add further review comments

edd8210

fix minor error

8ce5c2d

ogrisel reviewed Mar 28, 2025

View reviewed changes

snath-xoc and others added 4 commits March 28, 2025 18:37

Update sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

92dbbdf

Co-authored-by: Olivier Grisel <[email protected]>

Apply suggestions from code review

b790d11

Co-authored-by: Olivier Grisel <[email protected]>

change midpoints name and handling of distinc_values<max_bins

e255e46

Merge branch 'main' into HGBT_bin_weights

2584e43

ogrisel reviewed Apr 10, 2025

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py Outdated Show resolved Hide resolved

ogrisel reviewed Apr 10, 2025

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py Outdated Show resolved Hide resolved

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py Outdated Show resolved Hide resolved

snath-xoc and others added 5 commits April 10, 2025 16:09

Apply suggestions from code review

8573b61

Co-authored-by: Olivier Grisel <[email protected]>

Merge branch 'main' into HGBT_bin_weights

0b24d8a

remove redundant comment

d3d7ae8

add handling of null waited values for distinct value edge case

2935cce

Merge branch 'main' into HGBT_bin_weights

759cf12

snath-xoc commented Apr 10, 2025

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/binning.py Outdated Show resolved Hide resolved

snath-xoc commented Apr 10, 2025

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/binning.py Show resolved Hide resolved

snath-xoc and others added 3 commits April 10, 2025 20:09

Apply suggestions from code review

34b97f4

[all random seeds] test_sample_weight_effect

14180d2

Merge branch 'main' into HGBT_bin_weights

fc7556f

snath-xoc added 2 commits April 25, 2025 20:55

fix sample size in test_sample_weight_effect

546648f

fix documentation

2dc5e0a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added sample weight handling to BinMapper under HGBT #29641

Added sample weight handling to BinMapper under HGBT #29641

snath-xoc commented Aug 8, 2024 •

edited

Loading

github-actions bot commented Aug 8, 2024 •

edited

Loading

ogrisel left a comment •

edited

Loading

snath-xoc commented Mar 21, 2025

snath-xoc commented Mar 21, 2025

ogrisel left a comment

ogrisel Mar 21, 2025

ogrisel Mar 21, 2025

ogrisel Mar 21, 2025

ogrisel Mar 21, 2025

snath-xoc Mar 27, 2025

ogrisel Mar 21, 2025 •

edited

Loading

snath-xoc Mar 28, 2025

snath-xoc Mar 28, 2025

snath-xoc commented Mar 28, 2025

ogrisel Mar 28, 2025 •

edited

Loading

ogrisel left a comment

ogrisel Mar 28, 2025

ogrisel Mar 28, 2025

ogrisel Mar 28, 2025

ogrisel Mar 28, 2025 •

edited

Loading

snath-xoc Apr 10, 2025 •

edited

Loading

snath-xoc commented Apr 25, 2025 •

edited

Loading

		@@ -198,6 +255,25 @@ def test_bin_mapper_repeated_values_invariance(n_distinct):
		assert_array_equal(binned_1, binned_2)


		def test_binmapper_weighted_vs_repeated_equivalence(global_random_seed):

	# TO DO: check if there is a better way to implement this
	# TODO: check if there is a better way to implement this

Added sample weight handling to BinMapper under HGBT #29641

Are you sure you want to change the base?

Added sample weight handling to BinMapper under HGBT #29641

Conversation

snath-xoc commented Aug 8, 2024 • edited Loading

github-actions bot commented Aug 8, 2024 • edited Loading

✔️ Linting Passed

ogrisel left a comment • edited Loading

Choose a reason for hiding this comment

snath-xoc commented Mar 21, 2025

snath-xoc commented Mar 21, 2025

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel Mar 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snath-xoc commented Mar 28, 2025

ogrisel Mar 28, 2025 • edited Loading

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel Mar 28, 2025 • edited Loading

Choose a reason for hiding this comment

snath-xoc Apr 10, 2025 • edited Loading

Choose a reason for hiding this comment

snath-xoc commented Apr 25, 2025 • edited Loading

snath-xoc commented Aug 8, 2024 •

edited

Loading

github-actions bot commented Aug 8, 2024 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel Mar 21, 2025 •

edited

Loading

ogrisel Mar 28, 2025 •

edited

Loading

ogrisel Mar 28, 2025 •

edited

Loading

snath-xoc Apr 10, 2025 •

edited

Loading

snath-xoc commented Apr 25, 2025 •

edited

Loading