Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Added sample weight handling to BinMapper under HGBT #29641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 48 commits into
base: main
Choose a base branch
from

Conversation

snath-xoc
Copy link
Contributor

@snath-xoc snath-xoc commented Aug 8, 2024

Fixes #29640 . See also #27117.

Calls sample weight within BinMapper and passes it on to _find_binning_thresholds where bin midpoints are calculated using weighted percentile. Sample weights also passed to rng.choice() subsampling in BinMapper for samples larger than 2e5

NOTE: when the n_bins<discrete_values the best workaround was to set the bins as the midpoints. In future, it may be worth getting rid of this altogether, however at the risk of getting inhomogeneous array from weighted_percentile. We will need to agree on the best methods of trimming.

Major changes proposed:

  • Subsampling with weights after which sample_weight is not propagated in _BinMapper (set to None)
  • Tests for sample_weight invariance for _BinMapper under deterministic case added
  • Some tests failed with the new changes and n_samples had to be increased e.g., test_interaction_cst_numerically
  • Some tests failed under special cases when sample weights are passed through but distinct values<max_bins. To be checked what to do
  • Statistical test added in comments below

NOTE: this also allows HGBT to pass weighted tests for more than 256 samples (but still less than 2e6)

TO DO:

  • Update tests under test_gradient_boosting so that sample weight is passed as positional argument
  • Fix test failures, something fishy changes such that e.g., test_interaction_cst_numerically is not working
  • Modified tests to have larger number of samples otherwise expected results are not attaines (i.e., r2 value threshold etc. is not reached).... was expecting this to be the other way so not sure if I set off another bug.
  • When distinct values are less than max_bins and sample weights are given we get a discrepancy between weighted and repeated under special cases (see test_zero_sample_weights_classification under test_gradient_boosting.py). Need to see what to do here.
  • Further test failure when max_bins are specified as > distinct values under weighted and repeated cases
  • Check that sample weight invariance holds under deterministic case (test added under test_binning)
  • For the case n_samples > subsample conduct a statistical analysis to check for that the repeated/reweighted equivalence holds for the binning procedure in expectation, similar to issue Incorrect sample weight handling in KBinsDiscretizer #29906. Conduct analysis for many different values random_state for match of the mean bin edges.

Copy link

github-actions bot commented Aug 8, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 39ed7f3. Link to the linter CI: here

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not analyse the test failures yet but here is some early feedback.

For the case n_samples > subsample we would need to conduct a statistical analysis to check for that the repeated/reweighted equivalence holds for the binning procedure in expectation.

Once the tests pass for the deterministic case, we should conduct such an analysis (e.g. using a side notebook, not included in the repo, where we rerun the binning for many different values random_state and then check for match of the mean bin edges).

Please add a TODO item to the description of this PR not to forget about this.

@snath-xoc
Copy link
Contributor Author

Statistical tests were conducted for n_samples>subsample (subsample=int(2e5)) using the following code:

import numpy as np
from sklearn.ensemble._hist_gradient_boosting.binning import _BinMapper
from scipy.stats import kstest
import matplotlib.pyplot as plt

BONFERRONI_CORRECTION = 1 #To adjust later according  to agreed test dim.

rng = np.random.RandomState(0)

n_samples = int(3e5)
X = rng.randint(0, 30, size=(n_samples,3))
# Use random integers (including zero) as weights.
sw = rng.randint(0, 5, size=n_samples)

X_repeated = np.repeat(X, sw,axis=0)
assert len(X_repeated)>int(2e5)

bin_thresholds_weighted=[]
bin_thresholds_repeated = []
for seed in np.arange(100):
    est_weighted = _BinMapper(n_bins=6, random_state=seed).fit(
                X, sample_weight=sw
            )

    est_repeated = _BinMapper(n_bins=6, random_state=seed+500).fit(
                X_repeated, sample_weight=None)

    bin_thresholds_weighted.append(est_weighted.bin_thresholds_)
    bin_thresholds_repeated.append(est_repeated.bin_thresholds_)

bin_thresholds_weighted = np.asarray(bin_thresholds_weighted)
bin_thresholds_repeated = np.asarray(bin_thresholds_repeated)

fig,axs = plt.subplots(3,4,figsize=(14,12))

j=0
for i,ax in enumerate(axs.flatten()):
    if i>0 and i%4==0:
        j+=1
    ax.hist(bin_thresholds_weighted[:,j,i%4].flatten())
    ax.hist(bin_thresholds_repeated[:,j,i%4].flatten(),alpha=0.5)
    pval = kstest(bin_thresholds_weighted[:,j,i%4].flatten(),
                                    bin_thresholds_repeated[:,j,i%4].flatten()).pvalue
    if pval<(0.05*BONFERRONI_CORRECTION):
        ax.set_title(f'p-value: {pval:.4f},failed')
    else:
        ax.set_title(f'p-value: {pval:.4f},passed')

The output is as follows:

image

@snath-xoc snath-xoc marked this pull request as ready for review March 21, 2025 13:32
@snath-xoc
Copy link
Contributor Author

The ARM lint test is still failing due to the test_find_binning_thresholds_small_regular_data assertion error but I can't reproduce it locally.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified tests to have larger number of samples otherwise expected results are not attaines (i.e., r2 value threshold etc. is not reached).... was expecting this to be the other way so not sure if I set off another bug.

This is annoying, but it's probably caused by the fact that we switched from np.percentile(..., method="linear") to np.percentile(..., method="averaged_inverted_cdf") in the sample_weight=None case. Assuming this change is really the cause and since it's necessary to fix the weight/repetition equivalence, we might consider this behavior change as a bug fix. It should be clearly documented as such in the changelog, possibly with a dedicated note in the "Changed models" section of the release notes to better warn the users about this change.

Another concern is that _averaged_weighted_percentile implementation is currently very naive and will cause a performance regression whenever the users passes sample_weight != None. We might want to postpone the final merge of this PR to wait for the review and merge of #30945 first.

func = _averaged_weighted_percentile
if len(distinct_values) <= max_bins:
max_bins = len(distinct_values)
func = _weighted_percentile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would _averaged_weighted_percentile not work when len(distinct_values) <= max_bins?

>>> import numpy as np
>>> from sklearn.utils.stats import _weighted_percentile, _averaged_weighted_percentile
>>> a = np.random.randint(0, 10, size=10000).astype(np.float64)
>>> max_bins = 30
>>> max_bins > np.unique(a).shape[0]
True
>>> np.array(
...     [_averaged_weighted_percentile(a, np.ones_like(a), percentile) for percentile in percentiles]
... )
array([0., 0., 0., 1., 1., 1., 1., 2., 2., 2., 3., 3., 3., 4., 4., 4., 5.,
       5., 5., 6., 6., 6., 7., 7., 7., 8., 8., 8., 9., 9., 9.])
>>> np.array(
...     [_weighted_percentile(a, np.ones_like(a), percentile) for percentile in percentiles]
... )
array([0., 0., 0., 1., 1., 1., 1., 2., 2., 2., 3., 3., 3., 4., 4., 4., 5.,
       5., 5., 6., 6., 6., 7., 7., 7., 8., 8., 8., 9., 9., 9.])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When distinct values are less than max_bins and sample weights are given we get a discrepancy between weighted and repeated under special cases (see test_zero_sample_weights_classification under test_gradient_boosting.py). Need to see what to do here.

The manual midpoints between distinct values in the weighted case is probably responsible for this discrepancy.

We should probably explore if it's possible to post process the output of _averaged_weighted_percentile to trim duplicated thresholds. If the number of trimmed thresholds is lower than max_bins then it might be possible to further post-process to recover thresholds that match the unweighted case by shifting the thresholds to use 0.5 midpoints.

@snath-xoc
Copy link
Contributor Author

snath-xoc commented Apr 25, 2025

I can reproduce the test_sample_weight_effect failure locally, it only happens for specific seeds though e.g., glbal_random_seed=3. This may be due to some larger issue, but it is the remaining test failure @ogrisel and @antoinebaker if you have any opinions?

Never mind: it seems finicky to sample size, I have lowered it to 200 now.

@snath-xoc
Copy link
Contributor Author

@ogrisel reviving this, i added a changelog, let me know what you think!

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a new pass of review:

# Remove duplicated midpoints if they exist and shift
unique_bin_values = np.unique(bin_thresholds)
if unique_bin_values.shape[0] != bin_thresholds.shape[0]:
bin_thresholds = unique_bin_values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to craft a test case to cover this line. I think the following might help trigger this case:

import numpy as np
>>> col_data = np.asarray(([1] * 1000) + [2, 3, 4, 5, 6])
>>> col_data.shape
(1005,)
>>> np.unique(col_data).shape
(6,)
>>> max_bins = 4
>>> percentiles = np.linspace(0, 100, num=max_bins + 1)[1:-1]
>>> np.percentile(col_data, percentiles)
array([1., 1., 1.])

Then the test could also check that the result matches what we get from:

>>> col_data = np.asarray(([1, 2, 3, 4, 5, 6])
>>> sample_weight = np.asarray(([1000, 1, 1, 1, 1, 1])

@@ -198,6 +198,27 @@ def test_bin_mapper_repeated_values_invariance(n_distinct):
assert_array_equal(binned_1, binned_2)


@pytest.mark.parametrize("n_bins", [50, None])
def test_binmapper_weighted_vs_repeated_equivalence(global_random_seed, n_bins):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found an occurrence of a test failure when global_random_seed=63 when running:

SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest -vl sklearn/ensemble/_hist_gradient_boosting/tests -k test_binmapper_weighted_vs_repeated_equivalence

Can you reproduce?

Here is the error message I get:

=========================================================================== test session starts ============================================================================
platform darwin -- Python 3.13.3, pytest-8.4.0, pluggy-1.6.0 -- /Users/ogrisel/miniforge3/envs/dev/bin/python3.13
cachedir: .pytest_cache
rootdir: /Users/ogrisel/code/scikit-learn
configfile: pyproject.toml
plugins: xdist-3.7.0, run-parallel-0.4.3, anyio-4.9.0
collected 451 items / 449 deselected / 2 selected                                                                                                                          
Collected 0 items to run in parallel

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py::test_binmapper_weighted_vs_repeated_equivalence[63-50] FAILED                                        [ 50%]
sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py::test_binmapper_weighted_vs_repeated_equivalence[63-None] PASSED                                      [100%]

================================================================================= FAILURES =================================================================================
__________________________________________________________ test_binmapper_weighted_vs_repeated_equivalence[63-50] __________________________________________________________

global_random_seed = 63, n_bins = 50

    @pytest.mark.parametrize("n_bins", [50, None])
    def test_binmapper_weighted_vs_repeated_equivalence(global_random_seed, n_bins):
        rng = np.random.RandomState(global_random_seed)
    
        n_samples = 200
        X = rng.randn(n_samples, 3)
        if n_bins is None:
            n_bins = np.unique(X[:, rng.randint(3)]).shape[0] + rng.randint(5) + 1
    
        sw = rng.randint(0, 5, size=n_samples)
        X_repeated = np.repeat(X, sw, axis=0)
    
        est_weighted = _BinMapper(n_bins=n_bins).fit(X, sample_weight=sw)
        est_repeated = _BinMapper(n_bins=n_bins).fit(X_repeated, sample_weight=None)
>       assert_allclose(est_weighted.bin_thresholds_, est_repeated.bin_thresholds_)
E       AssertionError: 
E       Not equal to tolerance rtol=1e-07, atol=0
E       
E       Mismatched elements: 6 / 144 (4.17%)
E       Max absolute difference among violations: 0.022745
E       Max relative difference among violations: 0.02664642
E        ACTUAL: array([[-1.712595, -1.285816, -1.226962, -1.009446, -0.960101, -0.882069,
E               -0.83596 , -0.69571 , -0.675828, -0.603487, -0.399731, -0.361047,
E               -0.340069, -0.237937, -0.207451, -0.087749, -0.075609,  0.032785,...
E        DESIRED: array([[-1.712595, -1.285816, -1.226962, -1.009446, -0.960101, -0.882069,
E               -0.83596 , -0.69571 , -0.675828, -0.593016, -0.399731, -0.361047,
E               -0.340069, -0.237937, -0.207451, -0.087749, -0.075609,  0.032785,...

X          = array([[-2.13897865e+00,  1.11206124e+00,  3.58015526e-02],
       [-6.30157742e-01,  3.54051160e-05, -1.20895742e+00]...      [-3.99730932e-01,  9.48906177e-01,  2.40632339e-01],
       [-3.83554471e-02, -1.19674458e+00, -1.27746484e+00]])
X_repeated = array([[-2.13897865,  1.11206124,  0.03580155],
       [ 0.29162807, -1.65065515, -1.50712909],
       [ 0.29162807, -...3234],
       [-0.39973093,  0.94890618,  0.24063234],
       [-0.39973093,  0.94890618,  0.24063234]], shape=(392, 3))
est_repeated = _BinMapper(n_bins=50)
est_weighted = _BinMapper(n_bins=50)
global_random_seed = 63
n_bins     = 50
n_samples  = 200
rng        = RandomState(MT19937) at 0x13D5D3240
sw         = array([1, 0, 4, 2, 4, 0, 4, 0, 4, 0, 4, 0, 2, 1, 3, 4, 4, 3, 0, 0, 3, 1,
       0, 4, 2, 2, 0, 1, 2, 2, 0, 4, 3, 4, 0,...4, 4, 1, 3, 2, 4, 2, 2, 0, 4, 0,
       2, 1, 0, 2, 2, 0, 0, 4, 3, 1, 4, 1, 0, 0, 0, 2, 4, 4, 3, 3, 2, 2,
       4, 0])

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py:215: AssertionError
-------------------------------------------------------------------------- Captured stdout setup ---------------------------------------------------------------------------
I: Seeding RNGs with 819376607
========================================================================= short test summary info ==========================================================================
FAILED sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py::test_binmapper_weighted_vs_repeated_equivalence[63-50] - AssertionError: 
=============================================================== 1 failed, 1 passed, 449 deselected in 0.46s ================================================================

So the problem happens when max_bins=50 < np.unique(X[sw != 0], axis=0).shape[0]=155, which means that it's not related to the branch of the code that uses midpoints as bin edges but rather uses the branch of the code that uses the _averaged_weighted_percentile. I think we try to extract a minimal reproducer to understand the root cause of this failure case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the root cause is a bug in _averaged_weighted_percentile that does not implement the expected weighted/repeated equivalence? If that the case we should try to extract a minimal reproducer for that function in isolation and debug this and fix this in a dedicated PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that I get the same failure for seed 63, will investigate a bit more, strange :/

Copy link
Contributor

@antoinebaker antoinebaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here a first round of review !

Comment on lines 1239 to 1242
self._bin_mapper = self._bin_mapper.fit(
X, sample_weight=sample_weight
) # F-aligned array
X_binned = self._bin_mapper.transform(X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._bin_mapper = self._bin_mapper.fit(
X, sample_weight=sample_weight
) # F-aligned array
X_binned = self._bin_mapper.transform(X)
self._bin_mapper.fit(X, sample_weight=sample_weight)
X_binned = self._bin_mapper.transform(X) # F-aligned array

Comment on lines 52 to 59
# if sample weight is not None and null values exist
# we need to remove those before calculating the
# distinct points
if sample_weight is not None:
col_data_non_null = col_data[sample_weight != 0]
distinct_values = np.unique(col_data_non_null).astype(X_DTYPE)
else:
distinct_values = np.unique(col_data).astype(X_DTYPE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# if sample weight is not None and null values exist
# we need to remove those before calculating the
# distinct points
if sample_weight is not None:
col_data_non_null = col_data[sample_weight != 0]
distinct_values = np.unique(col_data_non_null).astype(X_DTYPE)
else:
distinct_values = np.unique(col_data).astype(X_DTYPE)
if sample_weight is not None:
# A zero sample_weight should be equivalent to removing the sample.
# We discard sample_weight=0 when computing the distinct values.
distinct_values = np.unique(col_data[sample_weight != 0]).astype(X_DTYPE)
else:
distinct_values = np.unique(col_data).astype(X_DTYPE)

Comment on lines 80 to 83
# We could compute approximate midpoint percentiles using the output of
# np.unique(col_data, return_counts) instead but this is more
# work and the performance benefit will be limited because we
# work on a fixed-size subsample of the full data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment belongs to the elif sample_weight is None: block above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, maybe the comment should be updated (midpoints are now replaced with the averaged_inverted_cdf method) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this comment now but not sure if it is what you had in mind, feel free to check and provide suggestions!

percentiles = np.linspace(0, 100, num=max_bins + 1)
percentiles = percentiles[1:-1]
midpoints = np.percentile(col_data, percentiles, method="midpoint").astype(
X_DTYPE
sample_weight = sample_weight[~missing_mask]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do this in the if missing_mask.any(): above to make sure col_data and sample_weight are always aligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gives a bit of a spaghetti if but am O.K. with it for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if the fix makes sense for you now!

@ogrisel
Copy link
Member

ogrisel commented Jul 2, 2025

BTW, I re-ran #29641 (comment) on the current state of the PR and the results are good for the subsampling branch of the code:

statistical_test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BinMapper within HGBT does not handle sample weights
3 participants