-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Added sample weight handling to BinMapper under HGBT #29641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not analyse the test failures yet but here is some early feedback.
For the case n_samples > subsample
we would need to conduct a statistical analysis to check for that the repeated/reweighted equivalence holds for the binning procedure in expectation.
Once the tests pass for the deterministic case, we should conduct such an analysis (e.g. using a side notebook, not included in the repo, where we rerun the binning for many different values random_state
and then check for match of the mean bin edges).
Please add a TODO item to the description of this PR not to forget about this.
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
…oosting.py Co-authored-by: Olivier Grisel <[email protected]>
Statistical tests were conducted for n_samples>subsample (subsample=int(2e5)) using the following code: import numpy as np
from sklearn.ensemble._hist_gradient_boosting.binning import _BinMapper
from scipy.stats import kstest
import matplotlib.pyplot as plt
BONFERRONI_CORRECTION = 1 #To adjust later according to agreed test dim.
rng = np.random.RandomState(0)
n_samples = int(3e5)
X = rng.randint(0, 30, size=(n_samples,3))
# Use random integers (including zero) as weights.
sw = rng.randint(0, 5, size=n_samples)
X_repeated = np.repeat(X, sw,axis=0)
assert len(X_repeated)>int(2e5)
bin_thresholds_weighted=[]
bin_thresholds_repeated = []
for seed in np.arange(100):
est_weighted = _BinMapper(n_bins=6, random_state=seed).fit(
X, sample_weight=sw
)
est_repeated = _BinMapper(n_bins=6, random_state=seed+500).fit(
X_repeated, sample_weight=None)
bin_thresholds_weighted.append(est_weighted.bin_thresholds_)
bin_thresholds_repeated.append(est_repeated.bin_thresholds_)
bin_thresholds_weighted = np.asarray(bin_thresholds_weighted)
bin_thresholds_repeated = np.asarray(bin_thresholds_repeated)
fig,axs = plt.subplots(3,4,figsize=(14,12))
j=0
for i,ax in enumerate(axs.flatten()):
if i>0 and i%4==0:
j+=1
ax.hist(bin_thresholds_weighted[:,j,i%4].flatten())
ax.hist(bin_thresholds_repeated[:,j,i%4].flatten(),alpha=0.5)
pval = kstest(bin_thresholds_weighted[:,j,i%4].flatten(),
bin_thresholds_repeated[:,j,i%4].flatten()).pvalue
if pval<(0.05*BONFERRONI_CORRECTION):
ax.set_title(f'p-value: {pval:.4f},failed')
else:
ax.set_title(f'p-value: {pval:.4f},passed') The output is as follows: |
The ARM lint test is still failing due to the test_find_binning_thresholds_small_regular_data assertion error but I can't reproduce it locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified tests to have larger number of samples otherwise expected results are not attaines (i.e., r2 value threshold etc. is not reached).... was expecting this to be the other way so not sure if I set off another bug.
This is annoying, but it's probably caused by the fact that we switched from np.percentile(..., method="linear")
to np.percentile(..., method="averaged_inverted_cdf")
in the sample_weight=None
case. Assuming this change is really the cause and since it's necessary to fix the weight/repetition equivalence, we might consider this behavior change as a bug fix. It should be clearly documented as such in the changelog, possibly with a dedicated note in the "Changed models" section of the release notes to better warn the users about this change.
Another concern is that _averaged_weighted_percentile
implementation is currently very naive and will cause a performance regression whenever the users passes sample_weight != None
. We might want to postpone the final merge of this PR to wait for the review and merge of #30945 first.
func = _averaged_weighted_percentile | ||
if len(distinct_values) <= max_bins: | ||
max_bins = len(distinct_values) | ||
func = _weighted_percentile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would _averaged_weighted_percentile
not work when len(distinct_values) <= max_bins
?
>>> import numpy as np
>>> from sklearn.utils.stats import _weighted_percentile, _averaged_weighted_percentile
>>> a = np.random.randint(0, 10, size=10000).astype(np.float64)
>>> max_bins = 30
>>> max_bins > np.unique(a).shape[0]
True
>>> np.array(
... [_averaged_weighted_percentile(a, np.ones_like(a), percentile) for percentile in percentiles]
... )
array([0., 0., 0., 1., 1., 1., 1., 2., 2., 2., 3., 3., 3., 4., 4., 4., 5.,
5., 5., 6., 6., 6., 7., 7., 7., 8., 8., 8., 9., 9., 9.])
>>> np.array(
... [_weighted_percentile(a, np.ones_like(a), percentile) for percentile in percentiles]
... )
array([0., 0., 0., 1., 1., 1., 1., 2., 2., 2., 3., 3., 3., 4., 4., 4., 5.,
5., 5., 6., 6., 6., 7., 7., 7., 8., 8., 8., 9., 9., 9.])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When distinct values are less than max_bins and sample weights are given we get a discrepancy between weighted and repeated under special cases (see test_zero_sample_weights_classification under test_gradient_boosting.py). Need to see what to do here.
The manual midpoints between distinct values in the weighted case is probably responsible for this discrepancy.
We should probably explore if it's possible to post process the output of _averaged_weighted_percentile
to trim duplicated thresholds. If the number of trimmed thresholds is lower than max_bins
then it might be possible to further post-process to recover thresholds that match the unweighted case by shifting the thresholds to use 0.5 midpoints.
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
@@ -1734,7 +1740,7 @@ class HistGradientBoostingRegressor(RegressorMixin, BaseHistGradientBoosting): | |||
>>> X, y = load_diabetes(return_X_y=True) | |||
>>> est = HistGradientBoostingRegressor().fit(X, y) | |||
>>> est.score(X, y) | |||
0.92... | |||
0.93... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now compute the bin edges using np.percentile(..., method="averaged_inverted_cdf")
instead of np.percentile(..., method="linear")
(the default in numpy) when sample_weight=None
.
This is necessary to get the weighted / repetition equivalence semantics for the sample_weight
parameter.
if sample_weight is not None: | ||
subsampling_probabilities = sample_weight / np.sum(sample_weight) | ||
subset = rng.choice( | ||
X.shape[0], self.subsample, p=subsampling_probabilities, replace=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X.shape[0], self.subsample, p=subsampling_probabilities, replace=True | |
X.shape[0], self.subsample, p=subsampling_probabilities, replace=False |
Preserve current behavior with sample_weight=None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sampling with replacement is necessary to ensure the correct weight semantics: if you have a data point with a large weight, it might need to be resampled several times for the edges derived from computed on the subset to be identically distributed as what you would observe from an equivalent unweighted dataset but with repeated values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I agree we need a non-regression test for this: #29641 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we should weigh in the trade-offs between conserving the old behavior for sample_weight=None and maintainability.
I would favor to make a case distinction and sample without replacement when sw=None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that keeping the linear interpolation makes it impossible to have test_binmapper_weighted_vs_repeated_equivalence
pass which is the point of this PR. Without changing this, we cannot test that our sample weight correctly implement the semantics we look for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could/should keep replace=True if samples_weight is None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really disagree: this would break the statistical equivalence between sample_weight=None
and sample_weight=np.ones(n_samples)
(and again test_binmapper_weighted_vs_repeated_equivalence
).
- :class:`BaseHistGradientBoosting` now uses sample weights when binning data | ||
either by using sample weights in the `_averaged_weighted_percentile` or | ||
by subsampling using normalised sample weights when `n_samples` is greater | ||
than 2e5 within the `_find_binning_thresholds` function. | ||
By :user:`Shruti Nath <snath-xoc>` and :user:`Olivier Grisel <ogrisel>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide a better changelog entry:
- What exactly changes, in particular what will produce different estimators compared to the previous (now current) version of scikit-learn. I think we should even list this PR under the Changed models section.
- Do not mention private functions like
_averaged_weighted_percentile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snath-xoc we need to add a dedicated entry in the "Changed models section" to explain that the way the HistGradientBoostingClassifier/Regressor compute their bin edges is slightly changed when fit with sample_weight=None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this would be in addition to the changelog PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more feedback:
subset = rng.choice(X.shape[0], self.subsample, replace=False) | ||
subsampling_probabilities = None | ||
if sample_weight is not None: | ||
subsampling_probabilities = sample_weight / np.sum(sample_weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please extend the tests to cover this line? Maybe we could a simple kstest on the distribution of the bin edges for a smallish value of subsample
(e.g. 100 out of a dataset of 300 data points) and a few bins (e.g. 3 or 5) so that we ignore the non-independent multiple testing problem. We need the test to run fast enough (ideally under 1s) so we have to constrain ourselves in the number of rng repetitions, but since subsampled binning on a single column of data is quite cheap, I believe this is doable.
Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Christian Lorentzen <[email protected]>
For reference, I tried to run the common test for
After commenting out the matching Maybe we need to fix the remaining sources of discrepancies (at least on small datasets) before deciding if the change of behavior is justified. |
@snath-xoc it seems that the CI of this PR is red because configurations that run older version of scipy run into a bug of the ks test function that was probably fixed in more recent versions and make Could you please skip this test conditionally on the scipy version to get the CI green on this PR? |
BTW @ogrisel |
Fixes #29640, #27117
Towards #16298
Calls sample weight within BinMapper and passes it on to _find_binning_thresholds where bin midpoints are calculated using weighted percentile. Sample weights also passed to rng.choice() subsampling in BinMapper for samples larger than 2e5
NOTE: when the n_bins<discrete_values the best workaround was to set the bins as the midpoints. In future, it may be worth getting rid of this altogether, however at the risk of getting inhomogeneous array from weighted_percentile. We will need to agree on the best methods of trimming.
Major changes proposed:
NOTE: this also allows HGBT to pass weighted tests for more than 256 samples (but still less than 2e6)
TO DO:
KBinsDiscretizer
#29906. Conduct analysis for many different values random_state for match of the mean bin edges.test_subsampled_weighted_vs_repeated_equivalence
under test_binning.py, however there seems to be a pathological case forglobal random seed = 5, n_bins = 5
where at least one of the bin edges do not match in distribution between weighted and repeated. I now make the test such that it ascertains that the number of bin edges not matching is confined to only 1 (i.e.sum(p_val<0.025)<2
).