-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Added sample weight handling to BinMapper under HGBT #29641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not analyse the test failures yet but here is some early feedback.
For the case n_samples > subsample
we would need to conduct a statistical analysis to check for that the repeated/reweighted equivalence holds for the binning procedure in expectation.
Once the tests pass for the deterministic case, we should conduct such an analysis (e.g. using a side notebook, not included in the repo, where we rerun the binning for many different values random_state
and then check for match of the mean bin edges).
Please add a TODO item to the description of this PR not to forget about this.
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
…oosting.py Co-authored-by: Olivier Grisel <[email protected]>
Statistical tests were conducted for n_samples>subsample (subsample=int(2e5)) using the following code: import numpy as np
from sklearn.ensemble._hist_gradient_boosting.binning import _BinMapper
from scipy.stats import kstest
import matplotlib.pyplot as plt
BONFERRONI_CORRECTION = 1 #To adjust later according to agreed test dim.
rng = np.random.RandomState(0)
n_samples = int(3e5)
X = rng.randint(0, 30, size=(n_samples,3))
# Use random integers (including zero) as weights.
sw = rng.randint(0, 5, size=n_samples)
X_repeated = np.repeat(X, sw,axis=0)
assert len(X_repeated)>int(2e5)
bin_thresholds_weighted=[]
bin_thresholds_repeated = []
for seed in np.arange(100):
est_weighted = _BinMapper(n_bins=6, random_state=seed).fit(
X, sample_weight=sw
)
est_repeated = _BinMapper(n_bins=6, random_state=seed+500).fit(
X_repeated, sample_weight=None)
bin_thresholds_weighted.append(est_weighted.bin_thresholds_)
bin_thresholds_repeated.append(est_repeated.bin_thresholds_)
bin_thresholds_weighted = np.asarray(bin_thresholds_weighted)
bin_thresholds_repeated = np.asarray(bin_thresholds_repeated)
fig,axs = plt.subplots(3,4,figsize=(14,12))
j=0
for i,ax in enumerate(axs.flatten()):
if i>0 and i%4==0:
j+=1
ax.hist(bin_thresholds_weighted[:,j,i%4].flatten())
ax.hist(bin_thresholds_repeated[:,j,i%4].flatten(),alpha=0.5)
pval = kstest(bin_thresholds_weighted[:,j,i%4].flatten(),
bin_thresholds_repeated[:,j,i%4].flatten()).pvalue
if pval<(0.05*BONFERRONI_CORRECTION):
ax.set_title(f'p-value: {pval:.4f},failed')
else:
ax.set_title(f'p-value: {pval:.4f},passed') The output is as follows: |
The ARM lint test is still failing due to the test_find_binning_thresholds_small_regular_data assertion error but I can't reproduce it locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified tests to have larger number of samples otherwise expected results are not attaines (i.e., r2 value threshold etc. is not reached).... was expecting this to be the other way so not sure if I set off another bug.
This is annoying, but it's probably caused by the fact that we switched from np.percentile(..., method="linear")
to np.percentile(..., method="averaged_inverted_cdf")
in the sample_weight=None
case. Assuming this change is really the cause and since it's necessary to fix the weight/repetition equivalence, we might consider this behavior change as a bug fix. It should be clearly documented as such in the changelog, possibly with a dedicated note in the "Changed models" section of the release notes to better warn the users about this change.
Another concern is that _averaged_weighted_percentile
implementation is currently very naive and will cause a performance regression whenever the users passes sample_weight != None
. We might want to postpone the final merge of this PR to wait for the review and merge of #30945 first.
func = _averaged_weighted_percentile | ||
if len(distinct_values) <= max_bins: | ||
max_bins = len(distinct_values) | ||
func = _weighted_percentile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would _averaged_weighted_percentile
not work when len(distinct_values) <= max_bins
?
>>> import numpy as np
>>> from sklearn.utils.stats import _weighted_percentile, _averaged_weighted_percentile
>>> a = np.random.randint(0, 10, size=10000).astype(np.float64)
>>> max_bins = 30
>>> max_bins > np.unique(a).shape[0]
True
>>> np.array(
... [_averaged_weighted_percentile(a, np.ones_like(a), percentile) for percentile in percentiles]
... )
array([0., 0., 0., 1., 1., 1., 1., 2., 2., 2., 3., 3., 3., 4., 4., 4., 5.,
5., 5., 6., 6., 6., 7., 7., 7., 8., 8., 8., 9., 9., 9.])
>>> np.array(
... [_weighted_percentile(a, np.ones_like(a), percentile) for percentile in percentiles]
... )
array([0., 0., 0., 1., 1., 1., 1., 2., 2., 2., 3., 3., 3., 4., 4., 4., 5.,
5., 5., 6., 6., 6., 7., 7., 7., 8., 8., 8., 9., 9., 9.])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When distinct values are less than max_bins and sample weights are given we get a discrepancy between weighted and repeated under special cases (see test_zero_sample_weights_classification under test_gradient_boosting.py). Need to see what to do here.
The manual midpoints between distinct values in the weighted case is probably responsible for this discrepancy.
We should probably explore if it's possible to post process the output of _averaged_weighted_percentile
to trim duplicated thresholds. If the number of trimmed thresholds is lower than max_bins
then it might be possible to further post-process to recover thresholds that match the unweighted case by shifting the thresholds to use 0.5 midpoints.
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
@@ -198,6 +255,25 @@ def test_bin_mapper_repeated_values_invariance(n_distinct): | |||
assert_array_equal(binned_1, binned_2) | |||
|
|||
|
|||
def test_binmapper_weighted_vs_repeated_equivalence(global_random_seed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand this test (maybe with @pytest.mark.parametrize
) to include a case where n_bins > np.unique(X[:, some_arbitrary_feature_idx]).shape[0]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's expected to fail but we need to solve this case anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it fails but good to have this test in
# and return different values as compared to when using weights | ||
# since the extra 1 breakes the tie. For now the test is slightly | ||
# modified, in future we need to decide what to do when | ||
# distinct_values<=max_bins but sample weights are supplied |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not comment out this case since we need to find a way to make this test pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uncommented, we get the fail now, it may be worth also trying a strategy e.g.,
midpoints = np.array([_averaged_weighted_percentile ...
if np.unique(midpoints).shape[0]!=midpoints.shape[0]:
midpoints, counts = np.unique(midpoints, return_counts=True)
and do some calculations based off that during trimming... need to think a bit more about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UPDATE: test passes now after some fixes
The only remaining failing test is the weighted_vs_repeated equivalence test under test_binning when the max_bins>distinct values. I think this would need further changes to the _averaged_weighted_percentile implementation as mentioned before. |
midpoints = np.percentile(col_data, percentiles, method="midpoint").astype( | ||
X_DTYPE | ||
|
||
midpoints = np.array( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this variable should be renamed bin_thresholds
since they are no midpoints when n_unique > max_bins
.
Bin thresholds are set to mid_points
when there are fewer observed unique feature values than number of bins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more feedback.
else: | ||
# We could compute approximate midpoint percentiles using the output of | ||
# np.unique(col_data, return_counts) instead but this is more | ||
# work and the performance benefit will be limited because we | ||
# work on a fixed-size subsample of the full data. | ||
# TO DO: check if there is a better way to implement this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: the convention is to use TODO
which is easier to grep consistently and often syntax highlighted by IDEs.
# TO DO: check if there is a better way to implement this | |
# TODO: check if there is a better way to implement this |
else: | ||
percentiles = np.linspace(0, 100, num=max_bins + 1) | ||
percentiles = percentiles[1:-1] | ||
midpoints = np.percentile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, those are no longer midpoints since we use method="average_inverted_cdf"
.
# by 0.5 if the unique points are less than distinct | ||
# values | ||
if np.unique(midpoints).shape[0] != midpoints.shape[0]: | ||
midpoints = np.unique(midpoints) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid calling np.unique(midpoints)
repeatedly. At this stage, the input array has max_bins
values so it's not that costly, but still, let store this is a local variable.
if np.unique(midpoints).shape[0] != midpoints.shape[0]: | ||
midpoints = np.unique(midpoints) | ||
if len(distinct_values) <= len(midpoints): | ||
midpoints *= 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this cannot correctly match the midpoint-based bin threshold of the sample_weight=None
side of the condition.
Maybe we should just abandon the idea of trying to set midpoint-based thresholds for len(distinct_values) < max_bins
case.
Instead, we should just call np.percentile(..., method="averaged_inverted_cdf")
for the unweighted case and _averaged_weighted_percentile()
for the weighted case and trim the redundant thresholds a posteriori.
The code will be simpler and the equivalence guaranteed by construction in all the cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently I have midpoints in within the PR. I have tested locally without the if len(distinct_values) <= len(midpoints)
, the following tests fail:
- test_bin_mapper_repeated_values_invariance (we seem to get one more bin value than expected so the shapes don't match)
- test_find_binning_thresholds_small_regular_data: the last test which expects midpoints is now failing (we could instead just change the expected output here
- test_binmapper_weighted_vs_repeated_equivalence[42-None]: we get an inhomogeneous array since each column returns a different number of unique percentile values
- test_bin_mapper_identity_small: need to dig into this one more would not expect it to fail but it may be due to np.percentile calculation
amongst others. I think for some if we change the expected result it may be fine, but not sure what to do when non-homogeneous arrays are returned, let me know your thoughts.
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
I can reproduce the test_sample_weight_effect failure locally, it only happens for specific seeds though e.g., glbal_random_seed=3. This may be due to some larger issue, but it is the remaining test failure @ogrisel and @antoinebaker if you have any opinions? Never mind: it seems finicky to sample size, I have lowered it to 200 now. |
Fixes #29640 . See also #27117.
Calls sample weight within BinMapper and passes it on to _find_binning_thresholds where bin midpoints are calculated using weighted percentile. Sample weights also passed to rng.choice() subsampling in BinMapper for samples larger than 2e5
NOTE: when the n_bins<discrete_values the best workaround was to set the bins as the midpoints. In future, it may be worth getting rid of this altogether, however at the risk of getting inhomogeneous array from weighted_percentile. We will need to agree on the best methods of trimming.
Major changes proposed:
NOTE: this also allows HGBT to pass weighted tests for more than 256 samples (but still less than 2e6)
TO DO:
KBinsDiscretizer
#29906. Conduct analysis for many different values random_state for match of the mean bin edges.