-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
ENH add subsample to HGBT #28063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH add subsample to HGBT #28063
Conversation
Can you adapt or extend the existing stochastic gradient boosting example: https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html to check that similar results can be obtained with the HGBDT counter part?
If the above works as expected, we can probably turn a simplified version into a non-regression test (by making assertions on the test loss values, without the plots) while keeping a ref from the test to the example to "explain" what is tested in the test (and also referencing the Friedman 2002 paper). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to why I suggested above for testing, maybe you can parametrize existing HGBDT tests whose result should be invariant as to whether subsampling is enabled or not by adding:
@pytest.mark.parametrize("subsample", [0.5, 1.0])
whenever appropriate as is done in sklearn/ensemble/tests/test_gradient_boosting.py
.
# Do out of bag if required | ||
if do_oob: | ||
self._bagging_subsample_rng.shuffle(sample_mask) | ||
sample_weight_train = sample_weight_train_original * sample_mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using null sample_weight
to mathematically simulate subsampling it's simple to implement and does not need allocating any extra memory but it does not produce the expected 2x computational speed-up of when using a typical subsample=0.5
.
Have considered row-wise fancy indexing of the training data instead?
The comments of my first pass of review were marked as resolved without any pushed code change to address them nor any discussion. |
Sorry, I forgot to push. Will do as soon as time permits. |
One problem with the current way of just setting sample weight to zero is that the sample counts in histograms is wrong and Another solution is to pass a sample mask everywhere which is quite a massive change that I'm hesitant to implement. |
Why not just fancy index to physically resample the training set? To avoid the memory copy? |
With zero sample_weight, the count statistics of the histograms is wrong.
Reference Issues/PRs
Fixes #16062 (#27139 is already merged).
What does this implement/fix? Explain your changes.
Add
subsample
toHistGradientBoostingClassifier
andHistGradientBoostingRegressor
. Similar tosubsample
in the oldGradientBoostingClassifier
.Any other comments?
While the implementation is rather easy, suggestions for good tests are welcome.