ENH add subsample to HGBT #28063

lorentzenchr · 2024-01-04T16:57:47Z

Reference Issues/PRs

Fixes #16062 (#27139 is already merged).

What does this implement/fix? Explain your changes.

Add subsample to HistGradientBoostingClassifier and HistGradientBoostingRegressor. Similar to subsample in the old GradientBoostingClassifier.

Any other comments?

While the implementation is rather easy, suggestions for good tests are welcome.

github-actions · 2024-01-04T16:59:03Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 7fb2f72. Link to the linter CI: here}

ogrisel · 2024-01-08T13:46:05Z

Can you adapt or extend the existing stochastic gradient boosting example:

https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html

to check that similar results can be obtained with the HGBDT counter part?

While the implementation is rather easy, suggestions for good tests are welcome.

If the above works as expected, we can probably turn a simplified version into a non-regression test (by making assertions on the test loss values, without the plots) while keeping a ref from the test to the example to "explain" what is tested in the test (and also referencing the Friedman 2002 paper).

ogrisel

In addition to why I suggested above for testing, maybe you can parametrize existing HGBDT tests whose result should be invariant as to whether subsampling is enabled or not by adding:

@pytest.mark.parametrize("subsample", [0.5, 1.0])

whenever appropriate as is done in sklearn/ensemble/tests/test_gradient_boosting.py.

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

ogrisel · 2024-01-09T14:51:43Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+            # Do out of bag if required
+            if do_oob:
+                self._bagging_subsample_rng.shuffle(sample_mask)
+                sample_weight_train = sample_weight_train_original * sample_mask


Using null sample_weight to mathematically simulate subsampling it's simple to implement and does not need allocating any extra memory but it does not produce the expected 2x computational speed-up of when using a typical subsample=0.5.

Have considered row-wise fancy indexing of the training data instead?

ogrisel · 2024-01-09T14:53:44Z

The comments of my first pass of review were marked as resolved without any pushed code change to address them nor any discussion.

lorentzenchr · 2024-01-09T14:55:33Z

The comments of my first pass of review were marked as resolved without any pushed code change to address them nor any discussion.

Sorry, I forgot to push. Will do as soon as time permits.

lorentzenchr · 2024-01-11T16:39:20Z

One problem with the current way of just setting sample weight to zero is that the sample counts in histograms is wrong and min_samples_leaf does not work as it should.

Another solution is to pass a sample mask everywhere which is quite a massive change that I'm hesitant to implement.

ogrisel · 2024-01-25T10:21:50Z

Another solution is to pass a sample mask everywhere which is quite a massive change that I'm hesitant to implement.

Why not just fancy index to physically resample the training set? To avoid the memory copy?

With zero sample_weight, the count statistics of the histograms is wrong.

ENH add subsample to HGBT

7ed785a

github-actions bot added the module:ensemble label Jan 4, 2024

lorentzenchr added 2 commits January 4, 2024 18:52

FIX numpy.random.Generator.spawn for numpy<1.25

a597068

FIX spawing older Generators 2nd try

851c4f2

ogrisel reviewed Jan 8, 2024

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py Outdated Show resolved Hide resolved

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py Outdated Show resolved Hide resolved

CLN rename arg self

b873839

lorentzenchr mentioned this pull request Jan 8, 2024

FIX more precise log loss gradient and hessian #28048

Merged

ogrisel reviewed Jan 9, 2024

View reviewed changes

lorentzenchr added 2 commits January 9, 2024 20:43

Merge branch 'main' into hgbt_subsample

9e578d4

DOC gradient boosting regularization

7d5e426

lorentzenchr added 4 commits February 16, 2024 22:37

ENH use subsampled X, g, h instead of zero sample_weight

eab8e82

With zero sample_weight, the count statistics of the histograms is wrong.

Merge branch upstream/main

235949e

FIX bool insteaf of np.bool and enforce f-contiguity

a37bbe4

FIX special case constant hessian for subsample indices

7fb2f72

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH add subsample to HGBT #28063

ENH add subsample to HGBT #28063

Uh oh!

lorentzenchr commented Jan 4, 2024

Uh oh!

github-actions bot commented Jan 4, 2024 •

edited

Loading

Uh oh!

ogrisel commented Jan 8, 2024 •

edited

Loading

Uh oh!

ogrisel left a comment

Uh oh!

Uh oh!

Uh oh!

ogrisel Jan 9, 2024

Uh oh!

ogrisel commented Jan 9, 2024

Uh oh!

lorentzenchr commented Jan 9, 2024 •

edited

Loading

Uh oh!

lorentzenchr commented Jan 11, 2024

Uh oh!

ogrisel commented Jan 25, 2024

Uh oh!

Uh oh!

Uh oh!

ENH add subsample to HGBT #28063

Are you sure you want to change the base?

ENH add subsample to HGBT #28063

Uh oh!

Conversation

lorentzenchr commented Jan 4, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Jan 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

ogrisel commented Jan 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ogrisel Jan 9, 2024

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jan 9, 2024

Uh oh!

lorentzenchr commented Jan 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr commented Jan 11, 2024

Uh oh!

ogrisel commented Jan 25, 2024

Uh oh!

Uh oh!

github-actions bot commented Jan 4, 2024 •

edited

Loading

ogrisel commented Jan 8, 2024 •

edited

Loading

lorentzenchr commented Jan 9, 2024 •

edited

Loading