Add subsample and max_features parameters to HistGradientBoostingRegressor #16062

Gitman-code · 2020-01-08T23:37:57Z

The parameters subsample and max_features in GradientBoostingRegressor are useful. Is it possible to add equivalent parameters to HistGradientBoostingRegressor?

NicolasHug · 2020-01-11T18:23:22Z

Do you use these because of computation time, or for a bias / variance tradeoff?

I suppose you could expect some speed ups. However,
Since the HistGradientBoosting estimator first bin the data, these parameters might not have the same effect regarding bias / variance

Gitman-code · 2020-01-12T06:48:47Z

It was to prevent overfitting. I have read LightGBM has issues with overfitting so I thought this would be a good solution. If things are different in this method please explain.

NicolasHug · 2020-01-12T09:33:07Z

Looks like LightGBM has bagging_fraction for subsampling, and feature_fraction for dealing with overfitting.

I think it's worth implementing, WDYT @ogrisel @adrinjalali?

ogrisel · 2020-01-13T16:47:41Z

I believe that if we implement both bagging_fraction and feature_fraction and set a low learning rate (and large number of boosting iterations) you can get boosted trees to behave more like random forests, hence can probably better deal with overfitting.

This is probably also interesting to get better calibration of the predict_proba output (in classification problems) which can be useful to report meaningful confidence levels when deploying the system.

Having the ability to resample each boosted tree independently could also be a very efficient way to deal with imbalanced classification problems: the majority class could be subsampled. This would warrant additional parameters though to deal with class-wise bagging fractions. Related to: #13227.

amueller · 2020-01-24T22:36:15Z

For the feature_fraction, is that on a tree-basis or a per-split basis? Our GradientBoosting does "max_features" which is on a per-split basis....

Gitman-code · 2020-01-25T00:28:26Z

Doing it on per-split basis is the preferred method

ogrisel · 2020-01-29T10:55:50Z

I agree. My intuition is that if max_depth / max_leaf_nodes are high enough, the tree still has a chance to recover from a bad feature subset draw at a given split nodes while if we do it a tree-based level, the whole tree is wasted.

amueller · 2020-01-29T14:23:48Z

@DrEhrfurchtgebietend or anyone: for prosperity, is that also what the other libs do?

NicolasHug · 2020-01-29T14:33:24Z

LightGBM uses a per-tree approach for max_features. https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L325

xuyxu · 2020-05-11T16:10:05Z

Is anyone working on the sampling part of HistGradientBoostingClassifier/Regressor? Apart from acceleration, sampling strategy also provides regularization on the entire GBDT system and improves its performance.

Here are some commonly-used sampling strategies on a tree-basis in other GBDT libraries:

Stochastic Gradient Boosting (XGBoost/LightGBM/CatBoost)(https://astro.temple.edu/~msobel/courses_files/StochasticBoosting(gradient).pdf)
Gradient-based One-Side Sampling (LightGBM)
(https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf)
Minimum Variance Sampling (CatBoost)
(https://papers.nips.cc/paper/9645-minimal-variance-sampling-in-stochastic-gradient-boosting)

NicolasHug · 2020-05-11T16:15:30Z

@gbolmier is currently working on feature and sample subsampling.

However the GOSS strategy from LightGBM is out of scope (at least for now) as it departs quite significantly from traditional GBDTs.

lorentzenchr · 2023-08-20T15:31:47Z

Doing it on per-split basis is the preferred method

@mayer79 Do you think the same for column/feature subsampling?

mayer79 · 2023-08-20T20:16:50Z

@lorentzenchr : Ideally, we could have both options. But "per split" is also my favourite. Given that it is not interferring too much with interaction constraints!

lorentzenchr · 2023-08-21T16:56:12Z

LightGBM has 2 feature_fraction alias colsample_bytree and feature_fraction_bynode alias colsample_bynode. The last one is for split-based subsampling.

XGBoost even has 3 different ones: colsample_bytree, colsample_bylevel, colsample_bynode. The last one is for split-base subsampling.

I propose we go with colsample_bynode, i.e. per-split subsampling.

ogrisel · 2023-11-15T09:38:05Z

Feature subsampling has been merged. Next step: sample-wise subsampling.

Also, it would be great if someone wanted to open a PR to randomize tie breaking when considering splits with equal newton steps. This could be done by shuffling the order of the features as suggested in #26428 or with flipping an RNG-coin each time an exact tie is detected when pushing to the heap queue.

cmarmo added the module:ensemble label Jan 30, 2020

NicolasHug mentioned this issue Oct 30, 2020

HistGradientBoostingClassifier crushes on datasets with large number of input features #18703

Closed

cmarmo added the New Feature label Mar 29, 2022

lorentzenchr changed the title ~~[Feature Request] Add subsample and max_features parameters to HistGradientBoostingRegressor~~ Add subsample and max_features parameters to HistGradientBoostingRegressor Aug 22, 2023

lorentzenchr mentioned this issue Aug 23, 2023

ENH add feature subsampling per split for HGBT #27139

Merged

4 tasks

lorentzenchr mentioned this issue Nov 29, 2023

RFC Unify old GradientBoosting estimators and HGBT #27873

Open

lorentzenchr linked a pull request Jan 4, 2024 that will close this issue

ENH add subsample to HGBT #28063

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add subsample and max_features parameters to HistGradientBoostingRegressor #16062

Add subsample and max_features parameters to HistGradientBoostingRegressor #16062

Gitman-code commented Jan 8, 2020

NicolasHug commented Jan 11, 2020

Uh oh!

Gitman-code commented Jan 12, 2020

Uh oh!

NicolasHug commented Jan 12, 2020

Uh oh!

ogrisel commented Jan 13, 2020 •

edited

Loading

Uh oh!

amueller commented Jan 24, 2020

Uh oh!

Gitman-code commented Jan 25, 2020

Uh oh!

ogrisel commented Jan 29, 2020

Uh oh!

amueller commented Jan 29, 2020

Uh oh!

NicolasHug commented Jan 29, 2020

Uh oh!

xuyxu commented May 11, 2020

Uh oh!

NicolasHug commented May 11, 2020

Uh oh!

lorentzenchr commented Aug 20, 2023

Uh oh!

mayer79 commented Aug 20, 2023

Uh oh!

lorentzenchr commented Aug 21, 2023

Uh oh!

ogrisel commented Nov 15, 2023

Uh oh!

Uh oh!

Add subsample and max_features parameters to HistGradientBoostingRegressor #16062

Add subsample and max_features parameters to HistGradientBoostingRegressor #16062

Comments

Gitman-code commented Jan 8, 2020

NicolasHug commented Jan 11, 2020

Uh oh!

Gitman-code commented Jan 12, 2020

Uh oh!

NicolasHug commented Jan 12, 2020

Uh oh!

ogrisel commented Jan 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Jan 24, 2020

Uh oh!

Gitman-code commented Jan 25, 2020

Uh oh!

ogrisel commented Jan 29, 2020

Uh oh!

amueller commented Jan 29, 2020

Uh oh!

NicolasHug commented Jan 29, 2020

Uh oh!

xuyxu commented May 11, 2020

Uh oh!

NicolasHug commented May 11, 2020

Uh oh!

lorentzenchr commented Aug 20, 2023

Uh oh!

mayer79 commented Aug 20, 2023

Uh oh!

lorentzenchr commented Aug 21, 2023

Uh oh!

ogrisel commented Nov 15, 2023

Uh oh!

ogrisel commented Jan 13, 2020 •

edited

Loading