Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add subsample and max_features parameters to HistGradientBoostingRegressor #16062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Gitman-code opened this issue Jan 8, 2020 · 15 comments Β· May be fixed by #28063
Open

Add subsample and max_features parameters to HistGradientBoostingRegressor #16062

Gitman-code opened this issue Jan 8, 2020 · 15 comments Β· May be fixed by #28063

Comments

@Gitman-code
Copy link

The parameters subsample and max_features in GradientBoostingRegressor are useful. Is it possible to add equivalent parameters to HistGradientBoostingRegressor?

@NicolasHug
Copy link
Member

Do you use these because of computation time, or for a bias / variance tradeoff?

I suppose you could expect some speed ups. However,
Since the HistGradientBoosting estimator first bin the data, these parameters might not have the same effect regarding bias / variance

@Gitman-code
Copy link
Author

It was to prevent overfitting. I have read LightGBM has issues with overfitting so I thought this would be a good solution. If things are different in this method please explain.

@NicolasHug
Copy link
Member

Looks like LightGBM has bagging_fraction for subsampling, and feature_fraction for dealing with overfitting.

I think it's worth implementing, WDYT @ogrisel @adrinjalali?

@ogrisel
Copy link
Member

ogrisel commented Jan 13, 2020

I believe that if we implement both bagging_fraction and feature_fraction and set a low learning rate (and large number of boosting iterations) you can get boosted trees to behave more like random forests, hence can probably better deal with overfitting.

This is probably also interesting to get better calibration of the predict_proba output (in classification problems) which can be useful to report meaningful confidence levels when deploying the system.

Having the ability to resample each boosted tree independently could also be a very efficient way to deal with imbalanced classification problems: the majority class could be subsampled. This would warrant additional parameters though to deal with class-wise bagging fractions. Related to: #13227.

@amueller
Copy link
Member

For the feature_fraction, is that on a tree-basis or a per-split basis? Our GradientBoosting does "max_features" which is on a per-split basis....

@Gitman-code
Copy link
Author

Doing it on per-split basis is the preferred method

@ogrisel
Copy link
Member

ogrisel commented Jan 29, 2020

I agree. My intuition is that if max_depth / max_leaf_nodes are high enough, the tree still has a chance to recover from a bad feature subset draw at a given split nodes while if we do it a tree-based level, the whole tree is wasted.

@amueller
Copy link
Member

@DrEhrfurchtgebietend or anyone: for prosperity, is that also what the other libs do?

@NicolasHug
Copy link
Member

LightGBM uses a per-tree approach for max_features. https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L325

@xuyxu
Copy link

xuyxu commented May 11, 2020

Is anyone working on the sampling part of HistGradientBoostingClassifier/Regressor? Apart from acceleration, sampling strategy also provides regularization on the entire GBDT system and improves its performance.

Here are some commonly-used sampling strategies on a tree-basis in other GBDT libraries:

@NicolasHug
Copy link
Member

@gbolmier is currently working on feature and sample subsampling.

However the GOSS strategy from LightGBM is out of scope (at least for now) as it departs quite significantly from traditional GBDTs.

@lorentzenchr
Copy link
Member

Doing it on per-split basis is the preferred method

@mayer79 Do you think the same for column/feature subsampling?

@mayer79
Copy link
Contributor

mayer79 commented Aug 20, 2023

@lorentzenchr : Ideally, we could have both options. But "per split" is also my favourite. Given that it is not interferring too much with interaction constraints!

@lorentzenchr
Copy link
Member

LightGBM has 2 feature_fraction alias colsample_bytree and feature_fraction_bynode alias colsample_bynode. The last one is for split-based subsampling.

XGBoost even has 3 different ones: colsample_bytree, colsample_bylevel, colsample_bynode. The last one is for split-base subsampling.

I propose we go with colsample_bynode, i.e. per-split subsampling.

@lorentzenchr lorentzenchr changed the title [Feature Request] Add subsample and max_features parameters to HistGradientBoostingRegressor Add subsample and max_features parameters to HistGradientBoostingRegressor Aug 22, 2023
@ogrisel
Copy link
Member

ogrisel commented Nov 15, 2023

Feature subsampling has been merged. Next step: sample-wise subsampling.

Also, it would be great if someone wanted to open a PR to randomize tie breaking when considering splits with equal newton steps. This could be done by shuffling the order of the features as suggested in #26428 or with flipping an RNG-coin each time an exact tie is detected when pushing to the heap queue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants