-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Add subsample and max_features parameters to HistGradientBoostingRegressor #16062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Do you use these because of computation time, or for a bias / variance tradeoff? I suppose you could expect some speed ups. However, |
It was to prevent overfitting. I have read LightGBM has issues with overfitting so I thought this would be a good solution. If things are different in this method please explain. |
Looks like LightGBM has I think it's worth implementing, WDYT @ogrisel @adrinjalali? |
I believe that if we implement both This is probably also interesting to get better calibration of the Having the ability to resample each boosted tree independently could also be a very efficient way to deal with imbalanced classification problems: the majority class could be subsampled. This would warrant additional parameters though to deal with class-wise bagging fractions. Related to: #13227. |
For the feature_fraction, is that on a tree-basis or a per-split basis? Our GradientBoosting does "max_features" which is on a per-split basis.... |
Doing it on per-split basis is the preferred method |
I agree. My intuition is that if |
@DrEhrfurchtgebietend or anyone: for prosperity, is that also what the other libs do? |
LightGBM uses a per-tree approach for max_features. https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L325 |
Is anyone working on the sampling part of HistGradientBoostingClassifier/Regressor? Apart from acceleration, sampling strategy also provides regularization on the entire GBDT system and improves its performance. Here are some commonly-used sampling strategies on a tree-basis in other GBDT libraries:
|
@gbolmier is currently working on feature and sample subsampling. However the GOSS strategy from LightGBM is out of scope (at least for now) as it departs quite significantly from traditional GBDTs. |
@mayer79 Do you think the same for column/feature subsampling? |
@lorentzenchr : Ideally, we could have both options. But "per split" is also my favourite. Given that it is not interferring too much with interaction constraints! |
LightGBM has 2 XGBoost even has 3 different ones: I propose we go with |
Feature subsampling has been merged. Next step: sample-wise subsampling. Also, it would be great if someone wanted to open a PR to randomize tie breaking when considering splits with equal newton steps. This could be done by shuffling the order of the features as suggested in #26428 or with flipping an RNG-coin each time an exact tie is detected when pushing to the heap queue. |
The parameters subsample and max_features in GradientBoostingRegressor are useful. Is it possible to add equivalent parameters to HistGradientBoostingRegressor?
The text was updated successfully, but these errors were encountered: