Thanks to visit codestin.com
Credit goes to github.com

Skip to content

RFC Unify old GradientBoosting estimators and HGBT #27873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lorentzenchr opened this issue Nov 29, 2023 · 7 comments
Open

RFC Unify old GradientBoosting estimators and HGBT #27873

lorentzenchr opened this issue Nov 29, 2023 · 7 comments
Labels

Comments

@lorentzenchr
Copy link
Member

lorentzenchr commented Nov 29, 2023

Current situation

We have the unfortunate situation to have 2 different versions of gradient boosting, the old estimators (GradientBoostingClassifier and GradientBoostingRegressor) as well as the new ones using binning and histogram strategies similar to LightGBM (HistGradientBoostingClassifier and HistGradientBoostingRegressor).

This makes advertising the new ones harder, e.g. #26826, and also result in a larger feature gap between those two.
Based on discussions in #27139 and during a monthly meeting (maybe not documented), I'd like to call for comments on the following:

Proposition

Unify both types of gradient boosting in a single class, i.e. the old names GradientBoostingClassifier and make them switch the underlying estimator class based on a parameter value, e.g. max_bins (None->old classes, integer->new classes).

Note that binning and histograms are not the only difference.

Comparison

Algorithm

The old GBT uses Friedman gradient boosting with a line search step. (The lines search sometimes, e.g. for log loss, uses a 2. order approximation and is therefore, sometimes, called "hybrid gradient-Newton boosting"). The trees are learned on the gradients. A tree searches for the best split among all (veeeery many) split candidates for all features. After a single tree is fit, the terminal node values are re-computed which corresponds to a line search step.

The new HGBT uses a 2. order approximation of the loss, i.e. gradients and hessians (XGBoost paper, therefore sometimes called Newton boosting). In addition, it bins/discretizes the features X and uses a histogram of gradients/hessians/counts per feature. A tree then searches for the best split candidate, but there are only n_features * n_bins candidates (muuuuch less than in GBT).

estimator trees train on node values (consequence of tree train) features X
GBT gradients recomputed in lines search use as is
HGBT gradients/hessians sum(gradient)/sum(hessian) bin/discretize

In fact, one could use 2. order loss (gradients and hessians) without binning X, and vice-versa, use binning with fitting trees on gradients (without hessians).

Parameters

HistGradientBoostingRegressor GradientBoostingRegressor Same Comment
loss loss βœ…
quantile alpha ❌
learning_rate learning_rate βœ…
max_iter n_estimators ❌ #12807 (comment)
max_leaf_nodes max_leaf_nodes βœ…
max_depth max_depth βœ…
min_samples_leaf min_samples_leaf βœ…
l2_regularization learning_rate βœ…
max_features max_features βœ…
max_bins β›” (nonsense) ❌
categorical_features β›” ❌
monotonic_cst β›” ❌ #27305
interaction_cst β›” ❌
warm_start warm_start βœ…
early_stopping β›” ❌
scoring β›” ❌
validation_fraction validation_fraction βœ…
n_iter_no_change n_iter_no_change βœ…
tol tol βœ…
verbose verbose βœ…
random_state random_state βœ…
class_weight β›” ❌
β›” subsample ❌ #16062
β›” (nonsense) criterion ❌
β›” min_samples_split ❌
β›” min_weight_fraction_leaf ❌
β›” min_impurity_decrease ❌
β›” init ❌ #27109
β›” ccp_alpha ❌

In fact, only the quantile/alpha and max_iter/n_estimator parameters are conflicting.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Nov 29, 2023
@lorentzenchr lorentzenchr added RFC and removed Needs Triage Issue requires triage labels Nov 29, 2023
@adrinjalali
Copy link
Member

This seems like a somewhat painful path (deprecation-wise), but I think we can greatly benefit from it. So I'm overall +1.

@glevv
Copy link
Contributor

glevv commented Dec 6, 2023

About first and second order optimizations, I think catboost has an interesting take on it, where you can select an optimizer order by setting leaf_estimation_method to {Newton, Gradient, Exact}.

Tthis will be a huge refactoring and documentation rewriting. From the user standpoint, it is really frustrating when a lot of parameter interactions are not documented: you set one parameter and it will actually change available options of other parameters and you will never know this until you try (catboost is actually a great example of such behaviour).

Also about HGBM: I think there are some features missing that make it a bit less appealing than XGBoost or LightGBM, for example subsampling (of samples and features).

@betatim
Copy link
Member

betatim commented Dec 7, 2023

I support the idea of making HistGradientBoosting* available as GradientBoosting*, and therefore the default. Our best "product" should be available at the prime location of our shop (aka name), not hidden under the counter. I think the proposed method of using the value of n_bins to choose between the two is the best idea so far for how to do this. But it does require significant effort and a few deprecation cycles :-/ c'est la vie.

I can see that by switching (massively?) the algorithm that you get based on a constructor argument you can create confusion. However, I wonder how many people who use gradient boosting actually care about the precise algorithm. I'd include people who can't explain the differences between the two in the category of "don't care" (this includes me).


Having both versions available under a dedicated name, say, HistGradientBoosting* and ExactGradientBoosting* (?) could be nice for people who want to be super explicit. You could take this idea further by offering more classes that use an estimator but hardwire some of the constructor arguments. The goal would be to make it simpler for users (dealing with the forest of constructor args) and increase discoverability. A downside would be a proliferation of class names. But this is going off-topic.

@amueller
Copy link
Member

We switch the algorithm with a constructor parameter in many models, right (for good or bad)?
I think unification would be nice, but painful and laborious.

@glevv the subsampling of features and samples could be achieved with a BaggingClassifier around it, right? but I guess that's not entirely the same thing. Both of these would be quite easy to add, I think we just didn't want to explode the number of hyper-parameters.
I think dropping the tree pruning parameters from GradientBoostingRegressor wouldn't be too bad, I don't think anyone in their right mind uses anything but max_depth or max_leaf_nodes.

@betatim
Copy link
Member

betatim commented Mar 11, 2024

An idea I had relating to this issue: what do people think of a small tool that transforms user's code? It could either be an automatic tool that makes the required edits or maybe more like a linter that points you to things that you should investigate. Taking it one step further, a linter for your scikit-learn usage would be cool. It could point out transformations you could make, anti-patterns, etc

@adrinjalali
Copy link
Member

If we do such a tool, I think it'd be nice to have it a collaboration between a bunch of other projects in the pydata/scipy space.

@betatim
Copy link
Member

betatim commented Apr 9, 2024

Someone just pointed me to https://numpy.org/devdocs/numpy_2_0_migration_guide.html#ruff-plugin

There is nothing new under the sun :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants