ENH add feature subsampling per split for HGBT #27139

lorentzenchr · 2023-08-23T06:21:25Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR adds ~~colsample_bynode~~ max_features parameter to HistGradientBoostingRegressor and HistGradientBoostingClassifier. With this parameter, one can specify the proportion of features subsampled per split/node.

The name colsample_bynode is the same in XGBoost and LightGBM.

Any other comments?

Not yet.

TODO

make it work
tests
example/docu
benchmark script (higgs)

github-actions · 2023-08-23T06:23:02Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: f00f443. Link to the linter CI: here}

ogrisel · 2023-08-23T08:13:29Z

The name colsample_bynode is the same in XGBoost and LightGBM.

This will make the HistGradientBoostingClassifier/Regressor class parametrization more different than the other tree ensemble methods. Wouldn't max_features work? Alternatively max_features_by_node maybe?

We rarely use the name "column" in the scikit-learn API but rather "feature".

EDIT: I know see that that this parameter is only defined in terms of ratio (float) and does not allow absolute integer number of features...

I suppose this is to avoid the ambiguity of max_features=1 vs max_features=1.0. In practice, max_features=1 has interesting properties in terms of estimating mutual information between features and the target using mean decrease in impurity (loss) in an RF:

This result might generalize the GBDT, in which case it could be interesting to have simple hyper-parameter setting to do it with scikit-learn.

lorentzenchr · 2023-08-23T08:36:23Z

I intentionally named it the same as in LightGBM and XGBoost. I personally find max_features not a good name and something like subsample_features_per_split much clearer.

I also intentionally only allowed fractions as input, because with "max_features=1" you could instead use stumps (tree depth=1) instead to get an equivalent result.

BTW, results for random forests usually do not carry over to GBTs. RFs are better with overfitting trees, GBTs better with weak learners.
In this particular case, we have https://arxiv.org/abs/2207.14490.

lorentzenchr · 2023-08-23T12:30:38Z

API comparison

`HistGradientBoostingRegressor`	`GradientBoostingRegressor`	Same
loss	loss	✅
quantile	alpha	❌
learning_rate	learning_rate	✅
max_iter	n_estimators	❌
max_leaf_nodes	max_leaf_nodes	✅
max_depth	max_depth	✅
min_samples_leaf	min_samples_leaf	✅
l2_regularization	learning_rate	✅
max_bins	⛔ (nonsense)	❌
categorical_features	⛔	❌
monotonic_cst	⛔	❌
interaction_cst	⛔	❌
warm_start	warm_start	✅
early_stopping	⛔	❌
scoring	⛔	❌
validation_fraction	validation_fraction	✅
n_iter_no_change	n_iter_no_change	✅
tol	tol	✅
verbose	verbose	✅
random_state	random_state	✅
⛔	subsample	❌
⛔ (nonsense)	criterion	❌
⛔	min_samples_split	❌
⛔	min_weight_fraction_leaf	❌
⛔	min_impurity_decrease	❌
⛔	init	❌
⛔ (this PR)	max_features	❌
⛔	ccp_alpha	❌

Indeed, only the quantile/alpha and max_iter/n_estimator parameters deviate.

ogrisel · 2023-08-23T13:25:26Z

something like subsample_features_per_split much clearer.

Indeed.

I also intentionally only allowed fractions as input, because with "max_features=1" you could instead use stumps (tree depth=1) instead to get an equivalent result.

Ensembles of decision stumps cannot model feature interactions, while ensemble of deeper tree with max_features=1 / colsample_bynode=1.0 / X_train.shape[0] can, even if the feature selection (choice of axis to partition over at each node) is totally random. Hence the inductive bias would be very different.

EDIT: in particular, decision stumps would fail on a XOR tasks while max_features=1 would not given enough trees.

ogrisel · 2023-08-23T13:40:43Z

In this particular case, we have https://arxiv.org/abs/2207.14490.

Note that this is about local (per observation) explanations (decomposition of the prediction function, independently of the true y), e.g. SHAP values.

While https://arxiv.org/abs/2111.02218 is about global (aggregate over dataset/distribution) explanations (decomposition of the per feature contribution to the aggregate loss) SAGE values that can be shown to estimate Mutual Information between feature and target variables under certain conditions.

But SHAP and SAGE values are Shapley values (per feature decompositions) but not of the same quantity.

The feature importance (MDI) of RFs with max_features=1 are therefore a computationally cheap estimates of MI(Y; X_j). Computing SAGE values is typically much more computationally expensive than fitting a single big enough RF.

But I agree that the sequential nature of the fit of GBDT might prevent them to show the same property as RFs.

lorentzenchr · 2023-08-23T16:17:11Z

in particular, decision stumps would fail on a XOR tasks while max_features=1 would not given enough trees.

You‘re right. Note that it can be achieved in this PR by a tiny subsample fraction (we could allow 0) because the the min numbers if features is set to be 1. Let’s stop here as this is a more academic discussion. In real use cases, the fraction is likely closer to 1 than to 0.

I also checked run time with the higgs benchmark and there is no time penalty.

@scikit-learn/core-devs Opinions on naming this parameter are very welcome.

betatim · 2023-08-28T14:22:47Z

Thinking about the name, I like: max_features_per_split. I prefer per_split over per_node because somehow "node" is weird for me. I'm +-0 on only allowing fractions, but if we allow only fractions then I dislike max_features.. because it reuses a name people already know from other estimators in scikit-learn, where integers > 1 are allowed. However, I also don't like col, columns, etc because we don't use them anywhere else in scikit-learn.

feature_sampling_fraction_per_split is super long and verbose. Makes me feel like I took a wrong turn and ended up in a Java library. Maybe feature_sampling_fraction is better? and if there is no global sampling then it is clear that this refers to the "per split" sampling?

Overall I'm not a huge fan of any of them :-/

ogrisel · 2023-08-28T16:36:26Z

If we are to use a different parameter name than in GradientBoostingClassifier (that is, not reuse max_features), then I would go for per_split rather than per_node as well.

Not sure how to convey the "fraction"/"ratio"/"relative" vs absolute integer meaning if we disallow integers.

lorentzenchr · 2023-08-30T17:43:53Z

Based on the discussion in the last dev meeting, I‘ll better change to max_feature for consistency with the old GBT.

ogrisel · 2023-09-11T08:37:56Z

@GaelVaroquaux maybe the lack of feature subsampling in the hyperparameter search grid could explain some of the discrepancy between scikit-learn and xgboost in this diagram:

source: https://arxiv.org/abs/2207.08815

According to the appendix, the xgboost parameter grid indeed includes both sample and feature subsampling. Based on appendix A.6, feature subsampling might be even more important than the colsample_by* parameters but both kinds seem to contribute.

ogrisel

I am wondering if we should not introduce this parameter under the name max_features_fraction.

This would allow the possibility to deprecate passing float values to max_features in the other models to avoid the usability trap of max_feature=1 vs max_features=1.0 while keeping consistency of the meaning of that parameter across all our tree implementation in scikit-learn.

Whether or not we decide to also add an absolute integral max_features parameter to HGBDT can be discussed later (if ever), but would be an orthogonal decision.

Also, I am wondering: how does feature subsampling should interact with interaction_cst? Shall we subsample among the allowed the features once the constraint is enforced? Or subsample before taking the cst into consideration and then risk censoring splits quite heavily?

In any case need some dedicated test to assert to check that the interaction_cst are respected even when feature subsampling is enabled. And maybe clarify what happens when both hyperparameters are active together in the docstring.

benchmarks/bench_hist_gradient_boosting_higgsboson.py

lorentzenchr · 2023-09-11T11:30:14Z

Concerning naming of this new feature, I see 2 options:

max_features in order to be consistent with our other tree based models
colsample_bynode (float in range (0,1]) in order to be consistent with LightGBM and XGBoost

I would not add another name.

lorentzenchr · 2023-09-11T11:50:43Z

If I understand the logic in https://github.com/microsoft/LightGBM/blob/master/src/treelearner/col_sampler.hpp correctly, the feature subsampling is performed after restricting to allowed features by interaction constraints.

ogrisel · 2023-09-12T07:29:03Z

Then I would vote for feature_sample_per_split or features_fraction_per_split

lorentzenchr · 2023-09-12T08:09:49Z

@scikit-learn/core-devs
We need a name for this feature. Options are:

max_features the same as our other tree-based models. The problem there is that it handles floats from 0 to 1 as fractions and integers as number of features. Then 1.0 and 1 have a different behavior. In this PR, we could start with allowing only floats between 0 and 1.
colsample_bynode, a float between 0 and 1. Same name as in XGBoost and LightGBM
feature_(sub)sample_per_split 3.a without, 3.b with prefix sub
~~features_fraction_per_split~~ feature_fraction_per_split

If we do not go with 1. max_features, then we can add this new name to our old tree-based models.

Edit: Based on

Overall I wonder if we should broaden the discussion to "do we want to and if yes how do we do it" with respect to moving away from allowing integers for max_features.

I opened #27347 .

betatim · 2023-09-12T08:21:36Z

I'd vote for 4 (but maybe lobby for changing to feature_fraction_per_split, removing the S from features)

Overall I wonder if we should broaden the discussion to "do we want to and if yes how do we do it" with respect to moving away from allowing integers for max_features. Because if we implement that move, I think I'd vote differently. Or at the least my voting behaviour would depend on the plan.

While catching up with this thread I saw max_features_fraction first and thought it was quite a good solution.

ogrisel

LGTM. It took be a while to understand the test but I am now convinced it's correctly testing what we want.

It would be great to add an example but I am not sure how to do so. But I don't think it should be a requirement to merge this PR.

I plan to conduct a large enough hyper-parameter search to check it would select this new option more often than not in the test highest performing models.

doc/whats_new/v1.4.rst

sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py

ogrisel · 2023-11-13T15:15:28Z

Related to this paper https://jmlr.org/papers/v20/18-444.html, sampling features by split is indeed expected to improve accuracy on most datasets. They even recommend a default value of ~0.6 (although this default value is only "optimal" in conjunction with changes made to the other default values). I would rather not change the default value to minimize disruption, but I found that interesting.

NicolasHug

Quick review from me, I just have one Q regarding the RNG but otherwise LGTM!

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/grower.py

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

ogrisel · 2023-11-13T17:38:27Z

I ran the following empirical experiments:

# %%
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


X, y = make_regression(
    n_samples=1000, n_features=200, n_informative=100, random_state=0
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
# %%
param_grid = {
    "learning_rate": [0.01, 0.03, 0.1, 0.3, 1.],
    "max_leaf_nodes": [7, 15, 31, 63, 127],
    "min_samples_leaf": [5, 10, 20, 50, 100, 200],
    "max_features": [0.5, 0.8, 1.0],
    "l2_regularization": [0.0, 0.1, 0.3, 1.0],
    "max_bins": [63, 127, 255],
}
model = HistGradientBoostingRegressor(early_stopping=True, max_iter=1_000, random_state=0)
search_cv = RandomizedSearchCV(
    model,
    param_grid,
    n_iter=500,
    cv=5,
    scoring="neg_mean_squared_error",
    n_jobs=8,
    random_state=0,
    verbose=10,
).fit(X_train, y_train)
results_df = pd.DataFrame(search_cv.cv_results_)
results_df.sort_values(by="rank_test_score").head()

# %%
for _, record in results_df.sort_values(by="rank_test_score").head(15).iterrows():
    model.set_params(**record["params"]).fit(X_train, y_train)
    test_mse = round(mean_squared_error(y_test, model.predict(X_test)), 1)
    cv_mse = round(-record["mean_test_score"], 1)
    print(f"Test MSE: {test_mse}, CV MSE: {cv_mse}, {record['params']}")

in this setting (large number of uninformative numerical features), I would have expected models with max_features < 1.0 to dominate. However this is not significantly the case. Here is the output on my machine:

Test MSE: 166439.4, CV MSE: 195540.1, {'min_samples_leaf': 50, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 0.1}
Test MSE: 165737.7, CV MSE: 197619.9, {'min_samples_leaf': 50, 'max_leaf_nodes': 7, 'max_features': 0.8, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 0.1}
Test MSE: 164282.0, CV MSE: 197905.0, {'min_samples_leaf': 50, 'max_leaf_nodes': 15, 'max_features': 1.0, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 166514.4, CV MSE: 198770.6, {'min_samples_leaf': 50, 'max_leaf_nodes': 63, 'max_features': 0.8, 'max_bins': 255, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 194007.6, CV MSE: 201834.2, {'min_samples_leaf': 100, 'max_leaf_nodes': 7, 'max_features': 0.8, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 1.0}
Test MSE: 154372.1, CV MSE: 203111.7, {'min_samples_leaf': 100, 'max_leaf_nodes': 63, 'max_features': 0.8, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 165784.7, CV MSE: 203437.7, {'min_samples_leaf': 50, 'max_leaf_nodes': 31, 'max_features': 1.0, 'max_bins': 255, 'learning_rate': 0.03, 'l2_regularization': 0.1}
Test MSE: 201307.6, CV MSE: 205819.0, {'min_samples_leaf': 20, 'max_leaf_nodes': 7, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.3}
Test MSE: 162051.9, CV MSE: 205886.5, {'min_samples_leaf': 50, 'max_leaf_nodes': 31, 'max_features': 0.8, 'max_bins': 127, 'learning_rate': 0.03, 'l2_regularization': 0.0}
Test MSE: 179643.1, CV MSE: 205986.4, {'min_samples_leaf': 50, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 1.0}
Test MSE: 163738.1, CV MSE: 206625.1, {'min_samples_leaf': 50, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.03, 'l2_regularization': 1.0}
Test MSE: 183112.7, CV MSE: 206675.5, {'min_samples_leaf': 100, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 127, 'learning_rate': 0.3, 'l2_regularization': 0.0}
Test MSE: 163586.6, CV MSE: 208135.2, {'min_samples_leaf': 50, 'max_leaf_nodes': 63, 'max_features': 0.8, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 0.1}
Test MSE: 183265.1, CV MSE: 208694.6, {'min_samples_leaf': 20, 'max_leaf_nodes': 7, 'max_features': 1.0, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 0.3}
Test MSE: 175477.5, CV MSE: 209630.9, {'min_samples_leaf': 50, 'max_leaf_nodes': 31, 'max_features': 0.5, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 1.0}

It's not too bad either but still. Maybe, this is because of the shared RNG instance (#27139 (comment))?

One way to confirm would be to run xgboost on a similar grid and compare the results.

lorentzenchr · 2023-11-13T18:25:08Z

@NicolasHug @ogrisel Thanks for your thorough and yielding reviews!

lorentzenchr · 2023-11-13T18:45:01Z

Now I get

Test MSE: 166492.4, CV MSE: 193456.4, {'min_samples_leaf': 100, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 127, 'learning_rate': 0.3, 'l2_regularization': 0.0}
Test MSE: 152407.1, CV MSE: 196445.8, {'min_samples_leaf': 100, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.3}
Test MSE: 183248.0, CV MSE: 197494.9, {'min_samples_leaf': 100, 'max_leaf_nodes': 63, 'max_features': 0.8, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.0}
Test MSE: 164282.0, CV MSE: 197905.0, {'min_samples_leaf': 50, 'max_leaf_nodes': 15, 'max_features': 1.0, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 182088.7, CV MSE: 198641.8, {'min_samples_leaf': 100, 'max_leaf_nodes': 63, 'max_features': 0.8, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 0.1}
Test MSE: 163956.9, CV MSE: 200319.4, {'min_samples_leaf': 50, 'max_leaf_nodes': 7, 'max_features': 0.8, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 181571.7, CV MSE: 202212.2, {'min_samples_leaf': 100, 'max_leaf_nodes': 31, 'max_features': 0.8, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 1.0}
Test MSE: 165784.7, CV MSE: 203437.7, {'min_samples_leaf': 50, 'max_leaf_nodes': 31, 'max_features': 1.0, 'max_bins': 255, 'learning_rate': 0.03, 'l2_regularization': 0.1}
Test MSE: 154229.4, CV MSE: 204226.0, {'min_samples_leaf': 100, 'max_leaf_nodes': 7, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.0}
Test MSE: 154229.4, CV MSE: 204226.0, {'min_samples_leaf': 100, 'max_leaf_nodes': 63, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.0}
Test MSE: 154226.0, CV MSE: 204315.6, {'min_samples_leaf': 100, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.1}
Test MSE: 159761.1, CV MSE: 204401.8, {'min_samples_leaf': 100, 'max_leaf_nodes': 7, 'max_features': 0.8, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 1.0}
Test MSE: 160218.8, CV MSE: 205853.8, {'min_samples_leaf': 100, 'max_leaf_nodes': 127, 'max_features': 0.8, 'max_bins': 127, 'learning_rate': 0.3, 'l2_regularization': 0.1}
Test MSE: 168158.0, CV MSE: 206666.1, {'min_samples_leaf': 100, 'max_leaf_nodes': 63, 'max_features': 0.8, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 171398.6, CV MSE: 207886.9, {'min_samples_leaf': 50, 'max_leaf_nodes': 63, 'max_features': 0.8, 'max_bins': 255, 'learning_rate': 0.1, 'l2_regularization': 1.0}

All the CV MSE are slightly better, most of the test MSE, too.

max_features	@ogrisel	@lorentzenchr
0.5	6	5
0.8	6	6
1.0	3	2

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

ogrisel · 2023-11-14T09:06:50Z

Thanks for the updates @lorentzenchr. #27139 (comment) does not seem to show a significant improvement (but no degradation either). Maybe this dataset is not suitable to show the advantage of feature subsampling.

Do you know cases where feature subsampling is a must-have? Or maybe it's most useful when combined with sample-wise subsampling?

NicolasHug

Code LGTM, thank you @lorentzenchr !

(I'll let you guys merge, as I haven't followed the benchmarking efforts)

lorentzenchr · 2023-11-14T11:40:18Z

#27139 (comment) does not seem to show a significant improvement (but no degradation either).

I think it does. Observe that on the majority of splits, the best max_features is smaller than 1.
Also, those numbers do not show the improvement of feature subsampling vs no subsampling.

Maybe this dataset is not suitable to show the advantage of feature subsampling.

Real datasets might provide stronger cases for feature subsampling, but I don’t know a specific one. Literature or xgboost/lightgbm docs might be a place to look at.

GaelVaroquaux · 2023-11-14T12:10:20Z

Real datasets might provide stronger cases for feature subsampling, but I don’t know a specific one. Literature or xgboost/lightgbm docs might be a place to look at.

Our large benchmark of a year ago can be useful here, in particular tables of appendix A9:

ogrisel · 2023-11-14T17:15:53Z

I conducted some more experiments to compare with a similar grid of hyperparams with xgboost:

scikit-learn with this PR

Test MSE: 186982.8, CV MSE: 214961.2, {'min_samples_leaf': 50, 'max_leaf_nodes': 7, 'max_features': 0.8, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 1.0}
Test MSE: 192191.4, CV MSE: 218726.1, {'min_samples_leaf': 50, 'max_leaf_nodes': 31, 'max_features': 0.5, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 1.0}
Test MSE: 175064.7, CV MSE: 221986.9, {'min_samples_leaf': 50, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 0.1}
Test MSE: 201463.8, CV MSE: 222520.9, {'min_samples_leaf': 50, 'max_leaf_nodes': 7, 'max_features': 0.8, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 0.0}
Test MSE: 184949.7, CV MSE: 226015.0, {'min_samples_leaf': 50, 'max_leaf_nodes': 127, 'max_features': 0.5, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 0.3}
Test MSE: 182894.5, CV MSE: 226272.9, {'min_samples_leaf': 50, 'max_leaf_nodes': 31, 'max_features': 0.5, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 0.3}
Test MSE: 177815.7, CV MSE: 227391.2, {'min_samples_leaf': 50, 'max_leaf_nodes': 15, 'max_features': 1.0, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 175611.1, CV MSE: 227881.7, {'min_samples_leaf': 50, 'max_leaf_nodes': 127, 'max_features': 1.0, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 0.1}
Test MSE: 175205.2, CV MSE: 227952.1, {'min_samples_leaf': 50, 'max_leaf_nodes': 127, 'max_features': 1.0, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 0.1}
Test MSE: 174255.5, CV MSE: 232310.7, {'min_samples_leaf': 50, 'max_leaf_nodes': 7, 'max_features': 0.8, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 177208.5, CV MSE: 232457.4, {'min_samples_leaf': 50, 'max_leaf_nodes': 63, 'max_features': 0.8, 'max_bins': 255, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 178335.9, CV MSE: 233305.6, {'min_samples_leaf': 50, 'max_leaf_nodes': 63, 'max_features': 1.0, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 0.3}
Test MSE: 184553.5, CV MSE: 234482.9, {'min_samples_leaf': 100, 'max_leaf_nodes': 31, 'max_features': 0.8, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 1.0}
Test MSE: 226905.4, CV MSE: 235177.1, {'min_samples_leaf': 100, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.1}
Test MSE: 226896.0, CV MSE: 235245.3, {'min_samples_leaf': 100, 'max_leaf_nodes': 63, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.0}

xgboost

Test MSE: 205183.5, CV MSE: 216478.7, {'reg_lambda': 0.3, 'min_child_weight': 100, 'max_leaves': 63, 'max_bin': 255, 'learning_rate': 0.3, 'colsample_bynode': 0.8}
Test MSE: 182059.2, CV MSE: 227485.9, {'reg_lambda': 0.0, 'min_child_weight': 50, 'max_leaves': 31, 'max_bin': 127, 'learning_rate': 0.1, 'colsample_bynode': 0.8}
Test MSE: 213369.6, CV MSE: 227864.4, {'reg_lambda': 1.0, 'min_child_weight': 50, 'max_leaves': 15, 'max_bin': 255, 'learning_rate': 0.1, 'colsample_bynode': 0.8}
Test MSE: 163838.9, CV MSE: 231233.9, {'reg_lambda': 1.0, 'min_child_weight': 100, 'max_leaves': 15, 'max_bin': 127, 'learning_rate': 0.3, 'colsample_bynode': 0.8}
Test MSE: 249533.1, CV MSE: 231305.5, {'reg_lambda': 1.0, 'min_child_weight': 50, 'max_leaves': 31, 'max_bin': 255, 'learning_rate': 0.3, 'colsample_bynode': 0.8}
Test MSE: 188021.9, CV MSE: 231744.2, {'reg_lambda': 0.0, 'min_child_weight': 50, 'max_leaves': 127, 'max_bin': 63, 'learning_rate': 0.1, 'colsample_bynode': 1.0}
Test MSE: 177587.6, CV MSE: 232212.3, {'reg_lambda': 0.0, 'min_child_weight': 100, 'max_leaves': 63, 'max_bin': 63, 'learning_rate': 0.3, 'colsample_bynode': 1.0}
Test MSE: 204435.0, CV MSE: 233026.8, {'reg_lambda': 0.0, 'min_child_weight': 50, 'max_leaves': 31, 'max_bin': 255, 'learning_rate': 0.03, 'colsample_bynode': 0.8}
Test MSE: 198921.3, CV MSE: 233539.9, {'reg_lambda': 0.1, 'min_child_weight': 20, 'max_leaves': 7, 'max_bin': 255, 'learning_rate': 0.1, 'colsample_bynode': 0.8}
Test MSE: 177591.7, CV MSE: 234108.5, {'reg_lambda': 0.1, 'min_child_weight': 100, 'max_leaves': 15, 'max_bin': 63, 'learning_rate': 0.3, 'colsample_bynode': 1.0}
Test MSE: 184331.6, CV MSE: 234612.1, {'reg_lambda': 0.3, 'min_child_weight': 50, 'max_leaves': 15, 'max_bin': 255, 'learning_rate': 0.1, 'colsample_bynode': 1.0}
Test MSE: 170036.9, CV MSE: 237298.4, {'reg_lambda': 0.1, 'min_child_weight': 50, 'max_leaves': 7, 'max_bin': 255, 'learning_rate': 0.03, 'colsample_bynode': 0.5}
Test MSE: 220638.8, CV MSE: 237298.4, {'reg_lambda': 0.1, 'min_child_weight': 50, 'max_leaves': 31, 'max_bin': 255, 'learning_rate': 0.03, 'colsample_bynode': 0.5}
Test MSE: 202371.1, CV MSE: 238190.6, {'reg_lambda': 1.0, 'min_child_weight': 50, 'max_leaves': 63, 'max_bin': 255, 'learning_rate': 0.3, 'colsample_bynode': 1.0}
Test MSE: 201657.6, CV MSE: 238680.6, {'reg_lambda': 1.0, 'min_child_weight': 100, 'max_leaves': 7, 'max_bin': 127, 'learning_rate': 0.3, 'colsample_bynode': 1.0}

The benchmark code is ugly because I cannot get the early stopping thing from xgboost to stop printing to stdout

# %%
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor


X, y = make_regression(
    n_samples=1000, n_features=200, n_informative=100, random_state=0
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)


# %%
param_grid = {
    "learning_rate": [0.01, 0.03, 0.1, 0.3, 1.0],
    "max_leaf_nodes": [7, 15, 31, 63, 127],
    "min_samples_leaf": [5, 10, 20, 50, 100, 200],
    "max_features": [0.5, 0.8, 1.0],
    "l2_regularization": [0.0, 0.1, 0.3, 1.0],
    "max_bins": [63, 127, 255],
}

model = HistGradientBoostingRegressor(
    early_stopping=True,
    n_iter_no_change=5,
    max_iter=1_000,
    validation_fraction=0.2,
    random_state=0,
)
search_cv = RandomizedSearchCV(
    model,
    param_grid,
    n_iter=500,
    cv=5,
    scoring="neg_mean_squared_error",
    n_jobs=8,
    random_state=0,
    verbose=10,
).fit(X_train, y_train)
results_df = pd.DataFrame(search_cv.cv_results_)
results_df.sort_values(by="rank_test_score").head()

# %%

# Let's reuse the same parameter grid but adapted for XGBoost:
param_grid = {
    "learning_rate": [0.01, 0.03, 0.1, 0.3, 1.0],
    "max_leaves": [7, 15, 31, 63, 127],
    "min_child_weight": [5, 10, 20, 50, 100, 200],
    "colsample_bynode": [0.5, 0.8, 1.0],
    "reg_lambda": [0.0, 0.1, 0.3, 1.0],
    "max_bin": [63, 127, 255],
}

xgb_model = XGBRegressor(
    tree_method="hist",
    n_estimators=1_000,
    early_stopping_rounds=5,
    random_state=0,
    verbosity=0,
)

# Do the validation split ahead of time:
X_train_subset, X_val, y_train_subset, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=0
)
xgb_search_cv = RandomizedSearchCV(
    xgb_model,
    param_grid,
    n_iter=500,
    cv=5,
    scoring="neg_mean_squared_error",
    n_jobs=8,
    random_state=0,
    verbose=10,
).fit(X_train_subset, y_train_subset, eval_set=[(X_val, y_val)])
xgb_results_df = pd.DataFrame(xgb_search_cv.cv_results_)
xgb_results_df.sort_values(by="rank_test_score").head()

# %%
for _, record in results_df.sort_values(by="rank_test_score").head(15).iterrows():
    model.set_params(**record["params"]).fit(X_train, y_train)
    test_mse = round(mean_squared_error(y_test, model.predict(X_test)), 1)
    cv_mse = round(-record["mean_test_score"], 1)
    print(f"Test MSE: {test_mse}, CV MSE: {cv_mse}, {record['params']}")

Anyway, the main point is that for this dataset, the implementation of feature subsampling of xgboost does not yield significantly better results and both models have subsampled models in the top performers.

That plus the updated test to check that it's not trivially passing makes me confident in this PR.

sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py

lorentzenchr · 2023-11-14T17:32:10Z

Just for expectation management, I do not intend to run any benchmark.

lesteve · 2023-11-16T10:02:00Z

I may have a look at rerunning the benchmarks, this looks like a good opportunity for me to learn about interesting things.

Co-authored-by: Olivier Grisel <[email protected]>

lesteve · 2023-11-17T15:57:35Z

By the way, I have been told that the latest version of the paper is not the arxiv one but the one on HAL: https://hal.science/hal-03723551v2/file/Tabular_NeurIPS2022%20%2822%29.pdf

Updated image from
#27139 (comment):

ENH add feature subsampling per split for HGBT

c610e71

lorentzenchr added this to the 1.4 milestone Aug 23, 2023

github-actions bot added module:ensemble cython labels Aug 23, 2023

lorentzenchr added 3 commits August 23, 2023 20:58

FIX Userwarning when shuffle is used on memoryviews

440edbf

TST add test_split_colsample_bynode

c6af0b9

DOC add whatsnew

20ecd5b

lorentzenchr added 2 commits September 6, 2023 21:00

ENH rename colsample_bynode to max_features

03dd622

DOC update whatsnew

2859ee4

ogrisel reviewed Sep 11, 2023

View reviewed changes

benchmarks/bench_hist_gradient_boosting_higgsboson.py Outdated Show resolved Hide resolved

lorentzenchr mentioned this pull request Sep 12, 2023

RFC feature subsampling for tree based models #27347

Closed

lorentzenchr added 3 commits September 14, 2023 13:12

DOC combination of interaction_cst with feature subsampling

bbd3507

CLN fix benchmark argument

d2fc5e9

CLN rename internal colsample_bynode to feature_fraction_per_split

12c8684

ogrisel approved these changes Nov 13, 2023

View reviewed changes

doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved

sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py Outdated Show resolved Hide resolved

NicolasHug reviewed Nov 13, 2023

View reviewed changes

lorentzenchr added 5 commits November 13, 2023 18:50

CLN merge main remnant

b9bdc11

CLN address review comments

d21f5d7

FIX put rng creation before main loop

b5f5e48

Merge branch 'main' into hgbt_col_subsample

d19ad69

TST add counterfactual splitter

5c0f50a

NicolasHug reviewed Nov 13, 2023

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py Outdated Show resolved Hide resolved

ENH rng seeding / continuing

f1cd97b

NicolasHug approved these changes Nov 14, 2023

View reviewed changes

ogrisel approved these changes Nov 14, 2023

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py Outdated Show resolved Hide resolved

Typo

f00f443

ogrisel enabled auto-merge (squash) November 14, 2023 17:17

ogrisel merged commit fc11dea into scikit-learn:main Nov 14, 2023

ogrisel deleted the hgbt_col_subsample branch November 14, 2023 21:34

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023

ENH add feature subsampling per split for HGBT (scikit-learn#27139)

0759da6

Co-authored-by: Olivier Grisel <[email protected]>

This was referenced Nov 29, 2023

RFC Unify old GradientBoosting estimators and HGBT #27873

Open

DOC set max_features for HGBT as MajorFeature #27892

Merged

lorentzenchr mentioned this pull request Jan 4, 2024

ENH add subsample to HGBT #28063

Open

Uh oh!

ENH add feature subsampling per split for HGBT #27139

ENH add feature subsampling per split for HGBT #27139

Uh oh!

Conversation

lorentzenchr commented Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

TODO

Uh oh!

github-actions bot commented Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

ogrisel commented Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr commented Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr commented Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Aug 23, 2023

Uh oh!

lorentzenchr commented Aug 23, 2023

Uh oh!

betatim commented Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Aug 28, 2023

Uh oh!

lorentzenchr commented Aug 30, 2023

Uh oh!

ogrisel commented Sep 11, 2023

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lorentzenchr commented Sep 11, 2023

Uh oh!

lorentzenchr commented Sep 11, 2023

Uh oh!

ogrisel commented Sep 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr commented Sep 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

betatim commented Sep 12, 2023

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Nov 13, 2023

Uh oh!

NicolasHug left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Nov 13, 2023

Uh oh!

lorentzenchr commented Nov 13, 2023

Uh oh!

lorentzenchr commented Nov 13, 2023

Uh oh!

lorentzenchr commented Aug 23, 2023 •

edited

Loading

github-actions bot commented Aug 23, 2023 •

edited

Loading

ogrisel commented Aug 23, 2023 •

edited

Loading

lorentzenchr commented Aug 23, 2023 •

edited

Loading

lorentzenchr commented Aug 23, 2023 •

edited

Loading

ogrisel commented Aug 23, 2023 •

edited

Loading

betatim commented Aug 28, 2023 •

edited

Loading

ogrisel commented Sep 12, 2023 •

edited

Loading

lorentzenchr commented Sep 12, 2023 •

edited

Loading

NicolasHug left a comment •

edited

Loading

NicolasHug left a comment •

edited

Loading

lesteve commented Nov 17, 2023 •

edited

Loading