Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH add feature subsampling per split for HGBT #27139

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Nov 14, 2023

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Aug 23, 2023

Reference Issues/PRs

Solves #16062.

What does this implement/fix? Explain your changes.

This PR adds colsample_bynode max_features parameter to HistGradientBoostingRegressor and HistGradientBoostingClassifier. With this parameter, one can specify the proportion of features subsampled per split/node.

The name colsample_bynode is the same in XGBoost and LightGBM.

Any other comments?

Not yet.

TODO

  • make it work
  • tests
  • example/docu
  • benchmark script (higgs)

@github-actions
Copy link

github-actions bot commented Aug 23, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: f00f443. Link to the linter CI: here

@ogrisel
Copy link
Member

ogrisel commented Aug 23, 2023

The name colsample_bynode is the same in XGBoost and LightGBM.

This will make the HistGradientBoostingClassifier/Regressor class parametrization more different than the other tree ensemble methods. Wouldn't max_features work? Alternatively max_features_by_node maybe?

We rarely use the name "column" in the scikit-learn API but rather "feature".

EDIT: I know see that that this parameter is only defined in terms of ratio (float) and does not allow absolute integer number of features...

I suppose this is to avoid the ambiguity of max_features=1 vs max_features=1.0. In practice, max_features=1 has interesting properties in terms of estimating mutual information between features and the target using mean decrease in impurity (loss) in an RF:

This result might generalize the GBDT, in which case it could be interesting to have simple hyper-parameter setting to do it with scikit-learn.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Aug 23, 2023

I intentionally named it the same as in LightGBM and XGBoost. I personally find max_features not a good name and something like subsample_features_per_split much clearer.

I also intentionally only allowed fractions as input, because with "max_features=1" you could instead use stumps (tree depth=1) instead to get an equivalent result.

BTW, results for random forests usually do not carry over to GBTs. RFs are better with overfitting trees, GBTs better with weak learners.
In this particular case, we have https://arxiv.org/abs/2207.14490.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Aug 23, 2023

API comparison

HistGradientBoostingRegressor GradientBoostingRegressor Same
loss loss
quantile alpha
learning_rate learning_rate
max_iter n_estimators
max_leaf_nodes max_leaf_nodes
max_depth max_depth
min_samples_leaf min_samples_leaf
l2_regularization learning_rate
max_bins ⛔ (nonsense)
categorical_features
monotonic_cst
interaction_cst
warm_start warm_start
early_stopping
scoring
validation_fraction validation_fraction
n_iter_no_change n_iter_no_change
tol tol
verbose verbose
random_state random_state
subsample
⛔ (nonsense) criterion
min_samples_split
min_weight_fraction_leaf
min_impurity_decrease
init
⛔ (this PR) max_features
ccp_alpha

Indeed, only the quantile/alpha and max_iter/n_estimator parameters deviate.

@ogrisel
Copy link
Member

ogrisel commented Aug 23, 2023

something like subsample_features_per_split much clearer.

Indeed.

I also intentionally only allowed fractions as input, because with "max_features=1" you could instead use stumps (tree depth=1) instead to get an equivalent result.

Ensembles of decision stumps cannot model feature interactions, while ensemble of deeper tree with max_features=1 / colsample_bynode=1.0 / X_train.shape[0] can, even if the feature selection (choice of axis to partition over at each node) is totally random. Hence the inductive bias would be very different.

EDIT: in particular, decision stumps would fail on a XOR tasks while max_features=1 would not given enough trees.

@ogrisel
Copy link
Member

ogrisel commented Aug 23, 2023

In this particular case, we have https://arxiv.org/abs/2207.14490.

Note that this is about local (per observation) explanations (decomposition of the prediction function, independently of the true y), e.g. SHAP values.

While https://arxiv.org/abs/2111.02218 is about global (aggregate over dataset/distribution) explanations (decomposition of the per feature contribution to the aggregate loss) SAGE values that can be shown to estimate Mutual Information between feature and target variables under certain conditions.

But SHAP and SAGE values are Shapley values (per feature decompositions) but not of the same quantity.

The feature importance (MDI) of RFs with max_features=1 are therefore a computationally cheap estimates of MI(Y; X_j). Computing SAGE values is typically much more computationally expensive than fitting a single big enough RF.

But I agree that the sequential nature of the fit of GBDT might prevent them to show the same property as RFs.

@lorentzenchr
Copy link
Member Author

in particular, decision stumps would fail on a XOR tasks while max_features=1 would not given enough trees.

You‘re right. Note that it can be achieved in this PR by a tiny subsample fraction (we could allow 0) because the the min numbers if features is set to be 1. Let’s stop here as this is a more academic discussion. In real use cases, the fraction is likely closer to 1 than to 0.

I also checked run time with the higgs benchmark and there is no time penalty.

@scikit-learn/core-devs Opinions on naming this parameter are very welcome.

@betatim
Copy link
Member

betatim commented Aug 28, 2023

Thinking about the name, I like: max_features_per_split. I prefer per_split over per_node because somehow "node" is weird for me. I'm +-0 on only allowing fractions, but if we allow only fractions then I dislike max_features.. because it reuses a name people already know from other estimators in scikit-learn, where integers > 1 are allowed. However, I also don't like col, columns, etc because we don't use them anywhere else in scikit-learn.

feature_sampling_fraction_per_split is super long and verbose. Makes me feel like I took a wrong turn and ended up in a Java library. Maybe feature_sampling_fraction is better? and if there is no global sampling then it is clear that this refers to the "per split" sampling?

Overall I'm not a huge fan of any of them :-/

@ogrisel
Copy link
Member

ogrisel commented Aug 28, 2023

If we are to use a different parameter name than in GradientBoostingClassifier (that is, not reuse max_features), then I would go for per_split rather than per_node as well.

Not sure how to convey the "fraction"/"ratio"/"relative" vs absolute integer meaning if we disallow integers.

@lorentzenchr
Copy link
Member Author

Based on the discussion in the last dev meeting, I‘ll better change to max_feature for consistency with the old GBT.

@ogrisel
Copy link
Member

ogrisel commented Sep 11, 2023

@GaelVaroquaux maybe the lack of feature subsampling in the hyperparameter search grid could explain some of the discrepancy between scikit-learn and xgboost in this diagram:

image

source: https://arxiv.org/abs/2207.08815

According to the appendix, the xgboost parameter grid indeed includes both sample and feature subsampling. Based on appendix A.6, feature subsampling might be even more important than the colsample_by* parameters but both kinds seem to contribute.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should not introduce this parameter under the name max_features_fraction.

This would allow the possibility to deprecate passing float values to max_features in the other models to avoid the usability trap of max_feature=1 vs max_features=1.0 while keeping consistency of the meaning of that parameter across all our tree implementation in scikit-learn.

Whether or not we decide to also add an absolute integral max_features parameter to HGBDT can be discussed later (if ever), but would be an orthogonal decision.

Also, I am wondering: how does feature subsampling should interact with interaction_cst? Shall we subsample among the allowed the features once the constraint is enforced? Or subsample before taking the cst into consideration and then risk censoring splits quite heavily?

In any case need some dedicated test to assert to check that the interaction_cst are respected even when feature subsampling is enabled. And maybe clarify what happens when both hyperparameters are active together in the docstring.

@lorentzenchr
Copy link
Member Author

Concerning naming of this new feature, I see 2 options:

  • max_features in order to be consistent with our other tree based models
  • colsample_bynode (float in range (0,1]) in order to be consistent with LightGBM and XGBoost

I would not add another name.

@lorentzenchr
Copy link
Member Author

If I understand the logic in https://github.com/microsoft/LightGBM/blob/master/src/treelearner/col_sampler.hpp correctly, the feature subsampling is performed after restricting to allowed features by interaction constraints.

@ogrisel
Copy link
Member

ogrisel commented Sep 12, 2023

Then I would vote for feature_sample_per_split or features_fraction_per_split

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Sep 12, 2023

@scikit-learn/core-devs
We need a name for this feature. Options are:

  1. max_features the same as our other tree-based models. The problem there is that it handles floats from 0 to 1 as fractions and integers as number of features. Then 1.0 and 1 have a different behavior. In this PR, we could start with allowing only floats between 0 and 1.
  2. colsample_bynode, a float between 0 and 1. Same name as in XGBoost and LightGBM
  3. feature_(sub)sample_per_split 3.a without, 3.b with prefix sub
  4. features_fraction_per_split feature_fraction_per_split

If we do not go with 1. max_features, then we can add this new name to our old tree-based models.

Edit: Based on

Overall I wonder if we should broaden the discussion to "do we want to and if yes how do we do it" with respect to moving away from allowing integers for max_features.

I opened #27347 .

@betatim
Copy link
Member

betatim commented Sep 12, 2023

I'd vote for 4 (but maybe lobby for changing to feature_fraction_per_split, removing the S from features)

Overall I wonder if we should broaden the discussion to "do we want to and if yes how do we do it" with respect to moving away from allowing integers for max_features. Because if we implement that move, I think I'd vote differently. Or at the least my voting behaviour would depend on the plan.


While catching up with this thread I saw max_features_fraction first and thought it was quite a good solution.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It took be a while to understand the test but I am now convinced it's correctly testing what we want.

It would be great to add an example but I am not sure how to do so. But I don't think it should be a requirement to merge this PR.

I plan to conduct a large enough hyper-parameter search to check it would select this new option more often than not in the test highest performing models.

@ogrisel
Copy link
Member

ogrisel commented Nov 13, 2023

Related to this paper https://jmlr.org/papers/v20/18-444.html, sampling features by split is indeed expected to improve accuracy on most datasets. They even recommend a default value of ~0.6 (although this default value is only "optimal" in conjunction with changes made to the other default values). I would rather not change the default value to minimize disruption, but I found that interesting.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick review from me, I just have one Q regarding the RNG but otherwise LGTM!

@ogrisel
Copy link
Member

ogrisel commented Nov 13, 2023

I ran the following empirical experiments:

# %%
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


X, y = make_regression(
    n_samples=1000, n_features=200, n_informative=100, random_state=0
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
# %%
param_grid = {
    "learning_rate": [0.01, 0.03, 0.1, 0.3, 1.],
    "max_leaf_nodes": [7, 15, 31, 63, 127],
    "min_samples_leaf": [5, 10, 20, 50, 100, 200],
    "max_features": [0.5, 0.8, 1.0],
    "l2_regularization": [0.0, 0.1, 0.3, 1.0],
    "max_bins": [63, 127, 255],
}
model = HistGradientBoostingRegressor(early_stopping=True, max_iter=1_000, random_state=0)
search_cv = RandomizedSearchCV(
    model,
    param_grid,
    n_iter=500,
    cv=5,
    scoring="neg_mean_squared_error",
    n_jobs=8,
    random_state=0,
    verbose=10,
).fit(X_train, y_train)
results_df = pd.DataFrame(search_cv.cv_results_)
results_df.sort_values(by="rank_test_score").head()

# %%
for _, record in results_df.sort_values(by="rank_test_score").head(15).iterrows():
    model.set_params(**record["params"]).fit(X_train, y_train)
    test_mse = round(mean_squared_error(y_test, model.predict(X_test)), 1)
    cv_mse = round(-record["mean_test_score"], 1)
    print(f"Test MSE: {test_mse}, CV MSE: {cv_mse}, {record['params']}")

in this setting (large number of uninformative numerical features), I would have expected models with max_features < 1.0 to dominate. However this is not significantly the case. Here is the output on my machine:

Test MSE: 166439.4, CV MSE: 195540.1, {'min_samples_leaf': 50, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 0.1}
Test MSE: 165737.7, CV MSE: 197619.9, {'min_samples_leaf': 50, 'max_leaf_nodes': 7, 'max_features': 0.8, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 0.1}
Test MSE: 164282.0, CV MSE: 197905.0, {'min_samples_leaf': 50, 'max_leaf_nodes': 15, 'max_features': 1.0, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 166514.4, CV MSE: 198770.6, {'min_samples_leaf': 50, 'max_leaf_nodes': 63, 'max_features': 0.8, 'max_bins': 255, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 194007.6, CV MSE: 201834.2, {'min_samples_leaf': 100, 'max_leaf_nodes': 7, 'max_features': 0.8, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 1.0}
Test MSE: 154372.1, CV MSE: 203111.7, {'min_samples_leaf': 100, 'max_leaf_nodes': 63, 'max_features': 0.8, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 165784.7, CV MSE: 203437.7, {'min_samples_leaf': 50, 'max_leaf_nodes': 31, 'max_features': 1.0, 'max_bins': 255, 'learning_rate': 0.03, 'l2_regularization': 0.1}
Test MSE: 201307.6, CV MSE: 205819.0, {'min_samples_leaf': 20, 'max_leaf_nodes': 7, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.3}
Test MSE: 162051.9, CV MSE: 205886.5, {'min_samples_leaf': 50, 'max_leaf_nodes': 31, 'max_features': 0.8, 'max_bins': 127, 'learning_rate': 0.03, 'l2_regularization': 0.0}
Test MSE: 179643.1, CV MSE: 205986.4, {'min_samples_leaf': 50, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 1.0}
Test MSE: 163738.1, CV MSE: 206625.1, {'min_samples_leaf': 50, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.03, 'l2_regularization': 1.0}
Test MSE: 183112.7, CV MSE: 206675.5, {'min_samples_leaf': 100, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 127, 'learning_rate': 0.3, 'l2_regularization': 0.0}
Test MSE: 163586.6, CV MSE: 208135.2, {'min_samples_leaf': 50, 'max_leaf_nodes': 63, 'max_features': 0.8, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 0.1}
Test MSE: 183265.1, CV MSE: 208694.6, {'min_samples_leaf': 20, 'max_leaf_nodes': 7, 'max_features': 1.0, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 0.3}
Test MSE: 175477.5, CV MSE: 209630.9, {'min_samples_leaf': 50, 'max_leaf_nodes': 31, 'max_features': 0.5, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 1.0}

It's not too bad either but still. Maybe, this is because of the shared RNG instance (#27139 (comment))?

One way to confirm would be to run xgboost on a similar grid and compare the results.

@lorentzenchr
Copy link
Member Author

@NicolasHug @ogrisel Thanks for your thorough and yielding reviews!

@lorentzenchr
Copy link
Member Author

Now I get

Test MSE: 166492.4, CV MSE: 193456.4, {'min_samples_leaf': 100, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 127, 'learning_rate': 0.3, 'l2_regularization': 0.0}
Test MSE: 152407.1, CV MSE: 196445.8, {'min_samples_leaf': 100, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.3}
Test MSE: 183248.0, CV MSE: 197494.9, {'min_samples_leaf': 100, 'max_leaf_nodes': 63, 'max_features': 0.8, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.0}
Test MSE: 164282.0, CV MSE: 197905.0, {'min_samples_leaf': 50, 'max_leaf_nodes': 15, 'max_features': 1.0, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 182088.7, CV MSE: 198641.8, {'min_samples_leaf': 100, 'max_leaf_nodes': 63, 'max_features': 0.8, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 0.1}
Test MSE: 163956.9, CV MSE: 200319.4, {'min_samples_leaf': 50, 'max_leaf_nodes': 7, 'max_features': 0.8, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 181571.7, CV MSE: 202212.2, {'min_samples_leaf': 100, 'max_leaf_nodes': 31, 'max_features': 0.8, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 1.0}
Test MSE: 165784.7, CV MSE: 203437.7, {'min_samples_leaf': 50, 'max_leaf_nodes': 31, 'max_features': 1.0, 'max_bins': 255, 'learning_rate': 0.03, 'l2_regularization': 0.1}
Test MSE: 154229.4, CV MSE: 204226.0, {'min_samples_leaf': 100, 'max_leaf_nodes': 7, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.0}
Test MSE: 154229.4, CV MSE: 204226.0, {'min_samples_leaf': 100, 'max_leaf_nodes': 63, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.0}
Test MSE: 154226.0, CV MSE: 204315.6, {'min_samples_leaf': 100, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.1}
Test MSE: 159761.1, CV MSE: 204401.8, {'min_samples_leaf': 100, 'max_leaf_nodes': 7, 'max_features': 0.8, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 1.0}
Test MSE: 160218.8, CV MSE: 205853.8, {'min_samples_leaf': 100, 'max_leaf_nodes': 127, 'max_features': 0.8, 'max_bins': 127, 'learning_rate': 0.3, 'l2_regularization': 0.1}
Test MSE: 168158.0, CV MSE: 206666.1, {'min_samples_leaf': 100, 'max_leaf_nodes': 63, 'max_features': 0.8, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 171398.6, CV MSE: 207886.9, {'min_samples_leaf': 50, 'max_leaf_nodes': 63, 'max_features': 0.8, 'max_bins': 255, 'learning_rate': 0.1, 'l2_regularization': 1.0}

All the CV MSE are slightly better, most of the test MSE, too.

max_features @ogrisel @lorentzenchr
0.5 6 5
0.8 6 6
1.0 3 2

@ogrisel
Copy link
Member

ogrisel commented Nov 14, 2023

Thanks for the updates @lorentzenchr. #27139 (comment) does not seem to show a significant improvement (but no degradation either). Maybe this dataset is not suitable to show the advantage of feature subsampling.

Do you know cases where feature subsampling is a must-have? Or maybe it's most useful when combined with sample-wise subsampling?

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, thank you @lorentzenchr !

(I'll let you guys merge, as I haven't followed the benchmarking efforts)

@lorentzenchr
Copy link
Member Author

#27139 (comment) does not seem to show a significant improvement (but no degradation either).

I think it does. Observe that on the majority of splits, the best max_features is smaller than 1.
Also, those numbers do not show the improvement of feature subsampling vs no subsampling.

Maybe this dataset is not suitable to show the advantage of feature subsampling.

Real datasets might provide stronger cases for feature subsampling, but I don’t know a specific one. Literature or xgboost/lightgbm docs might be a place to look at.

@GaelVaroquaux
Copy link
Member

Real datasets might provide stronger cases for feature subsampling, but I don’t know a specific one. Literature or xgboost/lightgbm docs might be a place to look at.

Our large benchmark of a year ago can be useful here, in particular tables of appendix A9:
image

image

@ogrisel
Copy link
Member

ogrisel commented Nov 14, 2023

I conducted some more experiments to compare with a similar grid of hyperparams with xgboost:

  • scikit-learn with this PR
Test MSE: 186982.8, CV MSE: 214961.2, {'min_samples_leaf': 50, 'max_leaf_nodes': 7, 'max_features': 0.8, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 1.0}
Test MSE: 192191.4, CV MSE: 218726.1, {'min_samples_leaf': 50, 'max_leaf_nodes': 31, 'max_features': 0.5, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 1.0}
Test MSE: 175064.7, CV MSE: 221986.9, {'min_samples_leaf': 50, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 0.1}
Test MSE: 201463.8, CV MSE: 222520.9, {'min_samples_leaf': 50, 'max_leaf_nodes': 7, 'max_features': 0.8, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 0.0}
Test MSE: 184949.7, CV MSE: 226015.0, {'min_samples_leaf': 50, 'max_leaf_nodes': 127, 'max_features': 0.5, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 0.3}
Test MSE: 182894.5, CV MSE: 226272.9, {'min_samples_leaf': 50, 'max_leaf_nodes': 31, 'max_features': 0.5, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 0.3}
Test MSE: 177815.7, CV MSE: 227391.2, {'min_samples_leaf': 50, 'max_leaf_nodes': 15, 'max_features': 1.0, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 175611.1, CV MSE: 227881.7, {'min_samples_leaf': 50, 'max_leaf_nodes': 127, 'max_features': 1.0, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 0.1}
Test MSE: 175205.2, CV MSE: 227952.1, {'min_samples_leaf': 50, 'max_leaf_nodes': 127, 'max_features': 1.0, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 0.1}
Test MSE: 174255.5, CV MSE: 232310.7, {'min_samples_leaf': 50, 'max_leaf_nodes': 7, 'max_features': 0.8, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 177208.5, CV MSE: 232457.4, {'min_samples_leaf': 50, 'max_leaf_nodes': 63, 'max_features': 0.8, 'max_bins': 255, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 178335.9, CV MSE: 233305.6, {'min_samples_leaf': 50, 'max_leaf_nodes': 63, 'max_features': 1.0, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 0.3}
Test MSE: 184553.5, CV MSE: 234482.9, {'min_samples_leaf': 100, 'max_leaf_nodes': 31, 'max_features': 0.8, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 1.0}
Test MSE: 226905.4, CV MSE: 235177.1, {'min_samples_leaf': 100, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.1}
Test MSE: 226896.0, CV MSE: 235245.3, {'min_samples_leaf': 100, 'max_leaf_nodes': 63, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.0}
  • xgboost
Test MSE: 205183.5, CV MSE: 216478.7, {'reg_lambda': 0.3, 'min_child_weight': 100, 'max_leaves': 63, 'max_bin': 255, 'learning_rate': 0.3, 'colsample_bynode': 0.8}
Test MSE: 182059.2, CV MSE: 227485.9, {'reg_lambda': 0.0, 'min_child_weight': 50, 'max_leaves': 31, 'max_bin': 127, 'learning_rate': 0.1, 'colsample_bynode': 0.8}
Test MSE: 213369.6, CV MSE: 227864.4, {'reg_lambda': 1.0, 'min_child_weight': 50, 'max_leaves': 15, 'max_bin': 255, 'learning_rate': 0.1, 'colsample_bynode': 0.8}
Test MSE: 163838.9, CV MSE: 231233.9, {'reg_lambda': 1.0, 'min_child_weight': 100, 'max_leaves': 15, 'max_bin': 127, 'learning_rate': 0.3, 'colsample_bynode': 0.8}
Test MSE: 249533.1, CV MSE: 231305.5, {'reg_lambda': 1.0, 'min_child_weight': 50, 'max_leaves': 31, 'max_bin': 255, 'learning_rate': 0.3, 'colsample_bynode': 0.8}
Test MSE: 188021.9, CV MSE: 231744.2, {'reg_lambda': 0.0, 'min_child_weight': 50, 'max_leaves': 127, 'max_bin': 63, 'learning_rate': 0.1, 'colsample_bynode': 1.0}
Test MSE: 177587.6, CV MSE: 232212.3, {'reg_lambda': 0.0, 'min_child_weight': 100, 'max_leaves': 63, 'max_bin': 63, 'learning_rate': 0.3, 'colsample_bynode': 1.0}
Test MSE: 204435.0, CV MSE: 233026.8, {'reg_lambda': 0.0, 'min_child_weight': 50, 'max_leaves': 31, 'max_bin': 255, 'learning_rate': 0.03, 'colsample_bynode': 0.8}
Test MSE: 198921.3, CV MSE: 233539.9, {'reg_lambda': 0.1, 'min_child_weight': 20, 'max_leaves': 7, 'max_bin': 255, 'learning_rate': 0.1, 'colsample_bynode': 0.8}
Test MSE: 177591.7, CV MSE: 234108.5, {'reg_lambda': 0.1, 'min_child_weight': 100, 'max_leaves': 15, 'max_bin': 63, 'learning_rate': 0.3, 'colsample_bynode': 1.0}
Test MSE: 184331.6, CV MSE: 234612.1, {'reg_lambda': 0.3, 'min_child_weight': 50, 'max_leaves': 15, 'max_bin': 255, 'learning_rate': 0.1, 'colsample_bynode': 1.0}
Test MSE: 170036.9, CV MSE: 237298.4, {'reg_lambda': 0.1, 'min_child_weight': 50, 'max_leaves': 7, 'max_bin': 255, 'learning_rate': 0.03, 'colsample_bynode': 0.5}
Test MSE: 220638.8, CV MSE: 237298.4, {'reg_lambda': 0.1, 'min_child_weight': 50, 'max_leaves': 31, 'max_bin': 255, 'learning_rate': 0.03, 'colsample_bynode': 0.5}
Test MSE: 202371.1, CV MSE: 238190.6, {'reg_lambda': 1.0, 'min_child_weight': 50, 'max_leaves': 63, 'max_bin': 255, 'learning_rate': 0.3, 'colsample_bynode': 1.0}
Test MSE: 201657.6, CV MSE: 238680.6, {'reg_lambda': 1.0, 'min_child_weight': 100, 'max_leaves': 7, 'max_bin': 127, 'learning_rate': 0.3, 'colsample_bynode': 1.0}

The benchmark code is ugly because I cannot get the early stopping thing from xgboost to stop printing to stdout

# %%
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor


X, y = make_regression(
    n_samples=1000, n_features=200, n_informative=100, random_state=0
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)


# %%
param_grid = {
    "learning_rate": [0.01, 0.03, 0.1, 0.3, 1.0],
    "max_leaf_nodes": [7, 15, 31, 63, 127],
    "min_samples_leaf": [5, 10, 20, 50, 100, 200],
    "max_features": [0.5, 0.8, 1.0],
    "l2_regularization": [0.0, 0.1, 0.3, 1.0],
    "max_bins": [63, 127, 255],
}

model = HistGradientBoostingRegressor(
    early_stopping=True,
    n_iter_no_change=5,
    max_iter=1_000,
    validation_fraction=0.2,
    random_state=0,
)
search_cv = RandomizedSearchCV(
    model,
    param_grid,
    n_iter=500,
    cv=5,
    scoring="neg_mean_squared_error",
    n_jobs=8,
    random_state=0,
    verbose=10,
).fit(X_train, y_train)
results_df = pd.DataFrame(search_cv.cv_results_)
results_df.sort_values(by="rank_test_score").head()

# %%

# Let's reuse the same parameter grid but adapted for XGBoost:
param_grid = {
    "learning_rate": [0.01, 0.03, 0.1, 0.3, 1.0],
    "max_leaves": [7, 15, 31, 63, 127],
    "min_child_weight": [5, 10, 20, 50, 100, 200],
    "colsample_bynode": [0.5, 0.8, 1.0],
    "reg_lambda": [0.0, 0.1, 0.3, 1.0],
    "max_bin": [63, 127, 255],
}

xgb_model = XGBRegressor(
    tree_method="hist",
    n_estimators=1_000,
    early_stopping_rounds=5,
    random_state=0,
    verbosity=0,
)

# Do the validation split ahead of time:
X_train_subset, X_val, y_train_subset, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=0
)
xgb_search_cv = RandomizedSearchCV(
    xgb_model,
    param_grid,
    n_iter=500,
    cv=5,
    scoring="neg_mean_squared_error",
    n_jobs=8,
    random_state=0,
    verbose=10,
).fit(X_train_subset, y_train_subset, eval_set=[(X_val, y_val)])
xgb_results_df = pd.DataFrame(xgb_search_cv.cv_results_)
xgb_results_df.sort_values(by="rank_test_score").head()

# %%
for _, record in results_df.sort_values(by="rank_test_score").head(15).iterrows():
    model.set_params(**record["params"]).fit(X_train, y_train)
    test_mse = round(mean_squared_error(y_test, model.predict(X_test)), 1)
    cv_mse = round(-record["mean_test_score"], 1)
    print(f"Test MSE: {test_mse}, CV MSE: {cv_mse}, {record['params']}")

Anyway, the main point is that for this dataset, the implementation of feature subsampling of xgboost does not yield significantly better results and both models have subsampled models in the top performers.

That plus the updated test to check that it's not trivially passing makes me confident in this PR.

@ogrisel ogrisel enabled auto-merge (squash) November 14, 2023 17:17
@lorentzenchr
Copy link
Member Author

Just for expectation management, I do not intend to run any benchmark.

@ogrisel ogrisel merged commit fc11dea into scikit-learn:main Nov 14, 2023
@ogrisel ogrisel deleted the hgbt_col_subsample branch November 14, 2023 21:34
@lesteve
Copy link
Member

lesteve commented Nov 16, 2023

I may have a look at rerunning the benchmarks, this looks like a good opportunity for me to learn about interesting things.

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
@lesteve
Copy link
Member

lesteve commented Nov 17, 2023

By the way, I have been told that the latest version of the paper is not the arxiv one but the one on HAL: https://hal.science/hal-03723551v2/file/Tabular_NeurIPS2022%20%2822%29.pdf

Updated image from
#27139 (comment):
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants