-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
ENH add feature subsampling per split for HGBT #27139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This will make the HistGradientBoostingClassifier/Regressor class parametrization more different than the other tree ensemble methods. Wouldn't We rarely use the name "column" in the scikit-learn API but rather "feature". EDIT: I know see that that this parameter is only defined in terms of ratio (float) and does not allow absolute integer number of features... I suppose this is to avoid the ambiguity of This result might generalize the GBDT, in which case it could be interesting to have simple hyper-parameter setting to do it with scikit-learn. |
I intentionally named it the same as in LightGBM and XGBoost. I personally find I also intentionally only allowed fractions as input, because with "max_features=1" you could instead use stumps (tree depth=1) instead to get an equivalent result. BTW, results for random forests usually do not carry over to GBTs. RFs are better with overfitting trees, GBTs better with weak learners. |
API comparison
Indeed, only the quantile/alpha and max_iter/n_estimator parameters deviate. |
Indeed.
Ensembles of decision stumps cannot model feature interactions, while ensemble of deeper tree with EDIT: in particular, decision stumps would fail on a XOR tasks while |
Note that this is about local (per observation) explanations (decomposition of the prediction function, independently of the true y), e.g. SHAP values. While https://arxiv.org/abs/2111.02218 is about global (aggregate over dataset/distribution) explanations (decomposition of the per feature contribution to the aggregate loss) SAGE values that can be shown to estimate Mutual Information between feature and target variables under certain conditions. But SHAP and SAGE values are Shapley values (per feature decompositions) but not of the same quantity. The feature importance (MDI) of RFs with But I agree that the sequential nature of the fit of GBDT might prevent them to show the same property as RFs. |
You‘re right. Note that it can be achieved in this PR by a tiny subsample fraction (we could allow 0) because the the min numbers if features is set to be 1. Let’s stop here as this is a more academic discussion. In real use cases, the fraction is likely closer to 1 than to 0. I also checked run time with the higgs benchmark and there is no time penalty. @scikit-learn/core-devs Opinions on naming this parameter are very welcome. |
Thinking about the name, I like:
Overall I'm not a huge fan of any of them :-/ |
If we are to use a different parameter name than in Not sure how to convey the "fraction"/"ratio"/"relative" vs absolute integer meaning if we disallow integers. |
Based on the discussion in the last dev meeting, I‘ll better change to |
@GaelVaroquaux maybe the lack of feature subsampling in the hyperparameter search grid could explain some of the discrepancy between scikit-learn and xgboost in this diagram: source: https://arxiv.org/abs/2207.08815 According to the appendix, the xgboost parameter grid indeed includes both sample and feature subsampling. Based on appendix A.6, feature subsampling might be even more important than the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should not introduce this parameter under the name max_features_fraction
.
This would allow the possibility to deprecate passing float
values to max_features
in the other models to avoid the usability trap of max_feature=1
vs max_features=1.0
while keeping consistency of the meaning of that parameter across all our tree implementation in scikit-learn.
Whether or not we decide to also add an absolute integral max_features
parameter to HGBDT can be discussed later (if ever), but would be an orthogonal decision.
Also, I am wondering: how does feature subsampling should interact with interaction_cst
? Shall we subsample among the allowed the features once the constraint is enforced? Or subsample before taking the cst into consideration and then risk censoring splits quite heavily?
In any case need some dedicated test to assert to check that the interaction_cst
are respected even when feature subsampling is enabled. And maybe clarify what happens when both hyperparameters are active together in the docstring.
Concerning naming of this new feature, I see 2 options:
I would not add another name. |
If I understand the logic in https://github.com/microsoft/LightGBM/blob/master/src/treelearner/col_sampler.hpp correctly, the feature subsampling is performed after restricting to allowed features by interaction constraints. |
Then I would vote for |
@scikit-learn/core-devs
If we do not go with 1. max_features, then we can add this new name to our old tree-based models. Edit: Based on
I opened #27347 . |
I'd vote for 4 (but maybe lobby for changing to Overall I wonder if we should broaden the discussion to "do we want to and if yes how do we do it" with respect to moving away from allowing integers for While catching up with this thread I saw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. It took be a while to understand the test but I am now convinced it's correctly testing what we want.
It would be great to add an example but I am not sure how to do so. But I don't think it should be a requirement to merge this PR.
I plan to conduct a large enough hyper-parameter search to check it would select this new option more often than not in the test highest performing models.
sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py
Outdated
Show resolved
Hide resolved
Related to this paper https://jmlr.org/papers/v20/18-444.html, sampling features by split is indeed expected to improve accuracy on most datasets. They even recommend a default value of ~0.6 (although this default value is only "optimal" in conjunction with changes made to the other default values). I would rather not change the default value to minimize disruption, but I found that interesting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick review from me, I just have one Q regarding the RNG but otherwise LGTM!
sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py
Outdated
Show resolved
Hide resolved
I ran the following empirical experiments: # %%
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X, y = make_regression(
n_samples=1000, n_features=200, n_informative=100, random_state=0
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
# %%
param_grid = {
"learning_rate": [0.01, 0.03, 0.1, 0.3, 1.],
"max_leaf_nodes": [7, 15, 31, 63, 127],
"min_samples_leaf": [5, 10, 20, 50, 100, 200],
"max_features": [0.5, 0.8, 1.0],
"l2_regularization": [0.0, 0.1, 0.3, 1.0],
"max_bins": [63, 127, 255],
}
model = HistGradientBoostingRegressor(early_stopping=True, max_iter=1_000, random_state=0)
search_cv = RandomizedSearchCV(
model,
param_grid,
n_iter=500,
cv=5,
scoring="neg_mean_squared_error",
n_jobs=8,
random_state=0,
verbose=10,
).fit(X_train, y_train)
results_df = pd.DataFrame(search_cv.cv_results_)
results_df.sort_values(by="rank_test_score").head()
# %%
for _, record in results_df.sort_values(by="rank_test_score").head(15).iterrows():
model.set_params(**record["params"]).fit(X_train, y_train)
test_mse = round(mean_squared_error(y_test, model.predict(X_test)), 1)
cv_mse = round(-record["mean_test_score"], 1)
print(f"Test MSE: {test_mse}, CV MSE: {cv_mse}, {record['params']}") in this setting (large number of uninformative numerical features), I would have expected models with
It's not too bad either but still. Maybe, this is because of the shared RNG instance (#27139 (comment))? One way to confirm would be to run xgboost on a similar grid and compare the results. |
@NicolasHug @ogrisel Thanks for your thorough and yielding reviews! |
Now I get
All the CV MSE are slightly better, most of the test MSE, too.
|
Thanks for the updates @lorentzenchr. #27139 (comment) does not seem to show a significant improvement (but no degradation either). Maybe this dataset is not suitable to show the advantage of feature subsampling. Do you know cases where feature subsampling is a must-have? Or maybe it's most useful when combined with sample-wise subsampling? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM, thank you @lorentzenchr !
(I'll let you guys merge, as I haven't followed the benchmarking efforts)
I think it does. Observe that on the majority of splits, the best max_features is smaller than 1.
Real datasets might provide stronger cases for feature subsampling, but I don’t know a specific one. Literature or xgboost/lightgbm docs might be a place to look at. |
I conducted some more experiments to compare with a similar grid of hyperparams with xgboost:
Test MSE: 186982.8, CV MSE: 214961.2, {'min_samples_leaf': 50, 'max_leaf_nodes': 7, 'max_features': 0.8, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 1.0}
Test MSE: 192191.4, CV MSE: 218726.1, {'min_samples_leaf': 50, 'max_leaf_nodes': 31, 'max_features': 0.5, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 1.0}
Test MSE: 175064.7, CV MSE: 221986.9, {'min_samples_leaf': 50, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 0.1}
Test MSE: 201463.8, CV MSE: 222520.9, {'min_samples_leaf': 50, 'max_leaf_nodes': 7, 'max_features': 0.8, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 0.0}
Test MSE: 184949.7, CV MSE: 226015.0, {'min_samples_leaf': 50, 'max_leaf_nodes': 127, 'max_features': 0.5, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 0.3}
Test MSE: 182894.5, CV MSE: 226272.9, {'min_samples_leaf': 50, 'max_leaf_nodes': 31, 'max_features': 0.5, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 0.3}
Test MSE: 177815.7, CV MSE: 227391.2, {'min_samples_leaf': 50, 'max_leaf_nodes': 15, 'max_features': 1.0, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 175611.1, CV MSE: 227881.7, {'min_samples_leaf': 50, 'max_leaf_nodes': 127, 'max_features': 1.0, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 0.1}
Test MSE: 175205.2, CV MSE: 227952.1, {'min_samples_leaf': 50, 'max_leaf_nodes': 127, 'max_features': 1.0, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 0.1}
Test MSE: 174255.5, CV MSE: 232310.7, {'min_samples_leaf': 50, 'max_leaf_nodes': 7, 'max_features': 0.8, 'max_bins': 127, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 177208.5, CV MSE: 232457.4, {'min_samples_leaf': 50, 'max_leaf_nodes': 63, 'max_features': 0.8, 'max_bins': 255, 'learning_rate': 0.1, 'l2_regularization': 1.0}
Test MSE: 178335.9, CV MSE: 233305.6, {'min_samples_leaf': 50, 'max_leaf_nodes': 63, 'max_features': 1.0, 'max_bins': 63, 'learning_rate': 0.1, 'l2_regularization': 0.3}
Test MSE: 184553.5, CV MSE: 234482.9, {'min_samples_leaf': 100, 'max_leaf_nodes': 31, 'max_features': 0.8, 'max_bins': 63, 'learning_rate': 0.3, 'l2_regularization': 1.0}
Test MSE: 226905.4, CV MSE: 235177.1, {'min_samples_leaf': 100, 'max_leaf_nodes': 15, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.1}
Test MSE: 226896.0, CV MSE: 235245.3, {'min_samples_leaf': 100, 'max_leaf_nodes': 63, 'max_features': 0.5, 'max_bins': 255, 'learning_rate': 0.3, 'l2_regularization': 0.0}
Test MSE: 205183.5, CV MSE: 216478.7, {'reg_lambda': 0.3, 'min_child_weight': 100, 'max_leaves': 63, 'max_bin': 255, 'learning_rate': 0.3, 'colsample_bynode': 0.8}
Test MSE: 182059.2, CV MSE: 227485.9, {'reg_lambda': 0.0, 'min_child_weight': 50, 'max_leaves': 31, 'max_bin': 127, 'learning_rate': 0.1, 'colsample_bynode': 0.8}
Test MSE: 213369.6, CV MSE: 227864.4, {'reg_lambda': 1.0, 'min_child_weight': 50, 'max_leaves': 15, 'max_bin': 255, 'learning_rate': 0.1, 'colsample_bynode': 0.8}
Test MSE: 163838.9, CV MSE: 231233.9, {'reg_lambda': 1.0, 'min_child_weight': 100, 'max_leaves': 15, 'max_bin': 127, 'learning_rate': 0.3, 'colsample_bynode': 0.8}
Test MSE: 249533.1, CV MSE: 231305.5, {'reg_lambda': 1.0, 'min_child_weight': 50, 'max_leaves': 31, 'max_bin': 255, 'learning_rate': 0.3, 'colsample_bynode': 0.8}
Test MSE: 188021.9, CV MSE: 231744.2, {'reg_lambda': 0.0, 'min_child_weight': 50, 'max_leaves': 127, 'max_bin': 63, 'learning_rate': 0.1, 'colsample_bynode': 1.0}
Test MSE: 177587.6, CV MSE: 232212.3, {'reg_lambda': 0.0, 'min_child_weight': 100, 'max_leaves': 63, 'max_bin': 63, 'learning_rate': 0.3, 'colsample_bynode': 1.0}
Test MSE: 204435.0, CV MSE: 233026.8, {'reg_lambda': 0.0, 'min_child_weight': 50, 'max_leaves': 31, 'max_bin': 255, 'learning_rate': 0.03, 'colsample_bynode': 0.8}
Test MSE: 198921.3, CV MSE: 233539.9, {'reg_lambda': 0.1, 'min_child_weight': 20, 'max_leaves': 7, 'max_bin': 255, 'learning_rate': 0.1, 'colsample_bynode': 0.8}
Test MSE: 177591.7, CV MSE: 234108.5, {'reg_lambda': 0.1, 'min_child_weight': 100, 'max_leaves': 15, 'max_bin': 63, 'learning_rate': 0.3, 'colsample_bynode': 1.0}
Test MSE: 184331.6, CV MSE: 234612.1, {'reg_lambda': 0.3, 'min_child_weight': 50, 'max_leaves': 15, 'max_bin': 255, 'learning_rate': 0.1, 'colsample_bynode': 1.0}
Test MSE: 170036.9, CV MSE: 237298.4, {'reg_lambda': 0.1, 'min_child_weight': 50, 'max_leaves': 7, 'max_bin': 255, 'learning_rate': 0.03, 'colsample_bynode': 0.5}
Test MSE: 220638.8, CV MSE: 237298.4, {'reg_lambda': 0.1, 'min_child_weight': 50, 'max_leaves': 31, 'max_bin': 255, 'learning_rate': 0.03, 'colsample_bynode': 0.5}
Test MSE: 202371.1, CV MSE: 238190.6, {'reg_lambda': 1.0, 'min_child_weight': 50, 'max_leaves': 63, 'max_bin': 255, 'learning_rate': 0.3, 'colsample_bynode': 1.0}
Test MSE: 201657.6, CV MSE: 238680.6, {'reg_lambda': 1.0, 'min_child_weight': 100, 'max_leaves': 7, 'max_bin': 127, 'learning_rate': 0.3, 'colsample_bynode': 1.0} The benchmark code is ugly because I cannot get the early stopping thing from xgboost to stop printing to stdout # %%
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor
X, y = make_regression(
n_samples=1000, n_features=200, n_informative=100, random_state=0
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
# %%
param_grid = {
"learning_rate": [0.01, 0.03, 0.1, 0.3, 1.0],
"max_leaf_nodes": [7, 15, 31, 63, 127],
"min_samples_leaf": [5, 10, 20, 50, 100, 200],
"max_features": [0.5, 0.8, 1.0],
"l2_regularization": [0.0, 0.1, 0.3, 1.0],
"max_bins": [63, 127, 255],
}
model = HistGradientBoostingRegressor(
early_stopping=True,
n_iter_no_change=5,
max_iter=1_000,
validation_fraction=0.2,
random_state=0,
)
search_cv = RandomizedSearchCV(
model,
param_grid,
n_iter=500,
cv=5,
scoring="neg_mean_squared_error",
n_jobs=8,
random_state=0,
verbose=10,
).fit(X_train, y_train)
results_df = pd.DataFrame(search_cv.cv_results_)
results_df.sort_values(by="rank_test_score").head()
# %%
# Let's reuse the same parameter grid but adapted for XGBoost:
param_grid = {
"learning_rate": [0.01, 0.03, 0.1, 0.3, 1.0],
"max_leaves": [7, 15, 31, 63, 127],
"min_child_weight": [5, 10, 20, 50, 100, 200],
"colsample_bynode": [0.5, 0.8, 1.0],
"reg_lambda": [0.0, 0.1, 0.3, 1.0],
"max_bin": [63, 127, 255],
}
xgb_model = XGBRegressor(
tree_method="hist",
n_estimators=1_000,
early_stopping_rounds=5,
random_state=0,
verbosity=0,
)
# Do the validation split ahead of time:
X_train_subset, X_val, y_train_subset, y_val = train_test_split(
X_train, y_train, test_size=0.2, random_state=0
)
xgb_search_cv = RandomizedSearchCV(
xgb_model,
param_grid,
n_iter=500,
cv=5,
scoring="neg_mean_squared_error",
n_jobs=8,
random_state=0,
verbose=10,
).fit(X_train_subset, y_train_subset, eval_set=[(X_val, y_val)])
xgb_results_df = pd.DataFrame(xgb_search_cv.cv_results_)
xgb_results_df.sort_values(by="rank_test_score").head()
# %%
for _, record in results_df.sort_values(by="rank_test_score").head(15).iterrows():
model.set_params(**record["params"]).fit(X_train, y_train)
test_mse = round(mean_squared_error(y_test, model.predict(X_test)), 1)
cv_mse = round(-record["mean_test_score"], 1)
print(f"Test MSE: {test_mse}, CV MSE: {cv_mse}, {record['params']}") Anyway, the main point is that for this dataset, the implementation of feature subsampling of xgboost does not yield significantly better results and both models have subsampled models in the top performers. That plus the updated test to check that it's not trivially passing makes me confident in this PR. |
sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py
Outdated
Show resolved
Hide resolved
Just for expectation management, I do not intend to run any benchmark. |
I may have a look at rerunning the benchmarks, this looks like a good opportunity for me to learn about interesting things. |
Co-authored-by: Olivier Grisel <[email protected]>
By the way, I have been told that the latest version of the paper is not the arxiv one but the one on HAL: https://hal.science/hal-03723551v2/file/Tabular_NeurIPS2022%20%2822%29.pdf Updated image from |
Reference Issues/PRs
Solves #16062.
What does this implement/fix? Explain your changes.
This PR adds
colsample_bynode
max_features
parameter toHistGradientBoostingRegressor
andHistGradientBoostingClassifier
. With this parameter, one can specify the proportion of features subsampled per split/node.The name
colsample_bynode
is the same in XGBoost and LightGBM.Any other comments?
Not yet.
TODO