-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
ENH add X_val and y_val to HGBT.fit #27124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH add X_val and y_val to HGBT.fit #27124
Conversation
@ogrisel @thomasjpfan @adrinjalali You may be interested. |
β Linting issuesThis PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling You can see the details of the linting issues under the
|
I would also add an example where you pass the eval data in a |
Do you have a good existing example where to add it? I don't want a new example. |
I don't know which one would be best, but we have a few where you could add this:
|
@adrinjalali I realize that we do not have a proper example how to train the HGBT. I think the most appropriate place to add this PR's feature is #26991. If you already want it in existing examples, I can would go with either Time-related feature engineering or Poisson regression and non-normal loss. BTW, the latter should be placed under read world examples, not linear models. |
That seems reasonable, cc @ArturoAmorQ for the example. |
For the time related feature engineering, I could add this feature. Because there it is used with a CV evaluation, it gets a bit complicated. I came up with: # HGBT with early stopping "auto" uses early stopping with n_samples > 10_000.
# So we set it manually.
class TimeSplittingHGBR(HistGradientBoostingRegressor):
# For simplicity, we don't deal with sample_weight as it is not used here.
def fit(self, X, y):
verbose_original = self.verbose
self.verbose = max(0, verbose_original - 1)
# We know that the data is ordered and use the same time gap of 48
# hours as in the CV splitter.
n_split = int((1 - self.validation_fraction) * X.shape[0])
gap = max(min(48, X.shape[0] - n_split - 1), 0)
X_train, y_train = X[:n_split], y[:n_split]
X_val, y_val = X[n_split + gap:], y[n_split+gap:]
# ==== IMPORTANT ============
# The first call to fit determines the number of boosting rounds via
# early stopping.
super().fit(X_train, y_train, X_val=X_val, y_val=y_val)
# ===========================
# Print some important information
if verbose_original >= 1:
import numpy as np
print(
f"train = {X_train.shape[0]} val = {X_val.shape[0]}, {gap=} "
f"n_iter = {self.n_iter_:>3}, "
f"train loss (RMSE) = {np.sqrt(-self.train_score_[-1]):0.4f}, "
f"validation loss (RMSE) = {np.sqrt(-self.validation_score_[-1]):0.4f}"
)
# ==== IMPORTANT ============
# The second call to fit uses all available training data using
# n_iter_ as max_iter from the first call to fit.
n_iter = self.n_iter_
early_stopping = self.early_stopping
max_iter = self.max_iter
self.early_stopping = False
self.max_iter = n_iter
super().fit(X, y)
# ===========================
self.early_stopping = early_stopping
self.max_iter = max_iter
self.verbose = verbose_original
return self
time_gbrt_pipeline = make_pipeline(
ColumnTransformer(
transformers=[
("categorical", ordinal_encoder, categorical_columns),
],
remainder="passthrough",
# Use short feature names to make it easier to specify the categorical
# variables in the HistGradientBoostingRegressor in the next
# step of the pipeline.
verbose_feature_names_out=False,
),
TimeSplittingHGBR(
learning_rate=0.05,
max_iter=300,
early_stopping=True,
validation_fraction=0.03, # 252 validation samples
categorical_features=categorical_columns,
random_state=42,
),
).set_output(transform="pandas") |
What I am not sure about is the integration with the For a a pipeline made of transformers and a predictor, we will expect to transform the validation set before to provide it to the predictor. So we need quite a smarter pipeline. On the another side, we also have the work on the callbacks that it starting to look good and it could be worth to the discussion about an |
Note that HGBT, as of now, still needs an OrdinalEncoder as preprocessor and therefore a pipeline.
We either need the possibility to produce intermediate resutls, like Not that for efficiency reasons, that's also what LightGBM and XGBoost have: |
BTW, should we make those params keyword only? |
This would be my preference. |
It already is a keyword arg π |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the small comments and documentation, LGTM.
@pytest.mark.parametrize("sample_weight", [False, True]) | ||
def test_X_val_in_fit(GradientBoosting, make_X_y, sample_weight): | ||
"""Test that passing X_val, y_val in fit is same as validation fraction.""" | ||
rng = np.random.RandomState(42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this use the global_random_seed
thingy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual data does not really matter for this test. The random seeds further below are much more important. So, honest answer, I don't know.
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
.. versionadded:: 0.23 | ||
X_val : array-like of shape (n_val, n_features) | ||
Additional sample of features for validation used in early stopping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the doc here deserves a note on lack of transformations on these parameters if used in a pipeline.
Also probably a mini section in the user guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a short note in the docstring.
I would prefer a separate PR for a larger user guide and example improvement on this matter.
Something discussed in the last dev meeting was to organize drafting meetings around this session. @jeremiedbb while developing an early-stopping callback should have concrete idea of some solution and what are they impact in terms of API. I assume that this could give some inertia on the topic since this is something that we certainly want to have. |
@glemaitre I don't understand, does that mean we don't want to have this in its current form? |
For reference of (new) parameter names in
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think other than minor nits, LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits, overall looks good.
validation_fraction=0.5, | ||
random_state=rng_seed, | ||
) | ||
m1.fit(X, y, sample_weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should happen if validation_fraction=0.5
and X_val
is passed in during fit? I feel like we should error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I go with a warning saying that X_val
wins over validation_fraction
. Ok for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I slightly prefer raising an error more than showing a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
I prefer the warning because the default is validation_fraction=0.1
and an error seems like patronizing. A user did not make a terrible mistake and we should not make her/his (machine learning) life miserable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default could be "auto" instead of 0.1, and then it makes sense to raise for any real value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
If I am reading the following without running it, it's hard to tell what the behavior is:
clf = HistGradientBoosting(validation_fraction=0.3)
clf.fit(X_train, y_train, X_val=X_val, y_val=y_val)
the default is validation_fraction=0.1
Ah that is unfortunate. I'm okay with the "auto" suggestion from @adrinjalali
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am reading the following without running it, it's hard to tell what the behavior is
I don't think so, the docstring of validation fraction says: "It is ignored if X_val
and y_val
are passed to fit."
Please note that:
validation_fraction
is ignored ifearly_stopping=False
.validation_fraction=None
means to use training data for early stopping.
In light of these, I suggest the following (see c43d43b):
- Raise an error if
X_val, y_val
are passed tofit
, butearly_stopping=False
(default is "auto"). - Remove the warning with passed
X_val, y_val
andvalidation_fraction
not None.
Reasoning:validation_fraction
is just ignored and will never raise. On top,None
already has another meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrinjalali Ok for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approved, but since the API changed since @adrinjalali's review, I'm going to not merge yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, yes, I know, time is short.... |
CI green π’ |
Thanks @lorentzenchr . As a separate PR, it'd be nice to have an example with |
Wow, I think this is a great achievement. It required metadata routing (#22083 v1.4 and #22893) and transformation of metadata (#28901 in v1.6) to finally have |
Reference Issues/PRs
Partially solves #18748.
What does this implement/fix? Explain your changes.
This PR adds to the
fit
signature orHistGradientBoostingClassifier
andHistGradientBoostingRegressor
the possibility to pass validation dataX_val
,y_val
andsample_weight_val
: