-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Native support for missing values in GBDTs #13911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Native support for missing values in GBDTs #13911
Conversation
β¦ssing_value_gbdt
β¦ssing_value_gbdt
β¦ssing_value_gbdt
This comment is for documenting the technical changes made in this PR. Ping @ogrisel when you're back ;)
|
β¦ssing_value_gbdt
β¦ssing_value_gbdt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @NicolasHug! Nice work. I have the following comments but otherwise it looks good.
Furthermore we should document in the API/estimator changes section of the what's new document that the internal data structure of the fitted model has changed to be able to add the native support for missing values.
Trying to load a pickled model fitted with scikit-learn 0.21 in scikit-learn 0.22
will yield an exception such as:
ValueError: Buffer dtype mismatch, expected 'unsigned char' but got 'unsigned int' in
'node_struct.missing_go_to_left'
Model re-training is required in this case.
Alternatively we could override the __setstate__
method to detect if the predictor array has the valid dtype and reshape with some padding to make it match be extra nice to our users and not have to confuse them with a complex changelog message. But this would imply wrinting a test.
predictions_binned = predictor.predict_binned( | ||
X_binned, missing_values_bin_idx=bin_mapper.missing_values_bin_idx_) | ||
assert np.all(predictions == -gradients) | ||
assert np.all(predictions_binned == -gradients) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is a good test, I wonder if it wouldn't be better to write something that looks equivalent from the public API only to make it easier to understand. E.g. something based on the following:
>>> from sklearn.experimental import enable_hist_gradient_boosting
>>> from sklearn.ensemble import HistGradientBoostingClassifier
>>> import numpy as np
>>> X = np.asarray([-np.inf, 0, 1, np.inf, np.nan]).reshape(-1, 1)
>>> X
array([[-inf],
[ 0.],
[ 1.],
[ inf],
[ nan]])
>>> y_isnan = np.isnan(X.ravel())
>>> y_isnan
array([False, False, False, False, True])
>>> y_isinf = X.ravel() == np.inf
>>> y_isinf
array([False, False, False, True, False])
>>> stump_clf = HistGradientBoostingClassifier(min_samples_leaf=1, max_iter=1, learning_rate=1., max_depth=2)
>>> stump_clf.fit(X, y_isinf).score(X, y_isinf)
1.0
>>> stump_clf.fit(X, y_isnan).score(X, y_isnan)
1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with the public API is that we can't test the predictions for X_binned which I think is important too.
I'll add your test as well though, it can't hurt ;)
|
||
# Make sure in particular that the +inf sample is mapped to the left child | ||
# Note that lightgbm "fails" here and will assign the inf sample to the | ||
# right child, even though it's a "split on nan" situation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure LightGBM fails in this case? Why would they have introduced AvoidInf()
if not for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened microsoft/LightGBM#2277, looks like they just fixed it (I haven't checked again though)
Since this is still experimental, I would rather not worry about that. Adding sample weights and then categorical data would probably also cause the same issue. |
I doubt that |
Thanks Olivier, I have addressed comments and updated the whatsnew with a compound entry of all the changes so far. I labeled this as a MajorFeature, but feel free to change, I'm not sure. Regarding the pickles: we don't even support pickling between major versions so I'm not sure we should have a special case for these estimators |
Well it's cheap (and friendly) to warn the users in the changelog when that actually happens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but some more comments / questions ;)
Whether this is to be considered a major feature or not can later be changed (either at release time and or discussed at next dev meeting).
sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py
Outdated
Show resolved
Hide resolved
@adrinjalali Does this have your +1 as well? Just in case, maybe @thomasjpfan @jnothman @rth @glemaitre @qinhanmin2014 would want to give it a quick look before we merge? |
I like the proposal. Don't have time soon to check for correctness unfortunately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point our docs skulls have a reference listing of estimators that'd support missing values
The example in docstrings still need fixing. |
Indeed. This time I ran pytest locally prior to pushing :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good now. And I guess the three of us are in agreement. Merging, nitpicks can go in other PRs, rather have this in, and there's time to fix the issues before the release.
This PR implements support for missing values in histogram-based GBDTs.
By missing values, I mean NaNs. I haven't implemented anything related to sparse matrices (that's for another PR).
Here's a breakdown of the changes, which follows LightGBM and XGBoost strategies. It's relatively simple:
When binning, missing values are assigned to a specific bin (the first one*). We ignore the missing values when computing the bin thresholds (i.e. we only use non-missing values, just like before).
When training, what changes is the strategy to find the best bin: basically when considering a bin to split on, we compute the gain according to 2 scenarii: mapping the samples with missing values to the left child, or mapping them to the right child.
Concretely, this is done by scanning the bins from left to right (like we used to) and from right to left. This is what LightGBM does, and it's explained in XGBoost paper (algo 3.).
At predicting, samples with nans are mapped to the appropriate child (left or right) that was learned to be the one with best gain. Note that if there were no missing values during training, then the samples with missing values are mapped to whichever node has the most samples. That's what H20 does (I haven't checked the other libraries but that seems to be the most sensible behavior).
*LightGBM assigns missing values to the last bin.
I haven't found a compelling reason to do so. We assign missing values to the first bin instead, which has the advantage of always being the first bin (whereas the index of the last bin may vary, and thus needs to be passed along as a parameter which is annoying).We also assign missing values to the last bin nowEDIT: see list of technical changes at #13911 (comment)