-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] add stratify and shuffle variants for GroupKFold #9413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you feel is work to still be done on this PR (i.e. why WIP?). I assume you need more tests of stratify and shuffle.
sklearn/model_selection/_split.py
Outdated
method: string, default='balance' | ||
One of 'balance', 'stratify', 'shuffle'. | ||
By default, try to equalize the sizes of the resulting folds. | ||
If 'stratify', sort groups according to ``y`` variable and distribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only sorts groups according to the y
variable of the first sample for that group, not, say the mean or the mode or whatever. I think the stratify case is most useful when there is a many-to-one relationship between group and target. And I wonder if we should enforce that relationship (throw an error if there are many y values for some group) just to make this logic explicable, straightforward, and invariant to reordering the samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to make tests for the stratify and shuffle cases. Maybe the example I added for stratify is enough? Other than that I suppose this PR is ready. |
am I right in saying that you're stratifying on the basis of the first y
value in a group? Ideally it would not be so brittle, but at a minimum that
needs documenting.
|
Ideally the shuffling should not alter too much the balance in the folds. Otherwise, what is the difference with GroupShuffleSplit? |
The difference is that this ensures you use every group in testing exactly
once. That is, in general, the difference between kfold and shuffle-split
…On 21 Jul 2017 3:13 am, "Jean Kossaifi" ***@***.***> wrote:
Ideally the shuffling should not alter *too much* the balance in the
folds. Otherwise, what is the difference with GroupShuffleSplit
<http://scikit-learn.org/dev/modules/generated/sklearn.model_selection.GroupShuffleSplit.html#sklearn.model_selection.GroupShuffleSplit>
?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9413 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-lYLPjG2HA7lUMWEY3N_I_CGPCxks5sP4rDgaJpZM4Oc906>
.
|
I modified "stratify" to use the median value for each group. The thing about balancing the folds vs. stratifying or shuffling is that it is a trade-off, and this PR allows one to make the choice explicitly. If your dataset has a small number of groups, they will strongly constrain the possible folds, and balancing might have priority. When there is a sufficient number of groups, or the groups are all the same size, the sizes of folds will be approximately balanced either way, and one can stratify or shuffle the folds instead. |
Tests appear to be failing |
All tests pass now, so I went ahead and renamed this MRG. |
It's not clear from the docs what this actually does. Can you maybe give an example in the docs? Can you please add a legend to the plots? And |
Thanks for the comments. You're right, I didn't think the discrete case through, and the documentation should be improved. What kind of API should I use for distinguishing the discrete and the continuous case? Autodetection is possible but probably not a good idea. I see that in the The regular StratifiedKFold only supports the discrete case. Maybe I should add support for the continuous case there too? You're right shuffling is not completely random, but otherwise the folds could be very uneven. Did you mean it should be better documented, or that the implementation should be different? |
My comments were mostly on the documentation. I think we should be very clear on what we are doing. Your solution seems good for regression, but is not the only possible way to stratify, I think. For maybe There is a PR somewhere for stratified cross-validation for regression. It uses binning, though, and I think sorting is better. I haven't looked at it in a while. |
personally i think it's more important to be clear on when you need each
strategy than detail how it works. The code should be legible enough for
the latter, and if you can't explain why you need it, we probably don't
need so many strategies
…On 27 Jul 2017 4:01 am, "Andreas Mueller" ***@***.***> wrote:
My comments were mostly on the documentation. I think we should be very
clear on what we are doing. Your solution seems good for regression, but is
not the only possible way to stratify, I think.
I'm not opposed to adding this stratification strategy, but we should try
to describe well what the different strategies are doing.
For feature_selection I think doing regression and classification in the
same class has given us a lot of headache. So in some way I'd prefer
different classes for regression and classification. On the other hand I
like the way that stratification is implemented as an option here - it's
somewhat different to the other classes, but we have a real explosion of
different classes, and I might prefer to have it as a parameter.
But we can't really have different classes for regression and
classification and have stratification as a parameter, that makes no sense.
We could auto-detect using type_of_target, though that's a bit dangerous.
If there are many different ints, how do you know if they are classes or
regression targets?
maybe method="stratify_classification" and method="stratify_regression"
would work so we don't have redundant parameters?
There is a PR somewhere for stratified cross-validation for regression. It
uses binning, though, and I think sorting is better. I haven't looked at it
in a while.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9413 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz68VLhUm0aoxrkv8vDOZ9JLoDv_hFks5sR37lgaJpZM4Oc906>
.
|
I think both are important. |
But yeah, definitely we need a motivation for each case. I think the case for having some stratified option is pretty clear, but I feel there are non-obvious choices to make. |
Closing this one as it needs a refresh and motivation, but related: #26821 |
Supersedes #5396
Adds an option "method" to GroupKFold to change the way groups are distributed over folds. Current default is to balance the sizes of the folds. This adds the alternative of stratifying on the y variable, or shuffling the groups to randomize the folds they end up in.