-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[Feature] Adding a Group And Label Kfold Split Method to Model Selection #14524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by the description here. Is this trying to perform stratification while upholding the grouping constraint? #9413 also implements such a thing. But I'm not sure if that's the goal here.
@jnothman Essentially, this treats the group and label as something that can only be in a single fold. The main use case is for time series data where you may have a metadata that is Subject, where the subject performs a bunch of activities. When you build your model using windowing, you'll end up with a bunch of Subject A performing action 1. It helps understanding if the model is generalizing if you build your validation set such that Subject A performing action 1 is only in a single fold. But you also want to make sure your folds have the similar amount of action 1s. The GroupKFold validation method doesn't provide that. It looks similar to the commit you referenced, but I don't think I would be able to use that to accomplish what this does exactly. |
22919fe
to
51f41c4
Compare
@cdknorow can you say what that PR doesn't do that you do? You want a |
@amueller I could probably use #9413 it looks like a good contribution. The main difference is that it computes the mode or median of the y label in each group and puts all of them into the same fold. This can still lead to imbalances in the y label distribution across folds depending on how the data is distributed. In this pull Request it treats the group and label together as something unique. Then for each fold it distributes the y labels evenly guaranteeing feature vectors with the same y labels and same group will not be in different folds. It however, allows feature vectors with different y labels with the same group to be placed across folds. |
@cdknorow can you motivate your suggested behavior? What would the application be? |
@amueller I put this in the code docstring.
|
I still don't think I've clearly understood. If it's just about making sure
some combination of labels is excluded from train when present in test,
then the existing grouped splitters should be sufficient with the right
group labels
|
@jnothman Here is an example that shows the difference between groupandlabel kfold and groupkfold. GroupAndLabel Kfold does a better job of keeping the classes even across folds. This is only applicable if you don't need to restrict your groups from going across folds for different classes.
|
Sorry I still don't get why this is a good idea, i.e. how those test sets
are a good proxy for an independent sample from the population you want to
evaluate on.
|
@jnothman I think its useful for understanding if your dataset has enough variance across groups for each class. If you have a large enough dataset, then really any of the methods will do. This method helps to figure out if you are able to correctly sample your data or if you need to get more. For a real world dataset that I have, I have a comparison between group k fold, and the group and label kfold. Group kfold does not do a great job of splitting up classes across folds making it a fairly bad metric for how the model can perform. However, by using group and label kfold, I'm able to get a much more reasonable distribution. Again, this method is only useful when a unique group and label combination can be considered distinctly, ie a group 1 label 2 can be in train and group 1 label 1 can be in test without having data leakage. Setup: 2 class problem, with 2 splits. Group Kfold Group and Label Kfold |
I think I get your motivation, but am reluctant to consider it further
because I think it is hard to explain and would be hard for users to
understand how to use it carefully; it tries to construct skewed test sets
and that is a very unfamiliar and unintuitive approach.
|
@jnothman I'm still confused whether this is what I think of as If these have the same goal, then the question is how to implement it - pretty sure it's NP hard to do exactly right. But probably submodular? |
@jnothman why do you think this creates skewed sets? This tries to create balanced sets, right? (Sorry my brain has been a bit foggy the last couple of days, might be missing something) |
I think @cdknorow is right that #9413 doesn't implement what we want for a StratifiedGroupKFold. I need more coffee before I can understand what this PR does, though. @cdknorow is there a reason you didn't name this |
@amueller thats correct, the goal is to create more balanced sets than GroupKfold does. I think the name StratifiedGroupKfold would be a better choice, i'll update the pull request. |
@cdknorow could you briefly explain your algorithm as well? |
51f41c4
to
e123699
Compare
@jnothman @amueller I think the main question that I'm not sure about, is in general would it be better to constrain groups to be within the same fold, or is it ok to relax that constraint and allow groups to move across folds as long as the classes for groups across folds are not different. ie if group 1 label 2 is in test, then group 1 label 2 can not be in train. But group 1 label 1 could be in train. Right now this implements that latter, which allows for better stratification of class labels. The only problem, is if for some domains/data sets, this might lead to leakage. For that, the strict constraint for groups being forced into the same fold would be better. |
10f191c
to
c41416a
Compare
Because it intentionally puts, for some group, all samples of some label into one test fold. |
The algorithm boils down this.
At the end, you will have folds that have stratified classes along with the guarantee that each fold will have unique (group, label) combinations. |
@jnothman that's kind of the point of GroupKFold, right? That all samples within a group are within the same fold? @cdknorow oh so you don't require a group to be contained within a single fold? You allow group 1 class a to be in fold 1 and group 1 class b to be in fold 2? |
treats which attempts to evenly distribute group and labels accross each fold.
c41416a
to
917217a
Compare
@amueller the other one with a grouping of labels, groups would probably work. I updated this pull request one to restrict the groups to a single fold by default. I still have to think some more about the best way to implement this though. I think it can be improved by a better sampling of which label we are stratifying over which I have some ideas for. |
[Feature] This commit implements an additional model selection split strategy names LabelAndGroupKFold which treats the label and group combination as something that can not be split across multiple folds, while at the same time attempting to evenly distribute labels across folds so that each fold has a similar amount of each label type. This is a extension of the GroupKfold, which does not attempt to evenly distribute the classes, only the groups.