Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Feature] Adding a Group And Label Kfold Split Method to Model Selection #14524

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cdknorow
Copy link

@cdknorow cdknorow commented Jul 30, 2019

[Feature] This commit implements an additional model selection split strategy names LabelAndGroupKFold which treats the label and group combination as something that can not be split across multiple folds, while at the same time attempting to evenly distribute labels across folds so that each fold has a similar amount of each label type. This is a extension of the GroupKfold, which does not attempt to evenly distribute the classes, only the groups.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by the description here. Is this trying to perform stratification while upholding the grouping constraint? #9413 also implements such a thing. But I'm not sure if that's the goal here.

@cdknorow
Copy link
Author

cdknorow commented Aug 2, 2019

@jnothman Essentially, this treats the group and label as something that can only be in a single fold.

The main use case is for time series data where you may have a metadata that is Subject, where the subject performs a bunch of activities. When you build your model using windowing, you'll end up with a bunch of Subject A performing action 1. It helps understanding if the model is generalizing if you build your validation set such that Subject A performing action 1 is only in a single fold. But you also want to make sure your folds have the similar amount of action 1s. The GroupKFold validation method doesn't provide that.

It looks similar to the commit you referenced, but I don't think I would be able to use that to accomplish what this does exactly.

@cdknorow cdknorow force-pushed the features/label_and_group branch 3 times, most recently from 22919fe to 51f41c4 Compare August 2, 2019 03:52
@amueller
Copy link
Member

amueller commented Aug 2, 2019

@cdknorow can you say what that PR doesn't do that you do? You want a StratifiedGroupKFold, right?

@cdknorow
Copy link
Author

cdknorow commented Aug 5, 2019

@amueller I could probably use #9413 it looks like a good contribution. The main difference is that it computes the mode or median of the y label in each group and puts all of them into the same fold. This can still lead to imbalances in the y label distribution across folds depending on how the data is distributed.

In this pull Request it treats the group and label together as something unique. Then for each fold it distributes the y labels evenly guaranteeing feature vectors with the same y labels and same group will not be in different folds. It however, allows feature vectors with different y labels with the same group to be placed across folds.

@amueller
Copy link
Member

amueller commented Aug 5, 2019

@cdknorow can you motivate your suggested behavior? What would the application be?

@cdknorow
Copy link
Author

cdknorow commented Aug 5, 2019

@amueller I put this in the code docstring.

The main use case is for time series data where you may have some metadata/group such as
Subject , where the subject performs several activities. If you build a model
using a sliding window to segment data, you will end up with "Subject A"
performing slight variations of "action 1" many times. If you use a validation method that
splits up "Subject A" performing "action 1" into different folds it can
often result in data leakage and overfitting. If however, you build your
validation set such that "Subject A" performing "action 1" is only in a
single fold you can be more confident that your model is generalizing to new Subjects.
This validation will also attempt to ensure you have a similar amount
of "action 1s" across your folds.

@jnothman
Copy link
Member

jnothman commented Aug 5, 2019 via email

@cdknorow
Copy link
Author

cdknorow commented Aug 6, 2019

@jnothman Here is an example that shows the difference between groupandlabel kfold and groupkfold. GroupAndLabel Kfold does a better job of keeping the classes even across folds. This is only applicable if you don't need to restrict your groups from going across folds for different classes.

groups = np.array([0,0,0,0,1,1,1,1,2,2,2,2])
y =      np.array([1,1,1,2,1,1,2,2,1,2,1,2])
X =      np.zeros(len(y))


cv=GroupAndLabelKFold(n_splits=2)
cv.get_n_splits(X,y,groups)

for train_index, test_index in cv.split(X, y, groups):
    group_test, y_test = groups[test_index], y[test_index]
    group_train, y_train = groups[train_index], y[train_index]
    print("GroupTest", group_test, "LabelTest", y_test)

print(len(y_train), len(y_test))
    
cv = GroupKFold(n_splits=2)
cv.get_n_splits(X,y,groups)

for train_index, test_index in cv.split(X, y, groups):
    group_test, y_test = groups[test_index], y[test_index]
    group_train, y_train = groups[train_index], y[train_index]
    print("GroupTest", group_test, "LabelTest", y_test)
print(len(y_train), len(y_test))


GroupAndLabel Folds
> fold 1: GroupTest [0 0 0 0 2 2] LabelTest [1 1 1 2 2 2]
>   train_size: 6, test_size: 6
>fold 2: GroupTest [1 1 1 1 2 2] LabelTest [1 1 2 2 1 1]
>   train_size: 6, test_size: 6

>GroupKfold
>Fold 1: GroupTest [0 0 0 0 2 2 2 2] LabelTest [1 1 1 2 1 2 1 2]
> train_size: 4  test_size: 8
>Fold 2: GroupTest [1 1 1 1] LabelTest [1 1 2 2]
> train_size: 8  test_size: 4

@jnothman
Copy link
Member

jnothman commented Aug 6, 2019 via email

@cdknorow
Copy link
Author

cdknorow commented Aug 8, 2019

@jnothman I think its useful for understanding if your dataset has enough variance across groups for each class. If you have a large enough dataset, then really any of the methods will do. This method helps to figure out if you are able to correctly sample your data or if you need to get more.

For a real world dataset that I have, I have a comparison between group k fold, and the group and label kfold. Group kfold does not do a great job of splitting up classes across folds making it a fairly bad metric for how the model can perform. However, by using group and label kfold, I'm able to get a much more reasonable distribution. Again, this method is only useful when a unique group and label combination can be considered distinctly, ie a group 1 label 2 can be in train and group 1 label 1 can be in test without having data leakage.

Setup: 2 class problem, with 2 splits.

Group Kfold
split 1 : # class 1 = 95, # class 2 =566
split 2 : # class 1 = 413, # class 2 =248

Group and Label Kfold
split 1 : # class 1 =237 , # class 2 =412
split 2 : # class 1 = 271, # class 2 =402

@jnothman
Copy link
Member

jnothman commented Aug 8, 2019 via email

@amueller amueller mentioned this pull request Aug 8, 2019
@amueller
Copy link
Member

amueller commented Aug 8, 2019

@jnothman I'm still confused whether this is what I think of as StratifiedGroupKFold or not. It's GroupKFold with a stratification constraint, right? I have to look at #9413 again to see what it implements.

If these have the same goal, then the question is how to implement it - pretty sure it's NP hard to do exactly right. But probably submodular?

@amueller
Copy link
Member

amueller commented Aug 8, 2019

@jnothman why do you think this creates skewed sets? This tries to create balanced sets, right? (Sorry my brain has been a bit foggy the last couple of days, might be missing something)

@amueller
Copy link
Member

amueller commented Aug 8, 2019

I think @cdknorow is right that #9413 doesn't implement what we want for a StratifiedGroupKFold. I need more coffee before I can understand what this PR does, though.

@cdknorow is there a reason you didn't name this StratifiedGroupKFold? your goal is to combine StratifiedKFold and GroupKFold, right?

@cdknorow
Copy link
Author

cdknorow commented Aug 8, 2019

@amueller thats correct, the goal is to create more balanced sets than GroupKfold does. I think the name StratifiedGroupKfold would be a better choice, i'll update the pull request.

@amueller
Copy link
Member

amueller commented Aug 8, 2019

@cdknorow could you briefly explain your algorithm as well?

@cdknorow cdknorow force-pushed the features/label_and_group branch from 51f41c4 to e123699 Compare August 8, 2019 16:57
@cdknorow
Copy link
Author

cdknorow commented Aug 8, 2019

@jnothman @amueller I think the main question that I'm not sure about, is in general would it be better to constrain groups to be within the same fold, or is it ok to relax that constraint and allow groups to move across folds as long as the classes for groups across folds are not different. ie if group 1 label 2 is in test, then group 1 label 2 can not be in train. But group 1 label 1 could be in train.

Right now this implements that latter, which allows for better stratification of class labels. The only problem, is if for some domains/data sets, this might lead to leakage. For that, the strict constraint for groups being forced into the same fold would be better.

@cdknorow cdknorow force-pushed the features/label_and_group branch 2 times, most recently from 10f191c to c41416a Compare August 8, 2019 17:49
@jnothman
Copy link
Member

jnothman commented Aug 8, 2019

why do you think this creates skewed sets?

Because it intentionally puts, for some group, all samples of some label into one test fold.

@cdknorow
Copy link
Author

cdknorow commented Aug 8, 2019

@amueller

The algorithm boils down this.

  1. Select a unique class in the data set.
  2. Get all samples that contain that class
  3. groupby the samples by the group and sort that by the number of samples in each group
  4. Start with fold index 0, place all of that samples that have group with the largest number of samples into that fold
  5. If weighted enabled, increment a weight counter by n_smaples in that group, otherwise by 1
  6. Get the next batch of samples for the next largest group. Place it in the appropriate fold based on the number of samples in the folds.
  7. Go back to step 1 and repeat for other unique classes.

At the end, you will have folds that have stratified classes along with the guarantee that each fold will have unique (group, label) combinations.

@amueller
Copy link
Member

@jnothman that's kind of the point of GroupKFold, right? That all samples within a group are within the same fold?

@cdknorow oh so you don't require a group to be contained within a single fold? You allow group 1 class a to be in fold 1 and group 1 class b to be in fold 2?
That's not what the other PR allows, and I'm not sure I see the motivation for that.
That would be StratifiedGroupKFold as I think of it applied to labels=y, groups=product(labels, groups).
So I would probably rather implement the other one, and then have your use-case covered by setting groups to the product of groups and labels.

treats which attempts to evenly distribute group and labels accross
each fold.
@cdknorow cdknorow force-pushed the features/label_and_group branch from c41416a to 917217a Compare August 12, 2019 19:48
@cdknorow
Copy link
Author

cdknorow commented Aug 30, 2019

@amueller the other one with a grouping of labels, groups would probably work. I updated this pull request one to restrict the groups to a single fold by default. I still have to think some more about the best way to implement this though. I think it can be improved by a better sampling of which label we are stratifying over which I have some ideas for.

Base automatically changed from master to main January 22, 2021 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants