[Feature] Adding a Group And Label Kfold Split Method to Model Selection #14524

cdknorow · 2019-07-30T21:07:01Z

[Feature] This commit implements an additional model selection split strategy names LabelAndGroupKFold which treats the label and group combination as something that can not be split across multiple folds, while at the same time attempting to evenly distribute labels across folds so that each fold has a similar amount of each label type. This is a extension of the GroupKfold, which does not attempt to evenly distribute the classes, only the groups.

jnothman

I'm confused by the description here. Is this trying to perform stratification while upholding the grouping constraint? #9413 also implements such a thing. But I'm not sure if that's the goal here.

cdknorow · 2019-08-02T03:16:34Z

@jnothman Essentially, this treats the group and label as something that can only be in a single fold.

The main use case is for time series data where you may have a metadata that is Subject, where the subject performs a bunch of activities. When you build your model using windowing, you'll end up with a bunch of Subject A performing action 1. It helps understanding if the model is generalizing if you build your validation set such that Subject A performing action 1 is only in a single fold. But you also want to make sure your folds have the similar amount of action 1s. The GroupKFold validation method doesn't provide that.

It looks similar to the commit you referenced, but I don't think I would be able to use that to accomplish what this does exactly.

amueller · 2019-08-02T20:31:46Z

@cdknorow can you say what that PR doesn't do that you do? You want a StratifiedGroupKFold, right?

cdknorow · 2019-08-05T13:50:41Z

@amueller I could probably use #9413 it looks like a good contribution. The main difference is that it computes the mode or median of the y label in each group and puts all of them into the same fold. This can still lead to imbalances in the y label distribution across folds depending on how the data is distributed.

In this pull Request it treats the group and label together as something unique. Then for each fold it distributes the y labels evenly guaranteeing feature vectors with the same y labels and same group will not be in different folds. It however, allows feature vectors with different y labels with the same group to be placed across folds.

amueller · 2019-08-05T14:45:52Z

@cdknorow can you motivate your suggested behavior? What would the application be?

cdknorow · 2019-08-05T16:26:31Z

@amueller I put this in the code docstring.

The main use case is for time series data where you may have some metadata/group such as
Subject , where the subject performs several activities. If you build a model
using a sliding window to segment data, you will end up with "Subject A"
performing slight variations of "action 1" many times. If you use a validation method that
splits up "Subject A" performing "action 1" into different folds it can
often result in data leakage and overfitting. If however, you build your
validation set such that "Subject A" performing "action 1" is only in a
single fold you can be more confident that your model is generalizing to new Subjects.
This validation will also attempt to ensure you have a similar amount
of "action 1s" across your folds.

jnothman · 2019-08-05T23:04:02Z

I still don't think I've clearly understood. If it's just about making sure some combination of labels is excluded from train when present in test, then the existing grouped splitters should be sufficient with the right group labels

cdknorow · 2019-08-06T03:56:13Z

@jnothman Here is an example that shows the difference between groupandlabel kfold and groupkfold. GroupAndLabel Kfold does a better job of keeping the classes even across folds. This is only applicable if you don't need to restrict your groups from going across folds for different classes.

groups = np.array([0,0,0,0,1,1,1,1,2,2,2,2])
y =      np.array([1,1,1,2,1,1,2,2,1,2,1,2])
X =      np.zeros(len(y))


cv=GroupAndLabelKFold(n_splits=2)
cv.get_n_splits(X,y,groups)

for train_index, test_index in cv.split(X, y, groups):
    group_test, y_test = groups[test_index], y[test_index]
    group_train, y_train = groups[train_index], y[train_index]
    print("GroupTest", group_test, "LabelTest", y_test)

print(len(y_train), len(y_test))
    
cv = GroupKFold(n_splits=2)
cv.get_n_splits(X,y,groups)

for train_index, test_index in cv.split(X, y, groups):
    group_test, y_test = groups[test_index], y[test_index]
    group_train, y_train = groups[train_index], y[train_index]
    print("GroupTest", group_test, "LabelTest", y_test)
print(len(y_train), len(y_test))


GroupAndLabel Folds
> fold 1: GroupTest [0 0 0 0 2 2] LabelTest [1 1 1 2 2 2]
>   train_size: 6, test_size: 6
>fold 2: GroupTest [1 1 1 1 2 2] LabelTest [1 1 2 2 1 1]
>   train_size: 6, test_size: 6

>GroupKfold
>Fold 1: GroupTest [0 0 0 0 2 2 2 2] LabelTest [1 1 1 2 1 2 1 2]
> train_size: 4  test_size: 8
>Fold 2: GroupTest [1 1 1 1] LabelTest [1 1 2 2]
> train_size: 8  test_size: 4

jnothman · 2019-08-06T08:13:32Z

Sorry I still don't get why this is a good idea, i.e. how those test sets are a good proxy for an independent sample from the population you want to evaluate on.

cdknorow · 2019-08-08T06:39:25Z

@jnothman I think its useful for understanding if your dataset has enough variance across groups for each class. If you have a large enough dataset, then really any of the methods will do. This method helps to figure out if you are able to correctly sample your data or if you need to get more.

For a real world dataset that I have, I have a comparison between group k fold, and the group and label kfold. Group kfold does not do a great job of splitting up classes across folds making it a fairly bad metric for how the model can perform. However, by using group and label kfold, I'm able to get a much more reasonable distribution. Again, this method is only useful when a unique group and label combination can be considered distinctly, ie a group 1 label 2 can be in train and group 1 label 1 can be in test without having data leakage.

Setup: 2 class problem, with 2 splits.

Group Kfold
split 1 : # class 1 = 95, # class 2 =566
split 2 : # class 1 = 413, # class 2 =248

Group and Label Kfold
split 1 : # class 1 =237 , # class 2 =412
split 2 : # class 1 = 271, # class 2 =402

jnothman · 2019-08-08T11:53:20Z

I think I get your motivation, but am reluctant to consider it further because I think it is hard to explain and would be hard for users to understand how to use it carefully; it tries to construct skewed test sets and that is a very unfamiliar and unintuitive approach.

amueller · 2019-08-08T14:53:30Z

@jnothman I'm still confused whether this is what I think of as StratifiedGroupKFold or not. It's GroupKFold with a stratification constraint, right? I have to look at #9413 again to see what it implements.

If these have the same goal, then the question is how to implement it - pretty sure it's NP hard to do exactly right. But probably submodular?

amueller · 2019-08-08T14:57:10Z

@jnothman why do you think this creates skewed sets? This tries to create balanced sets, right? (Sorry my brain has been a bit foggy the last couple of days, might be missing something)

amueller · 2019-08-08T15:03:45Z

I think @cdknorow is right that #9413 doesn't implement what we want for a StratifiedGroupKFold. I need more coffee before I can understand what this PR does, though.

@cdknorow is there a reason you didn't name this StratifiedGroupKFold? your goal is to combine StratifiedKFold and GroupKFold, right?

cdknorow · 2019-08-08T16:15:18Z

@amueller thats correct, the goal is to create more balanced sets than GroupKfold does. I think the name StratifiedGroupKfold would be a better choice, i'll update the pull request.

amueller · 2019-08-08T16:56:33Z

@cdknorow could you briefly explain your algorithm as well?

cdknorow · 2019-08-08T17:28:31Z

@jnothman @amueller I think the main question that I'm not sure about, is in general would it be better to constrain groups to be within the same fold, or is it ok to relax that constraint and allow groups to move across folds as long as the classes for groups across folds are not different. ie if group 1 label 2 is in test, then group 1 label 2 can not be in train. But group 1 label 1 could be in train.

Right now this implements that latter, which allows for better stratification of class labels. The only problem, is if for some domains/data sets, this might lead to leakage. For that, the strict constraint for groups being forced into the same fold would be better.

jnothman · 2019-08-08T21:23:15Z

why do you think this creates skewed sets?

Because it intentionally puts, for some group, all samples of some label into one test fold.

cdknorow · 2019-08-08T23:02:13Z

@amueller

The algorithm boils down this.

Select a unique class in the data set.
Get all samples that contain that class
groupby the samples by the group and sort that by the number of samples in each group
Start with fold index 0, place all of that samples that have group with the largest number of samples into that fold
If weighted enabled, increment a weight counter by n_smaples in that group, otherwise by 1
Get the next batch of samples for the next largest group. Place it in the appropriate fold based on the number of samples in the folds.
Go back to step 1 and repeat for other unique classes.

At the end, you will have folds that have stratified classes along with the guarantee that each fold will have unique (group, label) combinations.

amueller · 2019-08-12T17:22:19Z

@jnothman that's kind of the point of GroupKFold, right? That all samples within a group are within the same fold?

@cdknorow oh so you don't require a group to be contained within a single fold? You allow group 1 class a to be in fold 1 and group 1 class b to be in fold 2?
That's not what the other PR allows, and I'm not sure I see the motivation for that.
That would be StratifiedGroupKFold as I think of it applied to labels=y, groups=product(labels, groups).
So I would probably rather implement the other one, and then have your use-case covered by setting groups to the product of groups and labels.

treats which attempts to evenly distribute group and labels accross each fold.

cdknorow · 2019-08-30T15:48:05Z

@amueller the other one with a grouping of labels, groups would probably work. I updated this pull request one to restrict the groups to a single fold by default. I still have to think some more about the best way to implement this though. I think it can be improved by a better sampling of which label we are stratifying over which I have some ideas for.

jnothman reviewed Jul 31, 2019

View reviewed changes

cdknorow force-pushed the features/label_and_group branch 3 times, most recently from 22919fe to 51f41c4 Compare August 2, 2019 03:52

amueller mentioned this pull request Aug 8, 2019

Stratified GroupKFold #13621

Closed

cdknorow force-pushed the features/label_and_group branch from 51f41c4 to e123699 Compare August 8, 2019 16:57

cdknorow force-pushed the features/label_and_group branch 2 times, most recently from 10f191c to c41416a Compare August 8, 2019 17:49

Adding a model selection method for stratified group kfold

917217a

treats which attempts to evenly distribute group and labels accross each fold.

cdknorow force-pushed the features/label_and_group branch from c41416a to 917217a Compare August 12, 2019 19:48

github-actions bot added the module:model_selection label Mar 2, 2020

jnothman mentioned this pull request Mar 4, 2020

StratifiedGroupShuffleSplit #12076

Open

Base automatically changed from master to main January 22, 2021 10:51

Uh oh!

[Feature] Adding a Group And Label Kfold Split Method to Model Selection #14524

Are you sure you want to change the base?

[Feature] Adding a Group And Label Kfold Split Method to Model Selection #14524

Uh oh!

Conversation

cdknorow commented Jul 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

cdknorow commented Aug 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Aug 2, 2019

Uh oh!

cdknorow commented Aug 5, 2019

Uh oh!

amueller commented Aug 5, 2019

Uh oh!

cdknorow commented Aug 5, 2019

Uh oh!

jnothman commented Aug 5, 2019 via email

Uh oh!

cdknorow commented Aug 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Aug 6, 2019 via email

Uh oh!

cdknorow commented Aug 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Aug 8, 2019 via email

Uh oh!

amueller commented Aug 8, 2019

Uh oh!

amueller commented Aug 8, 2019

Uh oh!

amueller commented Aug 8, 2019

Uh oh!

cdknorow commented Aug 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Aug 8, 2019

Uh oh!

cdknorow commented Aug 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Aug 8, 2019

Uh oh!

cdknorow commented Aug 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Aug 12, 2019

Uh oh!

cdknorow commented Aug 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cdknorow commented Jul 30, 2019 •

edited

Loading

cdknorow commented Aug 2, 2019 •

edited

Loading

cdknorow commented Aug 6, 2019 •

edited

Loading

cdknorow commented Aug 8, 2019 •

edited

Loading

cdknorow commented Aug 8, 2019 •

edited

Loading

cdknorow commented Aug 8, 2019 •

edited

Loading

cdknorow commented Aug 8, 2019 •

edited

Loading

cdknorow commented Aug 30, 2019 •

edited

Loading