Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FEA Group aware Time-based cross validation #16236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 107 commits into
base: main
Choose a base branch
from

Conversation

getgaurav2
Copy link
Contributor

@getgaurav2 getgaurav2 commented Jan 26, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Fixes #14257
1 . Assumes that groups are contiguous .
2. Split the groups into train and test indices using TimeSeriesSplit
3. Use these group indices to get the indices for original data.
4. Use the max_train_size parameter to trim the train_array for each iteration of split.

Any other comments?

Question:

  1. Should we have separate parameter for max_train_size in group vs. actual data ?
  2. Should we have separate parameter for gap in group vs. actual data ?

@jnothman
Copy link
Member

@getgaurav2, tests may be failing. Let us know when you want review.

@jnothman
Copy link
Member

Linting is failing, actually
.

@getgaurav2
Copy link
Contributor Author

@getgaurav2, tests may be failing. Let us know when you want review.

Sure @jnothman . Will do . Thank you .

@getgaurav2
Copy link
Contributor Author

getgaurav2 commented Mar 3, 2020

@jnothman @ogrisel Could you please opine on the approach that I have taken for this feature ?
I can work on performance metrics , documentation etc. if you are ok with the general direction .
Thanks

@getgaurav2
Copy link
Contributor Author

@jnothman @ogrisel Could you please opine on the approach that I have taken for this feature ? ...

@getgaurav2 - gentle reminder @ogrisel

Sent with GitHawk

@getgaurav2
Copy link
Contributor Author

...

@ogrisel @jnothman - can you pls review this PR ? Thank you

Sent with GitHawk

@albertvillanova
Copy link
Contributor

@getgaurav2 are you still working on this PR?

@getgaurav2
Copy link
Contributor Author

@getgaurav2 are you still working on this PR?

@albertvillanova - yes . I would like to finish this please. Have been busy at work lately . Will find some time this week to make some progress . Thanks

Sent with GitHawk

@albertvillanova
Copy link
Contributor

@getgaurav2 Great! Because I do really need this feature in the next release ;)

@PimwipaV
Copy link

PimwipaV commented Sep 5, 2020

hi @getgaurav2, I tried pushing here but I got an error remote: Permission to getgaurav2/scikit-learn.git denied to Pimpwhippa. My git remote -v are correct I think. Please could you check if you allow edits to this PR. Thank you.

@albertvillanova
Copy link
Contributor

@getgaurav2, as you have seen, it is not a good idea to rebase onto master during a Pull Request. If you would like to sync branches, it is better you make a merge:

git merge upstream master

Anyway, it does not seem necessary to do it: see GitHub message: This branch has no conflicts with the base branch.

@getgaurav2
Copy link
Contributor Author

getgaurav2 commented Sep 6, 2020

hi @getgaurav2, I tried pushing here but I got an error remote: Permission to getgaurav2/scikit-learn.git denied to Pimpwhippa. My git remote -v are correct I think. Please could you check if you allow edits to this PR. Thank you.

Screen Shot 2020-09-06 at 5 33 17 PM

Let's see if this helps. Please add test cases into test_split.py

@getgaurav2
Copy link
Contributor Author

@getgaurav2 Great! Because I do really need this feature in the next release ;)

@albertvillanova - would you want to check the latest code base ?
I want to talk through the best way to handle cases when the function may yield empty training splits .

Sent with GitHawk

@PimwipaV
Copy link

PimwipaV commented Sep 7, 2020

@getgaurav2 thank you. I just opened a PR.

@albertvillanova
Copy link
Contributor

Fast check: why this empty file? sklearn/model_selection/conftest.py

@albertvillanova
Copy link
Contributor

albertvillanova commented Sep 8, 2020

I would like to point out an issue in your implementation. I have created a new test case:

groups = np.array(['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'd'])
gtss = GroupTimeSeriesSplit(n_splits=3)

and this is what I get:

In [49]: for train_idx, test_idx in gtss.split(groups, groups=groups):
    ...:     print(train_idx, test_idx)
    ...:     print(groups[train_idx], groups[test_idx])
    ...:     print()
    ...:
[0 1 2 3 4 5] [6 7 8 9]
['a' 'a' 'a' 'a' 'a' 'a'] ['b' 'b' 'b' 'b']

[0 1 2 3 4 5] [10 11 12 13]
['a' 'a' 'a' 'a' 'a' 'a'] ['b' 'c' 'c' 'c']

[ 0  1  2  3  4  5  6  7  8  9 10] [14 15 16 17]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b'] ['c' 'd' 'd' 'd']

As you can see, there are gaps between the train and the test indices, i.e. they are not contiguous.

I would expect the following result:

In [49]: for train_idx, test_idx in gtss.split(groups, groups=groups):
    ...:     print(train_idx, test_idx)
    ...:     print(groups[train_idx], groups[test_idx])
    ...:     print()
    ...:
[0 1 2 3 4 5] [6 7 8 9 10]
['a' 'a' 'a' 'a' 'a' 'a'] ['b' 'b' 'b' 'b' 'b']

[0 1 2 3 4 5 6 7 8 9 10] [11 12 13 14]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b'] ['c' 'c' 'c' 'c']

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14] [15 16 17]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c'] ['d' 'd' 'd']

By the way, and concerning the gaps, if you look at the next v0.24 release version of TimeSeriesSplit, there are two new parameters: test_size and gap. Maybe these should be also implemented in GroupTimeSeriesSplit.

@PimwipaV
Copy link

PimwipaV commented Sep 8, 2020

'In each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate'.
You made the class description clearer already but does that mean i can still shuffle the groups, but not the time right?

@PimwipaV
Copy link

PimwipaV commented Sep 8, 2020

i mean this is allowed
groups = ['B', 'D', 'D', 'C', 'C', 'A', 'B', 'A']
and not only this
groups = ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D']

@getgaurav2
Copy link
Contributor Author

I would like to point out an issue in your implementation. I have created a new test case:

groups = np.array(['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'd'])
gtss = GroupTimeSeriesSplit(n_splits=3)

and this is what I get:

In [49]: for train_idx, test_idx in gtss.split(groups, groups=groups):
    ...:     print(train_idx, test_idx)
    ...:     print(groups[train_idx], groups[test_idx])
    ...:     print()
    ...:
[0 1 2 3 4 5] [6 7 8 9]
['a' 'a' 'a' 'a' 'a' 'a'] ['b' 'b' 'b' 'b']

[0 1 2 3 4 5] [10 11 12 13]
['a' 'a' 'a' 'a' 'a' 'a'] ['b' 'c' 'c' 'c']

[ 0  1  2  3  4  5  6  7  8  9 10] [14 15 16 17]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b'] ['c' 'd' 'd' 'd']

As you can see, there are gaps between the train and the test indices, i.e. they are not contiguous.

I would expect the following result:

In [49]: for train_idx, test_idx in gtss.split(groups, groups=groups):
    ...:     print(train_idx, test_idx)
    ...:     print(groups[train_idx], groups[test_idx])
    ...:     print()
    ...:
[0 1 2 3 4 5] [6 7 8 9 10]
['a' 'a' 'a' 'a' 'a' 'a'] ['b' 'b' 'b' 'b' 'b']

[0 1 2 3 4 5 6 7 8 9 10] [11 12 13 14]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b'] ['c' 'c' 'c' 'c']

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14] [15 16 17]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c'] ['d' 'd' 'd']

By the way, and concerning the gaps, if you look at the next v0.24 release version of TimeSeriesSplit, there are two new parameters: test_size and gap. Maybe these should be also implemented in GroupTimeSeriesSplit.

@albertvillanova - Thanks for your review. I have incorporated your feedback in the latest Commit . Can you please check again .

@albertvillanova
Copy link
Contributor

@getgaurav2, thanks for your contribution. Your implementation now gives the expected result in the case I pointed out above.

However, when using the new parameters (gap, test_size) I think I have found some unexpected behavior (that I am commenting below), besides that they are not documented in the docstring. Moreover, I think there are some peculiarities when dealing with groups that might deserve a discussion to decide the expected behavior, the definition of the corresponding API and its implementation. For example, I am thinking of test_size, which in TimeSeriesSplit corresponds to the number of samples in the test set; however, when dealing with groups, maybe it would be more useful to be able to specify the number of groups in the test set instead.

Maybe, I would suggest, forgetting about this parameters for the moment, so that this PR can already be finished. You could eventually open a new PR to implement those parameters. What do you think?

The unexpected behavior for the parameter gap. Current behavior:

In [10]: gtss = GroupTimeSeriesSplit(n_splits=2, test_size=2, gap=1)

In [11]: for train_idx, test_idx in gtss.split(groups, groups=groups):
    ...:     print(train_idx, test_idx)
    ...:     print(groups[train_idx], groups[test_idx])
    ...:     print()
    ...:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] [11, 12]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b'] ['c' 'c']

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] [15, 16]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c'] ['d' 'd']

As you can see, there is no gap between training and test sets.

The unexpected behavior for the parameter test_size. Current behavior:

In [8]: gtss = GroupTimeSeriesSplit(n_splits=2, test_size=5)

In [9]: for train_idx, test_idx in gtss.split(groups, groups=groups):
   ...:     print(train_idx, test_idx)
   ...:     print(groups[train_idx], groups[test_idx])
   ...:     print()
   ...:
scikit-learn\sklearn\model_selection\_split.py:2375: UserWarning: The size=4 of group=c is smaller than test_size=5.
  warnings.warn(("The size=%d of group=%s is smaller"
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] [11, 12, 13, 14]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b'] ['c' 'c' 'c' 'c']

scikit-learn\sklearn\model_selection\_split.py:2375: UserWarning: The size=3 of group=d is smaller than test_size=5.
  warnings.warn(("The size=%d of group=%s is smaller"
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] [15, 16, 17]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c'] ['d' 'd' 'd']

In this case, I think the result should be:

[0 1 2 3 4 5] [6 7 8 9 10]
['a' 'a' 'a' 'a' 'a' 'a'] ['b' 'b' 'b' 'b' 'b']

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] [11, 12, 13, 14, 15]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b'] ['c' 'c' 'c' 'c', 'd']

@albertvillanova
Copy link
Contributor

Maybe I would also add the test case I pointed out above:

In [47]: groups = np.array(['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'd'])
In [48]: gtss = GroupTimeSeriesSplit(n_splits=3)

In [49]: for train_idx, test_idx in gtss.split(groups, groups=groups):
    ...:     print(train_idx, test_idx)
    ...:     print(groups[train_idx], groups[test_idx])
    ...:     print()
    ...:
[0 1 2 3 4 5] [6 7 8 9 10]
['a' 'a' 'a' 'a' 'a' 'a'] ['b' 'b' 'b' 'b' 'b']

[0 1 2 3 4 5 6 7 8 9 10] [11 12 13 14]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b'] ['c' 'c' 'c' 'c']

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14] [15 16 17]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c'] ['d' 'd' 'd']

@getgaurav2
Copy link
Contributor Author

getgaurav2 commented Oct 5, 2020

@getgaurav2, thanks for your contribution. Your implementation now gives the expected result in the case I pointed out above.

However, when using the new parameters (gap, test_size) I think I have found some unexpected behavior (that I am commenting below), besides that they are not documented in the docstring. Moreover, I think there are some peculiarities when dealing with groups that might deserve a discussion to decide the expected behavior, the definition of the corresponding API and its implementation. For example, I am thinking of test_size, which in TimeSeriesSplit corresponds to the number of samples in the test set; however, when dealing with groups, maybe it would be more useful to be able to specify the number of groups in the test set instead.

Maybe, I would suggest, forgetting about this parameters for the moment, so that this PR can already be finished. You could eventually open a new PR to implement those parameters. What do you think?

The unexpected behavior for the parameter gap. Current behavior:

In [10]: gtss = GroupTimeSeriesSplit(n_splits=2, test_size=2, gap=1)

In [11]: for train_idx, test_idx in gtss.split(groups, groups=groups):
    ...:     print(train_idx, test_idx)
    ...:     print(groups[train_idx], groups[test_idx])
    ...:     print()
    ...:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] [11, 12]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b'] ['c' 'c']

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] [15, 16]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c'] ['d' 'd']

As you can see, there is no gap between training and test sets.

The unexpected behavior for the parameter test_size. Current behavior:

In [8]: gtss = GroupTimeSeriesSplit(n_splits=2, test_size=5)

In [9]: for train_idx, test_idx in gtss.split(groups, groups=groups):
   ...:     print(train_idx, test_idx)
   ...:     print(groups[train_idx], groups[test_idx])
   ...:     print()
   ...:
scikit-learn\sklearn\model_selection\_split.py:2375: UserWarning: The size=4 of group=c is smaller than test_size=5.
  warnings.warn(("The size=%d of group=%s is smaller"
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] [11, 12, 13, 14]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b'] ['c' 'c' 'c' 'c']

scikit-learn\sklearn\model_selection\_split.py:2375: UserWarning: The size=3 of group=d is smaller than test_size=5.
  warnings.warn(("The size=%d of group=%s is smaller"
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] [15, 16, 17]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c'] ['d' 'd' 'd']

In this case, I think the result should be:

[0 1 2 3 4 5] [6 7 8 9 10]
['a' 'a' 'a' 'a' 'a' 'a'] ['b' 'b' 'b' 'b' 'b']

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] [11, 12, 13, 14, 15]
['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b'] ['c' 'c' 'c' 'c', 'd']

@albertvillanova - Thanks for your feedback. Can you please review the latest commit .

  • For the test_size = 5 , the expected output that you have put is certainly more desirable . It will be interesting to determine how the 2nd split shall behave when there is conflict between gap and test_size like :
groups = np.array(['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'd', 'd', 'd'])
gtss = GroupTimeSeriesSplit(n_splits=2, test_size=5)

> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] [11, 12, 13, 14, 15]
> ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b'] ['c' 'c' 'c' 'c', 'd']
OR 
> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] [15, 16, 17, 18, 19]
> ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b'] ['d' 'd' 'd' 'd', 'd']
  • Would you recommend that I open another PR here or raise a new feature request to keep things separate ?

@albertvillanova
Copy link
Contributor

@getgaurav2, as you have already done an amazing amount of work here (thank you), my suggestion would be to leave for another PR the implementation of the parameters test_size and gap. That way, you would already be able to finish this PR.

Just a minor addition: could you please add GroupTimeSeriesSplit to the test test_yields_constant_splits?

And please, add a whatsnew entry to the file v0.24.rst, in the model_selection section, as a |MajorFeature|, and reference this PR and your name and link. You can take the other entries as example.

@getgaurav2
Copy link
Contributor Author

@getgaurav2, as you have already done an amazing amount of work here (thank you), my suggestion would be to leave for another PR the implementation of the parameters test_size and gap. That way, you would already be able to finish this PR.

Just a minor addition: could you please add GroupTimeSeriesSplit to the test test_yields_constant_splits?

And please, add a whatsnew entry to the file v0.24.rst, in the model_selection section, as a |MajorFeature|, and reference this PR and your name and link. You can take the other entries as example.

@albertvillanova - Thank you . Can you please check now .

@albertvillanova
Copy link
Contributor

albertvillanova commented Oct 7, 2020

@getgaurav2 thanks. I think your PR is ready for a review round. Could you please change its name from WIP to MRG?

@getgaurav2 getgaurav2 requested a review from jnothman July 20, 2023 15:21
@adrinjalali
Copy link
Member

@glemaitre this seems like a good candidate for a "time series related" "project board".

@sluofoss
Copy link

sluofoss commented Sep 26, 2024

Hi, can I check what is the status of this PR?

Would I be correct that this PR would also close issue #6322 and related old PR #6351 's HeteroTSCV.?

Also I feel like a gap functionality that is similar to timegapsplit #13761 and the current gap parameter in https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html would be useful when it comes to chronological groups.

@adrinjalali
Copy link
Member

@lu0x1a0 this is up for grabs. Now that we have metadata routing, this can nicely move forward if you're up for it.

@getgaurav2
Copy link
Contributor Author

getgaurav2 commented Sep 26, 2024

@adrinjalali , @lu0x1a0 -- I would love to continue work on it to see this one to completion if that's okay .

@adrinjalali
Copy link
Member

@getgaurav2 that'd be fantastic!

@getgaurav2
Copy link
Contributor Author

getgaurav2 commented Jan 21, 2025

@getgaurav2 that'd be fantastic!

@adrinjalali - would you please be able to check the issue in doc build to and give some direction ....thank you !

@adrinjalali
Copy link
Member

Something has gone wrong here probably in your merge with main. You have over 12k line change, and some large changes in the rst files. I'd start with fixing those issues if I were you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: Group aware Time-based cross validation