-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
FEA Group aware Time-based cross validation #16236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@getgaurav2, tests may be failing. Let us know when you want review. |
Linting is failing, actually |
Sure @jnothman . Will do . Thank you . |
@getgaurav2 - gentle reminder @ogrisel Sent with GitHawk |
@getgaurav2 are you still working on this PR? |
@albertvillanova - yes . I would like to finish this please. Have been busy at work lately . Will find some time this week to make some progress . Thanks Sent with GitHawk |
@getgaurav2 Great! Because I do really need this feature in the next release ;) |
hi @getgaurav2, I tried pushing here but I got an error remote: Permission to getgaurav2/scikit-learn.git denied to Pimpwhippa. My git remote -v are correct I think. Please could you check if you allow edits to this PR. Thank you. |
@getgaurav2, as you have seen, it is not a good idea to rebase onto master during a Pull Request. If you would like to sync branches, it is better you make a merge:
Anyway, it does not seem necessary to do it: see GitHub message: This branch has no conflicts with the base branch. |
Let's see if this helps. Please add test cases into test_split.py |
@albertvillanova - would you want to check the latest code base ? Sent with GitHawk |
@getgaurav2 thank you. I just opened a PR. |
Fast check: why this empty file? |
I would like to point out an issue in your implementation. I have created a new test case:
and this is what I get:
As you can see, there are gaps between the train and the test indices, i.e. they are not contiguous. I would expect the following result:
By the way, and concerning the gaps, if you look at the next v0.24 release version of |
'In each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate'. |
i mean this is allowed |
@albertvillanova - Thanks for your review. I have incorporated your feedback in the latest Commit . Can you please check again . |
@getgaurav2, thanks for your contribution. Your implementation now gives the expected result in the case I pointed out above. However, when using the new parameters ( Maybe, I would suggest, forgetting about this parameters for the moment, so that this PR can already be finished. You could eventually open a new PR to implement those parameters. What do you think? The unexpected behavior for the parameter
As you can see, there is no gap between training and test sets. The unexpected behavior for the parameter
In this case, I think the result should be:
|
Maybe I would also add the test case I pointed out above:
|
@albertvillanova - Thanks for your feedback. Can you please review the latest commit .
|
@getgaurav2, as you have already done an amazing amount of work here (thank you), my suggestion would be to leave for another PR the implementation of the parameters Just a minor addition: could you please add And please, add a whatsnew entry to the file |
@albertvillanova - Thank you . Can you please check now . |
@getgaurav2 thanks. I think your PR is ready for a review round. Could you please change its name from WIP to MRG? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @getgaurav2 for your patience.
Those two comments should fix the documentation error and rendering.
@ogrisel - Do the attached plots match what you have in mind? |
Added a test case with imbalanced groups . Imbalanced splits still pass the test as of now . I can change that if we agree on a function to that will give "closest group-aligned split" ( comment ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests are currently quite hard to read. Tests need to be easy to read and understand that they are testing what we believe to be true.
sklearn/model_selection/_split.py
Outdated
for idx in np.arange(n_samples): | ||
if groups[idx] in group_dict: | ||
if idx - group_dict[groups[idx]][-1] == 1: | ||
group_dict[groups[idx]].append(idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not need to store full lists... I think this is sufficient:
seen_groups = set()
prev_group = None
for group in groups:
if group != prev_group and group in seen_groups:
raise ValueError("The groups should be contiguous")
prev_group = group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could do:
reordered_unique_groups, indices, inverse = np.unique(groups, return_index=True, return_inverse=True)
if (np.diff(inverse) < 1).any():
raise ValueError("The groups should be contiguous")
although this is much more obfuscated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used the first variant with enumerate
to add the index in the error msg.
assert_array_equal(groups[test], ["d", "d", "d"]) | ||
|
||
|
||
def test_group_time_series_non_overlap_group_2(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of having each of these test cases? If they test distinct capabilities, consider using pytest.parametrize
and expressing your test in a more generalised way that tests the invariances of interest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed the test cases to convey the purpose of each. Also , removed one of the tests that was kind of repetitive. Please let me know if you think it needs further clarification.
@glemaitre this seems like a good candidate for a "time series related" "project board". |
Hi, can I check what is the status of this PR? Would I be correct that this PR would also close issue #6322 and related old PR #6351 's Also I feel like a gap functionality that is similar to timegapsplit #13761 and the current |
@lu0x1a0 this is up for grabs. Now that we have metadata routing, this can nicely move forward if you're up for it. |
@adrinjalali , @lu0x1a0 -- I would love to continue work on it to see this one to completion if that's okay . |
@getgaurav2 that'd be fantastic! |
@adrinjalali - would you please be able to check the issue in doc build to and give some direction ....thank you ! |
Something has gone wrong here probably in your merge with |
@adrinjalali - sorry for the long gap . Would you pls be able to review now ? |
@adrinjalali - Would you pls be able to share a link if this is public ..Thank you ! |
train, | ||
np.array( | ||
[ | ||
0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in all these cases, we don't have to write out each individual case, you can use functions such as range
, etc to make these much smaller diffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure ..fixed now.
# `np.unique` will reorder the group. We need to keep the original | ||
# ordering. | ||
reordered_unique_groups, indices = np.unique(groups, return_index=True) | ||
unique_groups = reordered_unique_groups[np.argsort(indices)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we care about the order? groups
doesn't have any ordered semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we care about the order?
groups
doesn't have any ordered semantics.
This is done to preserve the sequence in which the labels of the group appear in the dataset. The split logic will cut the list in the same order as the group label appear in the dataset.
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Fixes #14257
1 . Assumes that
groups
are contiguous .2. Split the
groups
into train and test indices using TimeSeriesSplit3. Use these group indices to get the indices for original data.
4. Use the max_train_size parameter to trim the train_array for each iteration of split.
Any other comments?
Question: