-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2?] Add homogeneous time series cross validation #6586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2?] Add homogeneous time series cross validation #6586
Conversation
4b54fac
to
045b4b5
Compare
Rename to |
Hi @jnothman, are you serious? |
045b4b5
to
fe49fa7
Compare
I'm serious in that that's how I might name it it in hindsight, especially since we now have Note however, that there's also the further inconsistency with the Ordinarily, @GaelVaroquaux righteously champions a conservative position on such change, so it's useful to have his input on whether that piggybacking is at all worth considering. |
fe49fa7
to
61c5a15
Compare
Any update on this? Would @GaelVaroquaux please share your opinions? |
61c5a15
to
1348e4a
Compare
Sorry to interrupt, @jnothman and @MechCoder . Problem remains now is that |
n_splits : int | ||
Returns the number of splitting iterations in the cross-validator. | ||
""" | ||
return self.n_folds-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space between operator and args
See #7169, which I should have left to you to post, @yenchenlin ! |
@jnothman thanks! 😄 |
Please add to |
|
||
Provides train/test indices to split time series data in train/test sets. | ||
|
||
This cross-validation object is a variation of KFold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well use a :class:
link
I wonder whether there's any benefit in allowing the user to sub-sample from the training set so that its size is not variable. You could imagine not wanting to choose model parameters that maximise the mean score, but a mean weighted to those with a larger sample of training data: if the amount of training data has an effect, cross-validation scores may not be normally distributed. Allowing the option to sub-sample to a fixed training set size does not fix this, though. It may be worth leaving a note for the user that due to learning curve effects the mean over CV scores might not be as informative as otherwise... |
This needs some narrative documentation (i.e. content in |
Test failures pending rebase on #7187, I hope. |
Ideally, this PR would also be supported by an example. Do we know of / have datasets where this would be appropriate? |
class HomogeneousTimeSeriesCV(_BaseKFold): | ||
"""Homogeneous Time Series cross-validator | ||
|
||
Provides train/test indices to split time series data in train/test sets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify to the user that we assume samples from later times have higher indices (and thus shuffling in CV is inappropriate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Before this is merged, I think that it is very important that there is a section / subsection added in the model selection documentation on time series data. It doesn't have to be long, but it should point out the particularities of cross-validation on time series, and point to this object. |
e87babe
to
e0a3865
Compare
e0a3865
to
9cdfe2f
Compare
I have to admit that I'm not familiar with time-series data, and the only time-series data I've played with is a private property of a startup. I remembered @MechCoder has some experiences with this, would you please suggest some toy dataset? |
You don't really need to be familiar with time-series data explicitly, as long as you understand why something like KFold is problematic and why this solution is better. "A note on shuffling" already introduces problems related to non-iid data. Similarly describe the time series case where samples are not identically distributed, rather the distribution changes as a function of the dataset ordering. |
I think we can merge after those minor changes, and a New Feature entry in whats_new. We should also create an issue quoting @amueller's suggestion of a restructure of the docs. You are welcome to do this. |
The first fold has size | ||
``n_samples // (n_splits + 1) + n_samples % (n_splits + 1)``, | ||
and the other folds all have size ``n_samples // (n_splits + 1)``. | ||
The training set has size ``i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've used a tab character here. And gone over 79 characters. You can break lines within double-backticks if necessary.
@@ -675,8 +675,9 @@ class TimeSeriesCV(_BaseKFold): | |||
|
|||
Notes | |||
----- | |||
The training set has size ``i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)`` | |||
in the ``i``th split, with a test set of size ``n_samples//(n_splits + 1)``, | |||
The training set has size ``i * n_samples // (n_splits + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still a tab character
@yenchenlin , can I suggest you install a pep8 checker? |
@jnothman I feel embarrassed, I do have one ... |
I had no intention to embarrass you. Here's a further hint:
use |
Hurrah! 🍻 |
Ugh. forgot to add to classes.rst |
Now that this is merged, I've realised that it's a bit unconventional for us to have the suffix |
True. I'd say |
Wanna add to classes and do the rename? |
Thanks a lot @yenchenlin . Yes, I agree that it should be |
@yenchenlin are you able to make a patch for these afterthoughts? |
@jnothman sure! |
|
I'm not sure I understand what I've implemented what I'm calling Here's an illustration for a small dataset. There are four time-series (four countries), and not all countries have data in every year (e.g., it's common to have countries be "born" in the middle of a dataset). Would this be of interest? @yenchenlin what's the status of Update (March 2019): The MultiTimeSeriesSplitter I wrote is here. It ended up using quantiles using this function to decide where to make splits, with the option of making the first split larger (e.g., "I want the first fold to be the earliest 60% of the data, the second fold to be the earliest 70% of the data, etc."). This class has the possibly undesirable behavior of putting some records at time t in one fold and other records at time t in another fold. The benefit, however, was that the numbers of records in the folds are more similar. The code got a bit complicated and more tied to my use case than I would have liked. |
This doesn't sound far from Hetero TSS. I think if we had an idea of the
API, i.e. how this information is specified and how general this approach
is, there might be some interest.
…On 1 December 2016 at 06:53, Charlie Brummitt ***@***.***> wrote:
I'm not sure I understand what HeterogeneousTimeSeries means. It was
mentioned here and in #6322
<#6322>. I think it
means that the time stamps are not equally spaced.
I've implemented what I'm calling MultiTimeSeriesSplit for the case of *multiple
time-series with different time stamps present in different time-series*
(or, said differently, *with missing data*). I compute splits by
calculating equally spaced quantiles of the distribution of time stamps.
Here's an illustration for a small dataset. There are four time-series
(four countries), and not all countries have data in every year (e.g., it's
common to have countries be "born" in the middle of a dataset).
[image: image]
<https://cloud.githubusercontent.com/assets/1268375/20768337/8a29d242-b70b-11e6-970d-720773b2328e.png>
Would this be of interest?
@yenchenlin <https://github.com/yenchenlin> what's the status of
HeterogeneousTimeSeriesSplit (or whatever is the latest name)?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6586 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz666bxMmwHrLhdDLSOThvyrvG3Nofks5rDdQogaJpZM4H3uLP>
.
|
This PR is a separated PR from #6351 that tries to solve #6322 ,
and it implements homogeneous time series cross validation and add the corresponding test.
However, there's a discussion on
n_folds
's meaning here, may reviewers please give me some suggestion?ping @MechCoder @rvraghav93 @amueller @jnothman 🙏