Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+2?] Add homogeneous time series cross validation #6586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

yenchenlin
Copy link
Contributor

This PR is a separated PR from #6351 that tries to solve #6322 ,
and it implements homogeneous time series cross validation and add the corresponding test.

However, there's a discussion on n_folds's meaning here, may reviewers please give me some suggestion?

ping @MechCoder @rvraghav93 @amueller @jnothman 🙏

@yenchenlin yenchenlin force-pushed the add-homogeneous-time-series-cv branch 2 times, most recently from 4b54fac to 045b4b5 Compare March 24, 2016 05:52
@jnothman
Copy link
Member

Rename to n_splits across the board? 😛

@yenchenlin
Copy link
Contributor Author

Hi @jnothman, are you serious?
For me, it sounds like a good solution 😄

@yenchenlin yenchenlin force-pushed the add-homogeneous-time-series-cv branch from 045b4b5 to fe49fa7 Compare March 24, 2016 06:18
@jnothman
Copy link
Member

I'm serious in that that's how I might name it it in hindsight, especially since we now have get_n_splits. Such change and deprecation should generally be avoided to not force users to change. However, if we were to make that change, now would be the time, given that users of this part of the codebase need to update their code anyway (sklearn.model_selection doesn't exist in master).

Note however, that there's also the further inconsistency with the ShuffleSplit family's use of n_iter for the same notion as n_folds. Indeed, for a consistent and unambiguous interface, [get_]n_iter may be preferable to [get_]n_splits. Unless there is wholesale agreement among devs that we should piggyback further consistency improvements on the model_selection change, I think the choice of parameter name should be limited to the new splitter.

Ordinarily, @GaelVaroquaux righteously champions a conservative position on such change, so it's useful to have his input on whether that piggybacking is at all worth considering.

@yenchenlin yenchenlin force-pushed the add-homogeneous-time-series-cv branch from fe49fa7 to 61c5a15 Compare March 24, 2016 06:44
@yenchenlin
Copy link
Contributor Author

Any update on this?

Would @GaelVaroquaux please share your opinions?
Thanks!

@yenchenlin yenchenlin force-pushed the add-homogeneous-time-series-cv branch from 61c5a15 to 1348e4a Compare August 10, 2016 00:50
@yenchenlin
Copy link
Contributor Author

yenchenlin commented Aug 10, 2016

Sorry to interrupt, @jnothman and @MechCoder .
I think the functionality of time series cross validation is done.

Problem remains now is that get_n_splits() will return self.n_folds -1, which may be confused since users used to see self.n_folds as how many test folds.

n_splits : int
Returns the number of splitting iterations in the cross-validator.
"""
return self.n_folds-1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space between operator and args

@jnothman
Copy link
Member

See #7169, which I should have left to you to post, @yenchenlin !

@yenchenlin
Copy link
Contributor Author

@jnothman thanks! 😄

@jnothman
Copy link
Member

Please add to classes.rst


Provides train/test indices to split time series data in train/test sets.

This cross-validation object is a variation of KFold.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well use a :class: link

@jnothman
Copy link
Member

I wonder whether there's any benefit in allowing the user to sub-sample from the training set so that its size is not variable. You could imagine not wanting to choose model parameters that maximise the mean score, but a mean weighted to those with a larger sample of training data: if the amount of training data has an effect, cross-validation scores may not be normally distributed. Allowing the option to sub-sample to a fixed training set size does not fix this, though. It may be worth leaving a note for the user that due to learning curve effects the mean over CV scores might not be as informative as otherwise...

@jnothman
Copy link
Member

This needs some narrative documentation (i.e. content in cross_validation.rst), whose role is to guide the user to the kinds of problems involved in choosing a cross-validation strategy, and the available solutions.

@jnothman
Copy link
Member

Test failures pending rebase on #7187, I hope.

@jnothman
Copy link
Member

Ideally, this PR would also be supported by an example. Do we know of / have datasets where this would be appropriate?

class HomogeneousTimeSeriesCV(_BaseKFold):
"""Homogeneous Time Series cross-validator

Provides train/test indices to split time series data in train/test sets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify to the user that we assume samples from later times have higher indices (and thus shuffling in CV is inappropriate).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@GaelVaroquaux
Copy link
Member

Before this is merged, I think that it is very important that there is a section / subsection added in the model selection documentation on time series data. It doesn't have to be long, but it should point out the particularities of cross-validation on time series, and point to this object.

@yenchenlin yenchenlin force-pushed the add-homogeneous-time-series-cv branch from e87babe to e0a3865 Compare August 15, 2016 13:48
@yenchenlin
Copy link
Contributor Author

yenchenlin commented Aug 15, 2016

I temporarily pushing commits that rebase #7187 to make sure CI looks good.
Will resolve all these commits after #7187 is merged.

@yenchenlin yenchenlin force-pushed the add-homogeneous-time-series-cv branch from e0a3865 to 9cdfe2f Compare August 15, 2016 14:21
@yenchenlin
Copy link
Contributor Author

I have to admit that I'm not familiar with time-series data, and the only time-series data I've played with is a private property of a startup.

I remembered @MechCoder has some experiences with this, would you please suggest some toy dataset?

@jnothman
Copy link
Member

I have to admit that I'm not familiar with time-series data, and the only time-series data I've played with is a private property of a startup.

You don't really need to be familiar with time-series data explicitly, as long as you understand why something like KFold is problematic and why this solution is better. "A note on shuffling" already introduces problems related to non-iid data. Similarly describe the time series case where samples are not identically distributed, rather the distribution changes as a function of the dataset ordering.

@jnothman
Copy link
Member

I think we can merge after those minor changes, and a New Feature entry in whats_new.

We should also create an issue quoting @amueller's suggestion of a restructure of the docs. You are welcome to do this.

The first fold has size
``n_samples // (n_splits + 1) + n_samples % (n_splits + 1)``,
and the other folds all have size ``n_samples // (n_splits + 1)``.
The training set has size ``i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've used a tab character here. And gone over 79 characters. You can break lines within double-backticks if necessary.

@@ -675,8 +675,9 @@ class TimeSeriesCV(_BaseKFold):

Notes
-----
The training set has size ``i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)``
in the ``i``th split, with a test set of size ``n_samples//(n_splits + 1)``,
The training set has size ``i * n_samples // (n_splits + 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still a tab character

@jnothman
Copy link
Member

@yenchenlin , can I suggest you install a pep8 checker?
flake8 tends to work well.

@yenchenlin
Copy link
Contributor Author

@jnothman I feel embarrassed, I do have one ...

@jnothman
Copy link
Member

jnothman commented Aug 24, 2016

I had no intention to embarrass you. Here's a further hint:

~/repos/scikit-learn$ cat .git/hooks/pre-commit
#!/bin/sh

FILES=$(git diff --cached --name-status | grep -v ^D | awk '$1 $2 { print $2}' | grep -e .py$)
if [ -n "$FILES" ]; then
pep8 -r $FILES
fi

use git commit -n to bypass its warnings. (Though I'm sure that could be done without the greps and I can't be bothered to change that now...)

@jnothman
Copy link
Member

Hurrah! 🍻

@jnothman jnothman merged commit 234d256 into scikit-learn:master Aug 24, 2016
@jnothman
Copy link
Member

Ugh. forgot to add to classes.rst

@jnothman
Copy link
Member

Now that this is merged, I've realised that it's a bit unconventional for us to have the suffix CV on a splitter, and it has an altogether different conventional meaning. Should this be TimeSeriesSplit or TimeSeriesFold? @amueller??

@amueller
Copy link
Member

True. I'd say TimeSeriesSplit

@amueller
Copy link
Member

Wanna add to classes and do the rename?

@MechCoder
Copy link
Member

Thanks a lot @yenchenlin . Yes, I agree that it should be TimeSeriesSplit

@jnothman
Copy link
Member

@yenchenlin are you able to make a patch for these afterthoughts?

@yenchenlin
Copy link
Contributor Author

@jnothman sure!

@raghavrv
Copy link
Member

TimeSeriesSplitter maybe?

@cbrummitt
Copy link
Contributor

cbrummitt commented Nov 30, 2016

I'm not sure I understand what HeterogeneousTimeSeries means. It was mentioned here and in #6322. I think it means that the time stamps are not equally spaced.

I've implemented what I'm calling MultiTimeSeriesSplit for the case of multiple time-series with different time stamps present in different time-series (or, said differently, with missing data). I compute splits by calculating equally spaced quantiles of the distribution of time stamps.

Here's an illustration for a small dataset. There are four time-series (four countries), and not all countries have data in every year (e.g., it's common to have countries be "born" in the middle of a dataset).

image

Would this be of interest?

@yenchenlin what's the status of HeterogeneousTimeSeriesSplit (or whatever is the latest name)?

Update (March 2019): The MultiTimeSeriesSplitter I wrote is here. It ended up using quantiles using this function to decide where to make splits, with the option of making the first split larger (e.g., "I want the first fold to be the earliest 60% of the data, the second fold to be the earliest 70% of the data, etc."). This class has the possibly undesirable behavior of putting some records at time t in one fold and other records at time t in another fold. The benefit, however, was that the numbers of records in the folds are more similar. The code got a bit complicated and more tied to my use case than I would have liked.

@jnothman
Copy link
Member

jnothman commented Nov 30, 2016 via email

@amueller
Copy link
Member

amueller commented Dec 1, 2016

@jnothman also check out the discussion here: #6322

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants