[MRG+2?] Add homogeneous time series cross validation #6586

yenchenlin · 2016-03-24T05:46:52Z

This PR is a separated PR from #6351 that tries to solve #6322 ,
and it implements homogeneous time series cross validation and add the corresponding test.

However, there's a discussion on n_folds's meaning here, may reviewers please give me some suggestion?

ping @MechCoder @rvraghav93 @amueller @jnothman 🙏

jnothman · 2016-03-24T05:52:29Z

Rename to n_splits across the board? 😛

yenchenlin · 2016-03-24T06:09:23Z

Hi @jnothman, are you serious?
For me, it sounds like a good solution 😄

jnothman · 2016-03-24T06:38:14Z

I'm serious in that that's how I might name it it in hindsight, especially since we now have get_n_splits. Such change and deprecation should generally be avoided to not force users to change. However, if we were to make that change, now would be the time, given that users of this part of the codebase need to update their code anyway (sklearn.model_selection doesn't exist in master).

Note however, that there's also the further inconsistency with the ShuffleSplit family's use of n_iter for the same notion as n_folds. Indeed, for a consistent and unambiguous interface, [get_]n_iter may be preferable to [get_]n_splits. Unless there is wholesale agreement among devs that we should piggyback further consistency improvements on the model_selection change, I think the choice of parameter name should be limited to the new splitter.

Ordinarily, @GaelVaroquaux righteously champions a conservative position on such change, so it's useful to have his input on whether that piggybacking is at all worth considering.

yenchenlin · 2016-04-05T15:34:31Z

Any update on this?

Would @GaelVaroquaux please share your opinions?
Thanks!

yenchenlin · 2016-08-10T00:53:49Z

Sorry to interrupt, @jnothman and @MechCoder .
I think the functionality of time series cross validation is done.

Problem remains now is that get_n_splits() will return self.n_folds -1, which may be confused since users used to see self.n_folds as how many test folds.

jnothman · 2016-08-10T05:40:37Z

sklearn/model_selection/_split.py

+        n_splits : int
+            Returns the number of splitting iterations in the cross-validator.
+        """
+        return self.n_folds-1


space between operator and args

jnothman · 2016-08-10T05:49:44Z

See #7169, which I should have left to you to post, @yenchenlin !

yenchenlin · 2016-08-10T15:52:02Z

@jnothman thanks! 😄

jnothman · 2016-08-14T14:09:34Z

Please add to classes.rst

jnothman · 2016-08-14T14:11:03Z

sklearn/model_selection/_split.py

+
+    Provides train/test indices to split time series data in train/test sets.
+
+    This cross-validation object is a variation of KFold.


Might as well use a :class: link

jnothman · 2016-08-14T14:18:09Z

I wonder whether there's any benefit in allowing the user to sub-sample from the training set so that its size is not variable. You could imagine not wanting to choose model parameters that maximise the mean score, but a mean weighted to those with a larger sample of training data: if the amount of training data has an effect, cross-validation scores may not be normally distributed. Allowing the option to sub-sample to a fixed training set size does not fix this, though. It may be worth leaving a note for the user that due to learning curve effects the mean over CV scores might not be as informative as otherwise...

jnothman · 2016-08-14T14:19:57Z

This needs some narrative documentation (i.e. content in cross_validation.rst), whose role is to guide the user to the kinds of problems involved in choosing a cross-validation strategy, and the available solutions.

jnothman · 2016-08-14T14:22:04Z

Test failures pending rebase on #7187, I hope.

jnothman · 2016-08-14T14:22:35Z

Ideally, this PR would also be supported by an example. Do we know of / have datasets where this would be appropriate?

jnothman · 2016-08-14T14:23:55Z

sklearn/model_selection/_split.py

+class HomogeneousTimeSeriesCV(_BaseKFold):
+    """Homogeneous Time Series cross-validator
+
+    Provides train/test indices to split time series data in train/test sets.


Please clarify to the user that we assume samples from later times have higher indices (and thus shuffling in CV is inappropriate).

GaelVaroquaux · 2016-08-14T14:55:55Z

Before this is merged, I think that it is very important that there is a section / subsection added in the model selection documentation on time series data. It doesn't have to be long, but it should point out the particularities of cross-validation on time series, and point to this object.

yenchenlin · 2016-08-15T13:52:19Z

I temporarily pushing commits that rebase #7187 to make sure CI looks good.
Will resolve all these commits after #7187 is merged.

yenchenlin · 2016-08-15T14:40:59Z

I have to admit that I'm not familiar with time-series data, and the only time-series data I've played with is a private property of a startup.

I remembered @MechCoder has some experiences with this, would you please suggest some toy dataset?

jnothman · 2016-08-16T04:48:10Z

I have to admit that I'm not familiar with time-series data, and the only time-series data I've played with is a private property of a startup.

You don't really need to be familiar with time-series data explicitly, as long as you understand why something like KFold is problematic and why this solution is better. "A note on shuffling" already introduces problems related to non-iid data. Similarly describe the time series case where samples are not identically distributed, rather the distribution changes as a function of the dataset ordering.

jnothman · 2016-08-24T10:59:30Z

I think we can merge after those minor changes, and a New Feature entry in whats_new.

We should also create an issue quoting @amueller's suggestion of a restructure of the docs. You are welcome to do this.

jnothman · 2016-08-24T12:43:10Z

sklearn/model_selection/_split.py

-    The first fold has size
-    ``n_samples // (n_splits + 1) + n_samples % (n_splits + 1)``,
-    and the other folds all have size ``n_samples // (n_splits + 1)``.
+	The training set has size ``i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)``


You've used a tab character here. And gone over 79 characters. You can break lines within double-backticks if necessary.

jnothman · 2016-08-24T13:02:05Z

sklearn/model_selection/_split.py

@@ -675,8 +675,9 @@ class TimeSeriesCV(_BaseKFold):

    Notes
    -----
-	The training set has size ``i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)``
-	in the ``i``th split, with a test set of size ``n_samples//(n_splits + 1)``,
+	The training set has size ``i * n_samples // (n_splits + 1)


still a tab character

jnothman · 2016-08-24T13:02:51Z

@yenchenlin , can I suggest you install a pep8 checker?
flake8 tends to work well.

yenchenlin · 2016-08-24T13:08:26Z

@jnothman I feel embarrassed, I do have one ...

jnothman · 2016-08-24T13:11:23Z

I had no intention to embarrass you. Here's a further hint:

~/repos/scikit-learn$ cat .git/hooks/pre-commit
#!/bin/sh

FILES=$(git diff --cached --name-status | grep -v ^D | awk '$1 $2 { print $2}' | grep -e .py$)
if [ -n "$FILES" ]; then
pep8 -r $FILES
fi

use git commit -n to bypass its warnings. (Though I'm sure that could be done without the greps and I can't be bothered to change that now...)

jnothman · 2016-08-24T13:15:05Z

Hurrah! 🍻

jnothman · 2016-08-24T13:18:28Z

Ugh. forgot to add to classes.rst

jnothman · 2016-08-24T13:20:07Z

Now that this is merged, I've realised that it's a bit unconventional for us to have the suffix CV on a splitter, and it has an altogether different conventional meaning. Should this be TimeSeriesSplit or TimeSeriesFold? @amueller??

amueller · 2016-08-24T14:31:28Z

True. I'd say TimeSeriesSplit

amueller · 2016-08-24T14:31:46Z

Wanna add to classes and do the rename?

MechCoder · 2016-08-24T17:56:14Z

Thanks a lot @yenchenlin . Yes, I agree that it should be TimeSeriesSplit

jnothman · 2016-08-25T15:22:19Z

@yenchenlin are you able to make a patch for these afterthoughts?

yenchenlin · 2016-08-25T15:45:42Z

@jnothman sure!

raghavrv · 2016-08-25T19:45:53Z

TimeSeriesSplitter maybe?

cbrummitt · 2016-11-30T19:53:11Z

I'm not sure I understand what HeterogeneousTimeSeries means. It was mentioned here and in #6322. I think it means that the time stamps are not equally spaced.

I've implemented what I'm calling MultiTimeSeriesSplit for the case of multiple time-series with different time stamps present in different time-series (or, said differently, with missing data). I compute splits by calculating equally spaced quantiles of the distribution of time stamps.

Here's an illustration for a small dataset. There are four time-series (four countries), and not all countries have data in every year (e.g., it's common to have countries be "born" in the middle of a dataset).

Would this be of interest?

@yenchenlin what's the status of HeterogeneousTimeSeriesSplit (or whatever is the latest name)?

Update (March 2019): The MultiTimeSeriesSplitter I wrote is here. It ended up using quantiles using this function to decide where to make splits, with the option of making the first split larger (e.g., "I want the first fold to be the earliest 60% of the data, the second fold to be the earliest 70% of the data, etc."). This class has the possibly undesirable behavior of putting some records at time t in one fold and other records at time t in another fold. The benefit, however, was that the numbers of records in the folds are more similar. The code got a bit complicated and more tied to my use case than I would have liked.

jnothman · 2016-11-30T22:47:42Z

This doesn't sound far from Hetero TSS. I think if we had an idea of the API, i.e. how this information is specified and how general this approach is, there might be some interest.

…

On 1 December 2016 at 06:53, Charlie Brummitt ***@***.***> wrote: I'm not sure I understand what HeterogeneousTimeSeries means. It was mentioned here and in #6322 <#6322>. I think it means that the time stamps are not equally spaced. I've implemented what I'm calling MultiTimeSeriesSplit for the case of *multiple time-series with different time stamps present in different time-series* (or, said differently, *with missing data*). I compute splits by calculating equally spaced quantiles of the distribution of time stamps. Here's an illustration for a small dataset. There are four time-series (four countries), and not all countries have data in every year (e.g., it's common to have countries be "born" in the middle of a dataset). [image: image] <https://cloud.githubusercontent.com/assets/1268375/20768337/8a29d242-b70b-11e6-970d-720773b2328e.png> Would this be of interest? @yenchenlin <https://github.com/yenchenlin> what's the status of HeterogeneousTimeSeriesSplit (or whatever is the latest name)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6586 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz666bxMmwHrLhdDLSOThvyrvG3Nofks5rDdQogaJpZM4H3uLP> .

amueller · 2016-12-01T21:30:58Z

@jnothman also check out the discussion here: #6322

yenchenlin force-pushed the add-homogeneous-time-series-cv branch 2 times, most recently from 4b54fac to 045b4b5 Compare March 24, 2016 05:52

yenchenlin force-pushed the add-homogeneous-time-series-cv branch from 045b4b5 to fe49fa7 Compare March 24, 2016 06:18

yenchenlin force-pushed the add-homogeneous-time-series-cv branch from fe49fa7 to 61c5a15 Compare March 24, 2016 06:44

yenchenlin force-pushed the add-homogeneous-time-series-cv branch from 61c5a15 to 1348e4a Compare August 10, 2016 00:50

jnothman reviewed Aug 10, 2016
View reviewed changes

jnothman mentioned this pull request Aug 10, 2016

Rename CV params n_{folds,iter} to n_splits #7169

Closed

jnothman reviewed Aug 14, 2016
View reviewed changes

yenchenlin force-pushed the add-homogeneous-time-series-cv branch from e87babe to e0a3865 Compare August 15, 2016 13:48

yenchenlin force-pushed the add-homogeneous-time-series-cv branch from e0a3865 to 9cdfe2f Compare August 15, 2016 14:21

yenchenlin added 3 commits August 24, 2016 20:16

PEP8

0e09d4a

Enhance doc

2e4bc4b

Make doc clear

d6bea48

jnothman reviewed Aug 24, 2016
View reviewed changes

Fix pep8

c855f89

jnothman reviewed Aug 24, 2016
View reviewed changes

PEP8

0de40bb

jnothman merged commit 234d256 into scikit-learn:master Aug 24, 2016

amueller mentioned this pull request Aug 24, 2016

Add TimeSeriesCV and HomogeneousTimeSeriesCV #6322

Open

yenchenlin mentioned this pull request Aug 25, 2016

Rename TimeSeriesCV to TimeSeriesSplit #7245

Merged

jnothman mentioned this pull request Sep 1, 2016

[WIP] RollingWindow cross-validation #3638

Closed

kingjr mentioned this pull request Sep 15, 2016

WIP: sklearn-style encoding / modularizing encoding pipelines mne-tools/mne-python#3310

Closed

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG] Add homogeneous time series cross validation (scikit-learn#6586)

50e5302


		Provides train/test indices to split time series data in train/test sets.

		This cross-validation object is a variation of KFold.

Uh oh!

[MRG+2?] Add homogeneous time series cross validation #6586

[MRG+2?] Add homogeneous time series cross validation #6586

Uh oh!

Conversation

yenchenlin commented Mar 24, 2016

Uh oh!

jnothman commented Mar 24, 2016

Uh oh!

yenchenlin commented Mar 24, 2016

Uh oh!

jnothman commented Mar 24, 2016

Uh oh!

yenchenlin commented Apr 5, 2016

Uh oh!

yenchenlin commented Aug 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman Aug 10, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman commented Aug 10, 2016

Uh oh!

yenchenlin commented Aug 10, 2016

Uh oh!

jnothman commented Aug 14, 2016

Uh oh!

jnothman Aug 14, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman commented Aug 14, 2016

Uh oh!

jnothman commented Aug 14, 2016

Uh oh!

jnothman commented Aug 14, 2016

Uh oh!

jnothman commented Aug 14, 2016

Uh oh!

jnothman Aug 14, 2016

Choose a reason for hiding this comment

Uh oh!

yenchenlin Aug 15, 2016

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Aug 14, 2016

Uh oh!

yenchenlin commented Aug 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yenchenlin commented Aug 15, 2016

Uh oh!

jnothman commented Aug 16, 2016

Uh oh!

jnothman commented Aug 24, 2016

Uh oh!

jnothman Aug 24, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman Aug 24, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman commented Aug 24, 2016

Uh oh!

yenchenlin commented Aug 24, 2016

Uh oh!

jnothman commented Aug 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Aug 24, 2016

Uh oh!

jnothman commented Aug 24, 2016

Uh oh!

jnothman commented Aug 24, 2016

Uh oh!

amueller commented Aug 24, 2016

Uh oh!

amueller commented Aug 24, 2016

Uh oh!

MechCoder commented Aug 24, 2016

Uh oh!

yenchenlin commented Aug 10, 2016 •

edited

Loading

yenchenlin commented Aug 15, 2016 •

edited

Loading

jnothman commented Aug 24, 2016 •

edited

Loading

cbrummitt commented Nov 30, 2016 •

edited

Loading