Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add TimeSeriesCV and HomogeneousTimeSeriesCV #6322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
amueller opened this issue Feb 9, 2016 · 23 comments
Open

Add TimeSeriesCV and HomogeneousTimeSeriesCV #6322

amueller opened this issue Feb 9, 2016 · 23 comments
Labels
help wanted Moderate Anything that requires some knowledge of conventions and best practices module:model_selection New Feature

Comments

@amueller
Copy link
Member

amueller commented Feb 9, 2016

I get this asked about once a day, so I think we should just add it.
Many people work with time series, and adding cross-validation for them would be really easy.
The standard strategy is described for example here

There are basically two cases: homogeneous time series (one sample every X seconds / days), or heterogeneous time series, where each sample has a time stamp.

For the homogeneous case, we can just put the first n_samples // n_folds in the first fold etc, so it's a very simple variation of KFold. Fixed in #6586.

For heterogeneous case, we need to get a labels array and split accordingly. If we cast that to integers, people could actually provide pandas time series, and they would be handled correctly (they will be converted to nanoseconds).

I remember arguing against this addition, but I changed my mind ;)

@yenchenlin
Copy link
Contributor

Hello @amueller ,
May I take this issue?

Recently, I'm playing with time series data and found it is problematic for cross-validation.
I think it would be really useful if scikit-learn add this new feature.

Thanks!

@raghavrv
Copy link
Member

Please go ahead :)

@raghavrv
Copy link
Member

cc: @agramfort and @jasmainak for suggestions as this would be immensely useful for mne-python too.

@jasmainak
Copy link
Contributor

yes I agree this can be useful if you want to tune parameters in a BCI setting. This way, you also avoid the problem of testing on data correlated with your training samples. Although, changing the size of the train set for each fold looks a bit weird to me.

@yenchenlin
Copy link
Contributor

Hello @amueller and @rvraghav93 ,
In the homogeneous case, are we expecting to see something works like below?

cv = HomogeneousTimeSeriesCV(5, n_folds=5)
for train, test in cv:
    print "train:"
    print train
    print "test:"
    print test

which will output

train:
[0]
test:
[1]
train:
[0 1]
test:
[2]
train:
[0 1 2]
test:
[3]
train:
[0 1 2 3]
test:
[4]

Sorry if this is a stupid question.

@raghavrv
Copy link
Member

htcv = HomogeneousTimeSeriesCV(5, n_folds=5)

(nitpick) You should be developing this new CV as per our new cross validator API which is data independent (see the model_selection module...)

train:
...
[4]

Yes I think you are in the right direction! Ping @agramfort for more suggestions

Sorry if this is a stupid question.

No question is stupid unless it remains unasked. At the least this is my philosophy towards asking questions :p

@lesshaste
Copy link

Here are some links which I hope are helpful.

http://robjhyndman.com/hyndsight/tscvexample/ has a worked example in R.

For those with university subscriptions, "On the use of cross-validation for time series predictor evaluation" http://www.sciencedirect.com/science/article/pii/S0020025511006773 contains experimental results for different time series cross-validation approaches with a particular emphasis on machine learning.

The data and conclusions of that paper are available at http://dicits.ugr.es/papers/CV-TS/ .

@yenchenlin
Copy link
Contributor

@lesshaste Thanks for your information!

@yenchenlin
Copy link
Contributor

Hello @amueller , sorry for disturbing you.
Can you elaborate more on heterogeneous time series CV?
Maybe an example will definitely help me a lot!

Thanks!

@yenchenlin
Copy link
Contributor

Hello @amueller ,
Do you think we can separate HomogeneousTimeSeriesCV and HteroegeneousTimeSeriesCV into two PR?
By far, I've already completed the HomogeneousTimeSeriesCV case in #6351 .

@amueller
Copy link
Member Author

amueller commented Mar 3, 2016

Well, imagine you have time-stamps with each data point, and you want to use time ranges instead of number of samples for the cross-validation. The time-stamps should be passed as labels attribute (maybe it could be called something else, but probably not).

Imagine you have 5 days of data , and each day has a different number of samples (each with an measurement time in seconds). You want to do CV over the days (train day 1, test day 2, train day 1-2, test day 3, etc).

@amueller
Copy link
Member Author

amueller commented Nov 30, 2016

Replying to @cbrummitt who commented here: #6586 (comment)

(we should keep the discussion in the open issue, not in a closed and merged pull request)

We wanted to deal with a single time series. Processing of the data is separate from splitting it into training and test sets (in the scikit-learn API / conventions).
It looks like you resampled your data to evenly spaced times (like every year). After you do that, you can just use the HomogeneousTimeSeries.

The idea of the HeterogeneousTimeSeries is to deal with event data, say on a second level over multiple years. You could resample that to even intervals, but if your events are rare, that means your data is mostly missing everywhere, and that's not a very good representation. Instead, you can represent each measurement as having a timestamp. Now I want to do cross-validation over years. training on data from year 1 and testing on year 2, training on year 1 and 2 and testing on year three. I need to know which data points are in which year to do that.

Maybe this case is too specific. You could always use HomogeneousTimeSeries to do that, so maybe that was a bad name :( - the issue is that you wouldn't have semantic cut-off dates then, instead you'd learn on the first 1k events and test on the next 1k events etc.
If our data has strong long-term periodicity, this might be a bad idea because the distribution of the data over the different splits might be very different.

Maybe that's too specific a use-case for sklearn, though...

@cbrummitt
Copy link
Contributor

In this post, I'm going to consider two dimensions along which time-series data can differ:

  1. Samples are evenly spaced in time (I'll denote this by E) or samples are heterogeneously spaced in time (I'll denote this by H).
  2. There is a single "trajectory" (I'll denote this by S for single) or there are multiple trajectories (I'll denote this by M for _multiple). Calling this "multiple trajectories" is probably not standard; it's also called panel data or longitudinal cohort data.

Here are examples of the four possible types of time-series, just to clarify and fix the ideas:

  • ES: The number of hours that Alice slept each night for 40 nights.
  • HS: The number of hours that Alice slept on some of the past 40 nights (i.e., some days are missing data).
  • EM: Number of hours slept by Alice and by Bob every night during the past 40 nights.
  • HM: Number of hours slept by Alice and by Bob on some of the past 40 nights (i.e., they are missing data on potentially different days; e.g., Bob is missing data on January 1 but Alice has data on January 1.)

(To clarify, by "Multiple trajectories" I have in mind a pandas Panel, with items = ['Alice', 'Bob'], time on the major_axis, and features on the minor_axis. In the above example there's only one feature, 'num_hours_slept', but I have in mind the general case of multiple features. By "Single trajectory" I have in mind a panel with one item, 'Alice', and potentially many features on the minor_axis, which you can also consider simply a two-dimensional array.)


Now let's try to split all four kinds of time-series data into cross-validation splits (or "folds") that are successively nested in time.

I'm thinking out loud here, still fleshing out these ideas, so please consider the proposal below a rough draft. Comments are welcome!

One may want to create cross-validation splits in one of two ways:

  1. by='samples': the folds have (roughly) the same number of samples, but not necessarily the same amount of time in each fold.

(By "roughly" I mean that the number of samples in the folds may differ from each another by at most 1 depending on n_samples % n_folds.)

Here's how TimeSeriesSplit(by='samples') can be implemented with existing tools and with tools under development:

  • ES: We can do this with the existing TimeSeriesSplit because no data is missing. ✅
  • HS: We can do this with the existing TimeSeriesSplit if the user first removes missing samples whose features are all missing (e.g., dataframe.dropna(how='all')). ✅ (Would it be innocuous to have TimeSeriesSplit drop all samples that have every feature missing?)
  • EM: If no data is missing, then use the existing TimeSeriesSplit. ✅ But if some trajectories are missing data at times when other trajectories are not, then you'd want to pass in a list of time-stamps in a times keyword argument; compute equally spaced quantiles of the time-stamps; and use those quantiles to make the nested splits. For a visualization of this idea, please see my comment [MRG+2?] Add homogeneous time series cross validation #6586 (comment)
  • HM: We cannot use the existing the existing TimeSeriesSplit because there are different number of samples on different time-stamps. My current implementation is to convert the pandas Panel to a MultiIndex DataFrame; remove rows that are missing all their features (in the example above, we'd drop (January 1, na) from Bob's time-series); and then split the data based on the quantiles of the distribution of time-stamps (as described in the previous bullet point).

In general, TimeSeriesSplit(by='samples') could take a 2D array X and a list of time-stamps times (which contains the time-stamps of the rows); drop rows in X that are missing data in every column (and drop the corresponding time-stamps); and then split the data based on quantiles of the time-stamp distribution. This approach would backward compatible with the current behavior of TimeSeriesSplit for all data matrices X without any rows that are missing all their columns. Splitting on the quantiles of equally spaced times is the same as the current behavior of TimeSeriesSplit, which splits on the indices of the rows of X.

  1. by='time': the folds have (roughly) the same amount of time in each, but not necessarily the same number of samples.

Here's how TimeSeriesSplit(by='time') can be implemented with existing tools and with tools under development:

  • ES: We can do this with the existing TimeSeriesSplit
  • HS: I think this is the case that @amueller has in mind when he discusses HeterogenousTimeSeries. I wrote a rough draft of an implementation below.
  • EM: We can do this with the existing TimeSeriesSplit. It doesn't matter if some trajectories are missing data at times when other trajectories are not. ✅
  • HM: Like for HS, we would need @amueller's HeterogenousTimeSeries.

Here's a simple implementation of @amueller's HeterogeneousTimeSeries, or what I've been calling TimeSeriesSplit(by='time').

times = np.array([0, 1, 5, 19, 20, 30, 31, 33])
values = np.array([2, 3, 1, 3, 4, 5, 1, 3])
n_folds = 4

n_samples = len(values)

bin_size = max(times) // n_folds

bin_right_endpoints = np.arange(
    # Make the first bin as large as possible
    bin_size + max(times) % n_folds,
    max(times) + 1,
    bin_size)
# bin_right_endpoints in this example is array([ 9, 17, 25, 33])

successive_train_test_time_endpoints = zip(
    bin_right_endpoints[:-1], bin_right_endpoints[1:])

for i, (train_max_time, test_max_time) in enumerate(successive_train_test_time_endpoints):
    mask_train = times <= train_max_time
    mask_test = ~mask_train & (times <= test_max_time)
    template = "Fold {i}: time span {timespan:>8}. Train: {train:<20} Test: {test}"
    print(template.format(
        i=i,
        timespan=str((train_max_time, test_max_time)),
        train=str(times[mask_train]),
        test=str(times[mask_test])))

Output:

Fold 0: time span  (9, 17). Train: [0 1 5]              Test: []
Fold 1: time span (17, 25). Train: [0 1 5]              Test: [19 20]
Fold 2: time span (25, 33). Train: [ 0  1  5 19 20]     Test: [30 31 33]

As we see in this example, for TimeSeriesSplit(by='time') the user may choose a sufficiently large number of folds that one of the train or test sets is empty 😟


I hope this post helps to fix ideas a bit. In summary, I think TimeSeriesSplit could be improved in a mostly backward compatible way, with new optional parameters by and times, so that it could handle missing data, heterogenous time-stamps, and panel data.

@cbrummitt
Copy link
Contributor

Another feature to consider adding to TimeSeriesSplit is an optional argument drop (or drop_first?) that drops the first k samples from the train and test sets. The reason to do this is that in time-series analysis one often does a regression of the time-series at time t against the time-series at times t - 1, t - 2, ..., t - n_lags. For cross validation you want the training and test sets to be independent. If, for example, the time-series is stationary with significant autocorrelation up to 4 lags and negligible autocorrelation beyond 4 lags, then one would create a model with n_lags set equal to 4; moreover, the first 4 values in the test set are not independent of the training set, so you'd want to drop the first 4 values from the test set.

This example is illustrated in Figure 3(b) of Ref. [1], a paper that is behind a paywall. I copied Fig. 3(b) below. The train, validation, and test sets are marked by blue triangles, red squares, and green squares, respectively. Notice how each of these sets has 4 unmarked points to the left of them; these 12 unmarked points have been dropped from those sets.

Figure 3b of Ref. 1

[1] Bergmeir, C., & Benítez, J. M. (2012). On the use of cross-validation for time series predictor evaluation. Information Sciences, 191, 192–213. http://doi.org/10.1016/j.ins.2011.12.028

This drop keyword argument could be 0 by default so that it's backward compatible.

@jnothman
Copy link
Member

jnothman commented Dec 3, 2016 via email

@amueller
Copy link
Member Author

amueller commented Dec 3, 2016

I'm not sure the concept of multiple trajectories makes sense within scikit-learn. We don't even know about indices right now, I don't think we'll support panels any time soon. This only matters in the HM case, though, the EM is basically "a single time-series with multiple features". Actually EM seems the most natural case for scikit-learn.

Apart from the fact that panels don't really fit into the scikit-learn framework, it's pretty hard to even pas around a single time-series through the scikit-learn API. We could abuse "groups". For a real solution we might need #4497. But even with that, in the HM case, there is really no concept of sample, is there? (unless you consider the time-series samples and you want to learn from some time series and generalize to new time-series, but that sounds like a structured prediction problem)

@amueller
Copy link
Member Author

amueller commented Dec 3, 2016

Or did you want to transform HM into EM by creating "missing value" entries so that it's kinda like HS?
I'm OK with adding by and time though I'm not sure if time needs to be called groups. We can not use any pandas datastructures, though (though we can support them)

@cbrummitt
Copy link
Contributor

I would reserve groups for the names of multiple time-series (i.e., the items of a pandas Panel). For example, in a longitudinal cohort dataset on 1000 patients with 10 measurements taken every year for 20 years, the groups would be the 1000 patient IDs, and the times would be the 20 years (expressed as integers, floats, or maybe as datetime objects?). An example close to this one is used in the user guide 3.1.5. Cross-validation iterators for grouped data:

An example would be when there is medical data collected from multiple patients, with multiple samples taken from each patient. And such data is likely to be dependent on the individual group. In our example, the patient id for each sample will be its group identifier.

The guide mentions "multiple samples taken from each patient" but doesn't focus on time. If I understand correctly, the existing grouped cross-validation iterators (GroupKFold, LeaveOneGroupOut, LeavePGroupsOut, GroupShuffleSplit) are used in situations where samples within the same group are correlated with each other, but the outcome probabilities don't vary in time. It seems natural to go beyond splitting by just groups and to create CV iterators for situations in which groups and time both matter.

I think the most elegant solution would be for users to put their (panel or time-series) dataset into long format: a 2D array of shape with features on the columns and with each row corresponding to an observation (of some group at some time). Let (n_observations, n_features) be the shape of this 2D array. The user would need to compute as many as two 1D arrays of shape (n_observations,): groups and times. These two arrays could be passed into the the k-fold and shuffle-split split methods. For many such methods, times would be ignored and would exist only for compatibility (like groups is now for KFold, StratifiedKFold, and so on). Only TimeSeriesSplit.split() would the times parameter be used. In the future there may be more time-series CV splitters (e.g., with a rolling window rather than nested splits?) that use the times parameter.

It would be great to be able to do nested cross-validation with, say, a groups-based split in the outer loop and a times-based split in the inner loop. (Getting this to work may require fixing #7646 .) If we used groups to pass in times in TimeSeriesSplit.split(), then the user couldn't simply write groups=patient_ids in the outer and inner loop. Instead the user would have to write groups=patient_ids in one loop and groups=years in the other, which seems a bit cognitively incongruent. I don't know whether there are cross-validation methods for panel datasets that use both groups and times simultaneously.

@jnothman
Copy link
Member

jnothman commented Dec 4, 2016 via email

@amueller
Copy link
Member Author

amueller commented Dec 6, 2016

yeah @cbrummitt I was not asking whether this is different then groups, but what will break if we add something that is not called groups. And I think the answer is "everything". So yeah, we might want to revisit #4497 before we can properly implement this.

@howardbandy
Copy link

For models of financial time series, as in stock market predictions, there are two techniques for defining the training period (referred to as "in-sample" in the trading system world). One is "anchored" where the initial date of the training period is held constant and the length of the training period expands. The second is "rolling" where the length of the training period is held constant and both the start and ends date move forward. The out-of-sample data includes both validation (if used) and test data and always is more recent than the in-sample data.

@mitar
Copy link
Contributor

mitar commented Apr 18, 2019

Just to see if I understand current state correctly. So we now have TimeSeriesSplit for homogeneous time-series, but there is nothing yet for heterogeneous?

Moreover, if my Pandas DataFrame contains multiple time-series in one DataFrame, current TimeSeriesSplit works well if all time-series are across same time range (for example, last 30 years of stock market data), but not if they come from different time ranges (one time-series from 2015 and one from 2016, I would want first 10 months as train data from of each of time-series).

@adrinjalali
Copy link
Member

Note that now that we have SLEP006 / metadata routing, we can easily implement these if there's an interest.

@adrinjalali adrinjalali added Moderate Anything that requires some knowledge of conventions and best practices help wanted labels Aug 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Moderate Anything that requires some knowledge of conventions and best practices module:model_selection New Feature
Projects
None yet
Development

No branches or pull requests