-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Add TimeSeriesCV and HomogeneousTimeSeriesCV #6322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello @amueller , Recently, I'm playing with time series data and found it is problematic for cross-validation. Thanks! |
Please go ahead :) |
cc: @agramfort and @jasmainak for suggestions as this would be immensely useful for mne-python too. |
yes I agree this can be useful if you want to tune parameters in a BCI setting. This way, you also avoid the problem of testing on data correlated with your training samples. Although, changing the size of the train set for each fold looks a bit weird to me. |
Hello @amueller and @rvraghav93 , cv = HomogeneousTimeSeriesCV(5, n_folds=5)
for train, test in cv:
print "train:"
print train
print "test:"
print test which will output
Sorry if this is a stupid question. |
(nitpick) You should be developing this new CV as per our new cross validator API which is data independent (see the
Yes I think you are in the right direction! Ping @agramfort for more suggestions
No question is stupid unless it remains unasked. At the least this is my philosophy towards asking questions :p |
Here are some links which I hope are helpful. http://robjhyndman.com/hyndsight/tscvexample/ has a worked example in R. For those with university subscriptions, "On the use of cross-validation for time series predictor evaluation" http://www.sciencedirect.com/science/article/pii/S0020025511006773 contains experimental results for different time series cross-validation approaches with a particular emphasis on machine learning. The data and conclusions of that paper are available at http://dicits.ugr.es/papers/CV-TS/ . |
@lesshaste Thanks for your information! |
Hello @amueller , sorry for disturbing you. Thanks! |
Well, imagine you have time-stamps with each data point, and you want to use time ranges instead of number of samples for the cross-validation. The time-stamps should be passed as Imagine you have 5 days of data , and each day has a different number of samples (each with an measurement time in seconds). You want to do CV over the days (train day 1, test day 2, train day 1-2, test day 3, etc). |
Replying to @cbrummitt who commented here: #6586 (comment) (we should keep the discussion in the open issue, not in a closed and merged pull request) We wanted to deal with a single time series. Processing of the data is separate from splitting it into training and test sets (in the scikit-learn API / conventions). The idea of the Maybe this case is too specific. You could always use Maybe that's too specific a use-case for sklearn, though... |
In this post, I'm going to consider two dimensions along which time-series data can differ:
Here are examples of the four possible types of time-series, just to clarify and fix the ideas:
(To clarify, by "Multiple trajectories" I have in mind a pandas Now let's try to split all four kinds of time-series data into cross-validation splits (or "folds") that are successively nested in time. I'm thinking out loud here, still fleshing out these ideas, so please consider the proposal below a rough draft. Comments are welcome! One may want to create cross-validation splits in one of two ways:
(By "roughly" I mean that the number of samples in the folds may differ from each another by at most 1 depending on Here's how
In general,
Here's how
Here's a simple implementation of @amueller's
Output:
As we see in this example, for I hope this post helps to fix ideas a bit. In summary, I think |
Another feature to consider adding to This example is illustrated in Figure 3(b) of Ref. [1], a paper that is behind a paywall. I copied Fig. 3(b) below. The train, validation, and test sets are marked by blue triangles, red squares, and green squares, respectively. Notice how each of these sets has 4 unmarked points to the left of them; these 12 unmarked points have been dropped from those sets. [1] Bergmeir, C., & Benítez, J. M. (2012). On the use of cross-validation for time series predictor evaluation. Information Sciences, 191, 192–213. http://doi.org/10.1016/j.ins.2011.12.028 This |
Yes, we've somewhere discussed some kind of gap/lag parameter to distance
train from test. I'll certainly review a PR for it.
…On 4 December 2016 at 01:48, Charlie Brummitt ***@***.***> wrote:
Another feature to consider adding to TimeSeriesSplit is an optional
argument drop (or drop_first?) that drops the first k samples from the
train and test sets. The reason to do this is that in time-series analysis
one often does a regression of the time-series at time t against the
time-series at times t - 1, t - 2, ..., t - n_lags. For cross validation
you want the training and test sets to be independent. If, for example, the
time-series is stationary with significant autocorrelation up to 4 lags and
negligible autocorrelation beyond 4 lags, then one would create a model
with n_lags set equal to 4; moreover, *the first 4 values in the test set
are not independent of the training set*, so you'd want to drop the first
4 values from the test set.
This example is illustrated in Figure 3(b) of Ref. [1], a paper that is
behind a paywall. I copied Fig. 3(b) below. The train, validation, and test
sets are marked by blue triangles, red squares, and green squares,
respectively. Notice how each of these sets has 4 unmarked points to the
left of them; these 12 unmarked points have been dropped from those sets.
[image: Figure 3b of Ref. 1]
<https://camo.githubusercontent.com/a4eced755894ef1cdcf51ac3689b49218f8b139e/68747470733a2f2f692e696d6775722e636f6d2f6d666c4a78514e6d2e706e67>
[1] Bergmeir, C., & Benítez, J. M. (2012). On the use of cross-validation
for time series predictor evaluation. *Information Sciences*, 191,
192–213. http://doi.org/10.1016/j.ins.2011.12.028
This drop keyword argument could be 0 by default so that it's backward
compatible.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#6322 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz60honJy2f0NegiH3cKx0Kia91Julks5rEYE0gaJpZM4HWz5L>
.
|
I'm not sure the concept of multiple trajectories makes sense within scikit-learn. We don't even know about indices right now, I don't think we'll support panels any time soon. This only matters in the HM case, though, the EM is basically "a single time-series with multiple features". Actually EM seems the most natural case for scikit-learn. Apart from the fact that panels don't really fit into the scikit-learn framework, it's pretty hard to even pas around a single time-series through the scikit-learn API. We could abuse "groups". For a real solution we might need #4497. But even with that, in the HM case, there is really no concept of sample, is there? (unless you consider the time-series samples and you want to learn from some time series and generalize to new time-series, but that sounds like a structured prediction problem) |
Or did you want to transform HM into EM by creating "missing value" entries so that it's kinda like HS? |
I would reserve
The guide mentions "multiple samples taken from each patient" but doesn't focus on time. If I understand correctly, the existing grouped cross-validation iterators ( I think the most elegant solution would be for users to put their (panel or time-series) dataset into long format: a 2D array of shape with features on the columns and with each row corresponding to an observation (of some group at some time). Let It would be great to be able to do nested cross-validation with, say, a |
It's clear that these cases involve attaching more properties to samples. I
am uncomfortable about adding specific new params to handle these cases.
Instead, we need to add it to the use-cases for generic approaches like
#4497, where substantial design work is still needed.
I also note that all these things can be hacked by storing the full dataset
globally and setting `X = arange(...)` where nested estimators have a way
to derefrence these indexes within the original dataset.
…On 5 December 2016 at 02:30, Charlie Brummitt ***@***.***> wrote:
I would reserve groups for the names of multiple time-series (i.e., the
items of a pandas Panel). For example, in a longitudinal cohort dataset
on 1000 patients with 10 measurements taken every year for 20 years, the
groups would be the 1000 patient IDs, and the times would be the 20 years
(expressed as integers, floats, or maybe as datetime objects?). An
example close to this one is used in the user guide 3.1.5.
Cross-validation iterators for grouped data
<http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-for-grouped-data>
:
An example would be when there is medical data collected from multiple
patients, with multiple samples taken from each patient. And such data is
likely to be dependent on the individual group. In our example, the patient
id for each sample will be its group identifier.
The guide mentions "multiple samples taken from each patient" but doesn't
focus on time. If I understand correctly, the existing grouped
cross-validation iterators (GroupKFold, LeaveOneGroupOut, LeavePGroupsOut,
GroupShuffleSplit) are used in situations where samples within the same
group are correlated with each other, but the outcome probabilities don't
vary in time. It seems natural to go beyond splitting by just groups and to
create CV iterators for situations in which groups and time both matter.
I think the most elegant solution would be for users to put their (panel
or time-series) dataset into long format: a 2D array of shape with features
on the columns and with each row corresponding to an observation (of some
group at some time). Let (n_observations, n_features) be the shape of
this 2D array. The user would need to compute as many as two 1D arrays of
shape (n_observations,): groups and times. These two arrays could be
passed into the the k-fold and shuffle-split split methods. For many such
methods, times would be ignored and would exist only for compatibility
(like groups is now for KFold, StratifiedKFold, and so on). Only
TimeSeriesSplit.split() would the times parameter be used. In the future
there may be more time-series CV splitters (e.g., with a rolling window
rather than nested splits?) that use the times parameter.
It would be great to be able to do nested cross-validation with, say, a
groups-based split in the outer loop and a times-based split in the inner
loop. (Getting this to work may require fixing #7646
<#7646> .) If we used
groups to pass in times in TimeSeriesSplit.split(), then the user
couldn't simply write groups=patient_ids in the outer and inner loop.
Instead the user would have to write groups=patient_ids in one loop and
groups=years in the other, which seems a bit cognitively incongruent. I
don't know whether there are cross-validation methods for panel datasets
that use both groups and times simultaneously.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6322 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_FnUai0tmy46MecqMif16BHwByvks5rEtyBgaJpZM4HWz5L>
.
|
yeah @cbrummitt I was not asking whether this is different then groups, but what will break if we add something that is not called groups. And I think the answer is "everything". So yeah, we might want to revisit #4497 before we can properly implement this. |
For models of financial time series, as in stock market predictions, there are two techniques for defining the training period (referred to as "in-sample" in the trading system world). One is "anchored" where the initial date of the training period is held constant and the length of the training period expands. The second is "rolling" where the length of the training period is held constant and both the start and ends date move forward. The out-of-sample data includes both validation (if used) and test data and always is more recent than the in-sample data. |
Just to see if I understand current state correctly. So we now have Moreover, if my Pandas DataFrame contains multiple time-series in one DataFrame, current |
Note that now that we have SLEP006 / metadata routing, we can easily implement these if there's an interest. |
I get this asked about once a day, so I think we should just add it.
Many people work with time series, and adding cross-validation for them would be really easy.
The standard strategy is described for example here
There are basically two cases: homogeneous time series (one sample every X seconds / days), or heterogeneous time series, where each sample has a time stamp.
For the homogeneous case, we can just put the first
n_samples // n_folds
in the first fold etc, so it's a very simple variation of KFold. Fixed in #6586.For heterogeneous case, we need to get a
labels
array and split accordingly. If we cast that to integers, people could actually provide pandas time series, and they would be handled correctly (they will be converted to nanoseconds).I remember arguing against this addition, but I changed my mind ;)
The text was updated successfully, but these errors were encountered: