-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Implement WalkForward cross-validator for time series data. #14376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@saninstein Why the gap? |
@clstaudt gap is useful feature for models evaluation in stock trading. In price prediction it's very common case when model performs very well on the data right after training set and degrades over time. So, in some cases it can be useful to skip a little amount of "good" data that occurred after training set. |
I think our usual position on time-series related features is that it's out of scope for sklearn (at least for now). And to me, it would make sense to revisit the matter once we have sample properties such as timestamp attached to the data. I'd like to have at least one other opinion from @scikit-learn/core-devs on this, but my vote is a "won't fix" resolution for now. |
@adrinjalali sklearn already has the many good models for timeseries,
Which ones?
|
@adrinjalali there is TimeSeriesSplit which already implemented in sklearn :) TimeSeriesSplit is specific case of proposed WalkForwardCV (can be achieved with I'm part of the stock trading ML team and we successfully using various sklearn features and sklearn compatible libraries. There was lack of proper CV splitter and we decided to contribute it back to sklearn, but if this is not the case we will use it internally :) |
Fair, but then I'd probably try to patch |
Timeseries prediction not the only case for WalkForwardCV. It useful when dataset observation is ordered in time and validation need to be done in the same way. For example one of our task is loosing trades filtering classification problem when we trying to improve existing trading strategy with ml model wich will "give permission" to trade.
I believe it would be the best solution. |
I am a data scientist working on time series forecasting models. I assumed time series-specific things are out of scope for scikit-learn, so I started to implement my own validation code, slightly different from the method described in this thread. Later I learned of the existence of TimeSeriesSplit and wondered whether I have started to reinvent the wheel. I would prefer to contribute the validation code to an existing, established project. Since there is clearly a demand for this kind of model evaluation, I still wonder where it fits. |
There are many patterns that are needed for prediction on time series. I
would think for instance that transformer creating wavelet features would
be very useful.
However, these are outside the scope of scikit-learn. It would be useful
to create a package that implements all these in a consistent way. It
would probably pick up momentum.
|
Yes. Sign me up. |
We'd be happy to have such a package in https://github.com/scikit-learn-contrib/ Closing this one then :) |
I think adding a better time-series cross-validation is in scope. |
Re scope, I agree with @amueller that we should be open to extending this to common use-cases. Basically, we generally assume in scikit-learn estimators (i.e. sklearn package) that the model should be more-or-less invariant to sample order and feature order. This excludes time series estimators. However, we do not have this constraint in cross validation splitters where we have long considered sample order something to pay attention to; ultimately, cross validation is where the core assumptions around ML lie. But as @amueller also points out, really the conversation should be continued in the existing pull requests, moving them towards an agreeable state. |
Sure, but I'm wary of cases where the actual timestamp of the data should matter in the split (which IMO it should), and not the mere count of the rows. I don't think we're going to handle the timestamps anytime soon, are we? |
In the 5 years since I first proposed this in #3202, this question has come up at least 50 times in conversations teaching or applying. @saninstein , did you make a decision about whether to push for inclusion here or -contrib? I would love to help if there is anything you need assistance with to get this over the line (somewhere). |
I would also like to contribute. Wrote in another issue about this and would like to expand on TimeSeriesSplit or collaborate on creating another package for that. I feel this is something that is related to splits in the CV domain and should be in sklearn. |
@mjbommar I think as I and @jnothman above say, we are quite open in moving forward and there are some exiting PRs, in particular #13761 and #13204 and feedback on the two APIs would be much appreciated. I think #13204 looks to be the most mature so maybe going from there makes the most sense? I'm not sure if @kykosic is still working on it, given the delay in our response? |
@kykosic awesome, thanks! |
#13204 added |
Description
Implement the walk forward cv for time series data with gap between the train set and the test set.
Expected Results
Expanding

The text was updated successfully, but these errors were encountered: