Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Implement WalkForward cross-validator for time series data. #14376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
saninstein opened this issue Jul 15, 2019 · 23 comments Β· Fixed by #13204
Closed

Implement WalkForward cross-validator for time series data. #14376

saninstein opened this issue Jul 15, 2019 · 23 comments Β· Fixed by #13204
Labels

Comments

@saninstein
Copy link

saninstein commented Jul 15, 2019

Description

Implement the walk forward cv for time series data with gap between the train set and the test set.

Expected Results

image

Expanding
image

@saninstein saninstein changed the title Implement WalkForward cross-validator for time series with gap. Implement WalkForward cross-validator for time series data. Jul 15, 2019
@clstaudt
Copy link

@saninstein Why the gap?

@ksanderer
Copy link

@clstaudt gap is useful feature for models evaluation in stock trading. In price prediction it's very common case when model performs very well on the data right after training set and degrades over time. So, in some cases it can be useful to skip a little amount of "good" data that occurred after training set.

@adrinjalali
Copy link
Member

I think our usual position on time-series related features is that it's out of scope for sklearn (at least for now). And to me, it would make sense to revisit the matter once we have sample properties such as timestamp attached to the data.

I'd like to have at least one other opinion from @scikit-learn/core-devs on this, but my vote is a "won't fix" resolution for now.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jul 22, 2019 via email

@ksanderer
Copy link

@adrinjalali there is TimeSeriesSplit which already implemented in sklearn :)

TimeSeriesSplit is specific case of proposed WalkForwardCV (can be achieved with gap=0, expanding=True args), actually TimeSeriesSplit can be easily replaced with WalkForwardCV (just set default args as I mentioned before).

I'm part of the stock trading ML team and we successfully using various sklearn features and sklearn compatible libraries. There was lack of proper CV splitter and we decided to contribute it back to sklearn, but if this is not the case we will use it internally :)

@adrinjalali
Copy link
Member

TimeSeriesSplit is specific case of proposed WalkForwardCV (can be achieved with gap=0, expanding=True args), actually TimeSeriesSplit can be easily replaced with WalkForwardCV (just set default args as I mentioned before).

Fair, but then I'd probably try to patch TimeSeriesSplit in a backward compatible way to add the feature you need. That also has a higher chance of being accepted by the community.

@ksanderer
Copy link

Timeseries prediction not the only case for WalkForwardCV. It useful when dataset observation is ordered in time and validation need to be done in the same way.

For example one of our task is loosing trades filtering classification problem when we trying to improve existing trading strategy with ml model wich will "give permission" to trade.


Fair, but then I'd probably try to patch TimeSeriesSplit in a backward compatible way to add the feature you need. That also has a higher chance of being accepted by the community.

I believe it would be the best solution.

@clstaudt
Copy link

clstaudt commented Jul 22, 2019

I am a data scientist working on time series forecasting models. I assumed time series-specific things are out of scope for scikit-learn, so I started to implement my own validation code, slightly different from the method described in this thread.

Later I learned of the existence of TimeSeriesSplit and wondered whether I have started to reinvent the wheel. I would prefer to contribute the validation code to an existing, established project.

Since there is clearly a demand for this kind of model evaluation, I still wonder where it fits.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jul 22, 2019 via email

@clstaudt
Copy link

It would be useful to create a package that implements all these in a consistent way. It would probably pick up momentum.

Yes. Sign me up.

@adrinjalali
Copy link
Member

We'd be happy to have such a package in https://github.com/scikit-learn-contrib/

Closing this one then :)

@amueller
Copy link
Member

I think adding a better time-series cross-validation is in scope.

@amueller amueller reopened this Jul 23, 2019
@amueller
Copy link
Member

also see #13666 #13204 #6322 #13761

@jnothman
Copy link
Member

Re scope, I agree with @amueller that we should be open to extending this to common use-cases. Basically, we generally assume in scikit-learn estimators (i.e. sklearn package) that the model should be more-or-less invariant to sample order and feature order. This excludes time series estimators. However, we do not have this constraint in cross validation splitters where we have long considered sample order something to pay attention to; ultimately, cross validation is where the core assumptions around ML lie.

But as @amueller also points out, really the conversation should be continued in the existing pull requests, moving them towards an agreeable state.

@adrinjalali
Copy link
Member

Sure, but I'm wary of cases where the actual timestamp of the data should matter in the split (which IMO it should), and not the mere count of the rows. I don't think we're going to handle the timestamps anytime soon, are we?

@mjbommar
Copy link
Contributor

mjbommar commented Nov 3, 2019

In the 5 years since I first proposed this in #3202, this question has come up at least 50 times in conversations teaching or applying. @saninstein , did you make a decision about whether to push for inclusion here or -contrib? I would love to help if there is anything you need assistance with to get this over the line (somewhere).

@svenstehle
Copy link
Contributor

I would also like to contribute. Wrote in another issue about this and would like to expand on TimeSeriesSplit or collaborate on creating another package for that. I feel this is something that is related to splits in the CV domain and should be in sklearn.
To be honest though I am completely confused as to where to go and what to do now that I want to contribute. I am mindful of my time and I would like to use it in the right way for the community.

@amueller
Copy link
Member

@mjbommar I think as I and @jnothman above say, we are quite open in moving forward and there are some exiting PRs, in particular #13761 and #13204 and feedback on the two APIs would be much appreciated.

I think #13204 looks to be the most mature so maybe going from there makes the most sense? I'm not sure if @kykosic is still working on it, given the delay in our response?

@amueller
Copy link
Member

Hm though #13204 doesn't implement WalkForward... Do we want to merge #13204 first and then implement WalkForward later?
Should that be a separate CV object?

@kykosic
Copy link
Contributor

kykosic commented Dec 23, 2019

@amueller I had forgotten about #13204 until this post came up. I will address the reviews on it over the next week and see if it still fits in.

@amueller
Copy link
Member

@kykosic awesome, thanks!

@ManuelZ
Copy link

ManuelZ commented Jul 24, 2020

Curious about why was this issue closed because of #13204 being finished. I thought that #13204 was a pre-requisite for this one.

@thomasjpfan
Copy link
Member

#13204 added gap to TimeSeriesSplit that was the feature requested by this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.