-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
data-independent CV iterators #2904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1e6!!! I would argue for a different name for the method (maybe 'split'), but As this will be a major API change, I would think that it would be good |
+1 We still have a question of how to handle the |
|
I would vote for iter_splits |
KFold.split(arrays) with specific names on arrays? Thus, if one array is 'labels', this work. |
Naming labels for the CV generator is fine. Getting GridSearchCV to pass On 27 February 2014 21:21, Gael Varoquaux [email protected] wrote:
|
Yes, that's part of the bigger picture problem. You are right that they @ogrisel and myself have been brainstorming on the idea of allowing y to |
Please share / elaborate the ideas developed :) |
Nothing really much more than what I said above. y could be a panda data They would be useful to add any meta information that describes samples, |
Ping myself. This is a good idea. |
I don't know about pandas, but FWIW numpy recarrays / struct arrays can deal with such structures, if need be:
But this is not extremely intuitive to setup; and it destroys the plan to use sparse matrices, while welcoming back the idea of sequences of sequences:
|
Structured arrays are a dead end, I believe. Dictionnary of arrays would I don't care that much, I find that pandas is too limited for these |
I would be +1 on supporting both dictionary of arrays and pandas for y. Possible standard columns for y:
|
Off course we would be backward compat: if |
My above definition of the |
Is this proposal only for CV iterators or also for estimators? |
That would be for both, but we could start by the CV iterators first and keeping in mind that we could generalize the approach. |
A more minimal criterion might be support for callable I think this would have to apply to CV iterators, estimators and scorers... On 1 March 2014 04:45, Olivier Grisel [email protected] wrote:
|
Agreed. Since this a fairly ambitious change, we could split the effort into two parts. First, remove data-dependent parameters from the constructor in CV iterators (except for labels in LOLO). This part is localized to |
I'm trying to work out what On 3 March 2014 18:46, Mathieu Blondel [email protected] wrote:
|
You raise good points :$. I don't have an answer to them, but it is true
I don't believe so. For me, 'X' is what you are given in the 'predict'
No. I would never put biggish data in a pandas data frame, or a Also, I'd be worried about the asymmetry between fit and predict. And finally, I'd be worried about the ease with which leaks could be |
It looks like you can define this constraint by first expressing each constraint individually using a multi-output multi-class
What would be the use-case for weight? |
I don't think I understand this proposal. Standard CV tools (e.g. stratified k fold) tend to split groups with the same label, so I'm not sure how this helps. Also, you would presumably want to decode this operation after CV splitting, which tools like |
The constraint is not to split samples from the same subject and the same month-of-the-year for instance. Thus, you could use for instance |
Right. Obviously we can assign a unique id to each group and use LOLO, but On 29 April 2014 22:44, Arnaud Joly [email protected] wrote:
|
This would correspond to a leave one out approach. But you could use/implement other label (or id?) based validation such as
What would be the problem to pass those at construction? |
LeavePLabelOut only makes more folds. It's a combinatoric explosion of Passing labels at construction is the sort of problem that we have with the On 29 April 2014 23:03, Arnaud Joly [email protected] wrote:
|
One advantage of the data-independent CV iterators is that they are actually given the data. This means that we can include validation to ensure the |
I'm facing this situation in one of my Pull Requests #2862 (comment) . The problem is when So I need to compute the new y, and (hence the folds) outside the Parallel Loop? Is there any work-around? |
Is this resolved? |
I think the doc is not yet merged. |
Can this be closed now? |
Yes! |
Hurray! |
In many situations, you don't have a test set so you would like to use CV for both evaluation and hyper-parameter tuning. Therefore, you need to do nested cross-validation:
This is very difficult to implement in a generic way with our current API because CV iterators are tied to a particular data. For example, when doing
cv = KFold(n_samples)
,cv
will only work with a dataset of the specified size.Ideally, we would need something closer to the estimator API: use constructor parameters for data-independent options (n_folds, shuffle, random_state, train / test proportion, etc) and a
run
method that takesy
as argument (the reason to takey
is for stratified schemes). This would look something like this:The text was updated successfully, but these errors were encountered: