Thanks to visit codestin.com
Credit goes to github.com

Skip to content

data-independent CV iterators #2904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mblondel opened this issue Feb 26, 2014 · 35 comments
Closed

data-independent CV iterators #2904

mblondel opened this issue Feb 26, 2014 · 35 comments
Labels

Comments

@mblondel
Copy link
Member

In many situations, you don't have a test set so you would like to use CV for both evaluation and hyper-parameter tuning. Therefore, you need to do nested cross-validation:

for train, test in cv1:
   # Find the best hyper-parameters for this split
    for train, val in cv2:
        [...]
   # Retrain using the best hyper-parameters
   [...]
# Return best scores for each split

This is very difficult to implement in a generic way with our current API because CV iterators are tied to a particular data. For example, when doing cv = KFold(n_samples), cv will only work with a dataset of the specified size.

Ideally, we would need something closer to the estimator API: use constructor parameters for data-independent options (n_folds, shuffle, random_state, train / test proportion, etc) and a run method that takes y as argument (the reason to take y is for stratified schemes). This would look something like this:

# deprecated usage
for train, test in KFold(n, n_folds):
    print train, test

# new usage
for train, test in KFold(n_folds).run(y):
    print train, test
@mblondel mblondel added API and removed API labels Feb 26, 2014
@GaelVaroquaux
Copy link
Member

deprecated usage

for train, test in KFold(n, n_folds):
print train, test

new usage

for train, test in KFold(n_folds).run(y):
print train, test

+1e6!!!

I would argue for a different name for the method (maybe 'split'), but
the spirit is really the right one.

As this will be a major API change, I would think that it would be good
to good it with #2055. (FIXME: wrong issue number. I am trying to find the one I have in mind).

@GaelVaroquaux
Copy link
Member

As this will be a major API change, I would think that it would be good
to good it with #2055.

I actually meant: #1848

@jnothman
Copy link
Member

+1

We still have a question of how to handle the labels parameter for LOLO in cross_val_score or GridSearchCV, but in general, I think this is a much better interface.

@mblondel
Copy link
Member Author

split is fine with me.

@agramfort
Copy link
Member

I would vote for iter_splits

@GaelVaroquaux
Copy link
Member

We still have a question of how to handle the labels parameter for LOLO in
cross_val_score or GridSearchCV,

KFold.split(arrays)

with specific names on arrays? Thus, if one array is 'labels', this work.
The danger is to no capture a typo in an array name. But it's very
generic.

@jnothman
Copy link
Member

Naming labels for the CV generator is fine. Getting GridSearchCV to pass
labels onto the CV generator's .split is another matter.

On 27 February 2014 21:21, Gael Varoquaux [email protected] wrote:

We still have a question of how to handle the labels parameter for LOLO
in
cross_val_score or GridSearchCV,

KFold.split(arrays)

with specific names on arrays? Thus, if one array is 'labels', this work.
The danger is to no capture a typo in an array name. But it's very
generic.

Reply to this email directly or view it on GitHubhttps://github.com//issues/2904#issuecomment-36228209
.

@GaelVaroquaux
Copy link
Member

Naming labels for the CV generator is fine. Getting GridSearchCV to pass
labels onto the CV generator's .split is another matter.

Yes, that's part of the bigger picture problem. You are right that they
need to be tackled together.

@ogrisel and myself have been brainstorming on the idea of allowing y to
be a dictionnary of arrays or a pandas dataframe.

@mblondel
Copy link
Member Author

Please share / elaborate the ideas developed :)

@GaelVaroquaux
Copy link
Member

Please share / elaborate the ideas developed :)

Nothing really much more than what I said above. y could be a panda data
frame or a dict of arrays (we would still accept a simple array). These
arrays would all be of length n_samples, and would be sliced and diced
during cross-validation.

They would be useful to add any meta information that describes samples,
such as sample weights, labels for stratification. Open questions are:
how to deal with multi output or multi-label?

@larsmans
Copy link
Member

Ping myself. This is a good idea.

@jnothman
Copy link
Member

Open questions are: how to deal with multi output or multi-label?

I don't know about pandas, but FWIW numpy recarrays / struct arrays can deal with such structures, if need be:

>>> a = np.array([([0, 0, 1], .5), ([1, 0, 0], .25)],
                 dtype=[('y', 'i', 3), ('weight', 'f')])
>>> a
array([([0, 0, 1], 0.5), ([1, 0, 0], 0.25)],
      dtype=[('y', '<i4', (3,)), ('weight', '<f4')])
>>> a['y']
array([[0, 0, 1],
       [1, 0, 0]], dtype=int32)
>>> a['weight']
array([ 0.5 ,  0.25], dtype=float32)

But this is not extremely intuitive to setup; and it destroys the plan to use sparse matrices, while welcoming back the idea of sequences of sequences:

>>> a = np.array([([1], .5), ([0, 1], .25)], dtype=[('y', 'O'), ('weight', 'f')])
>>> a
array([([1], 0.5), ([0, 1], 0.25)],
      dtype=[('y', 'O'), ('weight', '<f4')])

@GaelVaroquaux
Copy link
Member

I don't know about pandas, but FWIW numpy recarrays / struct arrays can deal
with such structures, if need be:

Structured arrays are a dead end, I believe. Dictionnary of arrays would
work here, but it means that pandas couldn't work.

I don't care that much, I find that pandas is too limited for these
usecases.

@ogrisel
Copy link
Member

ogrisel commented Feb 28, 2014

I would be +1 on supporting both dictionary of arrays and pandas for y. Possible standard columns for y:

  • target (the regular target variable for supervised learning: could be a 1D or 2D array of floats for regression, multi-output or not or 1D integers for classification or 2D sparse indicator matrix for multi-label classification).
  • group (array of integers for group of samples that should not be splitted by a train / test split and that could have a special usage for some estimators and scoring functions that consider groups of samples at once such as learning to rank)
  • weight (array of floats as the current sample_weight array).
  • uid (a unique identifier, e.g. a unique integer or a unique string with dtype=object for each sample in the original training to trace provenance of the sample and make it easier for debugging)

y could also contain additional demain specific metadata that would not have any special meaning to scikit-learn but would be preserved along the matching samples when doing CV splits, pipeline transforms, resampling and so on.

@ogrisel
Copy link
Member

ogrisel commented Feb 28, 2014

Off course we would be backward compat: if y is not a dict of arrays or a pandas dataframe we would consider it the target variable as currently.

@ogrisel
Copy link
Member

ogrisel commented Feb 28, 2014

My above definition of the group column might be too simplistic though. In some cases you want to be able to support more complex types of cross-validation: do not split samples that come from the same subject and the same month-of-the-year for instance. So CV schemes could probably be parametrized by several y column names to handle multi-dimensional sample grouping constraints.

@mblondel
Copy link
Member Author

I would be +1 on supporting both dictionary of arrays and pandas for y

Is this proposal only for CV iterators or also for estimators?

@ogrisel
Copy link
Member

ogrisel commented Feb 28, 2014

Is this proposal only for CV iterators or also for estimators?

That would be for both, but we could start by the CV iterators first and keeping in mind that we could generalize the approach.

@jnothman
Copy link
Member

jnothman commented Mar 1, 2014

I would be +1 on supporting both dictionary of arrays and pandas for y.

A more minimal criterion might be support for callable keys and
__getitem__ (note this excludes struct arrays). You can then treat both
the same, and turn a DataFrame into a dict of arrays (although slicing rows
of a DataFrame may be more efficient and returns a DataFrame).

I think this would have to apply to CV iterators, estimators and scorers...
if not metrics. It seems a fairly substantial departure from numpy/scipy
API conventions, and their means of documentation. Maintaining API clarity
will be an interesting challenge.

On 1 March 2014 04:45, Olivier Grisel [email protected] wrote:

Is this proposal only for CV iterators or also for estimators?

That would be for both, but we could start by the CV iterators first and
keeping in mind that we could generalized the approach.

Reply to this email directly or view it on GitHubhttps://github.com//issues/2904#issuecomment-36375695
.

@mblondel
Copy link
Member Author

mblondel commented Mar 3, 2014

I think this would have to apply to CV iterators, estimators and scorers... if not metrics

Agreed. Since this a fairly ambitious change, we could split the effort into two parts. First, remove data-dependent parameters from the constructor in CV iterators (except for labels in LOLO). This part is localized to cross_validation.py and grid_search.py. Second, support dictionary for y. This part affects the whole code base.

@jnothman
Copy link
Member

jnothman commented Mar 3, 2014

I'm trying to work out what X and y now mean. Is y is None iff no
supervision still the case? Or can one train an unsupervised classifier
with y={'weight': [...]}. Should 'weight' really be part of X? Should
X and y then be merged? Should the two variables distinguish between
something tangible, like predict-time-observed (though not always used) and
unobserved data?

On 3 March 2014 18:46, Mathieu Blondel [email protected] wrote:

I think this would have to apply to CV iterators, estimators and
scorers... if not metrics

Agreed. Since this a fairly ambitious change, we could split the effort
into two parts. First, remove data-dependent parameters from the
constructor (except for labels in LOLO). This part is localized to
cross_validation.py and grid_search.py. Second, support dictionary for y.
This part affects the whole code base.

Reply to this email directly or view it on GitHubhttps://github.com//issues/2904#issuecomment-36487673
.

@GaelVaroquaux
Copy link
Member

I'm trying to work out what X and y now mean. Is y is None iff no
supervision still the case? Or can one train an unsupervised classifier
with y={'weight': [...]}.

You raise good points :$. I don't have an answer to them, but it is true
that I do a lot of unsupervised learning with a label structure.

Should 'weight' really be part of X?

I don't believe so. For me, 'X' is what you are given in the 'predict'
problem: once you have the classifier and want to work on new data.
Chances are that you do not have weights in such situation.

Should X and y then be merged?

No. I would never put biggish data in a pandas data frame, or a
dictionnary of arrays. These things don't scale terribly well in the
number of features. For instance, how would you put in there sparse data?

Also, I'd be worried about the asymmetry between fit and predict.

And finally, I'd be worried about the ease with which leaks could be
introduced between training and testing: it would be too easy to stick in
labeling information with X.

@arjoly
Copy link
Member

arjoly commented Apr 28, 2014

In some cases you want to be able to support more complex types of cross-validation: do not split samples that come from the same subject and the same month-of-the-year for instance.

It looks like you can define this constraint by first expressing each constraint individually using a multi-output multi-class y and then transform those class sets using a label power set-like transformation (as in #2461, which makes it for multilabel). The obtained y would be one dimensional and you could apply standard cross validation tools.

I would be +1 on supporting both dictionary of arrays and pandas for y. Possible standard columns for y:

target (the regular target variable for supervised learning: could be a 1D or 2D array of floats for regression, multi-output or not or 1D integers for classification or 2D sparse indicator matrix for multi-label classification).
group (array of integers for group of samples that should not be splitted by a train / test split and that could have a special usage for some estimators and scoring functions that consider groups of samples at once such as learning to rank)
weight (array of floats as the current sample_weight array).
uid (a unique identifier, e.g. a unique integer or a unique string with dtype=object for each sample in the original training to trace provenance of the sample and make it easier for debugging)

What would be the use-case for weight?

@jnothman
Copy link
Member

In some cases you want to be able to support more complex types of cross-validation: do not split samples that come from the same subject and the same month-of-the-year for instance.

It looks like you can define this constraint by first expressing each constraint individually using a multi-output multi-class y and then transform those class sets using a label power set-like transformation (as in #2461, which makes it for multilabel). The obtained y would be one dimensional and you could apply standard cross validation tools.

I don't think I understand this proposal. Standard CV tools (e.g. stratified k fold) tend to split groups with the same label, so I'm not sure how this helps. Also, you would presumably want to decode this operation after CV splitting, which tools like cross_val_score certainly do not handle nicely.

@arjoly
Copy link
Member

arjoly commented Apr 29, 2014

The constraint is not to split samples from the same subject and the same month-of-the-year for instance. Thus, you could use for instance LeaveOneLabelOut.

@jnothman
Copy link
Member

Right. Obviously we can assign a unique id to each group and use LOLO, but
that's a lot of folds! And with data-independent CV iterators, we still
have no way to pass labels for LOLO, except perhaps at construction time.

On 29 April 2014 22:44, Arnaud Joly [email protected] wrote:

The constraint is not split samples from the same subject and the same
month-of-the-year for instance. Thus, you could use for instance
LeaveOneLabelOut.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2904#issuecomment-41670896
.

@arjoly
Copy link
Member

arjoly commented Apr 29, 2014

... , but that's a lot of folds!

This would correspond to a leave one out approach. But you could use/implement other label (or id?) based validation such as LeavePLabelOut.

And with data-independent CV iterators, we still
have no way to pass labels for LOLO, except perhaps at construction time.

What would be the problem to pass those at construction?

@jnothman
Copy link
Member

LeavePLabelOut only makes more folds. It's a combinatoric explosion of
LeaveOneLabelOut. LeaveOneLabelOut isn't such a problem here really, just
one needs to arbitrarily merge some labels and hope the folds are still
reasonable.

Passing labels at construction is the sort of problem that we have with the
current CV generators, that, for instance, can't be nested. We'd like to be
able to do a grid search within a cross_val_score. If we pass something
like MyCVGenerator(unsplittable_group_ids) to the cross_val_score, which
then calls .get_splits(X, y), then we can't similarly exploit
unsplittable_group_ids for the grid search cross validation... Or we could
use MyCVGenerator().get_splits(X, y, unsplittable_group_ids), which is fine
as long as there's a nice way to communicate additional arguments to
cross_val_score and then from there to grid search. I don't think that's
very clear, but it's the issue that is being considered here.

On 29 April 2014 23:03, Arnaud Joly [email protected] wrote:

... , but that's a lot of folds!

This would correspond to a leave one out approach. But you could
use/implement other label (or id?) based validation such as LeavePLabelOut
.

And with data-independent CV iterators, we still
have no way to pass labels for LOLO, except perhaps at construction time.

What would be the problem to pass those at construction?


Reply to this email directly or view it on GitHubhttps://github.com//issues/2904#issuecomment-41672650
.

@jnothman
Copy link
Member

jnothman commented May 3, 2014

One advantage of the data-independent CV iterators is that they are actually given the data. This means that we can include validation to ensure the y for StratifiedKFold makes sense (not multilabel) to avoid issues like #3128.

@MechCoder
Copy link
Member

I'm facing this situation in one of my Pull Requests #2862 (comment) .

The problem is when y is multi-class in a one-vs-all setting, when I need to cross-validate (with the default StratifiedKFold) , the folds generated are different for each class.

So I need to compute the new y, and (hence the folds) outside the Parallel Loop? Is there any work-around?

@jnothman
Copy link
Member

Is this resolved?

@arjoly
Copy link
Member

arjoly commented Oct 24, 2015

I think the doc is not yet merged.

@mblondel
Copy link
Member Author

Can this be closed now?

@raghavrv
Copy link
Member

Yes!

@GaelVaroquaux
Copy link
Member

Hurray!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants