data-independent CV iterators #2904

mblondel · 2014-02-26T17:11:47Z

In many situations, you don't have a test set so you would like to use CV for both evaluation and hyper-parameter tuning. Therefore, you need to do nested cross-validation:

for train, test in cv1:
   # Find the best hyper-parameters for this split
    for train, val in cv2:
        [...]
   # Retrain using the best hyper-parameters
   [...]
# Return best scores for each split

This is very difficult to implement in a generic way with our current API because CV iterators are tied to a particular data. For example, when doing cv = KFold(n_samples), cv will only work with a dataset of the specified size.

Ideally, we would need something closer to the estimator API: use constructor parameters for data-independent options (n_folds, shuffle, random_state, train / test proportion, etc) and a run method that takes y as argument (the reason to take y is for stratified schemes). This would look something like this:

# deprecated usage
for train, test in KFold(n, n_folds):
    print train, test

# new usage
for train, test in KFold(n_folds).run(y):
    print train, test

The text was updated successfully, but these errors were encountered:

GaelVaroquaux · 2014-02-26T17:28:49Z

deprecated usage

for train, test in KFold(n, n_folds):
print train, test

new usage

for train, test in KFold(n_folds).run(y):
print train, test

+1e6!!!

I would argue for a different name for the method (maybe 'split'), but
the spirit is really the right one.

As this will be a major API change, I would think that it would be good
to good it with #2055. (FIXME: wrong issue number. I am trying to find the one I have in mind).

GaelVaroquaux · 2014-02-26T17:39:45Z

As this will be a major API change, I would think that it would be good
to good it with #2055.

I actually meant: #1848

jnothman · 2014-02-26T22:27:07Z

+1

We still have a question of how to handle the labels parameter for LOLO in cross_val_score or GridSearchCV, but in general, I think this is a much better interface.

mblondel · 2014-02-27T09:57:27Z

split is fine with me.

agramfort · 2014-02-27T10:17:13Z

I would vote for iter_splits

GaelVaroquaux · 2014-02-27T10:21:14Z

We still have a question of how to handle the labels parameter for LOLO in
cross_val_score or GridSearchCV,

KFold.split(arrays)

with specific names on arrays? Thus, if one array is 'labels', this work.
The danger is to no capture a typo in an array name. But it's very
generic.

jnothman · 2014-02-27T11:20:59Z

Naming labels for the CV generator is fine. Getting GridSearchCV to pass
labels onto the CV generator's .split is another matter.

On 27 February 2014 21:21, Gael Varoquaux [email protected] wrote:

We still have a question of how to handle the labels parameter for LOLO
in
cross_val_score or GridSearchCV,

KFold.split(arrays)

with specific names on arrays? Thus, if one array is 'labels', this work.
The danger is to no capture a typo in an array name. But it's very
generic.

Reply to this email directly or view it on GitHubhttps://github.com//issues/2904#issuecomment-36228209
.

GaelVaroquaux · 2014-02-27T15:29:56Z

Naming labels for the CV generator is fine. Getting GridSearchCV to pass
labels onto the CV generator's .split is another matter.

Yes, that's part of the bigger picture problem. You are right that they
need to be tackled together.

@ogrisel and myself have been brainstorming on the idea of allowing y to
be a dictionnary of arrays or a pandas dataframe.

mblondel · 2014-02-27T17:11:29Z

Please share / elaborate the ideas developed :)

GaelVaroquaux · 2014-02-27T17:28:42Z

Please share / elaborate the ideas developed :)

Nothing really much more than what I said above. y could be a panda data
frame or a dict of arrays (we would still accept a simple array). These
arrays would all be of length n_samples, and would be sliced and diced
during cross-validation.

They would be useful to add any meta information that describes samples,
such as sample weights, labels for stratification. Open questions are:
how to deal with multi output or multi-label?

larsmans · 2014-02-27T21:15:30Z

Ping myself. This is a good idea.

jnothman · 2014-02-27T22:50:10Z

Open questions are: how to deal with multi output or multi-label?

I don't know about pandas, but FWIW numpy recarrays / struct arrays can deal with such structures, if need be:

>>> a = np.array([([0, 0, 1], .5), ([1, 0, 0], .25)],
                 dtype=[('y', 'i', 3), ('weight', 'f')])
>>> a
array([([0, 0, 1], 0.5), ([1, 0, 0], 0.25)],
      dtype=[('y', '<i4', (3,)), ('weight', '<f4')])
>>> a['y']
array([[0, 0, 1],
       [1, 0, 0]], dtype=int32)
>>> a['weight']
array([ 0.5 ,  0.25], dtype=float32)

But this is not extremely intuitive to setup; and it destroys the plan to use sparse matrices, while welcoming back the idea of sequences of sequences:

>>> a = np.array([([1], .5), ([0, 1], .25)], dtype=[('y', 'O'), ('weight', 'f')])
>>> a
array([([1], 0.5), ([0, 1], 0.25)],
      dtype=[('y', 'O'), ('weight', '<f4')])

GaelVaroquaux · 2014-02-27T22:56:42Z

I don't know about pandas, but FWIW numpy recarrays / struct arrays can deal
with such structures, if need be:

Structured arrays are a dead end, I believe. Dictionnary of arrays would
work here, but it means that pandas couldn't work.

I don't care that much, I find that pandas is too limited for these
usecases.

ogrisel · 2014-02-28T14:19:22Z

I would be +1 on supporting both dictionary of arrays and pandas for y. Possible standard columns for y:

target (the regular target variable for supervised learning: could be a 1D or 2D array of floats for regression, multi-output or not or 1D integers for classification or 2D sparse indicator matrix for multi-label classification).
group (array of integers for group of samples that should not be splitted by a train / test split and that could have a special usage for some estimators and scoring functions that consider groups of samples at once such as learning to rank)
weight (array of floats as the current sample_weight array).
uid (a unique identifier, e.g. a unique integer or a unique string with dtype=object for each sample in the original training to trace provenance of the sample and make it easier for debugging)

y could also contain additional demain specific metadata that would not have any special meaning to scikit-learn but would be preserved along the matching samples when doing CV splits, pipeline transforms, resampling and so on.

ogrisel · 2014-02-28T14:20:45Z

Off course we would be backward compat: if y is not a dict of arrays or a pandas dataframe we would consider it the target variable as currently.

ogrisel · 2014-02-28T14:25:43Z

My above definition of the group column might be too simplistic though. In some cases you want to be able to support more complex types of cross-validation: do not split samples that come from the same subject and the same month-of-the-year for instance. So CV schemes could probably be parametrized by several y column names to handle multi-dimensional sample grouping constraints.

mblondel · 2014-02-28T17:33:09Z

I would be +1 on supporting both dictionary of arrays and pandas for y

Is this proposal only for CV iterators or also for estimators?

ogrisel · 2014-02-28T17:45:44Z

Is this proposal only for CV iterators or also for estimators?

That would be for both, but we could start by the CV iterators first and keeping in mind that we could generalize the approach.

jnothman · 2014-03-01T22:21:20Z

I would be +1 on supporting both dictionary of arrays and pandas for y.

A more minimal criterion might be support for callable keys and
__getitem__ (note this excludes struct arrays). You can then treat both
the same, and turn a DataFrame into a dict of arrays (although slicing rows
of a DataFrame may be more efficient and returns a DataFrame).

I think this would have to apply to CV iterators, estimators and scorers...
if not metrics. It seems a fairly substantial departure from numpy/scipy
API conventions, and their means of documentation. Maintaining API clarity
will be an interesting challenge.

On 1 March 2014 04:45, Olivier Grisel [email protected] wrote:

Is this proposal only for CV iterators or also for estimators?

That would be for both, but we could start by the CV iterators first and
keeping in mind that we could generalized the approach.

Reply to this email directly or view it on GitHubhttps://github.com//issues/2904#issuecomment-36375695
.

mblondel · 2014-03-03T07:46:33Z

I think this would have to apply to CV iterators, estimators and scorers... if not metrics

Agreed. Since this a fairly ambitious change, we could split the effort into two parts. First, remove data-dependent parameters from the constructor in CV iterators (except for labels in LOLO). This part is localized to cross_validation.py and grid_search.py. Second, support dictionary for y. This part affects the whole code base.

jnothman · 2014-03-03T08:13:00Z

I'm trying to work out what X and y now mean. Is y is None iff no
supervision still the case? Or can one train an unsupervised classifier
with y={'weight': [...]}. Should 'weight' really be part of X? Should
X and y then be merged? Should the two variables distinguish between
something tangible, like predict-time-observed (though not always used) and
unobserved data?

On 3 March 2014 18:46, Mathieu Blondel [email protected] wrote:

I think this would have to apply to CV iterators, estimators and
scorers... if not metrics

Agreed. Since this a fairly ambitious change, we could split the effort
into two parts. First, remove data-dependent parameters from the
constructor (except for labels in LOLO). This part is localized to
cross_validation.py and grid_search.py. Second, support dictionary for y.
This part affects the whole code base.

Reply to this email directly or view it on GitHubhttps://github.com//issues/2904#issuecomment-36487673
.

GaelVaroquaux · 2014-03-03T18:39:38Z

I'm trying to work out what X and y now mean. Is y is None iff no
supervision still the case? Or can one train an unsupervised classifier
with y={'weight': [...]}.

You raise good points :$. I don't have an answer to them, but it is true
that I do a lot of unsupervised learning with a label structure.

Should 'weight' really be part of X?

I don't believe so. For me, 'X' is what you are given in the 'predict'
problem: once you have the classifier and want to work on new data.
Chances are that you do not have weights in such situation.

Should X and y then be merged?

No. I would never put biggish data in a pandas data frame, or a
dictionnary of arrays. These things don't scale terribly well in the
number of features. For instance, how would you put in there sparse data?

Also, I'd be worried about the asymmetry between fit and predict.

And finally, I'd be worried about the ease with which leaks could be
introduced between training and testing: it would be too easy to stick in
labeling information with X.

arjoly · 2014-04-28T08:03:08Z

In some cases you want to be able to support more complex types of cross-validation: do not split samples that come from the same subject and the same month-of-the-year for instance.

It looks like you can define this constraint by first expressing each constraint individually using a multi-output multi-class y and then transform those class sets using a label power set-like transformation (as in #2461, which makes it for multilabel). The obtained y would be one dimensional and you could apply standard cross validation tools.

I would be +1 on supporting both dictionary of arrays and pandas for y. Possible standard columns for y:

target (the regular target variable for supervised learning: could be a 1D or 2D array of floats for regression, multi-output or not or 1D integers for classification or 2D sparse indicator matrix for multi-label classification).
group (array of integers for group of samples that should not be splitted by a train / test split and that could have a special usage for some estimators and scoring functions that consider groups of samples at once such as learning to rank)
weight (array of floats as the current sample_weight array).
uid (a unique identifier, e.g. a unique integer or a unique string with dtype=object for each sample in the original training to trace provenance of the sample and make it easier for debugging)

What would be the use-case for weight?

jnothman · 2014-04-28T08:55:11Z

In some cases you want to be able to support more complex types of cross-validation: do not split samples that come from the same subject and the same month-of-the-year for instance.

It looks like you can define this constraint by first expressing each constraint individually using a multi-output multi-class y and then transform those class sets using a label power set-like transformation (as in #2461, which makes it for multilabel). The obtained y would be one dimensional and you could apply standard cross validation tools.

I don't think I understand this proposal. Standard CV tools (e.g. stratified k fold) tend to split groups with the same label, so I'm not sure how this helps. Also, you would presumably want to decode this operation after CV splitting, which tools like cross_val_score certainly do not handle nicely.

arjoly · 2014-04-29T12:44:50Z

The constraint is not to split samples from the same subject and the same month-of-the-year for instance. Thus, you could use for instance LeaveOneLabelOut.

jnothman · 2014-04-29T12:57:07Z

Right. Obviously we can assign a unique id to each group and use LOLO, but
that's a lot of folds! And with data-independent CV iterators, we still
have no way to pass labels for LOLO, except perhaps at construction time.

On 29 April 2014 22:44, Arnaud Joly [email protected] wrote:

The constraint is not split samples from the same subject and the same
month-of-the-year for instance. Thus, you could use for instance
LeaveOneLabelOut.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2904#issuecomment-41670896
.

arjoly · 2014-04-29T13:03:07Z

... , but that's a lot of folds!

This would correspond to a leave one out approach. But you could use/implement other label (or id?) based validation such as LeavePLabelOut.

And with data-independent CV iterators, we still
have no way to pass labels for LOLO, except perhaps at construction time.

What would be the problem to pass those at construction?

jnothman · 2014-04-29T13:11:48Z

LeavePLabelOut only makes more folds. It's a combinatoric explosion of
LeaveOneLabelOut. LeaveOneLabelOut isn't such a problem here really, just
one needs to arbitrarily merge some labels and hope the folds are still
reasonable.

Passing labels at construction is the sort of problem that we have with the
current CV generators, that, for instance, can't be nested. We'd like to be
able to do a grid search within a cross_val_score. If we pass something
like MyCVGenerator(unsplittable_group_ids) to the cross_val_score, which
then calls .get_splits(X, y), then we can't similarly exploit
unsplittable_group_ids for the grid search cross validation... Or we could
use MyCVGenerator().get_splits(X, y, unsplittable_group_ids), which is fine
as long as there's a nice way to communicate additional arguments to
cross_val_score and then from there to grid search. I don't think that's
very clear, but it's the issue that is being considered here.

On 29 April 2014 23:03, Arnaud Joly [email protected] wrote:

... , but that's a lot of folds!

This would correspond to a leave one out approach. But you could
use/implement other label (or id?) based validation such as LeavePLabelOut
.

And with data-independent CV iterators, we still
have no way to pass labels for LOLO, except perhaps at construction time.

What would be the problem to pass those at construction?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2904#issuecomment-41672650
.

jnothman · 2014-05-03T22:27:47Z

One advantage of the data-independent CV iterators is that they are actually given the data. This means that we can include validation to ensure the y for StratifiedKFold makes sense (not multilabel) to avoid issues like #3128.

MechCoder · 2014-06-19T19:42:12Z

I'm facing this situation in one of my Pull Requests #2862 (comment) .

The problem is when y is multi-class in a one-vs-all setting, when I need to cross-validate (with the default StratifiedKFold) , the folds generated are different for each class.

So I need to compute the new y, and (hence the folds) outside the Parallel Loop? Is there any work-around?

jnothman · 2015-10-24T21:00:53Z

Is this resolved?

arjoly · 2015-10-24T21:16:30Z

I think the doc is not yet merged.

mblondel · 2015-11-16T03:46:11Z

Can this be closed now?

raghavrv · 2015-11-16T10:41:13Z

Yes!

GaelVaroquaux · 2015-11-16T10:45:30Z

Hurray!

mblondel added API and removed API labels Feb 26, 2014

GaelVaroquaux mentioned this issue Apr 27, 2014

Fix LabelBinarizer and LabelEncoder fit and transform signatures to work with Pipeline #3113

Closed

mjbommar mentioned this issue May 27, 2014

Walk-forward CV/optimization #3202

Closed

pignacio mentioned this issue Jul 3, 2014

[WIP] Data independent CV and model_selection module #3340

Closed

raghavrv mentioned this issue Feb 16, 2015

[WIP] Clean up of cross_validation et al. and model_selection refactoring #4254

Closed

raghavrv mentioned this issue Feb 25, 2015

[MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294

Merged

24 tasks

GaelVaroquaux mentioned this issue Apr 2, 2015

[API] Consistent API for attaching properties to samples #4497

Closed

raghavrv mentioned this issue Jun 5, 2015

[MRG + 1] move custom error/warning classes into sklearn.exceptions (and move deprecated away from utils.__init__.py) #4826

Merged

GaelVaroquaux closed this as completed Nov 16, 2015

Uh oh!

data-independent CV iterators #2904

data-independent CV iterators #2904

Comments

mblondel commented Feb 26, 2014

GaelVaroquaux commented Feb 26, 2014

deprecated usage

new usage

Uh oh!

GaelVaroquaux commented Feb 26, 2014

Uh oh!

jnothman commented Feb 26, 2014

Uh oh!

mblondel commented Feb 27, 2014

Uh oh!

agramfort commented Feb 27, 2014

Uh oh!

GaelVaroquaux commented Feb 27, 2014

Uh oh!

jnothman commented Feb 27, 2014

Uh oh!

GaelVaroquaux commented Feb 27, 2014

Uh oh!

mblondel commented Feb 27, 2014

Uh oh!

GaelVaroquaux commented Feb 27, 2014

Uh oh!

larsmans commented Feb 27, 2014

Uh oh!

jnothman commented Feb 27, 2014

Uh oh!

GaelVaroquaux commented Feb 27, 2014

Uh oh!

ogrisel commented Feb 28, 2014

Uh oh!

ogrisel commented Feb 28, 2014

Uh oh!

ogrisel commented Feb 28, 2014

Uh oh!

mblondel commented Feb 28, 2014

Uh oh!

ogrisel commented Feb 28, 2014

Uh oh!

jnothman commented Mar 1, 2014

Uh oh!

mblondel commented Mar 3, 2014

Uh oh!

jnothman commented Mar 3, 2014

Uh oh!

GaelVaroquaux commented Mar 3, 2014

Uh oh!

arjoly commented Apr 28, 2014

Uh oh!

jnothman commented Apr 28, 2014

Uh oh!

arjoly commented Apr 29, 2014

Uh oh!

jnothman commented Apr 29, 2014

Uh oh!

arjoly commented Apr 29, 2014

Uh oh!

jnothman commented Apr 29, 2014

Uh oh!

jnothman commented May 3, 2014

Uh oh!

MechCoder commented Jun 19, 2014

Uh oh!

jnothman commented Oct 24, 2015

Uh oh!

arjoly commented Oct 24, 2015

Uh oh!

mblondel commented Nov 16, 2015

Uh oh!

raghavrv commented Nov 16, 2015

Uh oh!

GaelVaroquaux commented Nov 16, 2015

Uh oh!