model_selection.StratifiedKFold should not require the data array to simply return split indices. #7126

Erotemic · 2016-08-01T16:00:54Z

When importing import sklearn.cross_validation I was prompted with a DepricationWarning
saying I should use sklearn.model_selection instead. To ensure my code is up to date, I switched to the new version, but then I encountered an odd behavior.

The place in my code where I was generating the cross validation indices does not have access to the data array. However, in the new version to simply generate the indices of the test/train split you must supply the entire dataset. Previously all that was needed was the labels.

Forcing the developer to specify a data array is a problem when you have large amounts of high dimensional data and you want to wait to load only the subset of it needed by the current cross validation run. Furthremore, I cannot think of a reason why X would be required by this process, nor can I see a reason in the scikit-learn code.

Here is a small piece of code demonstrating the issue.

    import sklearn.cross_validation
    import sklearn.model_selection
    y = np.array([0, 0, 1, 1, 1, 0, 0, 1])
    X = y.reshape(len(y), 1)

    # In the old version all that is needed is the labels
    skf_old = sklearn.cross_validation.StratifiedKFold(y, random_state=0)
    indicies_old = list(skf_old)

    # The new version seems to require a data array for some reason
    skf_new = sklearn.model_selection.StratifiedKFold(random_state=0)
    indicies_new = list(skf_new.split(X, y))

    # Causes an error, but there is no reason why X must be specified
    indicies_new2 = list(skf_new.split(None, y))

Even if it was nice to have the split signature contain an X for compatibility reasons, I think you should at least be able to specify X as None. However, if you try to set X=None it results in a type error.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-7995a67b2df2> in <module>()
----> 1 indicies_new2 = list(skf_new.split(None, y))

/home/joncrall/code/scikit-learn/sklearn/model_selection/_split.pyc in split(self, X, y, labels)
    312         """
    313         X, y, labels = indexable(X, y, labels)
--> 314         n_samples = _num_samples(X)
    315         if self.n_folds > n_samples:
    316             raise ValueError(

/home/joncrall/code/scikit-learn/sklearn/utils/validation.pyc in _num_samples(x)
    120         else:
    121             raise TypeError("Expected sequence or array-like, got %s" %
--> 122                             type(x))
    123     if hasattr(x, 'shape'):
    124         if len(x.shape) == 0:

TypeError: Expected sequence or array-like, got <type 'None Type'>

It would be nice if there was either an alternative method like "split_indicies(y)" that generated the indices using only the labels, or if the developer was able to specify X=None when calling split.

Version Info:

Linux-3.13.0-92-generic-x86_64-with-Ubuntu-14.04-trusty
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2]
NumPy 1.11.1
SciPy 0.18.0
Scikit-Learn 0.18.dev0

The text was updated successfully, but these errors were encountered:

agramfort · 2016-08-01T16:13:37Z

request seems legit. Any objection to allow X=None here? we would have to be more permissive also in other CV objects if needed

GaelVaroquaux · 2016-08-01T16:15:42Z

Any objection to allow X=None here?

It's impossible: the code introspects a variaty of thigns on X to generate the indices, such as the number of samples.

agramfort · 2016-08-01T16:39:30Z

yes but here you get this info from len(y)

GaelVaroquaux · 2016-08-01T16:41:15Z

yes but here you get this info from len(y)

Good point, most (all?) CV objects only need y. +1 for X=None, than

agramfort · 2016-08-01T16:43:01Z

@raghavrv since you did this refactoring any thoughts?

raghavrv · 2016-08-01T16:52:01Z

Why can't you just set X = np.empty((y.shape[0], 1))?

Maybe we can document this in the docstring of X to guide people.

agramfort · 2016-08-01T16:53:01Z

that works too :)

amueller · 2016-08-01T19:14:09Z

Forcing the developer to specify a data array is a problem when you have large amounts of high dimensional data and you want to wait to load only the subset of it needed by the current cross validation run.

Can you give more details on that? How did you do that?
This should be possible by having X be the indices to the data, and having a transformer that loads the data in a pipeline.

vene · 2016-08-01T19:36:04Z

There are definitely workarounds possible in user code, like the one @raghavrv suggested. If this is a common enough use case, however, I think we should not put the burden on the user, and allow X=None.

I'd argue that requiring X gives the illusion that X is somehow used.

raghavrv · 2016-08-01T20:05:01Z

I think we decided to go with split(X, y, labels) as it can also be used for unsupervised tasks too.

Now if you want X=None to be set in StratifiedKFold alone, the signature must be split(y, X=None, labels=None) which is not consistent with the other CV splitters.

Or if you are suggesting that we have (X=None, y=None, labels=None) and we can later make sure y is not None, users are expected to know that X can be None possibly by reading the docstring. In which case we can tell them to use a dummy X... And we'll have a more informative signature (X, y, labels=None) denoting that y cannot be None.

WDYT?

Also ping @jnothman

vene · 2016-08-01T22:25:47Z

Yeah, that's a good point... You changed my mind, now I think we should keep the current signature. When X doesn't fit, users are likely to apply the trick that @amueller described. If not, we can suggest the np.empty trick. I don't think this use case is frequent enough to warrant jumping through hoops in our input validation...

Erotemic · 2016-08-01T22:47:30Z

@raghavrv The first case in the second paragraph seems the most reasonable to me.

def split(X=None, y=None, labels=None): 
    if X is None: 
        assert y is not None, 'must specify at least X or y'
        X = np.empty((y.shape[0], 1))

I think X coming first is fine in the signature, but that doesn't mean it can't be a default argument.
Doing it this way a user could simply use skf.split(y=y) or skf.split(None, y). This wouldn't break in the unsupervised case and it would be much more reasonable in the supervised case.

I think this solution is much more clean than requiring a user to know that they need to create a dummy object to mirror X, just so input validation works. It also requires much less effort on in user code and makes the use of sklearn visually more elegant.

     # either
     for train_idx, test_idx in skf.split(y=y):
         pass
     # or
     for train_idx, test_idx in skf.split(None, y):
         pass
     # looks much more elegant than 
     for train_idx, test_idx in skf.split(np.empty((y.shape[0], 1)), y):
         pass

raghavrv · 2016-08-01T23:16:19Z

Doing it this way a user could simply use skf.split(y=y) or skf.split(None, y).

Thanks for the comment. But I have a feeling that people will then be motivated to try skf.split(y) and complain that it doesn't work while kfold.split(y) works.

I feel your use case of being unable to pass X is really small. People would happily pass X and y especially since X is neither copied nor modified. And this would continue to be the case as long as X is in the current namespace and has a shape of (n_samples, n_features)

People who don't wish to pass in X have a tad hackish use cases who can be burdened a tiny bit to put an empty placeholder X in compromise for a consistent and clear API with an informative signature.

But I accept your point on skf.split(y=y) being more elegant than skf.split(X=np.empty((y.shape[0], 1)), y=y) or skf.split(y, y). So I'd defer the judgement of what should be made to @jnothman, @vene, @amueller, @agramfort and @GaelVaroquaux.

Whatever the case is now would be the time to decide and make any change before the API is released.

agramfort · 2016-08-02T07:12:26Z

@Erotemic can you live for now with

X = np.empty((y.shape[0], 0))

?

I am not opposed to X=None allowed but it's another round of dev / review etc. and I think we have other priorities.

jnothman · 2016-08-02T07:28:48Z

@agramfort:

I am not opposed to X=None allowed but it's another round of dev / review etc. and I think we have other priorities.

But we are effectively piloting some substantial changes to a user interface that didn't require this, albeit broken in other ways. I think it's fair to carefully re-evaluate our design decisions at this point.

@raghavrv:

people will then be motivated to try skf.split(y) and complain that it doesn't work while kfold.split(y) works.

I think kfold.split(y) could be allowed to work on the basis of the unsupervised case anyway...?

I don't mind @Ertomic's suggestion.

agramfort · 2016-08-02T07:35:09Z

But we are effectively piloting some substantial changes to a user
interface that didn't require this, albeit broken in other ways. I think
it's fair to carefully re-evaluate our design decisions at this point.

fair enough but we "just" need someone to re-evaluate. Let's give it a try
then. Are you on it?

Erotemic · 2016-08-02T14:45:32Z

@agramfort

I'm certainly able to live with it. The only reason I bring it up is because I feel like it would be an improvement to the API. I completely understand that this might not be a high priority, and I recognize the enormous amount of effort that it takes to methodically make these changes in such a widely used open source API.

If this change can make it into the next release that would be great, but if it needs to be tabled I can understand that as well. However, I still think this would be a good change to implement at some point in the future.

I'm not familiar with your process of making API changes with dev/review, but if all you need is a pull request with the implementation, docstring update, and associated tests, I can do the work of throwing that together.

raghavrv · 2016-08-02T14:50:32Z

Please go ahead and raise a PR!

agramfort · 2016-08-02T14:50:58Z

I'm not familiar with your process of making API changes with dev/review,
but if all you need is a pull request with the implementation, docstring
update, and associated tests, I can do the work of throwing that together.

that would help. The model_selection module has not yet been release AFAIK
so you can do it without API deprecation mess

amueller · 2016-08-02T15:22:41Z

@Erotemic I think you are using the cv object in a way that's different from what we (I?) had in mind.
To me, it doesn't make sense to start splitting without having X.

I think this solution is much more clean than requiring a user to know that they need to create a dummy object to mirror X, just so input validation works.

That is true for your code, but I'd say that your code could easily be redesigned so as not to need that and will probably be cleaner.

Can you provide a full example of code where you would want to create the split without having X?

Erotemic · 2016-08-02T15:46:07Z

@amueller

Sure, but let me preface: This code isn't extremely clean and has been put together in an interactive environment. It demonstrates the problem, but not as nicely as it could. As the code currently sits, I could pass in the dataset as well, but my ultimate point is that I shouldn't need to do that because generating cross validation indices do not depend on that data at all. Especially when the same information can be introspected from the labels.

Here is a link to the script that caused me to run into the issue.
https://github.com/Erotemic/ibeis/blob/next/ibeis/scripts/classify_shark.py

Mirror in case I change anything:
https://gist.github.com/Erotemic/b694158a7637de42208d5b86852b4f9e

A more intuitive example (that I don't have full code for atm) would be the case where I can load individual data vectors from an SQLite database. Say I don't have enough room in memory to load everything, but I do have enough room to load just the training set or just the testing set. Without having the dataset loaded, I should simply be able to generate the cross validation indices and use those to SELECT the appropriate rows from my database. In this instance the entire dataset never exists as a single numpy array, but there is a need for generating cross validation indices.

@raghavrv and @agramfort
I submitted a PR #7128

amueller · 2016-08-02T16:07:35Z

A more intuitive example (that I don't have full code for atm) would be the case where I can load individual data vectors from an SQLite database. Say I don't have enough room in memory to load everything, but I do have enough room to load just the training set or just the testing set. Without having the dataset loaded, I should simply be able to generate the cross validation indices and use those to SELECT the appropriate rows from my database. In this instance the entire dataset never exists as a single numpy array, but there is a need for generating cross validation indices.

How do you construct the SELECT statements, and how do you make sure they respect the split?
Why don't you pass the possible keys at X? That would simplify the logic, and you wouldn't need to even touch the grid-search or cross-validation code, and would just write a transformer.

I agree, it is not necessary to have X to get the indices, but requiring the user to have X (arguably) encourages better usage patterns and less rewriting of existing functionality.

amueller · 2016-08-02T16:09:59Z

For the example you linked to above, I don't see why you didn't just use cross_val_score

Erotemic · 2016-08-02T17:50:08Z

Unfortunately, I haven't had too many occasions to use scikit-learn before and I'm not familiar with most of its functionality. Perhaps I'm rewriting more than I need to. I admit I'm not entirely sure what a transformer is or how to use it effectively. I will definitely take a look though now that I've been made aware of it.

On the other hand, doing a little bit of rewriting allows for a deeper understanding of what's happening, which is especially valuable to someone who is learning. Its also helpful to remove a little bit of the abstractions in order to extend the functionality. In this particular example I'm using HoG vectors with an SVM to classify if an image of a whale shark is injured or not. This is a simple baseline test to see how difficult the problem is as well as clean our dataset (the bounding boxes were placed on the sharks using another computer vision algorithm, so we need to fix bad bounding boxes if they come up). Ultimately we are going to move away from HoG / SVM towards a convolutional neural network approach using lasange and theano.

Another factor that came into the way I wrote that file is that I'm constantly copying and pasting into IPython from my gvim editor. Its useful to have to copy only a small amount of code to test the part I'm interested in. Therefore IPython's autoreload limitations are taken into account.

About the SELECT statements: If you load in an array of labels (which takes much less memory than features with thousands of dimensions) then you can get the cross validation indicies. You might also have a lit of rowids that corresponds to the data that you want to load. You can take the rowids that belong to the subset and then simply index into the SQL table on those rowids. I guess the argument could be made that the array of rowids represents pointers do your data, and thus should be passed in, but I find that to be non-intuitive.

GaelVaroquaux · 2016-08-02T18:17:29Z

I haven't seen so far a compeling case against X=np.zeros(n_samples) --I
purposely choose "zeros", and not "empty", see below.

In my opinion, enforing to give an X makes an explicitely uniform API
across all CV objects. This is in keep with scikit-learn's philosophy. A
CV object might look at statistics of X to compute the splitting. It's
legitimate, and maybe even good, to do that.

For now, I think that the cost of giving X=np.zeros is small. We should
add a note in the documentation to guide the users to do that, and keep
an eye on this problem to see how big a burden it is in the long run.
It's always possible to make an argument optional in the future. It's
much harder to change an optional argument to be mandatory.

raghavrv · 2016-10-03T14:57:25Z

We should add a note in the documentation to guide the users to do that

+1

Contributors welcome.

raghavrv · 2016-10-27T16:13:22Z

I've tried to address this at #7593 by adding this to the doc.

jnothman · 2017-07-13T04:44:16Z

Let's close it, then?

Erotemic mentioned this issue Aug 2, 2016

Changed signature of CV split to accept X=None #7128

Closed

raghavrv mentioned this issue Oct 3, 2016

[MRG] MAINT Use model_selection and add .gitignore autoreject/autoreject#2

Closed

TomDLT added Easy Well-defined and straightforward way to resolve Documentation labels Oct 3, 2016

TomDLT added the Need Contributor label Oct 3, 2016

raghavrv mentioned this issue Oct 16, 2016

[MRG + 1] FIX Validate and convert X, y and groups to ndarray before splitting #7593

Merged

raghavrv removed the Need Contributor label Oct 16, 2016

jnothman closed this as completed Jul 13, 2017

Uh oh!

model_selection.StratifiedKFold should not require the data array to simply return split indices. #7126

model_selection.StratifiedKFold should not require the data array to simply return split indices. #7126

Comments

Erotemic commented Aug 1, 2016

agramfort commented Aug 1, 2016 via email

Uh oh!

GaelVaroquaux commented Aug 1, 2016 via email

Uh oh!

agramfort commented Aug 1, 2016 via email

Uh oh!

GaelVaroquaux commented Aug 1, 2016 via email

Uh oh!

agramfort commented Aug 1, 2016

Uh oh!

raghavrv commented Aug 1, 2016

Uh oh!

agramfort commented Aug 1, 2016 via email

Uh oh!

amueller commented Aug 1, 2016

Uh oh!

vene commented Aug 1, 2016

Uh oh!

raghavrv commented Aug 1, 2016

Uh oh!

vene commented Aug 1, 2016

Uh oh!

Erotemic commented Aug 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raghavrv commented Aug 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agramfort commented Aug 2, 2016

Uh oh!

jnothman commented Aug 2, 2016

Uh oh!

agramfort commented Aug 2, 2016

Uh oh!

Erotemic commented Aug 2, 2016

Uh oh!

raghavrv commented Aug 2, 2016

Uh oh!

agramfort commented Aug 2, 2016

Uh oh!

amueller commented Aug 2, 2016

Uh oh!

Erotemic commented Aug 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Aug 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Aug 2, 2016

Uh oh!

Erotemic commented Aug 2, 2016

Uh oh!

GaelVaroquaux commented Aug 2, 2016

Uh oh!

raghavrv commented Oct 3, 2016

Uh oh!

raghavrv commented Oct 27, 2016

Uh oh!

jnothman commented Jul 13, 2017

Uh oh!

Erotemic commented Aug 1, 2016 •

edited

Loading

raghavrv commented Aug 1, 2016 •

edited

Loading

Erotemic commented Aug 2, 2016 •

edited

Loading

amueller commented Aug 2, 2016 •

edited

Loading