Thanks to visit codestin.com
Credit goes to github.com

Skip to content

model_selection.StratifiedKFold should not require the data array to simply return split indices. #7126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Erotemic opened this issue Aug 1, 2016 · 28 comments
Labels
Documentation Easy Well-defined and straightforward way to resolve

Comments

@Erotemic
Copy link
Contributor

Erotemic commented Aug 1, 2016

When importing import sklearn.cross_validation I was prompted with a DepricationWarning
saying I should use sklearn.model_selection instead. To ensure my code is up to date, I switched to the new version, but then I encountered an odd behavior.

The place in my code where I was generating the cross validation indices does not have access to the data array. However, in the new version to simply generate the indices of the test/train split you must supply the entire dataset. Previously all that was needed was the labels.

Forcing the developer to specify a data array is a problem when you have large amounts of high dimensional data and you want to wait to load only the subset of it needed by the current cross validation run. Furthremore, I cannot think of a reason why X would be required by this process, nor can I see a reason in the scikit-learn code.

Here is a small piece of code demonstrating the issue.

    import sklearn.cross_validation
    import sklearn.model_selection
    y = np.array([0, 0, 1, 1, 1, 0, 0, 1])
    X = y.reshape(len(y), 1)

    # In the old version all that is needed is the labels
    skf_old = sklearn.cross_validation.StratifiedKFold(y, random_state=0)
    indicies_old = list(skf_old)

    # The new version seems to require a data array for some reason
    skf_new = sklearn.model_selection.StratifiedKFold(random_state=0)
    indicies_new = list(skf_new.split(X, y))

    # Causes an error, but there is no reason why X must be specified
    indicies_new2 = list(skf_new.split(None, y))

Even if it was nice to have the split signature contain an X for compatibility reasons, I think you should at least be able to specify X as None. However, if you try to set X=None it results in a type error.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-7995a67b2df2> in <module>()
----> 1 indicies_new2 = list(skf_new.split(None, y))

/home/joncrall/code/scikit-learn/sklearn/model_selection/_split.pyc in split(self, X, y, labels)
    312         """
    313         X, y, labels = indexable(X, y, labels)
--> 314         n_samples = _num_samples(X)
    315         if self.n_folds > n_samples:
    316             raise ValueError(

/home/joncrall/code/scikit-learn/sklearn/utils/validation.pyc in _num_samples(x)
    120         else:
    121             raise TypeError("Expected sequence or array-like, got %s" %
--> 122                             type(x))
    123     if hasattr(x, 'shape'):
    124         if len(x.shape) == 0:

TypeError: Expected sequence or array-like, got <type 'None Type'>

It would be nice if there was either an alternative method like "split_indicies(y)" that generated the indices using only the labels, or if the developer was able to specify X=None when calling split.

Version Info:

Linux-3.13.0-92-generic-x86_64-with-Ubuntu-14.04-trusty
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2]
NumPy 1.11.1
SciPy 0.18.0
Scikit-Learn 0.18.dev0

@agramfort
Copy link
Member

agramfort commented Aug 1, 2016 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Aug 1, 2016 via email

@agramfort
Copy link
Member

agramfort commented Aug 1, 2016 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Aug 1, 2016 via email

@agramfort
Copy link
Member

@raghavrv since you did this refactoring any thoughts?

@raghavrv
Copy link
Member

raghavrv commented Aug 1, 2016

Why can't you just set X = np.empty((y.shape[0], 1))?

Maybe we can document this in the docstring of X to guide people.

@agramfort
Copy link
Member

agramfort commented Aug 1, 2016 via email

@amueller
Copy link
Member

amueller commented Aug 1, 2016

Forcing the developer to specify a data array is a problem when you have large amounts of high dimensional data and you want to wait to load only the subset of it needed by the current cross validation run.

Can you give more details on that? How did you do that?
This should be possible by having X be the indices to the data, and having a transformer that loads the data in a pipeline.

@vene
Copy link
Member

vene commented Aug 1, 2016

There are definitely workarounds possible in user code, like the one @raghavrv suggested. If this is a common enough use case, however, I think we should not put the burden on the user, and allow X=None.

I'd argue that requiring X gives the illusion that X is somehow used.

@raghavrv
Copy link
Member

raghavrv commented Aug 1, 2016

I think we decided to go with split(X, y, labels) as it can also be used for unsupervised tasks too.

Now if you want X=None to be set in StratifiedKFold alone, the signature must be split(y, X=None, labels=None) which is not consistent with the other CV splitters.

Or if you are suggesting that we have (X=None, y=None, labels=None) and we can later make sure y is not None, users are expected to know that X can be None possibly by reading the docstring. In which case we can tell them to use a dummy X... And we'll have a more informative signature (X, y, labels=None) denoting that y cannot be None.

WDYT?

Also ping @jnothman

@vene
Copy link
Member

vene commented Aug 1, 2016

Yeah, that's a good point... You changed my mind, now I think we should keep the current signature. When X doesn't fit, users are likely to apply the trick that @amueller described. If not, we can suggest the np.empty trick. I don't think this use case is frequent enough to warrant jumping through hoops in our input validation...

@Erotemic
Copy link
Contributor Author

Erotemic commented Aug 1, 2016

@raghavrv The first case in the second paragraph seems the most reasonable to me.

def split(X=None, y=None, labels=None): 
    if X is None: 
        assert y is not None, 'must specify at least X or y'
        X = np.empty((y.shape[0], 1))

I think X coming first is fine in the signature, but that doesn't mean it can't be a default argument.
Doing it this way a user could simply use skf.split(y=y) or skf.split(None, y). This wouldn't break in the unsupervised case and it would be much more reasonable in the supervised case.

I think this solution is much more clean than requiring a user to know that they need to create a dummy object to mirror X, just so input validation works. It also requires much less effort on in user code and makes the use of sklearn visually more elegant.

     # either
     for train_idx, test_idx in skf.split(y=y):
         pass
     # or
     for train_idx, test_idx in skf.split(None, y):
         pass
     # looks much more elegant than 
     for train_idx, test_idx in skf.split(np.empty((y.shape[0], 1)), y):
         pass

@raghavrv
Copy link
Member

raghavrv commented Aug 1, 2016

Doing it this way a user could simply use skf.split(y=y) or skf.split(None, y).

Thanks for the comment. But I have a feeling that people will then be motivated to try skf.split(y) and complain that it doesn't work while kfold.split(y) works.

I feel your use case of being unable to pass X is really small. People would happily pass X and y especially since X is neither copied nor modified. And this would continue to be the case as long as X is in the current namespace and has a shape of (n_samples, n_features)

People who don't wish to pass in X have a tad hackish use cases who can be burdened a tiny bit to put an empty placeholder X in compromise for a consistent and clear API with an informative signature.

But I accept your point on skf.split(y=y) being more elegant than skf.split(X=np.empty((y.shape[0], 1)), y=y) or skf.split(y, y). So I'd defer the judgement of what should be made to @jnothman, @vene, @amueller, @agramfort and @GaelVaroquaux.

Whatever the case is now would be the time to decide and make any change before the API is released.

@agramfort
Copy link
Member

@Erotemic can you live for now with

X = np.empty((y.shape[0], 0))

?

I am not opposed to X=None allowed but it's another round of dev / review etc. and I think we have other priorities.

@jnothman
Copy link
Member

jnothman commented Aug 2, 2016

@agramfort:

I am not opposed to X=None allowed but it's another round of dev / review etc. and I think we have other priorities.

But we are effectively piloting some substantial changes to a user interface that didn't require this, albeit broken in other ways. I think it's fair to carefully re-evaluate our design decisions at this point.

@raghavrv:

people will then be motivated to try skf.split(y) and complain that it doesn't work while kfold.split(y) works.

I think kfold.split(y) could be allowed to work on the basis of the unsupervised case anyway...?

I don't mind @Ertomic's suggestion.

@agramfort
Copy link
Member

But we are effectively piloting some substantial changes to a user
interface that didn't require this, albeit broken in other ways. I think
it's fair to carefully re-evaluate our design decisions at this point.

fair enough but we "just" need someone to re-evaluate. Let's give it a try
then. Are you on it?

@Erotemic
Copy link
Contributor Author

Erotemic commented Aug 2, 2016

@agramfort

I'm certainly able to live with it. The only reason I bring it up is because I feel like it would be an improvement to the API. I completely understand that this might not be a high priority, and I recognize the enormous amount of effort that it takes to methodically make these changes in such a widely used open source API.

If this change can make it into the next release that would be great, but if it needs to be tabled I can understand that as well. However, I still think this would be a good change to implement at some point in the future.

I'm not familiar with your process of making API changes with dev/review, but if all you need is a pull request with the implementation, docstring update, and associated tests, I can do the work of throwing that together.

@raghavrv
Copy link
Member

raghavrv commented Aug 2, 2016

Please go ahead and raise a PR!

@agramfort
Copy link
Member

I'm not familiar with your process of making API changes with dev/review,
but if all you need is a pull request with the implementation, docstring
update, and associated tests, I can do the work of throwing that together.

that would help. The model_selection module has not yet been release AFAIK
so you can do it without API deprecation mess

@amueller
Copy link
Member

amueller commented Aug 2, 2016

@Erotemic I think you are using the cv object in a way that's different from what we (I?) had in mind.
To me, it doesn't make sense to start splitting without having X.

I think this solution is much more clean than requiring a user to know that they need to create a dummy object to mirror X, just so input validation works.

That is true for your code, but I'd say that your code could easily be redesigned so as not to need that and will probably be cleaner.

Can you provide a full example of code where you would want to create the split without having X?

@Erotemic
Copy link
Contributor Author

Erotemic commented Aug 2, 2016

@amueller

Sure, but let me preface: This code isn't extremely clean and has been put together in an interactive environment. It demonstrates the problem, but not as nicely as it could. As the code currently sits, I could pass in the dataset as well, but my ultimate point is that I shouldn't need to do that because generating cross validation indices do not depend on that data at all. Especially when the same information can be introspected from the labels.

Here is a link to the script that caused me to run into the issue.
https://github.com/Erotemic/ibeis/blob/next/ibeis/scripts/classify_shark.py

Mirror in case I change anything:
https://gist.github.com/Erotemic/b694158a7637de42208d5b86852b4f9e

A more intuitive example (that I don't have full code for atm) would be the case where I can load individual data vectors from an SQLite database. Say I don't have enough room in memory to load everything, but I do have enough room to load just the training set or just the testing set. Without having the dataset loaded, I should simply be able to generate the cross validation indices and use those to SELECT the appropriate rows from my database. In this instance the entire dataset never exists as a single numpy array, but there is a need for generating cross validation indices.

@raghavrv and @agramfort
I submitted a PR #7128

@amueller
Copy link
Member

amueller commented Aug 2, 2016

A more intuitive example (that I don't have full code for atm) would be the case where I can load individual data vectors from an SQLite database. Say I don't have enough room in memory to load everything, but I do have enough room to load just the training set or just the testing set. Without having the dataset loaded, I should simply be able to generate the cross validation indices and use those to SELECT the appropriate rows from my database. In this instance the entire dataset never exists as a single numpy array, but there is a need for generating cross validation indices.

How do you construct the SELECT statements, and how do you make sure they respect the split?
Why don't you pass the possible keys at X? That would simplify the logic, and you wouldn't need to even touch the grid-search or cross-validation code, and would just write a transformer.

I agree, it is not necessary to have X to get the indices, but requiring the user to have X (arguably) encourages better usage patterns and less rewriting of existing functionality.

@amueller
Copy link
Member

amueller commented Aug 2, 2016

For the example you linked to above, I don't see why you didn't just use cross_val_score

@Erotemic
Copy link
Contributor Author

Erotemic commented Aug 2, 2016

Unfortunately, I haven't had too many occasions to use scikit-learn before and I'm not familiar with most of its functionality. Perhaps I'm rewriting more than I need to. I admit I'm not entirely sure what a transformer is or how to use it effectively. I will definitely take a look though now that I've been made aware of it.

On the other hand, doing a little bit of rewriting allows for a deeper understanding of what's happening, which is especially valuable to someone who is learning. Its also helpful to remove a little bit of the abstractions in order to extend the functionality. In this particular example I'm using HoG vectors with an SVM to classify if an image of a whale shark is injured or not. This is a simple baseline test to see how difficult the problem is as well as clean our dataset (the bounding boxes were placed on the sharks using another computer vision algorithm, so we need to fix bad bounding boxes if they come up). Ultimately we are going to move away from HoG / SVM towards a convolutional neural network approach using lasange and theano.

Another factor that came into the way I wrote that file is that I'm constantly copying and pasting into IPython from my gvim editor. Its useful to have to copy only a small amount of code to test the part I'm interested in. Therefore IPython's autoreload limitations are taken into account.

About the SELECT statements: If you load in an array of labels (which takes much less memory than features with thousands of dimensions) then you can get the cross validation indicies. You might also have a lit of rowids that corresponds to the data that you want to load. You can take the rowids that belong to the subset and then simply index into the SQL table on those rowids. I guess the argument could be made that the array of rowids represents pointers do your data, and thus should be passed in, but I find that to be non-intuitive.

@GaelVaroquaux
Copy link
Member

I haven't seen so far a compeling case against X=np.zeros(n_samples) --I
purposely choose "zeros", and not "empty", see below.

In my opinion, enforing to give an X makes an explicitely uniform API
across all CV objects. This is in keep with scikit-learn's philosophy. A
CV object might look at statistics of X to compute the splitting. It's
legitimate, and maybe even good, to do that.

For now, I think that the cost of giving X=np.zeros is small. We should
add a note in the documentation to guide the users to do that, and keep
an eye on this problem to see how big a burden it is in the long run.
It's always possible to make an argument optional in the future. It's
much harder to change an optional argument to be mandatory.

@raghavrv
Copy link
Member

raghavrv commented Oct 3, 2016

We should add a note in the documentation to guide the users to do that

+1

Contributors welcome.

@TomDLT TomDLT added Easy Well-defined and straightforward way to resolve Documentation labels Oct 3, 2016
@raghavrv
Copy link
Member

I've tried to address this at #7593 by adding this to the doc.

@jnothman
Copy link
Member

Let's close it, then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Easy Well-defined and straightforward way to resolve
Projects
None yet
Development

No branches or pull requests

8 participants