-
-
Notifications
You must be signed in to change notification settings - Fork 26k
model_selection.StratifiedKFold should not require the data array to simply return split indices. #7126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
request seems legit. Any objection to allow X=None here? we would have to
be more permissive also in other CV objects if needed
|
Any objection to allow X=None here?
It's impossible: the code introspects a variaty of thigns on X to
generate the indices, such as the number of samples.
|
yes but here you get this info from len(y)
|
yes but here you get this info from len(y)
Good point, most (all?) CV objects only need y.
+1 for X=None, than
|
@raghavrv since you did this refactoring any thoughts? |
Why can't you just set Maybe we can document this in the docstring of X to guide people. |
that works too :)
|
Can you give more details on that? How did you do that? |
There are definitely workarounds possible in user code, like the one @raghavrv suggested. If this is a common enough use case, however, I think we should not put the burden on the user, and allow X=None. I'd argue that requiring X gives the illusion that X is somehow used. |
I think we decided to go with Now if you want X=None to be set in Or if you are suggesting that we have WDYT? Also ping @jnothman |
Yeah, that's a good point... You changed my mind, now I think we should keep the current signature. When X doesn't fit, users are likely to apply the trick that @amueller described. If not, we can suggest the |
@raghavrv The first case in the second paragraph seems the most reasonable to me. def split(X=None, y=None, labels=None):
if X is None:
assert y is not None, 'must specify at least X or y'
X = np.empty((y.shape[0], 1)) I think X coming first is fine in the signature, but that doesn't mean it can't be a default argument. I think this solution is much more clean than requiring a user to know that they need to create a dummy object to mirror X, just so input validation works. It also requires much less effort on in user code and makes the use of sklearn visually more elegant. # either
for train_idx, test_idx in skf.split(y=y):
pass
# or
for train_idx, test_idx in skf.split(None, y):
pass
# looks much more elegant than
for train_idx, test_idx in skf.split(np.empty((y.shape[0], 1)), y):
pass |
Thanks for the comment. But I have a feeling that people will then be motivated to try I feel your use case of being unable to pass People who don't wish to pass in But I accept your point on Whatever the case is now would be the time to decide and make any change before the API is released. |
@Erotemic can you live for now with X = np.empty((y.shape[0], 0)) ? I am not opposed to X=None allowed but it's another round of dev / review etc. and I think we have other priorities. |
But we are effectively piloting some substantial changes to a user interface that didn't require this, albeit broken in other ways. I think it's fair to carefully re-evaluate our design decisions at this point.
I think I don't mind @Ertomic's suggestion. |
fair enough but we "just" need someone to re-evaluate. Let's give it a try |
I'm certainly able to live with it. The only reason I bring it up is because I feel like it would be an improvement to the API. I completely understand that this might not be a high priority, and I recognize the enormous amount of effort that it takes to methodically make these changes in such a widely used open source API. If this change can make it into the next release that would be great, but if it needs to be tabled I can understand that as well. However, I still think this would be a good change to implement at some point in the future. I'm not familiar with your process of making API changes with dev/review, but if all you need is a pull request with the implementation, docstring update, and associated tests, I can do the work of throwing that together. |
Please go ahead and raise a PR! |
that would help. The model_selection module has not yet been release AFAIK |
@Erotemic I think you are using the cv object in a way that's different from what we (I?) had in mind.
That is true for your code, but I'd say that your code could easily be redesigned so as not to need that and will probably be cleaner. Can you provide a full example of code where you would want to create the split without having X? |
Sure, but let me preface: This code isn't extremely clean and has been put together in an interactive environment. It demonstrates the problem, but not as nicely as it could. As the code currently sits, I could pass in the dataset as well, but my ultimate point is that I shouldn't need to do that because generating cross validation indices do not depend on that data at all. Especially when the same information can be introspected from the labels. Here is a link to the script that caused me to run into the issue. Mirror in case I change anything: A more intuitive example (that I don't have full code for atm) would be the case where I can load individual data vectors from an SQLite database. Say I don't have enough room in memory to load everything, but I do have enough room to load just the training set or just the testing set. Without having the dataset loaded, I should simply be able to generate the cross validation indices and use those to SELECT the appropriate rows from my database. In this instance the entire dataset never exists as a single numpy array, but there is a need for generating cross validation indices. @raghavrv and @agramfort |
How do you construct the SELECT statements, and how do you make sure they respect the split? I agree, it is not necessary to have X to get the indices, but requiring the user to have X (arguably) encourages better usage patterns and less rewriting of existing functionality. |
For the example you linked to above, I don't see why you didn't just use |
Unfortunately, I haven't had too many occasions to use scikit-learn before and I'm not familiar with most of its functionality. Perhaps I'm rewriting more than I need to. I admit I'm not entirely sure what a transformer is or how to use it effectively. I will definitely take a look though now that I've been made aware of it. On the other hand, doing a little bit of rewriting allows for a deeper understanding of what's happening, which is especially valuable to someone who is learning. Its also helpful to remove a little bit of the abstractions in order to extend the functionality. In this particular example I'm using HoG vectors with an SVM to classify if an image of a whale shark is injured or not. This is a simple baseline test to see how difficult the problem is as well as clean our dataset (the bounding boxes were placed on the sharks using another computer vision algorithm, so we need to fix bad bounding boxes if they come up). Ultimately we are going to move away from HoG / SVM towards a convolutional neural network approach using lasange and theano. Another factor that came into the way I wrote that file is that I'm constantly copying and pasting into IPython from my gvim editor. Its useful to have to copy only a small amount of code to test the part I'm interested in. Therefore IPython's autoreload limitations are taken into account. About the SELECT statements: If you load in an array of labels (which takes much less memory than features with thousands of dimensions) then you can get the cross validation indicies. You might also have a lit of rowids that corresponds to the data that you want to load. You can take the rowids that belong to the subset and then simply index into the SQL table on those rowids. I guess the argument could be made that the array of rowids represents pointers do your data, and thus should be passed in, but I find that to be non-intuitive. |
I haven't seen so far a compeling case against X=np.zeros(n_samples) --I In my opinion, enforing to give an X makes an explicitely uniform API For now, I think that the cost of giving X=np.zeros is small. We should |
+1 Contributors welcome. |
Let's close it, then? |
When importing import sklearn.cross_validation I was prompted with a DepricationWarning
saying I should use sklearn.model_selection instead. To ensure my code is up to date, I switched to the new version, but then I encountered an odd behavior.
The place in my code where I was generating the cross validation indices does not have access to the data array. However, in the new version to simply generate the indices of the test/train split you must supply the entire dataset. Previously all that was needed was the labels.
Forcing the developer to specify a data array is a problem when you have large amounts of high dimensional data and you want to wait to load only the subset of it needed by the current cross validation run. Furthremore, I cannot think of a reason why X would be required by this process, nor can I see a reason in the scikit-learn code.
Here is a small piece of code demonstrating the issue.
Even if it was nice to have the split signature contain an X for compatibility reasons, I think you should at least be able to specify X as None. However, if you try to set X=None it results in a type error.
It would be nice if there was either an alternative method like "split_indicies(y)" that generated the indices using only the labels, or if the developer was able to specify X=None when calling split.
Version Info:
Linux-3.13.0-92-generic-x86_64-with-Ubuntu-14.04-trusty
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2]
NumPy 1.11.1
SciPy 0.18.0
Scikit-Learn 0.18.dev0
The text was updated successfully, but these errors were encountered: