StratifiedShuffleSplit still buggy #6471

amueller · 2016-03-01T21:03:03Z

y = [0, 1, 2, 3] * 3 + [4, 5] * 5                                                                  
X = np.ones_like(y)                                                                                

splits = StratifiedShuffleSplit(n_iter=1, train_size=11, test_size=11, random_state=0)                                     

train, test = next(iter(splits.split(X=X, y=y)))                                                   

assert_array_equal(np.intersect1d(train, test), [])   
assert_equal(len(train), 11)
assert_equal(len(test), 11)

raise self.failureException('12 != 11')

this is a follow-up on #6379 and a sign that the logic is too complex. I'm doing a rewrite to fix this bug but it is not as clean as I'd like.

mattilyra · 2016-05-30T09:40:19Z

I can add a few more test cases to this. With the following data

import sklearn.cross_validation
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=1000, n_features=2, random_state=42, cluster_std=5.0)

All manner of crazy things happen

split = sklearn.cross_validation.StratifiedShuffleSplit(y, test_size=0.09, train_size=0.1)
train_idx, test_idx = next(iter(split))
assert_equal(train_idx.shape[0], split.n_train) # pass
assert_equal(test_idx.shape[0], split.n_test) # 901 != 90

AssertionError:
Items are not equal:
ACTUAL: 901
DESIRED: 90

split = sklearn.cross_validation.StratifiedShuffleSplit(y, test_size=0.11, train_size=0.11)
train_idx, test_idx = next(iter(split))
assert_equal(train_idx.shape[0], split.n_train) # 900 != 110
assert_equal(test_idx.shape[0], split.n_test) # 111 != 110

AssertionError:
Items are not equal:
ACTUAL: 900
DESIRED: 110

AssertionError:
Items are not equal:
ACTUAL: 111
DESIRED: 110

I'm not that bothered about getting 1 extra instance in the test or training set, but getting 10x the amount of data that was requested is not optimal.

mattilyra · 2016-05-30T10:12:32Z

Right, so the problem is this line

test.extend(missing_idx[-(split.n_test - len(test)):])

https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/cross_validation.py#L1005

If split.n_test - len(test) is 0, meaning that we already have to correct number of test indices, we end up adding all of the missing indices to the test set, including those that went to train. The simple fix is pretty ugly, so not sure if you're still wanting to rewrite the whole thing.

This seems to have already been fixed on master though.

lesteve · 2016-05-31T12:03:04Z

This seems to have already been fixed on master though.

Sorry, I may have missed something here. Have you found a bug that has not already been fixed in master or that was not the one for which the issue was created?

mattilyra · 2016-06-04T09:30:13Z

@lesteve no, after adding the comments I noticed that on the upstream master the issue has been fixed, so this should probably be closed.

The link above is to the 0.17 sources (linked to from the API docs). Sorry for the confusion.

amueller mentioned this issue Mar 1, 2016

[MRG+1] fix sampling in stratified shuffle split #6472

Merged

jnothman mentioned this issue Sep 3, 2016

Setting train_size=50 of StratifiedShuffleSplit results in wrong size #7335

Closed

ogrisel closed this as completed in #6472 Sep 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

StratifiedShuffleSplit still buggy #6471

StratifiedShuffleSplit still buggy #6471

amueller commented Mar 1, 2016

mattilyra commented May 30, 2016 •

edited

Loading

Uh oh!

mattilyra commented May 30, 2016

Uh oh!

lesteve commented May 31, 2016

Uh oh!

mattilyra commented Jun 4, 2016

Uh oh!

Uh oh!

StratifiedShuffleSplit still buggy #6471

StratifiedShuffleSplit still buggy #6471

Comments

amueller commented Mar 1, 2016

mattilyra commented May 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattilyra commented May 30, 2016

Uh oh!

lesteve commented May 31, 2016

Uh oh!

mattilyra commented Jun 4, 2016

Uh oh!

mattilyra commented May 30, 2016 •

edited

Loading