GroupShuffleSplit does not work as how it's described in the documentation. #13369

burak43 · 2019-03-01T19:02:37Z

Description

So, I need to produce test/train/validation splits with predefined groups. I don't want to use LeavePGroupsOut since I need to separate data according my desired percantages into training and validation sets. In the documentation of GroupShuffleSplit, for test_size parameter, it's said that:

test_size : float, int, None, optional
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. By default, the value is set to 0.2. The default will change in version 0.21. It will remain 0.2 only if train_size is unspecified, otherwise it will complement the specified train_size.

However, this is indeed not the case as in the following code:

Steps/Code to Reproduce

(1)

    tr, ts = next(GroupShuffleSplit(n_splits=1, test_size=3).split(TR_set, groups=tr_groups))
    print(tr)
    print(ts)

(2)

    tr, ts = next(GroupShuffleSplit(n_splits=1, test_size=0.1).split(TR_set, groups=tr_groups))
    print(len(tr))
    print(len(ts))

Actual Results

which prints out for instance:

(1)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 91 92 93 99 101 102 103 104 105 106 107] [ 26 27 89 90 94 95 96 97 98 100]
(2)
70 38

As you see above from (1), test size is not 3 but more than 3. This almost always the case. I checked the groups of the indices. Apparently, if test_size is an integer, it represents the absolute number of test groups, not samples. I think you need to fix the documentation since it's misleading.

Also, when test_size is a float, it mostly does not consider the ratio specified. It may be due to unequal sample sizes in the groups but then there must be a note/warning to specify what kind of behaviour it follows under unequal group sizes combined with test_size ratio. From (2), test size is 35% of the whole set where it supposed to be 10%.

So, either I'm missing something or the documentation is nothing but erroneous descriptions.

Thanks.

Versions

System:
python: 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 18:15:35) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
executable: /home/burak/anaconda3/bin/python
machine: Linux-4.15.0-45-generic-x86_64-with-debian-buster-sid

BLAS:
macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
lib_dirs: /home/burak/anaconda3/lib
cblas_libs: mkl_rt, pthread

Python deps:
pip: 18.1
setuptools: 40.2.0
sklearn: 0.20.1
numpy: 1.15.4
scipy: 1.1.0
Cython: 0.28.5
pandas: 0.23.4

The text was updated successfully, but these errors were encountered:

amueller · 2019-03-01T23:32:43Z

Thanks for the report.
From a cursory glance it looks like train_size and test_size are indeed in terms of groups, not in terms of samples.
Does that match your observations? In this case, a fix to the documentation is very welcome.
It should also be possible to implement the other variant, though only approximate (otherwise you'll end up with a bin packing problem, I think).

OFlanagan · 2019-03-07T06:27:16Z

I can fix the documentation, this is my first time contributing to scikit-learn.

If I have understood correctly, the text should be replaced with

test_size : float, int, None, optional
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test groups. If None, the value is set to the complement of the train size. By default, the value is set to 0.2. The default will change in version 0.21. It will remain 0.2 only if train_size is unspecified, otherwise it will complement the specified train_size.

jnothman · 2019-03-07T10:02:48Z

That looks about right. But it's easier to review fully in the context of a pull request. Go ahead and submit one, and let us know if you need more help doing so! Thanks.

burak43 · 2019-03-07T12:45:00Z

What do you think about the second (2) issue I pointed out? I still believe that if the example looks normal, there must be a warning or informative text about its behaviour.

OFlanagan · 2019-03-08T06:54:54Z

I have made a pull request to resolve the first issue.
#13414

pierretallotte · 2019-03-10T10:48:51Z

Also, when test_size is a float, it mostly does not consider the ratio specified. It may be due to unequal sample sizes in the groups but then there must be a note/warning to specify what kind of behaviour it follows under unequal group sizes combined with test_size ratio. From (2), test size is 35% of the whole set where it supposed to be 10%.

The parameter test_size is not ignored when it is a float. Indeed, the ratio specified corresponds to ratio of the number of groups in the test set and the total number of groups in the data set. So, if you have 3 groups (no matter the size of each of them) and test_size is set to 0.1, GroupShuffleSplit.split will generate a test set with at least 10% of the number of groups in the dataset : 3 * 0.1 = 0.3, the test set will contain 0.3 different group (rounded up) so 1 group.

So, I agree the behavior is normal but there is no need of a warning because it's actually the expected behavior as explained in the doc:

Note: The parameters test_size and train_size refer to groups, and not to samples, as in ShuffleSplit.

Maybe we should be more explicit in the description of the parameter (based on train_size parameter):

test_size : float, int, or None, default is None
If float, should be between 0.0 and 1.0 and represent the proportion of the groups to include in the test split (rounded up). If int, represents the absolute number of test groups. If None, the value is automatically set to the complement of the test size.

I notice the ratio is rounded up for test_size and rounded down for train_size.

burak43 · 2019-03-12T15:11:15Z

Thanks for the description @pierretallotte. I finally had a chance to check what you say, and it's exactly as you described. I do still believe that the documentation is misleading albeit the note you quoted. The description of test_size parameter should reflect that it represents the proportion of groups, not the dataset, as you emphasized. So, I believe we need another pull request for that, @OFlanagan could you do it?

OFlanagan · 2019-03-12T18:47:13Z

I can do that

On Wed, 13 Mar 2019 at 4:20 AM, Burak Mandıra ***@***.***> wrote: Thanks for the description @pierretallotte <https://github.com/pierretallotte>. I finally had a chance to check what you say, and it's exactly as you described. I do still believe that the documentation is misleading albeit the note you quoted. The description of test_size parameter should reflect that it represents the *proportion of groups*, not the dataset, as you emphasized. So, I believe we need another pull request for that, @OFlanagan <https://github.com/OFlanagan> could you do it? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13369 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AZ2o36VVrreyGZk8L211cEYFE7vvi63Xks5vV8WvgaJpZM4bZoD4> .

-- Regards Owen Flanagan

burak43 · 2019-03-18T00:00:17Z

Since the problems have been resolved, I'm closing this issue. Thank you everyone.

jnothman · 2019-03-18T09:57:01Z

Thanks for raising the issue @burak43, and for closing it. Although we usually would only close it after the pull requests that fix it are merged :)

jnothman added Documentation good first issue Easy with clear instructions to resolve help wanted labels Mar 4, 2019

OFlanagan mentioned this issue Mar 9, 2019

Correcting description of test_size parameter in model_selection.GroupShuffleSplit #13414

Merged

OFlanagan mentioned this issue Mar 13, 2019

Correcting description of test_size parameter in model_selection.GroupShuffleSplit #13441

Merged

burak43 closed this as completed Mar 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GroupShuffleSplit does not work as how it's described in the documentation. #13369

GroupShuffleSplit does not work as how it's described in the documentation. #13369

burak43 commented Mar 1, 2019 •

edited

Loading

amueller commented Mar 1, 2019

Uh oh!

OFlanagan commented Mar 7, 2019

Uh oh!

jnothman commented Mar 7, 2019 via email

Uh oh!

burak43 commented Mar 7, 2019

Uh oh!

OFlanagan commented Mar 8, 2019

Uh oh!

pierretallotte commented Mar 10, 2019 •

edited

Loading

Uh oh!

burak43 commented Mar 12, 2019

Uh oh!

OFlanagan commented Mar 12, 2019 via email

Uh oh!

burak43 commented Mar 18, 2019

Uh oh!

jnothman commented Mar 18, 2019

Uh oh!

Uh oh!

GroupShuffleSplit does not work as how it's described in the documentation. #13369

GroupShuffleSplit does not work as how it's described in the documentation. #13369

Comments

burak43 commented Mar 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Steps/Code to Reproduce

Actual Results

Versions

amueller commented Mar 1, 2019

Uh oh!

OFlanagan commented Mar 7, 2019

Uh oh!

jnothman commented Mar 7, 2019 via email

Uh oh!

burak43 commented Mar 7, 2019

Uh oh!

OFlanagan commented Mar 8, 2019

Uh oh!

pierretallotte commented Mar 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

burak43 commented Mar 12, 2019

Uh oh!

OFlanagan commented Mar 12, 2019 via email

Uh oh!

burak43 commented Mar 18, 2019

Uh oh!

jnothman commented Mar 18, 2019

Uh oh!

burak43 commented Mar 1, 2019 •

edited

Loading

pierretallotte commented Mar 10, 2019 •

edited

Loading