-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
GroupShuffleSplit does not work as how it's described in the documentation. #13369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. |
I can fix the documentation, this is my first time contributing to scikit-learn. If I have understood correctly, the text should be replaced with test_size : float, int, None, optional |
That looks about right. But it's easier to review fully in the context of a
pull request. Go ahead and submit one, and let us know if you need more
help doing so! Thanks.
|
What do you think about the second (2) issue I pointed out? I still believe that if the example looks normal, there must be a warning or informative text about its behaviour. |
I have made a pull request to resolve the first issue. |
The parameter So, I agree the behavior is normal but there is no need of a warning because it's actually the expected behavior as explained in the doc:
Maybe we should be more explicit in the description of the parameter (based on
I notice the ratio is rounded up for |
Thanks for the description @pierretallotte. I finally had a chance to check what you say, and it's exactly as you described. I do still believe that the documentation is misleading albeit the note you quoted. The description of |
I can do that
On Wed, 13 Mar 2019 at 4:20 AM, Burak Mandıra ***@***.***> wrote:
Thanks for the description @pierretallotte
<https://github.com/pierretallotte>. I finally had a chance to check what
you say, and it's exactly as you described. I do still believe that the
documentation is misleading albeit the note you quoted. The description of
test_size parameter should reflect that it represents the *proportion of
groups*, not the dataset, as you emphasized. So, I believe we need
another pull request for that, @OFlanagan <https://github.com/OFlanagan>
could you do it?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13369 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AZ2o36VVrreyGZk8L211cEYFE7vvi63Xks5vV8WvgaJpZM4bZoD4>
.
--
Regards
Owen Flanagan
|
Since the problems have been resolved, I'm closing this issue. Thank you everyone. |
Thanks for raising the issue @burak43, and for closing it. Although we usually would only close it after the pull requests that fix it are merged :) |
Uh oh!
There was an error while loading. Please reload this page.
Description
So, I need to produce test/train/validation splits with predefined groups. I don't want to use LeavePGroupsOut since I need to separate data according my desired percantages into training and validation sets. In the documentation of GroupShuffleSplit, for
test_size
parameter, it's said that:However, this is indeed not the case as in the following code:
Steps/Code to Reproduce
Actual Results
which prints out for instance:
(1)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 91 92 93 99 101 102 103 104 105 106 107] [ 26 27 89 90 94 95 96 97 98 100]
(2)
70 38
As you see above from (1), test size is not 3 but more than 3. This almost always the case. I checked the groups of the indices. Apparently, if test_size is an integer, it represents the absolute number of test groups, not samples. I think you need to fix the documentation since it's misleading.
Also, when test_size is a float, it mostly does not consider the ratio specified. It may be due to unequal sample sizes in the groups but then there must be a note/warning to specify what kind of behaviour it follows under unequal group sizes combined with test_size ratio. From (2), test size is 35% of the whole set where it supposed to be 10%.
So, either I'm missing something or the documentation is nothing but erroneous descriptions.
Thanks.
Versions
System:
python: 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 18:15:35) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
executable: /home/burak/anaconda3/bin/python
machine: Linux-4.15.0-45-generic-x86_64-with-debian-buster-sid
BLAS:
macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
lib_dirs: /home/burak/anaconda3/lib
cblas_libs: mkl_rt, pthread
Python deps:
pip: 18.1
setuptools: 40.2.0
sklearn: 0.20.1
numpy: 1.15.4
scipy: 1.1.0
Cython: 0.28.5
pandas: 0.23.4
The text was updated successfully, but these errors were encountered: