[MRG+1] ENH/MNT Rename labels --> groups #6660

raghavrv · 2016-04-14T12:23:55Z

Partially fixes #5053, fixes #7210

Vote

Gael : -0
Joel : +0.5?
Alex : -1 +1
Vene : +1
Andy : -1+1

TODO

Renames Leave*LabelOut --> Leave*GroupOut
Renames labels parameter to groups.
Renames LabelKFold --> GroupKFold
Renames LabelShuffleSplit --> GroupShuffleSplit
Fix doctests
Make all the tests pass
Fix examples
Fix docs
Split cv section into two, one for IID data and one for grouped data...

Please take a look at this @jnothman @amueller @MechCoder @vene

REF

[MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment)

I raised this as a separate PR as I felt it warrants its own discussions.

raghavrv · 2016-04-14T12:24:19Z

Also ping @GaelVaroquaux @ogrisel @larsmans @mblondel

GaelVaroquaux · 2016-04-14T13:30:06Z

-0

I cringe at breaking backward compatibility for that. People rely on us. Docs and books are written.

jnothman · 2016-04-14T13:32:37Z

We're already renaming the module... but yes, searchability is valuable too.

raghavrv · 2016-04-14T13:48:19Z

Copying my comment from the other PR -

I think the motivation for renaming labels --> groups is to demarcate clearly what we mean by labels. (As labels can also mean the target or class labels).

GaelVaroquaux · 2016-04-14T13:53:38Z

I think the motivation for renaming labels --> groups is to demarcate clearly what we mean by labels. (As labels can also mean the target or class labels).

I understand the motivation, and I agree with it. Yet, is it worth breaking the dozen of books that have been written using scikit-learn, and user code? Note that the breakage is more subttle than the change that we have just done. Anyhow, I am only -0, not -1.

raghavrv · 2016-04-14T13:55:03Z

Quoting a previous discussion at #4294

@vene wrote -

Based on what the code does, the labels in permutation_test_score is a "group label" (observable at runtime, sample property, yadda yadda).

Permutation tests are a kind of non-parametric test where you approximate the data distribution by shuffling the "labels" class labels (ys, argh!) of the observed data. If the data is grouped (e.g. multiple measurements of the same fMRI subject, multiple documents for the same query), the labels param is used to only shuffle y within each group. I guess a reason would be so that class probabilities keep the same distribution.

So I think labels should be passed to the inner CV. Prior to this PR, this function was probably used like permutation_test_score(X, y, labels, cv=LeavePLabelOut(len(y), labels...)). Does this pose any problems?

I think the person to ping about this is indeed @agramfort, based on some online threads I've found about permutation tests :)

raghavrv · 2016-04-14T13:55:33Z

We had this discussion on whether we should use the labels as group labels for the cross-validator of permutation_test_score as well as for the permutation. That also needs to be addressed. Currently we use the labels for both permuting and as the group labels for the cv.

raghavrv · 2016-04-14T14:12:20Z

@GaelVaroquaux I'm just populating this PR with relevant discussions for us to refer to :)

And

Yet, is it worth breaking the dozen of books that have been written using scikit-learn, and user code? Note that the breakage is more subttle than the change that we have just done.

But like Joel said model_selection is not yet released. May we could use this opportunity to make these kind of changes? Once released these issues become more pressing. We will be more compelled to not make these changes. However if the change brought about by this PR is not so useful, maybe we could leave things as such...

agramfort · 2016-04-14T14:13:10Z

I agree with Gael. I don't think it's worth it.

raghavrv · 2016-04-15T11:59:43Z

@vene at #5053 wrote -

That's my view as well. labels was a weird, overloaded word in this context. I was personally very confused about what LeavePLabelOut did for a while, because I was thinking of y as labels. I thought it's some sort of zero-shot learning evaluation.

Groups is much better; though there could be confusion with, say, GroupLasso.

Such breakage is annoying on two accounts. (a) user code will need to change: but this will happen anyway because of the move to model_selection. (b) books becoming out of date, as @GaelVaroquaux points out.

Now, my first impulse is to say that (b) is also moot because of model_selection but it's not that simple. It's easier to realize that cross_validation.LeavePLabelOut became model_selection.LeavePLabelOut rather than if it becomes model_selection.LeavePGroupOut.

Still, I think that if you at least know what your intention is, and understand the word "Group", the latter can be easily identified as relevant when tabbing through the imports.

(In light of this, I would have preferred LabelKFold to be KFoldLabels so that it would be easy to type KFold[tab] and find KFoldGroups instead...)

raghavrv · 2016-05-11T13:15:38Z

I'm closing as we have agreed this is not quite useful. Feel free to comment/reopen.

raghavrv · 2016-08-20T13:26:40Z

I'm reopening this in light of #7210

amueller · 2016-08-22T22:43:38Z

looks like @agramfort changed his mind between this issue and #7210 ;)

agramfort · 2016-08-23T08:04:46Z

seems like it ... :)

raghavrv · 2016-08-23T13:44:29Z

This is ready for reviews @jnothman @vene @amueller @agramfort

agramfort · 2016-08-23T13:55:58Z

LGTM

did you

"git grep Label" in model_selection

?

amueller · 2016-08-23T21:28:59Z

doc/tutorial/statistical_inference/model_selection.rst


-    - :class:`LeavePLabelOut`  **(p)**
+    - :class:`LeavePGroupOut`  **(p)**


ogrisel · 2016-09-09T17:05:49Z

doc/modules/cross_validation.rst

+medical data collected from multiple patients, with multiple samples taken from
+each patient. And such data is likely to be dependent on the individual group
+(generative process). In our example, the patient id for each sample will be
+it's group identifier.


raghavrv · 2016-09-09T21:35:22Z

I think I've addressed all the comments. @amueller @jnothman @ogrisel @vene

ogrisel · 2016-09-10T13:18:25Z

doc/modules/cross_validation.rst

+group is not in both testing and training sets. This is necessary for example
+if you obtained data from different subjects and you want to avoid over-fitting
+(i.e., learning person specific features) by testing and training on different
+subjects.


Group k-fold cross-validation will not avoid make the model avoid over-fitting but makes it possible to detect it when it happens. Here is a suggested rephrasing:

class:GroupKFold is a variation of k-fold which ensures that the same group is not represented in both testing and training sets. For example if the data is obtained from different subjects with several samples per-subject and if the model is flexible enough to learn from highly person specific features it could fail to generalize to new subjects. class:GroupKFold makes it possible to detect this kind of overfitting situations.

ogrisel · 2016-09-10T13:20:40Z

This was not addressed: https://github.com/scikit-learn/scikit-learn/pull/6660/files#r78210904

raghavrv · 2016-09-11T08:30:21Z

@ogrisel Done! Thanks for the patience!

ogrisel · 2016-09-11T12:13:00Z

sklearn/model_selection/_validation.py

-        train/test set.
+    groups : array-like, with shape (n_samples,), optional
+        Group labels used to constrain the permutation to specific subsets of
+        data.


This could be improved as suggested by @jnothman in https://github.com/scikit-learn/scikit-learn/pull/6660/files#r77563448

This was the improved version :p

Based on your comment, I'm guessing it's not clear? The first line says it's used for permutation and the NOTE which follows it clarifies that it is also used for label-cvs...

ogrisel · 2016-09-11T12:13:41Z

Besides the remaining unaddressed comment by @jnothman, LGTM.

jnothman · 2016-09-11T13:24:49Z

sklearn/model_selection/_validation.py

-        data.
+        Labels to constrain permutation within groups, i.e. ``y`` values
+        are permuted among samples with the same group identifier.
+        When not specified, ``y`` values are permuted among all samples.


It'd be good to add that it's also passed to Group CV splitters...

I added a NOTE below specially for that ;)

Ah, sorry. Still not used to github's "viewing a subset of changes". I don't see why this is a NOTE as distinct from a parameter description, though.

I felt the usage of GroupCV in permutation_test_score to be not a very common usecase... Would you rather have the NOTE heading removed?

raghavrv · 2016-09-11T14:29:49Z

@jnothman Done.

ogrisel · 2016-09-11T16:19:26Z

One of the travis job failed randomly (the execution of a timed joblib doctest took 24s instead of less than 1s usually). I restarted it.

ogrisel · 2016-09-11T17:14:54Z

Squash merged! 🍻

raghavrv · 2016-09-11T17:15:58Z

🍻

jnothman · 2016-09-11T23:09:03Z

Thanks, Raghav!

On 12 September 2016 at 03:16, Raghav RV [email protected] wrote:

🍻

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#6660 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz64dJjU4zsouN0uOVq8yADwJNKKN9ks5qpDdQgaJpZM4IHUHL
.

raghavrv · 2016-09-12T12:01:11Z

Thank YOU for the patient reviews!

raghavrv changed the title ~~[MRG] ENH/MNT Rename labels --> groups~~ [WIP] ENH/MNT Rename labels --> groups Apr 14, 2016

raghavrv mentioned this pull request Apr 14, 2016

[RFC] Changes to model_selection? #5053

Closed

raghavrv changed the title ~~[WIP] ENH/MNT Rename labels --> groups~~ [RFC/WIP] ENH/MNT Rename labels --> groups Apr 14, 2016

raghavrv closed this May 11, 2016

raghavrv deleted the rename_labels_to_groups branch May 11, 2016 13:15

raghavrv mentioned this pull request Aug 20, 2016

LabelKFold -> GroupKFold? #7210

Closed

raghavrv restored the rename_labels_to_groups branch August 20, 2016 13:26

raghavrv reopened this Aug 20, 2016

amueller added this to the 0.18 milestone Aug 22, 2016

raghavrv force-pushed the rename_labels_to_groups branch 2 times, most recently from a9ba660 to a84042c Compare August 23, 2016 10:37

raghavrv changed the title ~~[RFC/WIP] ENH/MNT Rename labels --> groups~~ [MRG] ENH/MNT Rename labels --> groups Aug 23, 2016

amueller reviewed Aug 23, 2016
View reviewed changes

ogrisel reviewed Sep 9, 2016
View reviewed changes

raghavrv force-pushed the rename_labels_to_groups branch from b7fb29b to c99c6de Compare September 9, 2016 21:32

ENH labels --> groups

e79095d

raghavrv force-pushed the rename_labels_to_groups branch from c99c6de to e79095d Compare September 9, 2016 21:35

ogrisel reviewed Sep 10, 2016
View reviewed changes

raghavrv added 2 commits September 11, 2016 10:28

Modify group kfold docs

2d3c362

IID doc

b794354

ogrisel reviewed Sep 11, 2016
View reviewed changes

DOC Use Joel's wording

8820e93

raghavrv force-pushed the rename_labels_to_groups branch from 1b50bb1 to 8820e93 Compare September 11, 2016 13:08

jnothman reviewed Sep 11, 2016
View reviewed changes

Remove NOTE

048b640

ogrisel merged commit 9a12555 into scikit-learn:master Sep 11, 2016

raghavrv deleted the rename_labels_to_groups branch September 11, 2016 17:15

rsmith54 pushed a commit to rsmith54/scikit-learn that referenced this pull request Sep 14, 2016

[MRG+1] ENH/MNT Rename labels --> groups in CV tools (scikit-learn#6660)

5e60a48

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG+1] ENH/MNT Rename labels --> groups in CV tools (scikit-learn#6660)

01cc4ec

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] ENH/MNT Rename labels --> groups in CV tools (scikit-learn#6660)

c1f49a7

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] ENH/MNT Rename labels --> groups in CV tools (scikit-learn#6660)

22775c9


		- :class:`LeavePLabelOut` (p)
		- :class:`LeavePGroupOut` (p)

Uh oh!

[MRG+1] ENH/MNT Rename labels --> groups #6660

[MRG+1] ENH/MNT Rename labels --> groups #6660

Uh oh!

Conversation

raghavrv commented Apr 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raghavrv commented Apr 14, 2016

Uh oh!

GaelVaroquaux commented Apr 14, 2016

Uh oh!

jnothman commented Apr 14, 2016

Uh oh!

raghavrv commented Apr 14, 2016

Uh oh!

GaelVaroquaux commented Apr 14, 2016 via email

Uh oh!

raghavrv commented Apr 14, 2016

Uh oh!

raghavrv commented Apr 14, 2016

Uh oh!

raghavrv commented Apr 14, 2016

Uh oh!

agramfort commented Apr 14, 2016 via email

Uh oh!

raghavrv commented Apr 15, 2016

Uh oh!

raghavrv commented May 11, 2016

Uh oh!

raghavrv commented Aug 20, 2016

Uh oh!

amueller commented Aug 22, 2016

Uh oh!

agramfort commented Aug 23, 2016 via email

Uh oh!

raghavrv commented Aug 23, 2016

Uh oh!

agramfort commented Aug 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raghavrv commented Sep 9, 2016

Uh oh!

ogrisel Sep 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Sep 10, 2016

Uh oh!

raghavrv commented Sep 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Sep 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raghavrv commented Sep 11, 2016

Uh oh!

ogrisel commented Sep 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Sep 11, 2016

Uh oh!

raghavrv commented Apr 14, 2016 •

edited

Loading

ogrisel Sep 10, 2016 •

edited

Loading

raghavrv commented Sep 11, 2016 •

edited

Loading

ogrisel commented Sep 11, 2016 •

edited

Loading