Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] ENH/MNT Rename labels --> groups #6660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 11, 2016

Conversation

raghavrv
Copy link
Member

@raghavrv raghavrv commented Apr 14, 2016

Partially fixes #5053, fixes #7210

Vote

Gael : -0
Joel : +0.5?
Alex : -1 +1
Vene : +1
Andy : -1+1

TODO

  • Renames Leave*LabelOut --> Leave*GroupOut
  • Renames labels parameter to groups.
  • Renames LabelKFold --> GroupKFold
  • Renames LabelShuffleSplit --> GroupShuffleSplit
  • Fix doctests
  • Make all the tests pass
  • Fix examples
  • Fix docs
  • Split cv section into two, one for IID data and one for grouped data...

Please take a look at this @jnothman @amueller @MechCoder @vene

REF

I raised this as a separate PR as I felt it warrants its own discussions.

@raghavrv
Copy link
Member Author

Also ping @GaelVaroquaux @ogrisel @larsmans @mblondel

@raghavrv raghavrv changed the title [MRG] ENH/MNT Rename labels --> groups [WIP] ENH/MNT Rename labels --> groups Apr 14, 2016
@GaelVaroquaux
Copy link
Member

-0

I cringe at breaking backward compatibility for that. People rely on us. Docs and books are written.

@jnothman
Copy link
Member

We're already renaming the module... but yes, searchability is valuable too.

@raghavrv
Copy link
Member Author

Copying my comment from the other PR -

I think the motivation for renaming labels --> groups is to demarcate clearly what we mean by labels. (As labels can also mean the target or class labels).

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 14, 2016 via email

@raghavrv
Copy link
Member Author

Quoting a previous discussion at #4294

@vene wrote -

Based on what the code does, the labels in permutation_test_score is a "group label" (observable at runtime, sample property, yadda yadda).

Permutation tests are a kind of non-parametric test where you approximate the data distribution by shuffling the "labels" class labels (ys, argh!) of the observed data. If the data is grouped (e.g. multiple measurements of the same fMRI subject, multiple documents for the same query), the labels param is used to only shuffle y within each group. I guess a reason would be so that class probabilities keep the same distribution.

So I think labels should be passed to the inner CV. Prior to this PR, this function was probably used like permutation_test_score(X, y, labels, cv=LeavePLabelOut(len(y), labels...)). Does this pose any problems?

I think the person to ping about this is indeed @agramfort, based on some online threads I've found about permutation tests :)

@raghavrv
Copy link
Member Author

We had this discussion on whether we should use the labels as group labels for the cross-validator of permutation_test_score as well as for the permutation. That also needs to be addressed. Currently we use the labels for both permuting and as the group labels for the cv.

@raghavrv
Copy link
Member Author

@GaelVaroquaux I'm just populating this PR with relevant discussions for us to refer to :)

And

Yet, is it worth breaking the dozen of books that have been written using scikit-learn, and user code? Note that the breakage is more subttle than the change that we have just done.

But like Joel said model_selection is not yet released. May we could use this opportunity to make these kind of changes? Once released these issues become more pressing. We will be more compelled to not make these changes. However if the change brought about by this PR is not so useful, maybe we could leave things as such...

@agramfort
Copy link
Member

agramfort commented Apr 14, 2016 via email

@raghavrv raghavrv changed the title [WIP] ENH/MNT Rename labels --> groups [RFC/WIP] ENH/MNT Rename labels --> groups Apr 14, 2016
@raghavrv
Copy link
Member Author

@vene at #5053 wrote -

That's my view as well. labels was a weird, overloaded word in this context. I was personally very confused about what LeavePLabelOut did for a while, because I was thinking of y as labels. I thought it's some sort of zero-shot learning evaluation.

Groups is much better; though there could be confusion with, say, GroupLasso.

Such breakage is annoying on two accounts. (a) user code will need to change: but this will happen anyway because of the move to model_selection. (b) books becoming out of date, as @GaelVaroquaux points out.

Now, my first impulse is to say that (b) is also moot because of model_selection but it's not that simple. It's easier to realize that cross_validation.LeavePLabelOut became model_selection.LeavePLabelOut rather than if it becomes model_selection.LeavePGroupOut.

Still, I think that if you at least know what your intention is, and understand the word "Group", the latter can be easily identified as relevant when tabbing through the imports.

(In light of this, I would have preferred LabelKFold to be KFoldLabels so that it would be easy to type KFold[tab] and find KFoldGroups instead...)

@raghavrv
Copy link
Member Author

I'm closing as we have agreed this is not quite useful. Feel free to comment/reopen.

@raghavrv raghavrv closed this May 11, 2016
@raghavrv raghavrv deleted the rename_labels_to_groups branch May 11, 2016 13:15
@raghavrv raghavrv restored the rename_labels_to_groups branch August 20, 2016 13:26
@raghavrv raghavrv reopened this Aug 20, 2016
@raghavrv
Copy link
Member Author

I'm reopening this in light of #7210

@amueller amueller added this to the 0.18 milestone Aug 22, 2016
@amueller
Copy link
Member

looks like @agramfort changed his mind between this issue and #7210 ;)

@agramfort
Copy link
Member

agramfort commented Aug 23, 2016 via email

@raghavrv raghavrv force-pushed the rename_labels_to_groups branch 2 times, most recently from a9ba660 to a84042c Compare August 23, 2016 10:37
@raghavrv raghavrv changed the title [RFC/WIP] ENH/MNT Rename labels --> groups [MRG] ENH/MNT Rename labels --> groups Aug 23, 2016
@raghavrv
Copy link
Member Author

This is ready for reviews @jnothman @vene @amueller @agramfort

@agramfort
Copy link
Member

LGTM

did you

"git grep Label" in model_selection

?


- :class:`LeavePLabelOut` **(p)**
- :class:`LeavePGroupOut` **(p)**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Groups?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

medical data collected from multiple patients, with multiple samples taken from
each patient. And such data is likely to be dependent on the individual group
(generative process). In our example, the patient id for each sample will be
it's group identifier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"its"

@raghavrv raghavrv force-pushed the rename_labels_to_groups branch from b7fb29b to c99c6de Compare September 9, 2016 21:32
@raghavrv raghavrv force-pushed the rename_labels_to_groups branch from c99c6de to e79095d Compare September 9, 2016 21:35
@raghavrv
Copy link
Member Author

raghavrv commented Sep 9, 2016

I think I've addressed all the comments. @amueller @jnothman @ogrisel @vene

group is not in both testing and training sets. This is necessary for example
if you obtained data from different subjects and you want to avoid over-fitting
(i.e., learning person specific features) by testing and training on different
subjects.
Copy link
Member

@ogrisel ogrisel Sep 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Group k-fold cross-validation will not avoid make the model avoid over-fitting but makes it possible to detect it when it happens. Here is a suggested rephrasing:

class:GroupKFold is a variation of k-fold which ensures that the same group is not represented in both testing and training sets. For example if the data is obtained from different subjects with several samples per-subject and if the model is flexible enough to learn from highly person specific features it could fail to generalize to new subjects. class:GroupKFold makes it possible to detect this kind of overfitting situations.

@ogrisel
Copy link
Member

ogrisel commented Sep 10, 2016

@raghavrv
Copy link
Member Author

raghavrv commented Sep 11, 2016

@ogrisel Done! Thanks for the patience!

train/test set.
groups : array-like, with shape (n_samples,), optional
Group labels used to constrain the permutation to specific subsets of
data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the improved version :p

Based on your comment, I'm guessing it's not clear? The first line says it's used for permutation and the NOTE which follows it clarifies that it is also used for label-cvs...

@ogrisel
Copy link
Member

ogrisel commented Sep 11, 2016

Besides the remaining unaddressed comment by @jnothman, LGTM.

@raghavrv raghavrv force-pushed the rename_labels_to_groups branch from 1b50bb1 to 8820e93 Compare September 11, 2016 13:08
data.
Labels to constrain permutation within groups, i.e. ``y`` values
are permuted among samples with the same group identifier.
When not specified, ``y`` values are permuted among all samples.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to add that it's also passed to Group CV splitters...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a NOTE below specially for that ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry. Still not used to github's "viewing a subset of changes". I don't see why this is a NOTE as distinct from a parameter description, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt the usage of GroupCV in permutation_test_score to be not a very common usecase... Would you rather have the NOTE heading removed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@raghavrv
Copy link
Member Author

@jnothman Done.

@ogrisel
Copy link
Member

ogrisel commented Sep 11, 2016

One of the travis job failed randomly (the execution of a timed joblib doctest took 24s instead of less than 1s usually). I restarted it.

@ogrisel ogrisel merged commit 9a12555 into scikit-learn:master Sep 11, 2016
@ogrisel
Copy link
Member

ogrisel commented Sep 11, 2016

Squash merged! 🍻

@raghavrv raghavrv deleted the rename_labels_to_groups branch September 11, 2016 17:15
@raghavrv
Copy link
Member Author

🍻

@jnothman
Copy link
Member

Thanks, Raghav!

On 12 September 2016 at 03:16, Raghav RV [email protected] wrote:

🍻


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#6660 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz64dJjU4zsouN0uOVq8yADwJNKKN9ks5qpDdQgaJpZM4IHUHL
.

@raghavrv
Copy link
Member Author

Thank YOU for the patient reviews!

rsmith54 pushed a commit to rsmith54/scikit-learn that referenced this pull request Sep 14, 2016
TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LabelKFold -> GroupKFold? [RFC] Changes to model_selection?
9 participants