-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+1] ENH/MNT Rename labels --> groups #6660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] ENH/MNT Rename labels --> groups #6660
Conversation
Also ping @GaelVaroquaux @ogrisel @larsmans @mblondel |
-0 I cringe at breaking backward compatibility for that. People rely on us. Docs and books are written. |
We're already renaming the module... but yes, searchability is valuable too. |
Copying my comment from the other PR - I think the motivation for renaming |
I think the motivation for renaming labels --> groups is to demarcate clearly
what we mean by labels. (As labels can also mean the target or class labels).
I understand the motivation, and I agree with it. Yet, is it worth
breaking the dozen of books that have been written using scikit-learn,
and user code? Note that the breakage is more subttle than the change
that we have just done.
Anyhow, I am only -0, not -1.
|
Quoting a previous discussion at #4294 @vene wrote -
|
We had this discussion on whether we should use the |
@GaelVaroquaux I'm just populating this PR with relevant discussions for us to refer to :) And
But like Joel said |
I agree with Gael. I don't think it's worth it.
|
That's my view as well. labels was a weird, overloaded word in this context. I was personally very confused about what Groups is much better; though there could be confusion with, say, Such breakage is annoying on two accounts. (a) user code will need to change: but this will happen anyway because of the move to model_selection. (b) books becoming out of date, as @GaelVaroquaux points out. Now, my first impulse is to say that (b) is also moot because of model_selection but it's not that simple. It's easier to realize that Still, I think that if you at least know what your intention is, and understand the word "Group", the latter can be easily identified as relevant when tabbing through the imports. (In light of this, I would have preferred |
I'm closing as we have agreed this is not quite useful. Feel free to comment/reopen. |
I'm reopening this in light of #7210 |
looks like @agramfort changed his mind between this issue and #7210 ;) |
seems like it ... :)
|
a9ba660
to
a84042c
Compare
This is ready for reviews @jnothman @vene @amueller @agramfort |
LGTM did you "git grep Label" in model_selection ? |
|
||
- :class:`LeavePLabelOut` **(p)** | ||
- :class:`LeavePGroupOut` **(p)** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Groups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
medical data collected from multiple patients, with multiple samples taken from | ||
each patient. And such data is likely to be dependent on the individual group | ||
(generative process). In our example, the patient id for each sample will be | ||
it's group identifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"its"
b7fb29b
to
c99c6de
Compare
c99c6de
to
e79095d
Compare
group is not in both testing and training sets. This is necessary for example | ||
if you obtained data from different subjects and you want to avoid over-fitting | ||
(i.e., learning person specific features) by testing and training on different | ||
subjects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Group k-fold cross-validation will not avoid make the model avoid over-fitting but makes it possible to detect it when it happens. Here is a suggested rephrasing:
class:GroupKFold
is a variation of k-fold which ensures that the same group is not represented in both testing and training sets. For example if the data is obtained from different subjects with several samples per-subject and if the model is flexible enough to learn from highly person specific features it could fail to generalize to new subjects. class:GroupKFold
makes it possible to detect this kind of overfitting situations.
This was not addressed: https://github.com/scikit-learn/scikit-learn/pull/6660/files#r78210904 |
@ogrisel Done! Thanks for the patience! |
train/test set. | ||
groups : array-like, with shape (n_samples,), optional | ||
Group labels used to constrain the permutation to specific subsets of | ||
data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be improved as suggested by @jnothman in https://github.com/scikit-learn/scikit-learn/pull/6660/files#r77563448
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the improved version :p
Based on your comment, I'm guessing it's not clear? The first line says it's used for permutation and the NOTE which follows it clarifies that it is also used for label-cvs...
Besides the remaining unaddressed comment by @jnothman, LGTM. |
1b50bb1
to
8820e93
Compare
data. | ||
Labels to constrain permutation within groups, i.e. ``y`` values | ||
are permuted among samples with the same group identifier. | ||
When not specified, ``y`` values are permuted among all samples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good to add that it's also passed to Group CV splitters...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a NOTE below specially for that ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry. Still not used to github's "viewing a subset of changes". I don't see why this is a NOTE as distinct from a parameter description, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt the usage of GroupCV in permutation_test_score
to be not a very common usecase... Would you rather have the NOTE heading removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
@jnothman Done. |
One of the travis job failed randomly (the execution of a timed joblib doctest took 24s instead of less than 1s usually). I restarted it. |
Squash merged! 🍻 |
🍻 |
Thanks, Raghav! On 12 September 2016 at 03:16, Raghav RV [email protected] wrote:
|
Thank YOU for the patient reviews! |
Partially fixes #5053, fixes #7210
Vote
Gael : -0
Joel : +0.5?
Alex :
-1+1Vene : +1
Andy :
-1+1TODO
Leave*LabelOut
-->Leave*GroupOut
labels
parameter togroups
.LabelKFold
-->GroupKFold
LabelShuffleSplit
-->GroupShuffleSplit
Please take a look at this @jnothman @amueller @MechCoder @vene
REF
I raised this as a separate PR as I felt it warrants its own discussions.