Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] GridSearchCV iid #9379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Dec 11, 2017
Merged

[MRG+1] GridSearchCV iid #9379

merged 19 commits into from
Dec 11, 2017

Conversation

amueller
Copy link
Member

@amueller amueller commented Jul 16, 2017

Continuation of #9103. Fixes #9085.
Closes #9103

@amueller amueller changed the title Iid mehss GridSearchCV iid Jul 16, 2017
@agramfort
Copy link
Member

this makes cross_val_scores(..).mean() equivalent to gridsearchcv mean cv scores identical?

@amueller
Copy link
Member Author

yes

@amueller
Copy link
Member Author

well in version 0.21 by default

@amueller
Copy link
Member Author

I think I misunderstood the iid parameter (again). It only reweights the mean computation but does not change the scoring. so it will warn whenever the test set sizes are unequal now.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise LGTM.

- The ``iid`` parameter of :class:`model_selection.GridSearchCV` and
:class:`model_selection.RandomizedSearchCV` has been deprecated and will
be removed in version 0.21. Future behavior will be the current default
behavior (equivalent to ``iid=True``).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also appreciate a note just explaining that weighting the average in CV is not appropriate (and we don't do it in cross_val_score either).

@jnothman jnothman changed the title GridSearchCV iid [MRG+1] GridSearchCV iid Aug 20, 2017
@jnothman
Copy link
Member

Resolve conflicts, change version numbers. LGTM.

It's okay if this happens slowly. Let's just make it happen.

Another review?

..deprecated:: 0.19
Parameter ``iid`` has been deprecated in version 0.19 and
will be removed in 0.21.
Future (and default) behavior is equivalent to `iid=true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these version numbers are not consistent with what's new

@agramfort
Copy link
Member

besides LGTM

@amueller you need to rebase

@amueller
Copy link
Member Author

Thanks for the review @agramfort, will fix it next week, afk camping.

@amueller
Copy link
Member Author

amueller commented Sep 8, 2017

should be good now.

@@ -59,6 +59,11 @@ Model evaluation and meta-estimators
- A scorer based on :func:`metrics.brier_score_loss` is also available.
:issue:`9521` by :user:`Hanmin Qin <qinhanmin2014>`.

- The default of the ``iid`` parameter of :class:`model_selection.GridSearchCV` and
:class:`model_selection.RandomizedSearchCV` will change from ``True`` to ``False``
in version 0.22, and will be removed in version 0.24.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the parameter will be removed altogether"

If True, the data is assumed to be identically distributed across
the folds, and the loss minimized is the total loss per sample,
and not the mean loss across the folds.
and not the mean loss across the folds. Default is True,
but will change to False in version 0.21.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (particularly given that we tend to get complaints about deprecations) that it's worth adding a few words on why, or basically saying "We now consider the iid=True formulation to be optimising an incorrect cross validation objective."

@@ -1143,11 +1159,15 @@ class RandomizedSearchCV(BaseSearchCV):
- A string, giving an expression as a function of n_jobs,
as in '2*n_jobs'

iid : boolean, default=True
iid : boolean, default=None
If True, the data is assumed to be identically distributed across
the folds, and the loss minimized is the total loss per sample,
and not the mean loss across the folds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make identical to above.

@@ -833,10 +844,15 @@ class GridSearchCV(BaseSearchCV):
- A string, giving an expression as a function of n_jobs,
as in '2*n_jobs'

iid : boolean, default=True
iid : boolean, default=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe should be default='warn'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... that's not entirely clear, is it? though I don't really have a better idea. Usually we use None for deprecated parameters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I prefer describing the defaults than stating them here anyway, particularly when they are something semantically underspecified like "None". But seeing default='warn' or default='deprecated' in the signature is quite user-friendly, IMO. None is not.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I've fixed the conflict. Already +3 so merge? @jnothman @amueller @agramfort

@qinhanmin2014
Copy link
Member

Seems that we now need to modify the test (introduced in #9677) to make CIs green.

@jnothman
Copy link
Member

jnothman commented Dec 6, 2017

This pull request introduces 1 alert - view on lgtm.com

new alerts:

  • 1 for Comparison using is when operands support eq

Comment posted by lgtm.com

@qinhanmin2014
Copy link
Member

I've resolved the conflict, the test error and the warning from lgtm. @amueller hope you won't mind.
ping @jnothman @amueller @agramfort already +3 so I think it's now ready for merge.

@@ -847,10 +858,16 @@ class GridSearchCV(BaseSearchCV):
- A string, giving an expression as a function of n_jobs,
as in '2*n_jobs'

iid : boolean, default=True
iid : boolean, default='warn'
If True, the data is assumed to be identically distributed across
the folds, and the loss minimized is the total loss per sample,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has always been a weird description. Could we clarify that it is an average score across folds, weighted by the number of samples in each test set...?

- The default of the ``iid`` parameter of :class:`model_selection.GridSearchCV` and
:class:`model_selection.RandomizedSearchCV` will change from ``True`` to ``False``
in version 0.22, and the parameter will be removed in version 0.24 altogether.
:issue:`9085` by :user:`Laurent Direr <ldirer>` and `Andreas Müller`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note that: This parameter is of greatest practical significance where the sizes of different test sets in cross-validation were very unequal, i.e. in group-based CV strategies.

@qinhanmin2014
Copy link
Member

ping @jnothman I update the document and what's new accordingly. Hope that I don't make anyone unhappy :)

@jnothman
Copy link
Member

jnothman commented Dec 7, 2017 via email

@qinhanmin2014
Copy link
Member

Seems that I mistakenly suppose there's already consensus since it's marked as blocker and is already +2 before l came.
I've gone through the discussions in #9103 and #9085 carefully before giving my approval. Seems reasonable from my side. But I'm not confident enough to say it's definitely the right solution, this is why l'm still confirming the +3 PR.
So l'll wait. Ping @amueller

@jnothman
Copy link
Member

jnothman commented Dec 7, 2017

Yes, I think we should merge this. But I think there had been some confusion, where some thought that iid=True was doing something like metric(cross_val_predict(est, X, y), y); it wasn't.

@qinhanmin2014
Copy link
Member

qinhanmin2014 commented Dec 7, 2017

@jnothman I might think the doc is enough, we now have a warning if users do not set iid or set it to True. In the doc, we also explain the behaviour of iid=True/iid=False and why we change the default value to False. I'm not sure what's the confusion and what can be improved. Could you please provide more detail? Thanks :)

@jnothman jnothman merged commit 4321002 into scikit-learn:master Dec 11, 2017
@amueller
Copy link
Member Author

@qinhanmin2014 thanks for finishing up. I guess the main issue with this was that it will add a lot of warnings. How many warnings do we get on master now?

@amueller
Copy link
Member Author

(answer: a bunch :-/)

@jnothman
Copy link
Member

jnothman commented Dec 11, 2017 via email

@amueller
Copy link
Member Author

No saying it was a bad idea to merge, but that had been a concern (haven't been following)

@glemaitre
Copy link
Member

Just stumble into that in the SciPy tutorial. I get warning and then you will be tempted to turn iid=True. The issue is that it will be removed so there is no way that you can avoid the warning, isn't it?

@jnothman
Copy link
Member

Just stumble into that in the SciPy tutorial. I get warning and then you will be tempted to turn iid=True. The issue is that it will be removed so there is no way that you can avoid the warning, isn't it?

That's why the deprecation cycle is extra long. You can either ignore the warning it with the warnings module, or by setting iid=True/False, but it wont't be removed for another 4 years or something.

@glemaitre
Copy link
Member

it wont't be removed for another 4 years or something.

True. I think that I will ignore this warning specifically.

hristog added a commit to hristog/dask-examples that referenced this pull request Mar 11, 2021
tl;dr `iid` has been deprecated and the updated behavior corresponds
to `iid=True`.

References:
- Deprecation warning and explanation of associated behavior:
  scikit-learn/scikit-learn#9379.
- Deprecation: scikit-learn/scikit-learn#13834.
arnavs added a commit to QuantEcon/lecture-datascience.myst that referenced this pull request Apr 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants