-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
MRG remove unreachable code from grid_search, test unsupervised setting #1210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
LGTM |
sklearn/grid_search.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure: why did you remove these lines. Them seem to be useful to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm... you mean in the case where the estimator doesn't have a score function but a loss function was specified (and not a score function)?
That is true...
"unfortunately" all estimators have a score-function... I'll have another look again. I'm not sure how it is possible to use the two lines below....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should have waited for #1198... the current GridSearchCV
contains some magic...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so actually, I should remove even more.
As all estimators define score
, this code doesn't make any sense.
How would you use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It used to be the case that score was not part of the required API: if you just implement your own fit / predict estimator you might have expected to be able to use the grid search tools on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On 10/14/2012 08:25 PM, Olivier Grisel wrote:
In sklearn/grid_search.py:
@@ -449,9 +442,5 @@ def fit(self, X, y):
def score(self, X, y=None):
if hasattr(self.best_estimator, 'score'):
return self.best_estimator_.score(X, y)
if self.score_func is None:
raise ValueError("No score function explicitly defined, "
"and the estimator doesn't provide one %s"
% self.best_estimator_)
It used to be the case that score was not part of the required API: if
you just implement your own fit / predict estimator you might have
expected to be able to use the grid search tools on it.—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/1210/files#r1837718.But this is only the case if you didn't inherit from
ClassifierMixin
orRegressorMixin
.
I though that was the minimum requirement for the API.
If this is not the case, then I can add some dummy classes to test this
case.
The question is: do we really want to support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Sun, Oct 14, 2012 at 11:43:32AM -0700, Andreas Mueller wrote:
"unfortunately" all estimators have a score-function...
Awesome! I am surprised, but happily surprised. Even the clustering ones?
But I guess that our code should also work with estimators that we did
not design. These might not have a score method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Sun, Oct 14, 2012 at 12:25:50PM -0700, Olivier Grisel wrote:
It used to be the case that score was not part of the required API: if you just
implement your own fit / predict estimator you might have expected to be able
to use the grid search tools on it.
In my mind, it still is the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Sun, Oct 14, 2012 at 12:44:38PM -0700, Andreas Mueller wrote:
But this is only the case if you didn't inherit from
ClassifierMixin
or
RegressorMixin
. I though that was the minimum requirement for the API.
In my mind, it is very important not to require inheritence to use the
scikit-learn. The reason being that it should be possible to be
scikit-learn compliant without having a dependency on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On 10/15/2012 06:32 AM, Gael Varoquaux wrote:
In sklearn/grid_search.py:
@@ -449,9 +442,5 @@ def fit(self, X, y):
def score(self, X, y=None):
if hasattr(self.best_estimator, 'score'):
return self.best_estimator_.score(X, y)
if self.score_func is None:
raise ValueError("No score function explicitly defined, "
"and the estimator doesn't provide one %s"
On Sun, Oct 14, 2012 at 11:43:32AM -0700, Andreas Mueller wrote:% self.best_estimator_)
"unfortunately" all estimators have a score-function...
Awesome! I am surprised, but happily surprised. Even the clustering
ones? But I guess that our code should also work with estimators that
we did not design. These might not have a score method.—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/1210/files#r1838742.I can't find these comments anywhere on github... so now an unordered
email reply.
Most clustering algorithms can't have a score, because they don't
supportpredict
.
Apart from the comment on the removed lines, this looks good to me. |
Ok so one more comment on the unreachable code: If we want to support that, I can add a dummy estimator that has these properties and test it. Somewhat related: I think it would be good to have a separate class to do model selection for unsupervised models, as I think people usually wouldn't use cross-validation there (most unsupervised models in sklearn don't have a wdyt @GaelVaroquaux @ogrisel ? |
I have no strong opinion for supporting estimators without a score function. For the unsupervised case I agree this should better be dealt with using dedicated classes. For instance clustering could be evaluated with the stability selection / consensus index method but this requires a special kind of CV iterator with overlapping folds and we should not render the existing GS class more complex to support this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to use the inertia (the default score
method for KMeans) as a way to select the number of clusters. inertia is to be minimized for an a priori fixed number of clusters. If the number of clusters increases, inertia will always decrease hence the best model will always be n_clusters=5
in this case, whatever the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know. See the discussion with @larsmans that github doesn't show here.
This is just a smoke test.
I could remove the case of unsupervised grid-search altogether.
I don't see a better way of testing it.
I just noticed that estimators have to define |
Sorry for the confusion. I thought it would be easy to merge some parts of #1198 but apparently that is not the case. |
On Sun, Oct 14, 2012 at 01:06:58PM -0700, Andreas Mueller wrote:
Yes!
I'd rather not remove this feature, as it creates heavier requirements on
I agree with you that for a different model selection scheme, a separate |
I think the API-requirements for doing grid-search are quite strong already and no-one should attempt to use it without inheriting from the sklearn estimators. It relies heavily on a working In my mind, to be able to grid-search, you inherit. How else? |
About requiring to have a score function: in #1198, the requirement is to have a IIRC, it was your idea to handle the score function in the estimator. I could make it such that the standard scores, like |
Based on my experience with clustering, I would have done the |
This significantly increases test-coverage.
Motivated by comments on my randomized search PR.
We should try to get 100% test coverage here. This is really the core of sklearn.