Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

amueller
Copy link
Member

@amueller amueller commented Oct 6, 2012

This significantly increases test-coverage.
Motivated by comments on my randomized search PR.

We should try to get 100% test coverage here. This is really the core of sklearn.

@agramfort
Copy link
Member

LGTM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure: why did you remove these lines. Them seem to be useful to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... you mean in the case where the estimator doesn't have a score function but a loss function was specified (and not a score function)?
That is true...
"unfortunately" all estimators have a score-function... I'll have another look again. I'm not sure how it is possible to use the two lines below....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should have waited for #1198... the current GridSearchCV contains some magic...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so actually, I should remove even more.
As all estimators define score, this code doesn't make any sense.
How would you use it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It used to be the case that score was not part of the required API: if you just implement your own fit / predict estimator you might have expected to be able to use the grid search tools on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 10/14/2012 08:25 PM, Olivier Grisel wrote:

In sklearn/grid_search.py:

@@ -449,9 +442,5 @@ def fit(self, X, y):
def score(self, X, y=None):
if hasattr(self.best_estimator
, 'score'):
return self.best_estimator_.score(X, y)

  •    if self.score_func is None:
    
  •        raise ValueError("No score function explicitly defined, "
    
  •                         "and the estimator doesn't provide one %s"
    
  •                         % self.best_estimator_)
    

It used to be the case that score was not part of the required API: if
you just implement your own fit / predict estimator you might have
expected to be able to use the grid search tools on it.


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/1210/files#r1837718.

But this is only the case if you didn't inherit from ClassifierMixin
or RegressorMixin.
I though that was the minimum requirement for the API.

If this is not the case, then I can add some dummy classes to test this
case.
The question is: do we really want to support it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Sun, Oct 14, 2012 at 11:43:32AM -0700, Andreas Mueller wrote:

"unfortunately" all estimators have a score-function...

Awesome! I am surprised, but happily surprised. Even the clustering ones?

But I guess that our code should also work with estimators that we did
not design. These might not have a score method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Sun, Oct 14, 2012 at 12:25:50PM -0700, Olivier Grisel wrote:

It used to be the case that score was not part of the required API: if you just
implement your own fit / predict estimator you might have expected to be able
to use the grid search tools on it.

In my mind, it still is the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Sun, Oct 14, 2012 at 12:44:38PM -0700, Andreas Mueller wrote:

But this is only the case if you didn't inherit from ClassifierMixin or
RegressorMixin. I though that was the minimum requirement for the API.

In my mind, it is very important not to require inheritence to use the
scikit-learn. The reason being that it should be possible to be
scikit-learn compliant without having a dependency on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 10/15/2012 06:32 AM, Gael Varoquaux wrote:

In sklearn/grid_search.py:

@@ -449,9 +442,5 @@ def fit(self, X, y):
def score(self, X, y=None):
if hasattr(self.best_estimator
, 'score'):
return self.best_estimator_.score(X, y)

  •    if self.score_func is None:
    
  •        raise ValueError("No score function explicitly defined, "
    
  •                         "and the estimator doesn't provide one %s"
    
  •                         % self.best_estimator_)
    
    On Sun, Oct 14, 2012 at 11:43:32AM -0700, Andreas Mueller wrote:
    "unfortunately" all estimators have a score-function...
    Awesome! I am surprised, but happily surprised. Even the clustering
    ones? But I guess that our code should also work with estimators that
    we did not design. These might not have a score method.


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/1210/files#r1838742.

I can't find these comments anywhere on github... so now an unordered
email reply.
Most clustering algorithms can't have a score, because they don't
support predict.

@GaelVaroquaux
Copy link
Member

Apart from the comment on the removed lines, this looks good to me.

@amueller
Copy link
Member Author

Ok so one more comment on the unreachable code:
The code I removed would be useful for an estimator that defines predict, but not score, and doesn't inherit from ClassifierMixin or RegressorMixin.

If we want to support that, I can add a dummy estimator that has these properties and test it.
I'd rather remove this "feature" if no-one has a use-case.

Somewhat related: I think it would be good to have a separate class to do model selection for unsupervised models, as I think people usually wouldn't use cross-validation there (most unsupervised models in sklearn don't have a predict making cross-validation impossible).

wdyt @GaelVaroquaux @ogrisel ?

@ogrisel
Copy link
Member

ogrisel commented Oct 14, 2012

I have no strong opinion for supporting estimators without a score function.

For the unsupervised case I agree this should better be dealt with using dedicated classes. For instance clustering could be evaluated with the stability selection / consensus index method but this requires a special kind of CV iterator with overlapping folds and we should not render the existing GS class more complex to support this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to use the inertia (the default score method for KMeans) as a way to select the number of clusters. inertia is to be minimized for an a priori fixed number of clusters. If the number of clusters increases, inertia will always decrease hence the best model will always be n_clusters=5 in this case, whatever the data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know. See the discussion with @larsmans that github doesn't show here.

This is just a smoke test.
I could remove the case of unsupervised grid-search altogether.
I don't see a better way of testing it.

@amueller
Copy link
Member Author

I just noticed that estimators have to define score for gridsearch for #1198.
I'll cherry pick the simple stuff from here into master and maybe we can move the discussion there.

@amueller
Copy link
Member Author

Sorry for the confusion. I thought it would be easy to merge some parts of #1198 but apparently that is not the case.

@GaelVaroquaux
Copy link
Member

On Sun, Oct 14, 2012 at 01:06:58PM -0700, Andreas Mueller wrote:

The code I removed would be useful for an estimator that defines predict, but
not score, and doesn't inherit from ClassifierMixin or RegressorMixin.

Yes!

If we want to support that, I can add a dummy estimator that has these
properties and test it.
I'd rather remove this "feature" if no-one has a use-case.

I'd rather not remove this feature, as it creates heavier requirements on
the API and tigther coupling; both of which I believe are bad things from
a design point of view.

Somewhat related: I think it would be good to have a separate class to
do model selection for unsupervised models, as no cross-validation is
needed there....

I agree with you that for a different model selection scheme, a separate
class would be good. I am just wondering: which model selection scheme do
you think would be good? I am not sure that I know of a universal one.

@amueller
Copy link
Member Author

I think the API-requirements for doing grid-search are quite strong already and no-one should attempt to use it without inheriting from the sklearn estimators.

It relies heavily on a working clone, get_params and set_params. Do you expect that people will reimplement this themselves? This is not really documented btw.
Maybe we should write some documentation for how to write your own estimators?

In my mind, to be able to grid-search, you inherit. How else?

@amueller
Copy link
Member Author

About requiring to have a score function: in #1198, the requirement is to have a score method and that the score method needs to support a score_func argument.

IIRC, it was your idea to handle the score function in the estimator. I could make it such that the standard scores, like zero_one_score don't need a score function, but auc_score does. That wouldn't be hard, though bloating the code a bit.

@amueller
Copy link
Member Author

Somewhat related: I think it would be good to have a separate class to
do model selection for unsupervised models, as no cross-validation is
needed there....

I agree with you that for a different model selection scheme, a separate
class would be good. I am just wondering: which model selection scheme do
you think would be good? I am not sure that I know of a universal one.

Based on my experience with clustering, I would have done the fit and
score on the same dataset and assumed some ground truth
for the scoring.
Basically the GridSearchUnsupervised would just call fit and score,
and not split up the dataset.
What the score does is basically up to the estimator / user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants