-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+1] Check estimator pairwise #9701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Check estimator pairwise #9701
Conversation
…into check_estimator_pairwise pull in upstream changes
…into check_estimator_pairwise pull in upstream changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not checked how complete this is, but I like the direction it's heading in!
sklearn/base.py
Outdated
|
||
|
||
def is_pairwise(estimator): | ||
"""Returns True if the given estimator has a _pairwise attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pep257: summary should be one line only. More description can follow after a blank line
sklearn/base.py
Outdated
"""Returns True if the given estimator has a _pairwise attribute | ||
set to True. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One blank line only please
sklearn/utils/estimator_checks.py
Outdated
def check_estimator_sparse_data(name, estimator_orig): | ||
|
||
# Sparse precomputed kernels aren't supported | ||
if getattr(estimator_orig, 'kernel', None) == 'precomputed': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you use is_pairwise here?
Actually I think we should be testing that an appropriate error is raised in to case
sklearn/utils/estimator_checks.py
Outdated
@@ -1194,6 +1223,7 @@ def check_estimators_fit_returns_self(name, estimator_orig): | |||
X, y = make_blobs(random_state=0, n_samples=9, n_features=4) | |||
# some want non-negative input | |||
X -= X.min() | |||
X = gram_matrix_if_pairwise(X, estimator_orig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Sometimes pairwise is for affinities, sometimes for distances. Estimators requiring distances may not play nicely with affinities and vice-versa. This may not be something we need to deal with now, but @amueller should probably consider an estimator tag which selects between distances and affinities when pairwise. Or we can use the presence of a metric parameter as a heuristic (for now)
sklearn/base.py
Outdated
out : bool | ||
True if _pairwise is set to True and False otherwise. | ||
""" | ||
return getattr(estimator, "_pairwise", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps wrap this in bool, just to be sure?
sklearn/utils/estimator_checks.py
Outdated
|
||
X_train = gram_matrix_if_pairwise(X_train, classifier_orig, | ||
kernel=rbf_kernel) | ||
X_test = gram_matrix_if_pairwise(X_test, classifier_orig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This surely can't work. The test data needs to be the kernel applied on X_test with respect to the training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a wonder that tests are passing
@@ -251,3 +252,9 @@ def __init__(self): | |||
check_no_fit_attributes_set_in_init, | |||
'estimator_name', | |||
NonConformantEstimator) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: extra blank line required
sklearn/utils/estimator_checks.py
Outdated
@@ -1795,3 +1834,8 @@ def check_decision_proba_consistency(name, estimator_orig): | |||
a = estimator.predict_proba(X_test)[:, 1] | |||
b = estimator.decision_function(X_test) | |||
assert_array_equal(rankdata(a), rankdata(b)) | |||
|
|||
|
|||
def check_pairwise_estimator(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this please
…into example_LW_shrinkage merge upstream
…into check_estimator_pairwise Pull in upstream changes
Codecov Report
@@ Coverage Diff @@
## master #9701 +/- ##
==========================================
- Coverage 96.19% 96.19% -0.01%
==========================================
Files 336 336
Lines 62739 62781 +42
==========================================
+ Hits 60353 60392 +39
- Misses 2386 2389 +3
Continue to review full report at Codecov.
|
Only flake8 is failing now ... |
sklearn/utils/estimator_checks.py
Outdated
@@ -404,6 +419,7 @@ def check_sample_weights_pandas_series(name, estimator_orig): | |||
try: | |||
import pandas as pd | |||
X = pd.DataFrame([[1, 1], [1, 2], [1, 3], [2, 1], [2, 2], [2, 3]]) | |||
X = gram_matrix_if_pairwise(X, estimator_orig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X output here will still not be a DataFrame, will it? There's not much point in doing this unless we make the gram matrix a DataFrame, which we might as well do even if it's a bit of a weird use of a DataFrame
sklearn/utils/estimator_checks.py
Outdated
@@ -1795,3 +1835,4 @@ def check_decision_proba_consistency(name, estimator_orig): | |||
a = estimator.predict_proba(X_test)[:, 1] | |||
b = estimator.decision_function(X_test) | |||
assert_array_equal(rankdata(a), rankdata(b)) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the extra blank line?
# check that check_estimator() works on estimator with _pairwise | ||
# attribute set | ||
est = SVC(kernel='precomputed') | ||
check_estimator(est) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this for an estimator based on a metric as well as a kernel? It's very possible that doing so will break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Can you give a quick example of an estimator based on a metric as well as a kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KNeighborsRegressor or AgglomerativeClustering
… dataframes with pairwise kernel
…into check_estimator_pairwise pull in upstream changes
check_estimator(est) | ||
|
||
|
||
def test_check_estimator_metric_and_kernel(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no kernel here. But you also need metric=precomputed for this to pertain.
Perhaps to make sure these tests are doing what they're meant to, you should assert that the estimator is pairwise
…airwise estimator
…into check_estimator_pairwise pull in upstream changes
…into check_estimator_pairwise
…into check_estimator_pairwise
@amueller fixed, knn works for sparse |
thanks, lgtm. @jnothman, still good? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise looks good.
Could you please add a new heading in the changelog called "Changes to estimator checks" and note this change there. I'll add a blurb there eventually.
sklearn/utils/estimator_checks.py
Outdated
"different from the number of features" | ||
" in fit.".format(name)): | ||
classifier.decision_function(X.T) | ||
if not _is_pairwise(classifier): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me why we don't have the decision_function pairwise case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or predict_proba
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at it now. I think my initial reaction was that transposing the pairwise matrix won't raise an error. I'll get it up and running 👍🏽
…r decision_function and predict_proba
…into check_estimator_pairwise
I'm happy to merge once you've updated the changelog in doc/whats_new/v0.20.rst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry my last look has uncovered a few minor things
doc/whats_new/v0.20.rst
Outdated
Changes to estimator checks | ||
--------------------------- | ||
|
||
- Pairwise Estimators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include a full description of what you're now checking, with reference to the PR and attribution, just like the other changelog entries. Thanks
doc/whats_new/v0.20.rst
Outdated
|
||
- Allow tests in :func:`estimator_checks.check_estimator` to test functions | ||
that accept pairwise data. | ||
:issue:`9701` by :user:`Andreas Mueller <amueller>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conventionally mentions the contributor of the fix, not the person who raised the issue.
sklearn/neighbors/regression.py
Outdated
@@ -139,6 +140,11 @@ def predict(self, X): | |||
y : array of int, shape = [n_samples] or [n_samples, n_outputs] | |||
Target values | |||
""" | |||
if issparse(X) and self.metric == 'precomputed': | |||
raise ValueError( | |||
"Sparse matricies not supported for prediction with " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matricies -> matrices
assert_true(np.mean(knn.predict(X2).round() == y) > 0.95) | ||
# sparse precomputed distance matrices not supported for prediction | ||
if knn.metric == 'precomputed': | ||
assert_raises(ValueError, knn.predict, csr_matrix(X2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is never actually run it seems...
sklearn/utils/estimator_checks.py
Outdated
def pairwise_estimator_convert_X(X, estimator, kernel=linear_kernel): | ||
|
||
if len(X.shape) == 1: | ||
X = X.reshape(-1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when is this needed? It seems the line is not currently covered by tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it in case for some reason X
is a 1-D array. I'll remove it
…essor_sparse(). Already checked using test_check_estimator_pairwise()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may still be a good idea to assert that predicting on a sparse precomputed matrix in knn raises a ValueError, but the test where you put that assertion didn't run it with metric=precomputed.
…into check_estimator_pairwise
Happy to merge when green |
@jnothman is there anything else I should do before you merge? |
I'd sort of expected someone in a different timezone would hit the green button! Thanks for the ping, and for your work! |
Congrats @GKjohns :) |
Reference Issue
Fixes issue #9580.
What does this implement/fix? Explain your changes.
This allows
check_estimator()
to work on estimators that have the_pairwise
attribute set to True.test_check_estimator_pairwise()
does this by calling it on a SVC with a precomputed kernel.The I created a function that checks for the attribute in the estimator being tested and creates a precomputed Gram matrix if the estimator accepts pairwise input. In all of the applicable estimator checks, I wrap the data
X
in this functionAny other comments?
Some of the checks either a) can't accept precomputed kernels or b) are set to fail in cases that don't apply to precomputed kernels. In those cases I skipped the test.