Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] Check estimator pairwise #9701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 71 commits into from
Nov 13, 2017

Conversation

GKjohns
Copy link
Contributor

@GKjohns GKjohns commented Sep 7, 2017

Reference Issue

Fixes issue #9580.

What does this implement/fix? Explain your changes.

This allows check_estimator() to work on estimators that have the _pairwise attribute set to True. test_check_estimator_pairwise() does this by calling it on a SVC with a precomputed kernel.

The I created a function that checks for the attribute in the estimator being tested and creates a precomputed Gram matrix if the estimator accepts pairwise input. In all of the applicable estimator checks, I wrap the data X in this function

Any other comments?

Some of the checks either a) can't accept precomputed kernels or b) are set to fail in cases that don't apply to precomputed kernels. In those cases I skipped the test.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not checked how complete this is, but I like the direction it's heading in!

sklearn/base.py Outdated


def is_pairwise(estimator):
"""Returns True if the given estimator has a _pairwise attribute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pep257: summary should be one line only. More description can follow after a blank line

sklearn/base.py Outdated
"""Returns True if the given estimator has a _pairwise attribute
set to True.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One blank line only please

def check_estimator_sparse_data(name, estimator_orig):

# Sparse precomputed kernels aren't supported
if getattr(estimator_orig, 'kernel', None) == 'precomputed':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you use is_pairwise here?

Actually I think we should be testing that an appropriate error is raised in to case

@@ -1194,6 +1223,7 @@ def check_estimators_fit_returns_self(name, estimator_orig):
X, y = make_blobs(random_state=0, n_samples=9, n_features=4)
# some want non-negative input
X -= X.min()
X = gram_matrix_if_pairwise(X, estimator_orig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Sometimes pairwise is for affinities, sometimes for distances. Estimators requiring distances may not play nicely with affinities and vice-versa. This may not be something we need to deal with now, but @amueller should probably consider an estimator tag which selects between distances and affinities when pairwise. Or we can use the presence of a metric parameter as a heuristic (for now)

sklearn/base.py Outdated
out : bool
True if _pairwise is set to True and False otherwise.
"""
return getattr(estimator, "_pairwise", False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps wrap this in bool, just to be sure?


X_train = gram_matrix_if_pairwise(X_train, classifier_orig,
kernel=rbf_kernel)
X_test = gram_matrix_if_pairwise(X_test, classifier_orig,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This surely can't work. The test data needs to be the kernel applied on X_test with respect to the training.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a wonder that tests are passing

@@ -251,3 +252,9 @@ def __init__(self):
check_no_fit_attributes_set_in_init,
'estimator_name',
NonConformantEstimator)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: extra blank line required

@@ -1795,3 +1834,8 @@ def check_decision_proba_consistency(name, estimator_orig):
a = estimator.predict_proba(X_test)[:, 1]
b = estimator.decision_function(X_test)
assert_array_equal(rankdata(a), rankdata(b))


def check_pairwise_estimator():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this please

@codecov
Copy link

codecov bot commented Sep 18, 2017

Codecov Report

Merging #9701 into master will decrease coverage by <.01%.
The diff coverage is 94.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9701      +/-   ##
==========================================
- Coverage   96.19%   96.19%   -0.01%     
==========================================
  Files         336      336              
  Lines       62739    62781      +42     
==========================================
+ Hits        60353    60392      +39     
- Misses       2386     2389       +3
Impacted Files Coverage Δ
sklearn/utils/tests/test_estimator_checks.py 96.83% <100%> (+0.14%) ⬆️
sklearn/neighbors/regression.py 100% <100%> (ø) ⬆️
sklearn/neighbors/tests/test_neighbors.py 99.43% <66.66%> (-0.15%) ⬇️
sklearn/utils/estimator_checks.py 93.21% <94.82%> (-0.1%) ⬇️
sklearn/ensemble/gradient_boosting.py 95.76% <0%> (-0.45%) ⬇️
sklearn/decomposition/pca.py 95.04% <0%> (-0.15%) ⬇️
sklearn/ensemble/tests/test_gradient_boosting.py 96.27% <0%> (-0.04%) ⬇️
sklearn/linear_model/stochastic_gradient.py 98.17% <0%> (ø) ⬆️
sklearn/feature_selection/base.py 94.79% <0%> (ø) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update abb43c1...3116c23. Read the comment docs.

@jnothman
Copy link
Member

Only flake8 is failing now ...

@@ -404,6 +419,7 @@ def check_sample_weights_pandas_series(name, estimator_orig):
try:
import pandas as pd
X = pd.DataFrame([[1, 1], [1, 2], [1, 3], [2, 1], [2, 2], [2, 3]])
X = gram_matrix_if_pairwise(X, estimator_orig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X output here will still not be a DataFrame, will it? There's not much point in doing this unless we make the gram matrix a DataFrame, which we might as well do even if it's a bit of a weird use of a DataFrame

@@ -1795,3 +1835,4 @@ def check_decision_proba_consistency(name, estimator_orig):
a = estimator.predict_proba(X_test)[:, 1]
b = estimator.decision_function(X_test)
assert_array_equal(rankdata(a), rankdata(b))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the extra blank line?

# check that check_estimator() works on estimator with _pairwise
# attribute set
est = SVC(kernel='precomputed')
check_estimator(est)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this for an estimator based on a metric as well as a kernel? It's very possible that doing so will break.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Can you give a quick example of an estimator based on a metric as well as a kernel?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KNeighborsRegressor or AgglomerativeClustering

check_estimator(est)


def test_check_estimator_metric_and_kernel():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no kernel here. But you also need metric=precomputed for this to pertain.

Perhaps to make sure these tests are doing what they're meant to, you should assert that the estimator is pairwise

@GKjohns
Copy link
Contributor Author

GKjohns commented Oct 30, 2017

@amueller fixed, knn works for sparse X where metric != 'precomputed now

@amueller
Copy link
Member

thanks, lgtm. @jnothman, still good?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise looks good.

Could you please add a new heading in the changelog called "Changes to estimator checks" and note this change there. I'll add a blurb there eventually.

"different from the number of features"
" in fit.".format(name)):
classifier.decision_function(X.T)
if not _is_pairwise(classifier):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me why we don't have the decision_function pairwise case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or predict_proba

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at it now. I think my initial reaction was that transposing the pairwise matrix won't raise an error. I'll get it up and running 👍🏽

@jnothman
Copy link
Member

jnothman commented Nov 1, 2017

I'm happy to merge once you've updated the changelog in doc/whats_new/v0.20.rst

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my last look has uncovered a few minor things

Changes to estimator checks
---------------------------

- Pairwise Estimators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a full description of what you're now checking, with reference to the PR and attribution, just like the other changelog entries. Thanks


- Allow tests in :func:`estimator_checks.check_estimator` to test functions
that accept pairwise data.
:issue:`9701` by :user:`Andreas Mueller <amueller>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conventionally mentions the contributor of the fix, not the person who raised the issue.

@@ -139,6 +140,11 @@ def predict(self, X):
y : array of int, shape = [n_samples] or [n_samples, n_outputs]
Target values
"""
if issparse(X) and self.metric == 'precomputed':
raise ValueError(
"Sparse matricies not supported for prediction with "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matricies -> matrices

assert_true(np.mean(knn.predict(X2).round() == y) > 0.95)
# sparse precomputed distance matrices not supported for prediction
if knn.metric == 'precomputed':
assert_raises(ValueError, knn.predict, csr_matrix(X2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is never actually run it seems...

def pairwise_estimator_convert_X(X, estimator, kernel=linear_kernel):

if len(X.shape) == 1:
X = X.reshape(-1, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is this needed? It seems the line is not currently covered by tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it in case for some reason X is a 1-D array. I'll remove it

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may still be a good idea to assert that predicting on a sparse precomputed matrix in knn raises a ValueError, but the test where you put that assertion didn't run it with metric=precomputed.

@jnothman
Copy link
Member

jnothman commented Nov 9, 2017

Happy to merge when green

@GKjohns
Copy link
Contributor Author

GKjohns commented Nov 13, 2017

@jnothman is there anything else I should do before you merge?

@jnothman
Copy link
Member

I'd sort of expected someone in a different timezone would hit the green button!

Thanks for the ping, and for your work!

@jnothman jnothman merged commit de0581a into scikit-learn:master Nov 13, 2017
@GKjohns GKjohns deleted the check_estimator_pairwise branch November 13, 2017 23:19
@amueller
Copy link
Member

Congrats @GKjohns :)

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants