Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@rsmith54
Copy link
Contributor

@rsmith54 rsmith54 commented Sep 6, 2016

Reference Issue

Fixes #7306.

What does this implement/fix? Explain your changes.

Adds _pairwise to OneVsOneClassifier and OneVsAllClassifier and adds a test to check this is properly set.


@property
def _pairwise(self):
'''Indicate if wrapped estimator is using a precomputed Gram matrix'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use double quotes """ please.

@rsmith54
Copy link
Contributor Author

rsmith54 commented Sep 6, 2016

Just fixed the single to double quotes

clf_precomputed = svm.SVC(kernel='precomputed')
clf_notprecomputed = svm.SVC()

ovrFalse = OneVsRestClassifier(clf_notprecomputed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for MultiClassClassifier in [OneVsRestClassifier, OneVsOneClassifier]:

?

@amueller
Copy link
Member

amueller commented Sep 6, 2016

is it clear what I mean with the test for cross_val_score?

@rsmith54
Copy link
Contributor Author

rsmith54 commented Sep 6, 2016

Not exactly, we crossposted though. Something similar to the test I linked to ?

@rsmith54
Copy link
Contributor Author

rsmith54 commented Sep 6, 2016

Okay, so I added another test which checks the cross_val_score with precomputed vs linear kernels, but this only works for OneVsRestClassifier, not OneVsOneClassifier. When running with a OneVsOneClassifier, I get

nose.proxy.ValueError: X.shape[0] should be equal to X.shape[1], although printing out the shape they naively seem the same. Is this expected (it's possible I'm missing the reason) ?


# for MultiClassClassifier in [OneVsRestClassifier, OneVsOneClassifier]:
for MultiClassClassifier in [OneVsRestClassifier]:
ovrFalse = MultiClassClassifier(clf_notprecomputed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid camel case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(unless it's a class name)

@jnothman
Copy link
Member

jnothman commented Sep 6, 2016

Not sure where you're saying tests fail: it looks like tests are passing except the PEP8 check, which may be merely due to your print statements.

@rsmith54
Copy link
Contributor Author

rsmith54 commented Sep 6, 2016

They all succeed as is, but I have commented out
# for MultiClassClassifier in [OneVsRestClassifier, OneVsOneClassifier]:

The failure happens with the OneVsOneClassifier.

@jnothman
Copy link
Member

jnothman commented Sep 6, 2016

oh, right.

@jnothman
Copy link
Member

jnothman commented Sep 6, 2016

Running flake8 on the diff in the range 5305861..6aee368 (5 commit(s)):
--------------------------------------------------------------------------------
./sklearn/tests/test_multiclass.py:615:1: E302 expected 2 blank lines, found 1
def test_pairwise_attribute():
^
./sklearn/tests/test_multiclass.py:626:1: E302 expected 2 blank lines, found 1
def test_pairwise_cross_val_score():
^
./sklearn/tests/test_multiclass.py:641:9: F841 local variable 'clf' is assigned to but never used
        clf = clf_notprecomputed
        ^

regarding the error, could you leave the broken test in or else report the full traceback so we don't necessarily need to go run it to help you debug?

@rsmith54
Copy link
Contributor Author

rsmith54 commented Sep 7, 2016

Hi, yes, I have reincluded it.


for MultiClassClassifier in [OneVsRestClassifier, OneVsOneClassifier]:
ovrFalse = MultiClassClassifier(clf_notprecomputed)
assert_false(ovrFalse._pairwise)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

camelCase

@jnothman
Copy link
Member

jnothman commented Sep 7, 2016

Right. I've never touched the 1v1 code before. Here 1v1 pulls out only some samples from X without respect to _pairwise. It should be using _safe_split. We should probably move _safe_split to utils and use it there in 1v1.

@rsmith54
Copy link
Contributor Author

rsmith54 commented Sep 7, 2016

I tried to play around with _safe_split, but I'm not exactly sure how to change that code to use it properly. Any hints? Thanks!

@jnothman jnothman changed the title fix for #7306 Add OneVs{One,All}Classifier._pairwise: fix for #7306 Sep 8, 2016
@jnothman
Copy link
Member

jnothman commented Sep 8, 2016

https://github.com/scikit-learn/scikit-learn/blob/5305861/sklearn/multiclass.py#L403 should be changed to be something like

return _fit_binary(estimator, _safe_split(estimator, X, y, indices=ind[cond])[0], y_binary, classes=[i, j])

ideally we wouldn't duplicate work with y, but that's minor.

@rsmith54
Copy link
Contributor Author

rsmith54 commented Sep 8, 2016

Okay, so I had tried something similar (at least it gave me the same error as we now see. It breaks a bunch of the other tests as well, always in these lines :

  File "scikit-learn/sklearn/cross_validation.py", line 1638, in _safe_split
    y_subset = safe_indexing(y, indices)
  File "scikit-learn/sklearn/utils/__init__.py", line 110, in safe_indexing
    return X.take(indices, axis=0)
IndexError: index 66 is out of bounds for size 66

@jnothman
Copy link
Member

jnothman commented Sep 8, 2016

error type and message?

@rsmith54
Copy link
Contributor Author

rsmith54 commented Sep 8, 2016

Sorry cut it out, see above.

ind = np.arange(X.shape[0])
return _fit_binary(estimator, X[ind[cond]], y_binary, classes=[i, j])

return _fit_binary(estimator, _safe_split(estimator, X, y, indices=ind[cond])[0], y_binary, classes=[i, j])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, my mistake. you shouldn't be passing y into _safe_split. Rather, _safe_split should probably allow y=None.

@rsmith54
Copy link
Contributor Author

rsmith54 commented Sep 8, 2016

Okay, so that fixes the other tests, but our new test fails, again only on the OneVsOneClassifier, even with _safe_split :

ValueError: X.shape[1] = 99 should be equal to 66, the number of samples at training time

@rsmith54
Copy link
Contributor Author

rsmith54 commented Sep 8, 2016

I guess looking at it a bit more that we need to add the same _safe_split call in the prediction logic somewhere also, but I'm not sure where.

@jnothman
Copy link
Member

By not moving _safe_split to sklearn.utils, you've created a circular dependency between multiclass and cross_validation modules.

@jnothman
Copy link
Member

I don't think it belongs in utils.multiclass. It might be worth just sticking it in __init__ for now until someone comes up with something better.

@jnothman
Copy link
Member

Arguably, utils.metaestimators might be appropriate.

Recall, however, that sklearn.cross_validation is deprecated. You should be moving this from model_selection/_validation.py, and ideally don't touch cross_validation unless we decide to back-port the fix to the deprecated version.

@rsmith54
Copy link
Contributor Author

Yeah, okay that makes a lot of sense. I'm happy to move everything over to model_selection/_validation.py. I'm still not exactly sure where the fix should be to properly use _safe_split in the prediction logic. Any clues where that should go?

@jnothman
Copy link
Member

I don't think it's relevant to the prediction logic.

On 13 September 2016 at 13:11, Russell Smith [email protected]
wrote:

Yeah, okay that makes a lot of sense. I'm happy to move everything over to
model_selection/_validation.py. I'm still not exactly sure where the fix
should be to properly use _safe_split in the prediction logic. Any clues
where that should go?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7350 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz67ewDPPAAs354Na_EM6mINxjtguCks5qphRJgaJpZM4J2L1m
.

@jnothman jnothman added this to the 0.18 milestone Sep 19, 2016
@jnothman
Copy link
Member

Not at this stage, @rsivapr. @amueller or someone else may review this soon, but it might wait until after the 0.18.0 release. I'll label it 0.18 because it's so close to completion it might make it in, but I'd consider this a non-urgent bug fix.

@jnothman
Copy link
Member

If you're lucky, the next review will just be a double-check with no work from you.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the following comments, this LGTM.

K = np.dot(X, X.T)

cv = ShuffleSplit(test_size=0.25, random_state=0)
tr, te = list(cv.split(X))[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use more explicit variable names? E.g. train_indices and test_indices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, this is moved code, not new code. Still, it might be a good idea to improve it, as it's easy to do so.


X_tr, y_tr = _safe_split(clf, X, y, tr)
K_tr, y_tr2 = _safe_split(clfp, K, y, tr)
assert_array_almost_equal(K_tr, np.dot(X_tr, X_tr.T))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also check that y_tr (to be renamed y_train) and y_tr2 (to be renamed y_train2) are equal.


X_te, y_te = _safe_split(clf, X, y, te, tr)
K_te, y_te2 = _safe_split(clfp, K, y, te, tr)
assert_array_almost_equal(K_te, np.dot(X_te, X_tr.T))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar update required here.

else:
y_subset = None

return X_subset, y_subset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this been moved to metaestimators.py instead of keeping it in sklearn/model_selection/_split.py ?

Copy link
Member

@jnothman jnothman Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now used in sklearn.multiclass and sklearn.model_selection. Do you think it belongs in model_selection? Do you think it would be better off in sklearn.utils.__init__ than sklearn.utils.metaestimators?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sklearn.model_selection._split seemed like a good module to host a private helper function named _safe_split. But I don't care that much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sklearn.model_selection._split seemed like a good module to host a private helper function named _safe_split. But I don't care that much.

It was, but I think the name is not quite right. It's just harder to come up with a better one: _pairwise_friendly_indexing?

def __init__(self, estimator, n_jobs=1):
self.estimator = estimator
self.n_jobs = n_jobs
self.pairwise_indices_ = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor of a scikit-learn estimators should never set attributes with a trailing _, it should only store hyperparameters as attributes. Attributes with a trailing _ should only be set by the fit method or by a private submethod called only at fit time.

At test time, methods that need access to that attribute can check its presence with the _check_fitted_model helper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had a consistency check in test_common for this kind of things but maybe it's not applied on meta estimators.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow I missed this, sorry.

y : numpy array of shape [n_samples]
Predicted multi-class targets.
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the new line here?

if indices is None:
Xs = [X] * len(self.estimators_)
else:
Xs = [X[:, idx] for idx in indices]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this case tested? If not please add a dedicated test in test_multiclass.py with SVC(kernel='precomputed') and check the expected shape of the output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if the test test_pairwise_indices is what you are looking for here.

Copy link
Member

@ogrisel ogrisel Sep 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like a to have a test that checks the call to decision_function method on OvO & OvR wrapped models fit on precomputed kernel.


- Cross-validation of :class:`OneVsOneClassifier` and
:class:`OneVsRestClassifier` now works with precomputed kernels.
(`#7350 <https://github.com/scikit-learn/scikit-learn/pull/7350/>`_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation

K = np.dot(X, X.T)

cv = ShuffleSplit(test_size=0.25, random_state=0)
tr, te = list(cv.split(X))[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, this is moved code, not new code. Still, it might be a good idea to improve it, as it's easy to do so.

else:
y_subset = None

return X_subset, y_subset
Copy link
Member

@jnothman jnothman Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now used in sklearn.multiclass and sklearn.model_selection. Do you think it belongs in model_selection? Do you think it would be better off in sklearn.utils.__init__ than sklearn.utils.metaestimators?

@raghavrv
Copy link
Member

It is now used in sklearn.multiclass and sklearn.model_selection. Do you think it belongs in model_selection?

Even I feel this should reside inside model_selection... But as you pointed out the cyclic import error, I think we have to go with sklearn.utils.__init__?

@jnothman
Copy link
Member

Even I feel this should reside inside model_selection... But as you pointed out the cyclic import error, I think we have to go with sklearn.utils.init?

I don't get why it should belong exclusively in model selection. It pertains to anything that indexes on X and has to respect the _pairwise attribute of a sub-estimator, i.e. any metaestimator with scissors.

@raghavrv
Copy link
Member

I don't get why it should belong exclusively in model selection.

I think the naming, signature and the intended use are not that generic enough. At least a generic version of such a function should may be called _safe_subset and should not use the arg training_indices...

@rsmith54
Copy link
Contributor Author

I made the changes requested by @ogrisel , other than those related to model_selection vs metaestimators for safe_split.

:class:`OneVsRestClassifier` now works with precomputed kernels.
(`#7350 <https://github.com/scikit-learn/scikit-learn/pull/7350/>`_)
(`#7350 <https://github.com/scikit-learn/scikit-learn/pull/7350/>`_)
by `Russell Smith <https://github.com/rsmith54>`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also add your name to the bottom as we are expecting more amazing pull requests like these from you ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you have added it there you can remove the link here... :)

@ogrisel
Copy link
Member

ogrisel commented Sep 20, 2016

I don't get why it should belong exclusively in model selection. It pertains to anything that indexes on X and has to respect the _pairwise attribute of a sub-estimator, i.e. any metaestimator with scissors.

As you wish. You can leave it where it is.

@ogrisel ogrisel merged commit 54b0e4b into scikit-learn:master Sep 20, 2016
@raghavrv
Copy link
Member

Thanks @rsmith54!

@jnothman
Copy link
Member

This proved somewhat trickier than was first thought, so big thanks
@rsmith54!

On 20 September 2016 at 21:50, Raghav RV [email protected] wrote:

Thanks @rsmith54 https://github.com/rsmith54!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7350 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz64ce5IWBCFJNC11mXMmiKX6YC-9Nks5qr8iAgaJpZM4J2L1m
.

@rsmith54
Copy link
Contributor Author

Thanks for all the help!

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
@amueller amueller mentioned this pull request Oct 13, 2016
jnothman referenced this pull request Nov 23, 2016
--------------------

* ENH Reogranize classes/fn from grid_search into search.py
* ENH Reogranize classes/fn from cross_validation into split.py
* ENH Reogranize cls/fn from cross_validation/learning_curve into validate.py

* MAINT Merge _check_cv into check_cv inside the model_selection module
* MAINT Update all the imports to point to the model_selection module
* FIX use iter_cv to iterate throught the new style/old style cv objs
* TST Add tests for the new model_selection members
* ENH Wrap the old-style cv obj/iterables instead of using iter_cv

* ENH Use scipy's binomial coefficient function comb for calucation of nCk
* ENH Few enhancements to the split module
* ENH Improve check_cv input validation and docstring
* MAINT _get_test_folds(X, y, labels) --> _get_test_folds(labels)
* TST if 1d arrays for X introduce any errors
* ENH use 1d X arrays for all tests;
* ENH X_10 --> X (global var)

Minor
-----

* ENH _PartitionIterator --> _BaseCrossValidator;
* ENH CVIterator --> CVIterableWrapper
* TST Import the old SKF locally
* FIX/TST Clean up the split module's tests.
* DOC Improve documentation of the cv parameter
* COSMIT consistently hyphenate cross-validation/cross-validator
* TST Calculate n_samples from X
* COSMIT Use separate lines for each import.
* COSMIT cross_validation_generator --> cross_validator

Commits merged manually
-----------------------

* FIX Document the random_state attribute in RandomSearchCV
* MAINT Use check_cv instead of _check_cv
* ENH refactor OVO decision function, use it in SVC for sklearn-like
  decision_function shape
* FIX avoid memory cost when sampling from large parameter grids

ENH Major to Minor incremental enhancements to the model_selection

Squashed commit messages - (For reference)

Major
-----

* ENH p --> n_labels
* FIX *ShuffleSplit: all float/invalid type errors at init and int error at split
* FIX make PredefinedSplit accept test_folds in constructor; Cleanup docstrings
* ENH+TST KFold: make rng to be generated at every split call for reproducibility
* FIX/MAINT KFold: make shuffle a public attr
* FIX Make CVIterableWrapper private.
* FIX reuse len_cv instead of recalculating it
* FIX Prevent adding *SearchCV estimators from the old grid_search module
* re-FIX In all_estimators: the sorting to use only the 1st item (name)
    To avoid collision between the old and the new GridSearch classes.
* FIX test_validate.py: Use 2D X (1D X is being detected as a single sample)
* MAINT validate.py --> validation.py
* MAINT make the submodules private
* MAINT Support old cv/gs/lc until 0.19
* FIX/MAINT n_splits --> get_n_splits
* FIX/TST test_logistic.py/test_ovr_multinomial_iris:
    pass predefined folds as an iterable
* MAINT expose BaseCrossValidator
* Update the model_selection module with changes from master
  - From #5161
  -  - MAINT remove redundant p variable
  -  - Add check for sparse prediction in cross_val_predict
  - From #5201 - DOC improve random_state param doc
  - From #5190 - LabelKFold and test
  - From #4583 - LabelShuffleSplit and tests
  - From #5300 - shuffle the `labels` not the `indxs` in LabelKFold + tests
  - From #5378 - Make the GridSearchCV docs more accurate.
  - From #5458 - Remove shuffle from LabelKFold
  - From #5466(#4270) - Gaussian Process by Jan Metzen
  - From #4826 - Move custom error / warnings into sklearn.exception

Minor
-----

* ENH Make the KFold shuffling test stronger
* FIX/DOC Use the higher level model_selection module as ref
* DOC in check_cv "y : array-like, optional"
* DOC a supervised learning problem --> supervised learning problems
* DOC cross-validators --> cross-validation strategies
* DOC Correct Olivier Grisel's name ;)
* MINOR/FIX cv_indices --> kfold
* FIX/DOC Align the 'See also' section of the new KFold, LeaveOneOut
* TST/FIX imports on separate lines
* FIX use __class__ instead of classmethod
* TST/FIX import directly from model_selection
* COSMIT Relocate the random_state documentation
* COSMIT remove pass
* MAINT Remove deprecation warnings from old tests
* FIX correct import at test_split
* FIX/MAINT Move P_sparse, X, y defns to top; rm unused W_sparse, X_sparse
* FIX random state to avoid doctest failure
* TST n_splits and split wrapping of _CVIterableWrapper
* FIX/MAINT Use multilabel indicator matrix directly
* TST/DOC clarify why we conflate classes 0 and 1
* DOC add comment that this was taken from BaseEstimator
* FIX use of labels is not needed in stratified k fold
* Fix cross_validation reference
* Fix the labels param doc

FIX/DOC/MAINT Addressing the review comments by Arnaud and Andy

COSMIT Sort the members alphabetically
COSMIT len_cv --> n_splits
COSMIT Merge 2 if; FIX Use kwargs
DOC Add my name to the authors :D
DOC make labels parameter consistent
FIX Remove hack for boolean indices; + COSMIT idx --> indices; DOC Add Returns
COSMIT preds --> predictions
DOC Add Returns and neatly arrange X, y, labels
FIX idx(s)/ind(s)--> indice(s)
COSMIT Merge if and else to elif
COSMIT n --> n_samples
COSMIT Use bincount only once
COSMIT cls --> class_i / class_i (ith class indices) -->
perm_indices_class_i

FIX/ENH/TST Addressing the final reviews

COSMIT c --> count
FIX/TST make check_cv raise ValueError for string cv value
TST nested cv (gs inside cross_val_score) works for diff cvs
FIX/ENH Raise ValueError when labels is None for label based cvs;
TST if labels is being passed correctly to the cv and that the
ValueError is being propagated to the cross_val_score/predict and grid
search
FIX pass labels to cross_val_score
FIX use make_classification
DOC Add Returns; COSMIT Remove scaffolding
TST add a test to check the _build_repr helper
REVERT the old GS/RS should also be tested by the common tests.
ENH Add a tuple of all/label based CVS
FIX raise VE even at get_n_splits if labels is None
FIX Fabian's comments
PEP8
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_pairwise not available in OneVs{One,Rest}Classifier

5 participants