Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] Raise error when SparseSeries is passed into classification metrics #7373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Oct 6, 2017

Conversation

nielsenmarkus11
Copy link
Contributor

Reference Issue

Fixes #7352

What does this implement/fix? Explain your changes.

This change raises an error when the type is a pandas SparseSeries of either the y_true or y_score input.

Any other comments?

@@ -22,6 +22,7 @@
import warnings
import numpy as np
from scipy.sparse import csr_matrix
import pandas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the PR. But AFAIR, it is a convention in general to import pandas and also put it in a try catch block ( like here ). I suppose this is necessary since Pandas is not a compulsory requirement for installing scikit-learn and this import would raise an error in case Pandas isn't installed. Other than that, LGTM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, you can not import pandas AT ALL. What you refer to is a test, and you can't really do that same thing here. You can check the class name, though, if you like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller sorry for my ignorance. Will keep it in mind from next time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maniteja123 don't sweat it, any help is appreciated :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I will be removing this line. Thanks.

@jnothman
Copy link
Member

This shouldn't just be in roc_* but should fix all metrics by putting in type_of_target. Needs tests.

@nielsenmarkus11
Copy link
Contributor Author

nielsenmarkus11 commented Sep 12, 2016

@jnothman, just for clarity, since roc_* doesn't check using the type_of_target would you still recommend adding the type_of_target call to the function within roc_* or to _binary_clf_curve?

Thanks

@nielsenmarkus11
Copy link
Contributor Author

I've removed the import pandas code and incorporated a check within the type_of_target code. Tested both in Python 3.5.2 and 2.7.12. Are there any additional tests that I need to do?

@amueller
Copy link
Member

you should add a test. I'm not sure if we have a mock for a sparse series but there are several pandas mocks that you could use.

@nielsenmarkus11
Copy link
Contributor Author

Just to make sure I understand correctly... I should add a new test to this code in sklean/utils/tests/test_multiclass.py, correct?

def test_type_of_target():
    for group, group_examples in iteritems(EXAMPLES):
        for example in group_examples:
            assert_equal(type_of_target(example), group,
                         msg=('type_of_target(%r) should be %r, got %r'
                              % (example, group, type_of_target(example))))

    for example in NON_ARRAY_LIKE_EXAMPLES:
        msg_regex = 'Expected array-like \(array or non-string sequence\).*'
        assert_raises_regex(ValueError, msg_regex, type_of_target, example)

    for example in MULTILABEL_SEQUENCES:
        msg = ('You appear to be using a legacy multi-label data '
               'representation. Sequence of sequences are no longer supported;'
               ' use a binary array or sparse matrix instead.')
        assert_raises_regex(ValueError, msg, type_of_target, example)

@amueller
Copy link
Member

yeah that sounds like a reasonable place

@@ -294,6 +294,11 @@ def _binary_clf_curve(y_true, y_score, pos_label=None, sample_weight=None):
thresholds : array, shape = [n_thresholds]
Decreasing score values.
"""
# Check to make sure y_true is valid
y_type = type_of_target(y_true)
if y_type != "binary":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in some of the other testing I'm getting the error: ValueError: multiclass format is not supported. I was under the impression that the _binary_clf_curve required 'binary' data. Should it also be allowed to accept 'multiclass' data?

from pandas import SparseSeries
except ImportError:
pass
y = SparseSeries([1, 0, 0, 1, 0])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm seeing the error in the automatic checks that states: UnboundLocalError: local variable 'SparseSeries' referenced before assignment Do I need to put all of the test code within the try block?

@nielsenmarkus11
Copy link
Contributor Author

Okay. I think it should be good now. I understand the supposed issue with test_precision_recall_curve_pos_label and put this back in. Perhaps the documentation can be updated to include the pos_label exception for the y_true input. I've added the test back in as well as a test_binary_clf_curve function.

@amueller amueller added this to the 0.19 milestone Oct 27, 2016
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, this LGTM.

msg = "y cannot be class 'SparseSeries'."
assert_raises_regex(ValueError, msg, type_of_target, y)
except ImportError:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wrap only the import statement and use raise SkipTest("Pandas not found") as elsewhere

@jnothman jnothman changed the title Raise error when SparseSeries is passed into roc_curve [MRG+1] Raise error when SparseSeries is passed into roc_curve May 28, 2017

y = SparseSeries([1, 0, 0, 1, 0])
msg = "y cannot be class 'SparseSeries'."
assert_raises_regex(ValueError, msg, type_of_target, y)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per @jnothman 's request, I've only wrapped the import and added raise SkipTest("Pandas not found") otherwise.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

y_type = type_of_target(y_true)
if not (y_type == "binary" or
(y_type == "multiclass" and pos_label is not None)):
raise ValueError("{0} format is not supported".format(y_type))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a nitpick, but it would help the user to give a different error message if y_type == "multiclass" and pos_label is None.

Beside, I am surprised, but it is really the case that in multiclass settings we require the pos_label not to be specified? I would have though the opposite. Is there an error in the condition above, or in my assumptions on our code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @jnothman pointed out: the code is correct, I was confused by the double negation.

Still, a different error message would help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, a different error message would help.

Agreed. The error message template I generally try to follow is something like:

Allowed values for parameter_name are ['value1', 'value2', 'value3']. Instead you provided 'parameter_name={parameter_value}'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve, my choice of error message was copy pasted from other portions of this code. I chose the language to be consistent with the other instances of similar errors in ranking.py.

@lesteve
Copy link
Member

lesteve commented Jun 1, 2017

Hmmm I am a bit confused on this one. I commented on the issue, see #7352 (comment).

@jnothman
Copy link
Member

jnothman commented Jun 1, 2017 via email

@@ -234,6 +234,10 @@ def type_of_target(y):
raise ValueError('Expected array-like (array or non-string sequence), '
'got %r' % y)

sparseseries = (y.__class__.__name__ == 'SparseSeries')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just testing the name of the class is a bit dodgy, I think it would be better to use an isinstance.

Copy link
Contributor Author

@nielsenmarkus11 nielsenmarkus11 Jun 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I've gone back to using the name of the class, per prior comments from @amueller on commit d21c7e38674388f97e146aef67f42bef2fe5d2d2, pandas should not be imported at all except in test.

@jnothman
Copy link
Member

jnothman commented Jun 1, 2017 via email

@jnothman jnothman changed the title [MRG+1] Raise error when SparseSeries is passed into roc_curve [MRG+1] Raise error when SparseSeries is passed into classification metrics Jun 18, 2017
@jnothman
Copy link
Member

jnothman commented Oct 3, 2017

Another review here?

@nielsenmarkus11
Copy link
Contributor Author

Ping @amueller

@GaelVaroquaux
Copy link
Member

LGTM. Merging given @jnothman 's +1

@GaelVaroquaux GaelVaroquaux merged commit 3a48f0a into scikit-learn:master Oct 6, 2017
@GaelVaroquaux
Copy link
Member

Thanks!

@nielsenmarkus11
Copy link
Contributor Author

Thank you!

@nielsenmarkus11 nielsenmarkus11 deleted the sparse branch October 6, 2017 22:14
@nielsenmarkus11 nielsenmarkus11 restored the sparse branch October 6, 2017 22:14
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Oct 21, 2017
…etrics (scikit-learn#7373)

* Raise error when SparseSeries is passed into roc_curve

* Changed "y_true" in second if block to "y_score"

* Remove code to import pandas and add sparseseries check to 'type_of_target' function. Finally, add 'type_of_target' call to _binary_clf_curve

* Remove pandas import and old comparison in roc_curve.

* Add test for 'type_of_target' function

* Add white space after commas

* Correct other white space issues

* Move type_of_target test into try clause, remove test_precision_recall_curve_pos_label since as multiclass it doesn't make sense

* Add test_precision_recall_curve_pos_label back in and also add test_binary_clf_curve to test new logic in _binary_clf_curve function

* Correct syntax and formatting.

* Remove trailing white space

* Correct validation logic

* Update test_multiclass.py per @jnothman 's request.

* Import SkipTest function.

* Remove extra white space from line 303
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Oct 24, 2017
* tag '0.19.1': (117 commits)
  TST Improve SelectFromModel tests (scikit-learn#9733)
  Name in what's new
  [MRG+1] Raise error when SparseSeries is passed into classification metrics (scikit-learn#7373)
  Fix LogisticRegressionCV default solver value in docstring (scikit-learn#9962)
  [MRG+1] DOC fix sign in GBRT mathematical formulation (scikit-learn#9885)
  [MRG+1] DOC fix sign in GBRT mathematical formulation (scikit-learn#9885)
  DOC fix a typo (scikit-learn#9892)
  [MRG+1] Ledoit-Wolf behavior explanation (scikit-learn#9500)
  [MRG+1] Fix typos in documentation (scikit-learn#9878)
  DOC: Use setattr(self, ...) instead of self.setattr(...) (scikit-learn#9866)
  DOC Removed a duplicate occurrence of a word in 'sklearn.neighbors.KNeighborsRegressor' docs (scikit-learn#9862)
  FIX docstring of negative_outlier_factor_ in LOF (scikit-learn#9809)
  [MRG+1] Fix scikit-learn#9743: Adding parameter information to docstring. (scikit-learn#9757)
  DOC: fix docstring of Imputer.fit (scikit-learn#9769)
  various minor spelling tweaks (scikit-learn#9783)
  MAINT comment on apparent inconsistency
  [MRG+1] DOC fix headers level in cross_validation.rst (scikit-learn#9679)
  Fix mailmap format (scikit-learn#9620)
  DOC Fix typos (scikit-learn#9577)
  Typo (scikit-learn#9571)
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Oct 24, 2017
* releases: (117 commits)
  TST Improve SelectFromModel tests (scikit-learn#9733)
  Name in what's new
  [MRG+1] Raise error when SparseSeries is passed into classification metrics (scikit-learn#7373)
  Fix LogisticRegressionCV default solver value in docstring (scikit-learn#9962)
  [MRG+1] DOC fix sign in GBRT mathematical formulation (scikit-learn#9885)
  [MRG+1] DOC fix sign in GBRT mathematical formulation (scikit-learn#9885)
  DOC fix a typo (scikit-learn#9892)
  [MRG+1] Ledoit-Wolf behavior explanation (scikit-learn#9500)
  [MRG+1] Fix typos in documentation (scikit-learn#9878)
  DOC: Use setattr(self, ...) instead of self.setattr(...) (scikit-learn#9866)
  DOC Removed a duplicate occurrence of a word in 'sklearn.neighbors.KNeighborsRegressor' docs (scikit-learn#9862)
  FIX docstring of negative_outlier_factor_ in LOF (scikit-learn#9809)
  [MRG+1] Fix scikit-learn#9743: Adding parameter information to docstring. (scikit-learn#9757)
  DOC: fix docstring of Imputer.fit (scikit-learn#9769)
  various minor spelling tweaks (scikit-learn#9783)
  MAINT comment on apparent inconsistency
  [MRG+1] DOC fix headers level in cross_validation.rst (scikit-learn#9679)
  Fix mailmap format (scikit-learn#9620)
  DOC Fix typos (scikit-learn#9577)
  Typo (scikit-learn#9571)
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Oct 24, 2017
* dfsg: (117 commits)
  TST Improve SelectFromModel tests (scikit-learn#9733)
  Name in what's new
  [MRG+1] Raise error when SparseSeries is passed into classification metrics (scikit-learn#7373)
  Fix LogisticRegressionCV default solver value in docstring (scikit-learn#9962)
  [MRG+1] DOC fix sign in GBRT mathematical formulation (scikit-learn#9885)
  [MRG+1] DOC fix sign in GBRT mathematical formulation (scikit-learn#9885)
  DOC fix a typo (scikit-learn#9892)
  [MRG+1] Ledoit-Wolf behavior explanation (scikit-learn#9500)
  [MRG+1] Fix typos in documentation (scikit-learn#9878)
  DOC: Use setattr(self, ...) instead of self.setattr(...) (scikit-learn#9866)
  DOC Removed a duplicate occurrence of a word in 'sklearn.neighbors.KNeighborsRegressor' docs (scikit-learn#9862)
  FIX docstring of negative_outlier_factor_ in LOF (scikit-learn#9809)
  [MRG+1] Fix scikit-learn#9743: Adding parameter information to docstring. (scikit-learn#9757)
  DOC: fix docstring of Imputer.fit (scikit-learn#9769)
  various minor spelling tweaks (scikit-learn#9783)
  MAINT comment on apparent inconsistency
  [MRG+1] DOC fix headers level in cross_validation.rst (scikit-learn#9679)
  Fix mailmap format (scikit-learn#9620)
  DOC Fix typos (scikit-learn#9577)
  Typo (scikit-learn#9571)
  ...
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…etrics (scikit-learn#7373)

* Raise error when SparseSeries is passed into roc_curve

* Changed "y_true" in second if block to "y_score"

* Remove code to import pandas and add sparseseries check to 'type_of_target' function. Finally, add 'type_of_target' call to _binary_clf_curve

* Remove pandas import and old comparison in roc_curve.

* Add test for 'type_of_target' function

* Add white space after commas

* Correct other white space issues

* Move type_of_target test into try clause, remove test_precision_recall_curve_pos_label since as multiclass it doesn't make sense

* Add test_precision_recall_curve_pos_label back in and also add test_binary_clf_curve to test new logic in _binary_clf_curve function

* Correct syntax and formatting.

* Remove trailing white space

* Correct validation logic

* Update test_multiclass.py per @jnothman 's request.

* Import SkipTest function.

* Remove extra white space from line 303
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
…etrics (scikit-learn#7373)

* Raise error when SparseSeries is passed into roc_curve

* Changed "y_true" in second if block to "y_score"

* Remove code to import pandas and add sparseseries check to 'type_of_target' function. Finally, add 'type_of_target' call to _binary_clf_curve

* Remove pandas import and old comparison in roc_curve.

* Add test for 'type_of_target' function

* Add white space after commas

* Correct other white space issues

* Move type_of_target test into try clause, remove test_precision_recall_curve_pos_label since as multiclass it doesn't make sense

* Add test_precision_recall_curve_pos_label back in and also add test_binary_clf_curve to test new logic in _binary_clf_curve function

* Correct syntax and formatting.

* Remove trailing white space

* Correct validation logic

* Update test_multiclass.py per @jnothman 's request.

* Import SkipTest function.

* Remove extra white space from line 303
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problems accepting pandas.SparseSeries as target
7 participants