-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Raise error when SparseSeries is passed into classification metrics #7373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/metrics/ranking.py
Outdated
@@ -22,6 +22,7 @@ | |||
import warnings | |||
import numpy as np | |||
from scipy.sparse import csr_matrix | |||
import pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for the PR. But AFAIR, it is a convention in general to import pandas and also put it in a try catch block ( like here ). I suppose this is necessary since Pandas is not a compulsory requirement for installing scikit-learn and this import would raise an error in case Pandas isn't installed. Other than that, LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, you can not import pandas AT ALL. What you refer to is a test, and you can't really do that same thing here. You can check the class name, though, if you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller sorry for my ignorance. Will keep it in mind from next time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maniteja123 don't sweat it, any help is appreciated :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I will be removing this line. Thanks.
This shouldn't just be in |
@jnothman, just for clarity, since Thanks |
…arget' function. Finally, add 'type_of_target' call to _binary_clf_curve
Update from Original
Update from Original
I've removed the |
you should add a test. I'm not sure if we have a mock for a sparse series but there are several pandas mocks that you could use. |
Just to make sure I understand correctly... I should add a new test to this code in sklean/utils/tests/test_multiclass.py, correct?
|
yeah that sounds like a reasonable place |
sklearn/metrics/ranking.py
Outdated
@@ -294,6 +294,11 @@ def _binary_clf_curve(y_true, y_score, pos_label=None, sample_weight=None): | |||
thresholds : array, shape = [n_thresholds] | |||
Decreasing score values. | |||
""" | |||
# Check to make sure y_true is valid | |||
y_type = type_of_target(y_true) | |||
if y_type != "binary": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in some of the other testing I'm getting the error: ValueError: multiclass format is not supported
. I was under the impression that the _binary_clf_curve
required 'binary' data. Should it also be allowed to accept 'multiclass' data?
from pandas import SparseSeries | ||
except ImportError: | ||
pass | ||
y = SparseSeries([1, 0, 0, 1, 0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm seeing the error in the automatic checks that states: UnboundLocalError: local variable 'SparseSeries' referenced before assignment
Do I need to put all of the test code within the try block?
…l_curve_pos_label since as multiclass it doesn't make sense
…inary_clf_curve to test new logic in _binary_clf_curve function
Okay. I think it should be good now. I understand the supposed issue with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, this LGTM.
msg = "y cannot be class 'SparseSeries'." | ||
assert_raises_regex(ValueError, msg, type_of_target, y) | ||
except ImportError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please wrap only the import statement and use raise SkipTest("Pandas not found")
as elsewhere
|
||
y = SparseSeries([1, 0, 0, 1, 0]) | ||
msg = "y cannot be class 'SparseSeries'." | ||
assert_raises_regex(ValueError, msg, type_of_target, y) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per @jnothman 's request, I've only wrapped the import and added raise SkipTest("Pandas not found")
otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
y_type = type_of_target(y_true) | ||
if not (y_type == "binary" or | ||
(y_type == "multiclass" and pos_label is not None)): | ||
raise ValueError("{0} format is not supported".format(y_type)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a nitpick, but it would help the user to give a different error message if y_type == "multiclass" and pos_label is None.
Beside, I am surprised, but it is really the case that in multiclass settings we require the pos_label not to be specified? I would have though the opposite. Is there an error in the condition above, or in my assumptions on our code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @jnothman pointed out: the code is correct, I was confused by the double negation.
Still, a different error message would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, a different error message would help.
Agreed. The error message template I generally try to follow is something like:
Allowed values for parameter_name are ['value1', 'value2', 'value3']. Instead you provided 'parameter_name={parameter_value}'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lesteve, my choice of error message was copy pasted from other portions of this code. I chose the language to be consistent with the other instances of similar errors in ranking.py
.
Hmmm I am a bit confused on this one. I commented on the issue, see #7352 (comment). |
Gael i think the confusion is yours in reading a double negation
…On 1 Jun 2017 9:59 pm, "Loïc Estève" ***@***.***> wrote:
Hmmm I am a bit confused on this one. I commented on the issue, see #7352
(comment)
<#7352 (comment)>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7373 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz62XaeqMndbeD52eOJWf6AAP9osh8ks5r_qejgaJpZM4J4jvQ>
.
|
@@ -234,6 +234,10 @@ def type_of_target(y): | |||
raise ValueError('Expected array-like (array or non-string sequence), ' | |||
'got %r' % y) | |||
|
|||
sparseseries = (y.__class__.__name__ == 'SparseSeries') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just testing the name of the class is a bit dodgy, I think it would be better to use an isinstance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I've gone back to using the name of the class, per prior comments from @amueller on commit d21c7e38674388f97e146aef67f42bef2fe5d2d2, pandas
should not be imported at all except in test.
You could be more specific, if that's a concern, by checking the type's
module name starts with 'pandas.'. you could also avoid __class__ with
type() etc. You could check the object's mro to see if it is a subclass of
SparseSeries.
, as I've said, I think this is fine
…On 2 Jun 2017 7:31 am, "nielsenmarkus11" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/utils/multiclass.py
<#7373 (comment)>
:
> @@ -234,6 +234,10 @@ def type_of_target(y):
raise ValueError('Expected array-like (array or non-string sequence), '
'got %r' % y)
+ sparseseries = (y.__class__.__name__ == 'SparseSeries')
So, I've gone back to using the name of the class, per prior comments from
@amueller <https://github.com/amueller> on commit
d21c7e3
<d21c7e3>.
Per his comments pandas should not be imported at all except in test.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7373 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67_7fTexHGOhdpageYjs3sFNq_shks5r_y2ZgaJpZM4J4jvQ>
.
|
Another review here? |
Ping @amueller |
LGTM. Merging given @jnothman 's +1 |
Thanks! |
Thank you! |
…etrics (scikit-learn#7373) * Raise error when SparseSeries is passed into roc_curve * Changed "y_true" in second if block to "y_score" * Remove code to import pandas and add sparseseries check to 'type_of_target' function. Finally, add 'type_of_target' call to _binary_clf_curve * Remove pandas import and old comparison in roc_curve. * Add test for 'type_of_target' function * Add white space after commas * Correct other white space issues * Move type_of_target test into try clause, remove test_precision_recall_curve_pos_label since as multiclass it doesn't make sense * Add test_precision_recall_curve_pos_label back in and also add test_binary_clf_curve to test new logic in _binary_clf_curve function * Correct syntax and formatting. * Remove trailing white space * Correct validation logic * Update test_multiclass.py per @jnothman 's request. * Import SkipTest function. * Remove extra white space from line 303
* tag '0.19.1': (117 commits) TST Improve SelectFromModel tests (scikit-learn#9733) Name in what's new [MRG+1] Raise error when SparseSeries is passed into classification metrics (scikit-learn#7373) Fix LogisticRegressionCV default solver value in docstring (scikit-learn#9962) [MRG+1] DOC fix sign in GBRT mathematical formulation (scikit-learn#9885) [MRG+1] DOC fix sign in GBRT mathematical formulation (scikit-learn#9885) DOC fix a typo (scikit-learn#9892) [MRG+1] Ledoit-Wolf behavior explanation (scikit-learn#9500) [MRG+1] Fix typos in documentation (scikit-learn#9878) DOC: Use setattr(self, ...) instead of self.setattr(...) (scikit-learn#9866) DOC Removed a duplicate occurrence of a word in 'sklearn.neighbors.KNeighborsRegressor' docs (scikit-learn#9862) FIX docstring of negative_outlier_factor_ in LOF (scikit-learn#9809) [MRG+1] Fix scikit-learn#9743: Adding parameter information to docstring. (scikit-learn#9757) DOC: fix docstring of Imputer.fit (scikit-learn#9769) various minor spelling tweaks (scikit-learn#9783) MAINT comment on apparent inconsistency [MRG+1] DOC fix headers level in cross_validation.rst (scikit-learn#9679) Fix mailmap format (scikit-learn#9620) DOC Fix typos (scikit-learn#9577) Typo (scikit-learn#9571) ...
* releases: (117 commits) TST Improve SelectFromModel tests (scikit-learn#9733) Name in what's new [MRG+1] Raise error when SparseSeries is passed into classification metrics (scikit-learn#7373) Fix LogisticRegressionCV default solver value in docstring (scikit-learn#9962) [MRG+1] DOC fix sign in GBRT mathematical formulation (scikit-learn#9885) [MRG+1] DOC fix sign in GBRT mathematical formulation (scikit-learn#9885) DOC fix a typo (scikit-learn#9892) [MRG+1] Ledoit-Wolf behavior explanation (scikit-learn#9500) [MRG+1] Fix typos in documentation (scikit-learn#9878) DOC: Use setattr(self, ...) instead of self.setattr(...) (scikit-learn#9866) DOC Removed a duplicate occurrence of a word in 'sklearn.neighbors.KNeighborsRegressor' docs (scikit-learn#9862) FIX docstring of negative_outlier_factor_ in LOF (scikit-learn#9809) [MRG+1] Fix scikit-learn#9743: Adding parameter information to docstring. (scikit-learn#9757) DOC: fix docstring of Imputer.fit (scikit-learn#9769) various minor spelling tweaks (scikit-learn#9783) MAINT comment on apparent inconsistency [MRG+1] DOC fix headers level in cross_validation.rst (scikit-learn#9679) Fix mailmap format (scikit-learn#9620) DOC Fix typos (scikit-learn#9577) Typo (scikit-learn#9571) ...
* dfsg: (117 commits) TST Improve SelectFromModel tests (scikit-learn#9733) Name in what's new [MRG+1] Raise error when SparseSeries is passed into classification metrics (scikit-learn#7373) Fix LogisticRegressionCV default solver value in docstring (scikit-learn#9962) [MRG+1] DOC fix sign in GBRT mathematical formulation (scikit-learn#9885) [MRG+1] DOC fix sign in GBRT mathematical formulation (scikit-learn#9885) DOC fix a typo (scikit-learn#9892) [MRG+1] Ledoit-Wolf behavior explanation (scikit-learn#9500) [MRG+1] Fix typos in documentation (scikit-learn#9878) DOC: Use setattr(self, ...) instead of self.setattr(...) (scikit-learn#9866) DOC Removed a duplicate occurrence of a word in 'sklearn.neighbors.KNeighborsRegressor' docs (scikit-learn#9862) FIX docstring of negative_outlier_factor_ in LOF (scikit-learn#9809) [MRG+1] Fix scikit-learn#9743: Adding parameter information to docstring. (scikit-learn#9757) DOC: fix docstring of Imputer.fit (scikit-learn#9769) various minor spelling tweaks (scikit-learn#9783) MAINT comment on apparent inconsistency [MRG+1] DOC fix headers level in cross_validation.rst (scikit-learn#9679) Fix mailmap format (scikit-learn#9620) DOC Fix typos (scikit-learn#9577) Typo (scikit-learn#9571) ...
…etrics (scikit-learn#7373) * Raise error when SparseSeries is passed into roc_curve * Changed "y_true" in second if block to "y_score" * Remove code to import pandas and add sparseseries check to 'type_of_target' function. Finally, add 'type_of_target' call to _binary_clf_curve * Remove pandas import and old comparison in roc_curve. * Add test for 'type_of_target' function * Add white space after commas * Correct other white space issues * Move type_of_target test into try clause, remove test_precision_recall_curve_pos_label since as multiclass it doesn't make sense * Add test_precision_recall_curve_pos_label back in and also add test_binary_clf_curve to test new logic in _binary_clf_curve function * Correct syntax and formatting. * Remove trailing white space * Correct validation logic * Update test_multiclass.py per @jnothman 's request. * Import SkipTest function. * Remove extra white space from line 303
…etrics (scikit-learn#7373) * Raise error when SparseSeries is passed into roc_curve * Changed "y_true" in second if block to "y_score" * Remove code to import pandas and add sparseseries check to 'type_of_target' function. Finally, add 'type_of_target' call to _binary_clf_curve * Remove pandas import and old comparison in roc_curve. * Add test for 'type_of_target' function * Add white space after commas * Correct other white space issues * Move type_of_target test into try clause, remove test_precision_recall_curve_pos_label since as multiclass it doesn't make sense * Add test_precision_recall_curve_pos_label back in and also add test_binary_clf_curve to test new logic in _binary_clf_curve function * Correct syntax and formatting. * Remove trailing white space * Correct validation logic * Update test_multiclass.py per @jnothman 's request. * Import SkipTest function. * Remove extra white space from line 303
Reference Issue
Fixes #7352
What does this implement/fix? Explain your changes.
This change raises an error when the type is a pandas SparseSeries of either the
y_true
ory_score
input.Any other comments?