-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
[MRG+2] Fix log loss bug #7239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Fix log loss bug #7239
Conversation
| The logarithm used is the natural logarithm (base-e). | ||
| """ | ||
| lb = LabelBinarizer() | ||
| T = lb.fit_transform(y_true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this block was moved to line 1620 in the diff. the variable T was renamed to transformed_labels and the variable Y was renamed to y_pred
sklearn/metrics/classification.py
Outdated
| else: | ||
| lb.fit(y_true) | ||
|
|
||
| if labels is None and len(lb.classes_) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could reorganize this into:
if len(lb.classes_) == 1:
if labels is None:
raise
else:
raiseThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, good catch
|
Could you also add a test for when there is more than one label in |
|
Also, it is possible to move all the |
|
Thanks for wrapping this up. Seems like changing the names of variables has given us the opportunity to do an unintentional cleanup ;) |
The check for |
sklearn/metrics/classification.py
Outdated
| raise ValueError("Unable to automatically cast y_pred to " | ||
| "float. y_pred should be an array of floats.") | ||
| # sanity check | ||
| if y_pred.dtype != float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the correct way to do this is to pass a dtype argument to check_array that raises an error if it is unable to be cast to the provided dtype. But if I pass dtype=np.float32, it fails this test (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/tests/test_classification.py#L1392) because check_array raises an error with a MockDataFrame and dtype=np.float32.
For now, I would suggest to either just remove these float checks (note that nothing useful was checked previously and np.clip should take care of string dtypes) or figure out what is going on with the MockDataframe and fix that (which is beyond the scope of this PR)
Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems reasonable, i agree that figuring out what's going on with MockDataframe is beyond the scope of the PR. do you want to raise an issue for that?
|
@nelson-liu That should be my last pass. LGTM pending comments. |
|
@MechCoder addressed your comments, let me know if i got them right (in particular the nonregression test). |
| assert_almost_equal(loss, 1.0383217, decimal=6) | ||
|
|
||
| # case when len(np.unique(y_true)) != y_pred.shape[1] | ||
| y_true = [1,2,2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space this out. and space out labels 2 lines below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still pep8
|
Thanks! You have my +1. |
sklearn/metrics/classification.py
Outdated
| Sample weights. | ||
| labels : array-like, optional (default=None) | ||
| If not provided, labels will be inferred from y_true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
full stop at the end.
|
looks good apart from minor comments. |
|
rebase? |
|
@amueller addressed your comments, can you look over them (in particular the docstrings clarifying assumptions made) to make sure I got what you wanted before I rebase and lose the history? I'll rebase after you verify it's ok. |
enhance log_loss labels option feature log_loss changed test log_loss case u add ValueError in log_loss
fixes as per existing pull request scikit-learn#6714 fixed log_loss bug enhance log_loss labels option feature log_loss changed test log_loss case u add ValueError in log_loss fixes as per existing pull request scikit-learn#6714 fixed error message when y_pred and y_test labels don't match fixed error message when y_pred and y_test labels don't match corrected doc/whats_new.rst for syntax and with correct formatting of credits additional formatting fixes for doc/whats_new.rst fixed versionadded comment removed superfluous line removed superflous line
|
lgtm with additional sentence to docstring. |
fix a typo in whatsnew refactor conditional and move dtype check before np.clip general cleanup of log_loss remove dtype checks edit non-regression test and wordings fix non-regression test misc doc fixes / clarifications + final touches fix naming of y_score2 variable specify log loss is only valid for 2 labels or more
4ced594 to
d97a25f
Compare
|
squashed my commits and rebased. do we want to merge all of the commits on this PR, seeing as there are 3 authors, instead of squashing? |
|
one commit per author is fine. |
|
Perfect, just how I squashed it. Assuming CI passes, this is ready for merge on my side, let me know if anything else is needed |
|
no, I think it's good to go. |
|
@amueller merge? |
|
Thanks! |
|
hey guys im new to github and coding and wondering - how do i use the fix that you guys seem to have created above? I am getting the same issue using log_loss:
thanks for working on this! |
|
@marconoe what version of scikit-learn are you using? This should be fixed in 0.18 and 0.181. |
|
hi @amueller, Now i get a new error but at least it's different so i can work on that. cheers |
|
Upgrade scikit-learn (e.g. pip install -U scikit-learn) and check if it's
still a problem.
…On 9 December 2016 at 08:36, Marco ***@***.***> wrote:
hey guys im new to github and coding and wondering - how do i use the fix
that you guys seem to have created above? I am getting the same issue using
log_loss:
ValueError: y_true and y_pred have different number of classes 2, 3
thanks
marco
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#7239 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6woJFV7Ua-JaiHIsQazacQozjKI8ks5rGHhEgaJpZM4Jsb6D>
.
|
Reference Issue
original PRs at #6714 and #7166 .
Fixes:
metrics.log_loss fails when any classes are missing in y_true #4033
Fix a bug, the result is wrong when use sklearn.metrics.log_loss with one class, #4546
Log_loss is calculated incorrectly when only 1 class present #6703
What does this implement/fix? Explain your changes.
This PR is a cherrypicked, rebased, and squashed version of #7166. I addressed the comments in there, namely by renaming the single-letter variables, adding another
ValueErrorsaying that labels should have more than one unique label iflen(lb.classes_) == 1andlabels is not None, and removing a commented out code block.Any other comments?
@MechCoder @amueller anything else that needs to be done?