Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@nelson-liu
Copy link
Contributor

@nelson-liu nelson-liu commented Aug 24, 2016

Reference Issue

original PRs at #6714 and #7166 .
Fixes:
metrics.log_loss fails when any classes are missing in y_true #4033
Fix a bug, the result is wrong when use sklearn.metrics.log_loss with one class, #4546
Log_loss is calculated incorrectly when only 1 class present #6703

What does this implement/fix? Explain your changes.

This PR is a cherrypicked, rebased, and squashed version of #7166. I addressed the comments in there, namely by renaming the single-letter variables, adding another ValueError saying that labels should have more than one unique label if len(lb.classes_) == 1 and labels is not None, and removing a commented out code block.

Any other comments?

@MechCoder @amueller anything else that needs to be done?

The logarithm used is the natural logarithm (base-e).
"""
lb = LabelBinarizer()
T = lb.fit_transform(y_true)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this block was moved to line 1620 in the diff. the variable T was renamed to transformed_labels and the variable Y was renamed to y_pred

@nelson-liu nelson-liu changed the title Fix log loss bug [MRG] Fix log loss bug Aug 24, 2016
@MechCoder MechCoder added this to the 0.18 milestone Aug 24, 2016
else:
lb.fit(y_true)

if labels is None and len(lb.classes_) == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could reorganize this into:

if len(lb.classes_) == 1:
    if labels is None:
        raise
    else:
        raise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, good catch

@MechCoder
Copy link
Member

Could you also add a test for when there is more than one label in y_true but still len(np.unique(y_true)) != y_pred.shape[1] as a non-regression test for this (#4033) and as a sanity check?

@MechCoder
Copy link
Member

Also, it is possible to move all the check_array and check_consistent_length checks to the top of the function. It is not clear why all those checks are necessary. (For instance, the fit and transform in the LabelBinarizer should internally call check_array as well.)

@MechCoder
Copy link
Member

Thanks for wrapping this up. Seems like changing the names of variables has given us the opportunity to do an unintentional cleanup ;)

@nelson-liu
Copy link
Contributor Author

Also, it is possible to move all the check_array and check_consistent_length checks to the top of the function. It is not clear why all those checks are necessary. (For instance, the fit and transform in the LabelBinarizer should internally call check_array as well.)

The check for transformed_labels that I didn't move to the top seems necessary, considering the error is thrown at all (and thus isn't picked up by LabelBinarizer). I've addressed your comments above (kept track of what I had finished with the 👍 emoji). if you can't find where I put a change, I'd be happy to point you to the associated place as this diff is pretty big.

raise ValueError("Unable to automatically cast y_pred to "
"float. y_pred should be an array of floats.")
# sanity check
if y_pred.dtype != float:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the correct way to do this is to pass a dtype argument to check_array that raises an error if it is unable to be cast to the provided dtype. But if I pass dtype=np.float32, it fails this test (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/tests/test_classification.py#L1392) because check_array raises an error with a MockDataFrame and dtype=np.float32.

For now, I would suggest to either just remove these float checks (note that nothing useful was checked previously and np.clip should take care of string dtypes) or figure out what is going on with the MockDataframe and fix that (which is beyond the scope of this PR)
Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable, i agree that figuring out what's going on with MockDataframe is beyond the scope of the PR. do you want to raise an issue for that?

@MechCoder
Copy link
Member

@nelson-liu That should be my last pass. LGTM pending comments.

@nelson-liu
Copy link
Contributor Author

nelson-liu commented Aug 25, 2016

@MechCoder addressed your comments, let me know if i got them right (in particular the nonregression test).

assert_almost_equal(loss, 1.0383217, decimal=6)

# case when len(np.unique(y_true)) != y_pred.shape[1]
y_true = [1,2,2]
Copy link
Member

@MechCoder MechCoder Aug 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space this out. and space out labels 2 lines below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still pep8

@MechCoder
Copy link
Member

Thanks! You have my +1.

Sample weights.
labels : array-like, optional (default=None)
If not provided, labels will be inferred from y_true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full stop at the end.

@amueller
Copy link
Member

looks good apart from minor comments.

@amueller
Copy link
Member

rebase?

@nelson-liu
Copy link
Contributor Author

@amueller addressed your comments, can you look over them (in particular the docstrings clarifying assumptions made) to make sure I got what you wanted before I rebase and lose the history? I'll rebase after you verify it's ok.

Harry040 and others added 2 commits August 25, 2016 09:52
enhance log_loss labels option feature

log_loss

changed test log_loss case

u

add ValueError in log_loss
fixes as per existing pull request scikit-learn#6714

fixed log_loss bug

enhance log_loss labels option feature

log_loss

changed test log_loss case

u

add ValueError in log_loss

fixes as per existing pull request scikit-learn#6714

fixed error message when y_pred and y_test labels don't match

fixed error message when y_pred and y_test labels don't match

corrected doc/whats_new.rst for syntax and with correct formatting of credits

additional formatting fixes for doc/whats_new.rst

fixed versionadded comment

removed superfluous line

removed superflous line
@amueller
Copy link
Member

lgtm with additional sentence to docstring.

fix a typo in whatsnew

refactor conditional and move dtype check before np.clip

general cleanup of log_loss

remove dtype checks

edit non-regression test and wordings

fix non-regression test

misc doc fixes / clarifications + final touches

fix naming of y_score2 variable

specify log loss is only valid for 2 labels or more
@nelson-liu
Copy link
Contributor Author

nelson-liu commented Aug 25, 2016

squashed my commits and rebased. do we want to merge all of the commits on this PR, seeing as there are 3 authors, instead of squashing?

@amueller
Copy link
Member

one commit per author is fine.

@nelson-liu
Copy link
Contributor Author

nelson-liu commented Aug 25, 2016

Perfect, just how I squashed it. Assuming CI passes, this is ready for merge on my side, let me know if anything else is needed

@amueller
Copy link
Member

no, I think it's good to go.

@nelson-liu nelson-liu changed the title [MRG+1] Fix log loss bug [MRG+2] Fix log loss bug Aug 25, 2016
@nelson-liu
Copy link
Contributor Author

@amueller merge?

@MechCoder MechCoder merged commit 104e09a into scikit-learn:master Aug 25, 2016
@MechCoder
Copy link
Member

Thanks!

@marconoe
Copy link

marconoe commented Dec 8, 2016

hey guys im new to github and coding and wondering - how do i use the fix that you guys seem to have created above? I am getting the same issue using log_loss:

ValueError: y_true and y_pred have different number of classes 2, 3

thanks for working on this!
marco

@amueller
Copy link
Member

amueller commented Dec 8, 2016

@marconoe what version of scikit-learn are you using? This should be fixed in 0.18 and 0.181.

@marconoe
Copy link

marconoe commented Dec 9, 2016

hi @amueller,
Thanks that makes sense - I'm using 0.17.1. I just updated to 0.18.1

Now i get a new error but at least it's different so i can work on that.

cheers
marco

@jnothman
Copy link
Member

jnothman commented Dec 9, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants