-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] Implements Multiclass hinge loss #3607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I apologise for the failing test. Can anyone please help me with this error? I will include the complete traceback Traceback (most recent call last): |
Are the variables hinge_loss_by_hand_part1, hinge_loss_by_hand_part2 and hinge_loss_by_hand_part3 references to primitives or numpy arrays? Based on a quick look at the test, hinge_loss_by_hand_part* refer to floats and thus the numpy method is giving you an error. If I understand correctly, numpy.mean (or np.mean) accepts an array as an argument. (See the docs at http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html).
and see if the test passes? |
margins_array = [] | ||
if label_vector is None: | ||
raise ValueError("label_vector\ | ||
required in multilabel classification") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being dumb, but what exactly is label_vector
. Isn't it simply np.unique(y)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we need a multi_class
option that calculates the ovr
or crammer-singer
hinge loss as @mblondel has suggested.
+1 for @Winterflower 's suggestion. But I'm not sure the code even goes there since you have ravelled |
Can anyone please review the new commit. I have tried to address comments |
@SaurabhJha I'm trying to have a look now. |
pred_decision = column_or_1d(pred_decision) | ||
pred_decision = np.array(pred_decision) | ||
lb = LabelEncoder() | ||
encoded_labels = lb.fit(y_true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need not store the value of lb.fit(y_true)
since it returns lb
. You can simply do
lb = LabelEncoder()
lb.fit(y_true)
and use lb.classes_
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or even better
le = LabelEncoder()
y_bin = le.fit_transform(y)
so that it can be used later on.
@SaurabhJha Please update your code according to this vectorized implementation. After that we can look at further clean ups. #3607 (comment) |
Do you think we need the optional parameter "label_vector"? I tried but cannot think any other way to do it. |
No, I do not think so. I have done it without the label_vector, you can just use the transformed classes of the LabelEncoder and probably raise an error
Does this seem ok to you? |
I think I got the issue. For a four-label problem, a typical pred_decision row looks like this-- So, you need all the labels to match each of them to their respective labels. So let's say we have the labels like this In fact, your first test will fail if len(classes_) != pred_decision.shape[1]: What do you think? Please correct me if I am wrong |
classes_ = [0, 1, 2, 3] and pred_decision = [ 1.27272363, 0.0342046 , -0.68379465, -1.40169096]. The true label that you pass is |
For ex, look at the |
Just in case I am wrong @arjoly can you please verify that this comment seems to be correct, #3607 (comment) , or can you think of a better way? |
labels are provided in multiclass metrics:
|
@jnothman Thanks for the clarification! |
I was thinking that we were not going to do |
pred_decision = column_or_1d(pred_decision) | ||
pred_decision = np.array(pred_decision) | ||
le = LabelEncoder() | ||
le.fit(y_true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The label encoding needs to account for labels not included in y_true
, if provided in label_vector
(which should be called labels
)
@SaurabhJha, I don't think this code is correct. Could please you describe the algorithm in words (pseudocode) for clarification? Then we can help you get the code to match its description. |
@jnothman The idea is that for each sample, we do 1 - the pred decision corresponding to the true class corresponding to the sample + max(pred decision corresponding to the other classes), then take the mean across all samples. I had come up with this version.
|
raise TypeError("pred_decision should be an array of floats.") | ||
y_true_unique = np.unique(y_true) | ||
if np.size(y_true_unique) > 2: | ||
if (labels is None and len(pred_decision.shape) > 1 and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at present this will fail if pred_decision
is a list. (I think), You need to do check_array(pred_decision, ensure_2d=False)
somewhere, and replace len(pred_decision.shape)
by pred_decision.ndim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also please write a test for this. you can replace one of the pred_decision
that is a numpy array in one of your tests with a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I am wrong. So your concern the calling of shape on pred_decision if it's not numpy array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
On Tue, Oct 28, 2014 at 5:08 PM, Saurabh Jha [email protected]
wrote:
In sklearn/metrics/classification.py:
are encoded as +1 and -1 respectively
- lbin = LabelBinarizer(neg_label=-1)
- y_true = lbin.fit_transform(y_true)[:, 0]
- if len(lbin.classes_) > 2 or (pred_decision.ndim == 2
and pred_decision.shape[1] != 1):
raise ValueError("Multi-class hinge loss not supported")
- pred_decision = np.ravel(pred_decision)
- try:
margin = y_true \* pred_decision
- except TypeError:
raise TypeError("pred_decision should be an array of floats.")
- y_true_unique = np.unique(y_true)
- if np.size(y_true_unique) > 2:
if (labels is None and len(pred_decision.shape) > 1 and
Please correct me if I am wrong. So your concern the calling of shape on
pred_decision if it's not numpy array—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/3607/files#r19481610.
Godspeed,
Manoj Kumar,
Intern, Telecom ParisTech
Mech Undergrad
http://manojbits.wordpress.com
that's it from my side. |
I wonder if I should squash the upcoming commit also into previous commit. I think I should |
you could just do
|
hinge_loss, y_true, pred_decision) | ||
|
||
|
||
def test_hinge_loss_multiclass_reptition_of_labels_with_missing_labels(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reptition -> repetition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, don't mention repetition. Repetition should be what's normal in multiclass evaluation, and is included in all tested.
In case @MechCoder threatens to merge again, I'm happy with this PR (except for the test name issue), but we should have a better documentation of how it fits into the invariance tests, and should probably explicitly exclude it from the current |
That was a really "unethical" way of getting your attention, but hey it worked ;) |
76fae17
to
eaaf80f
Compare
# Currently, invariance of string and integer labels cannot be tested | ||
# in common invarinace tests because invariance tests for multiclass | ||
# decision functions is not implemented yet. | ||
y_true = ['blue', 'green', 'red', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MechCoder Please see if this message is what was intended?
@jnothman I cherrypicked Saurabh's commit, made a few minor cosmetics and I want to push it directly from cmd-line. Sorry but how do I do it? I do not want to screw anything up |
I did this.
do I just do |
Okay, my git squashing/merging workflow in brief. In
Then I do something like: $ git checkout master
$ git pull upstream master
$ git fetch upstream # fetches all PRs
$ git checkout upstream/pr/3607
$ git squash master # uses git rebase -i; from https://github.com/jnothman/git-squash
$ git checkout - # back to master
$ git merge - # cherry-picks in squashed commit
$ git push upstream master (And in practice this looks like
with some bits obviously still needing an alias) |
I think I made a mistake before, I removed your latest commit from master (2f275de) . I think I somehow replaced origin with upstream somewhere while typing. Extremely sorry :/ |
Surely you can't have done that without a force push. Never ever force push On 3 November 2014 23:53, Manoj Kumar [email protected] wrote:
|
Anyway, I've restored the head to what it was recently. |
Sorry. I've learnt from my mistakes. Now trying out this comment (#3607 (comment)) |
@SaurabhJha There were some minor typos in the docs and I commited as 8eee4bc after removing it from the |
@jnothman Thanks for your tips! I think I did it correctly. On a side note, this (https://twitter.com/heathercmiller/status/526770571728531456) ;) |
Thank you @jnothman @MechCoder for your review. Good to see my first contribution :-) |
Implements multiclass hinge loss. Fixes #3451