Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] Implements Multiclass hinge loss #3607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 62 additions & 32 deletions doc/modules/model_evaluation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,10 +48,10 @@ Common cases: predefined values

For the most common use cases, you can designate a scorer object with the
``scoring`` parameter; the table below shows all possible values.
All scorer ojects follow the convention that higher return values are better
than lower return values. Thus the returns from mean_absolute_error
and mean_squared_error, which measure the distance between the model
and the data, are negated.
All scorer ojects follow the convention that higher return values are better
than lower return values. Thus the returns from mean_absolute_error
and mean_squared_error, which measure the distance between the model
and the data, are negated.


====================== ======================================= ==================================
Expand All @@ -60,7 +60,7 @@ Scoring Function
**Classification**
'accuracy' :func:`metrics.accuracy_score`
'average_precision' :func:`metrics.average_precision_score`
'f1' :func:`metrics.f1_score`
'f1' :func:`metrics.f1_score`
'log_loss' :func:`metrics.log_loss` requires ``predict_proba`` support
'precision' :func:`metrics.precision_score`
'recall' :func:`metrics.recall_score`
Expand Down Expand Up @@ -91,10 +91,10 @@ Usage examples:

.. note::

The values listed by the ValueError exception correspond to the functions measuring
The values listed by the ValueError exception correspond to the functions measuring
prediction accuracy described in the following sections.
The scorer objects for those functions are stored in the dictionary
``sklearn.metrics.SCORERS``.
``sklearn.metrics.SCORERS``.

.. currentmodule:: sklearn.metrics

Expand All @@ -112,8 +112,8 @@ measuring a prediction error given ground truth and prediction:
- functions ending with ``_error`` or ``_loss`` return a
value to minimize, the lower the better. When converting
into a scorer object using :func:`make_scorer`, set
the ``greater_is_better`` parameter to False (True by default; see the
parameter description below).
the ``greater_is_better`` parameter to False (True by default; see the
parameter description below).

Metrics available for various machine learning tasks are detailed in sections
below.
Expand All @@ -136,33 +136,33 @@ the :func:`fbeta_score` function::
>>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)

The second use case is to build a completely custom scorer object
from a simple python function using :func:`make_scorer`, which can
take several parameters:
from a simple python function using :func:`make_scorer`, which can
take several parameters:

* the python function you want to use (``my_custom_loss_func``
* the python function you want to use (``my_custom_loss_func``
in the example below)

* whether the python function returns a score (``greater_is_better=True``,
the default) or a loss (``greater_is_better=False``). If a loss, the output
* whether the python function returns a score (``greater_is_better=True``,
the default) or a loss (``greater_is_better=False``). If a loss, the output
of the python function is negated by the scorer object, conforming to
the cross validation convention that scorers return higher values for better models.
the cross validation convention that scorers return higher values for better models.

* for classification metrics only: whether the python function you provided requires continuous decision
certainties (``needs_threshold=True``). The default value is
* for classification metrics only: whether the python function you provided requires continuous decision
certainties (``needs_threshold=True``). The default value is
False.

* any additional parameters, such as ``beta`` in an :func:`f1_score`.

Here is an example of building custom scorers, and of using the
Here is an example of building custom scorers, and of using the
``greater_is_better`` parameter::

>>> import numpy as np
>>> def my_custom_loss_func(ground_truth, predictions):
... diff = np.abs(ground_truth - predictions).max()
... diff = np.abs(ground_truth - predictions).max()
... return np.log(1 + diff)
...
>>> # loss_func will negate the return value of my_custom_loss_func,
>>> # which will be np.log(2), 0.693, given the values for ground_truth
>>> # loss_func will negate the return value of my_custom_loss_func,
>>> # which will be np.log(2), 0.693, given the values for ground_truth
>>> # and predictions defined below.
>>> loss = make_scorer(my_custom_loss_func, greater_is_better=False)
>>> score = make_scorer(my_custom_loss_func, greater_is_better=True)
Expand All @@ -175,7 +175,7 @@ Here is an example of building custom scorers, and of using the
-0.69...
>>> score(clf,ground_truth, predictions) # doctest: +ELLIPSIS
0.69...


.. _diy_scoring:

Expand All @@ -193,7 +193,7 @@ the following two rules:

- It returns a floating point number that quantifies the
``estimator`` prediction quality on ``X``, with reference to ``y``.
Again, by convention higher numbers are better, so if your scorer
Again, by convention higher numbers are better, so if your scorer
returns loss, that value should be negated.


Expand All @@ -214,7 +214,6 @@ Some of these are restricted to the binary classification case:
.. autosummary::
:template: function.rst

hinge_loss
matthews_corrcoef
precision_recall_curve
roc_curve
Expand All @@ -226,6 +225,7 @@ Others also work in the multiclass case:
:template: function.rst

confusion_matrix
hinge_loss


Some also work in the multilabel case:
Expand Down Expand Up @@ -307,7 +307,7 @@ The :func:`confusion_matrix` function evaluates
classification accuracy by computing the `confusion matrix
<http://en.wikipedia.org/wiki/Confusion_matrix>`_.

By definition, entry :math:`i, j` in a confusion matrix is
By definition, entry :math:`i, j` in a confusion matrix is
the number of observations actually in group :math:`i`, but
predicted to be in group :math:`j`. Here is an example::

Expand All @@ -330,7 +330,7 @@ from the :ref:`example_model_selection_plot_confusion_matrix.py` example):
.. topic:: Example:

* See :ref:`example_model_selection_plot_confusion_matrix.py`
for an example of using a confusion matrix to evaluate classifier output
for an example of using a confusion matrix to evaluate classifier output
quality.

* See :ref:`example_classification_plot_digits_classification.py`
Expand Down Expand Up @@ -661,11 +661,11 @@ Then the metrics are defined as:
(array([ 0.66..., 0. , 0. ]), array([ 1., 0., 0.]), array([ 0.71..., 0. , 0. ]), array([2, 2, 2]...))


Hinge loss
Hinge loss
----------

The :func:`hinge_loss` function computes the average distance between
the model and the data using
the model and the data using
`hinge loss <http://en.wikipedia.org/wiki/Hinge_loss>`_, a one-sided metric
that considers only prediction errors. (Hinge
loss is used in maximal margin classifiers such as support vector machines.)
Expand All @@ -678,8 +678,22 @@ value, and :math:`w` is the predicted decisions as output by

L_\text{Hinge}(y, w) = \max\left\{1 - wy, 0\right\} = \left|1 - wy\right|_+

Here is a small example demonstrating the use of the :func:`hinge_loss` function
with a svm classifier::
If there are more than two labels, :func:`hinge_loss` uses a multiclass variant
due to Crammer & Singer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

due to -> by?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this is how you do it in scientific references. Does it matter much?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

`Here <http://jmlr.csail.mit.edu/papers/volume2/crammer01a/crammer01a.pdf>`_ is
the paper describing it.

If :math:`y_w` is the predicted decision for true label and :math:`y_t` is the
maximum of the predicted decisions for all other labels, where predicted
decisions are output by decision function, then multiclass hinge loss is defined
by:

.. math::

L_\text{Hinge}(y_w, y_t) = \max{1 + y_t - y_w, 0\right\}

Here a small example demonstrating the use of the :func:`hinge_loss` function
with a svm classifier in a binary class problem::

>>> from sklearn import svm
>>> from sklearn.metrics import hinge_loss
Expand All @@ -696,6 +710,22 @@ with a svm classifier::
>>> hinge_loss([-1, 1, 1], pred_decision) # doctest: +ELLIPSIS
0.3...

Here is an example demonstrating the use of the :func:`hinge_loss` function
with a svm classifier in a multiclass problem::

>>> X = np.array([[0], [1], [2], [3]])
>>> Y = np.array([0, 1, 2, 3])
>>> labels = np.array([0, 1, 2, 3])
>>> est = svm.LinearSVC()
>>> est.fit(X, Y)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='l2', max_iter=1000, multi_class='ovr',
penalty='l2', random_state=None, tol=0.0001, verbose=0)
>>> pred_decision = est.decision_function([[-1], [2], [3]])
>>> y_true = [0, 2, 3]
>>> hinge_loss(y_true, pred_decision, labels) #doctest: +ELLIPSIS
0.56...


Log loss
--------
Expand Down Expand Up @@ -752,7 +782,7 @@ sample has label 0. The log loss is non-negative.
Matthews correlation coefficient
---------------------------------

The :func:`matthews_corrcoef` function computes the
The :func:`matthews_corrcoef` function computes the
`Matthew's correlation coefficient (MCC) <http://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_
for binary classes. Quoting Wikipedia:

Expand Down Expand Up @@ -788,7 +818,7 @@ function:
Receiver operating characteristic (ROC)
---------------------------------------

The function :func:`roc_curve` computes the
The function :func:`roc_curve` computes the
`receiver operating characteristic curve, or ROC curve <http://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_.
Quoting Wikipedia :

Expand Down
6 changes: 6 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,12 @@ Enhancements
to `Rohit Sivaprasad`_), as well as evaluation metrics (by
`Joel Nothman`_).

- Add ``sample_weight`` parameter to `metrics.jaccard_similarity_score`.
By `Jatin Shah`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent this back to where it was, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new thing indentation is more in line with the whole file. Can you please have a look at the source please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new indentation is correct. (The indentation of the whole file is a bit awkward, IMO)


- Add support for multiclass in `metrics.hinge_loss`. Added ``labels=None``
as optional paramter. By `Saurabh Jha`.

- Add ``multi_class="multinomial"`` option in
:class:`linear_model.LogisticRegression` to implement a Logistic
Regression solver that minimizes the cross-entropy or multinomial loss
Expand Down
97 changes: 73 additions & 24 deletions sklearn/metrics/classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
# Joel Nothman <[email protected]>
# Noel Dawe <[email protected]>
# Jatin Shah <[email protected]>
# Saurabh Jha <[email protected]>
# License: BSD 3 clause

from __future__ import division
Expand Down Expand Up @@ -1376,14 +1377,20 @@ def log_loss(y_true, y_pred, eps=1e-15, normalize=True, sample_weight=None):
return _weighted_sum(loss, sample_weight, normalize)


def hinge_loss(y_true, pred_decision, pos_label=None, neg_label=None):
def hinge_loss(y_true, pred_decision, labels=None):
"""Average hinge loss (non-regularized)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

labels is undocumented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it a convention to not document optional parameters? For example, pos_label and neg_label are not documented here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know why pos_label and neg_label is undocumented here. But all parameters should be documented. For example, in sklearn.linear_model.lasso_path almost all parameters are documented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no such convention, whereas pos_label and neg_label should not exist here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pos_label and neg_label should not exist here.

Yes! I did not notice that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I remove them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git blame points me to @arjoly . Maybe he can confirm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will leave this as it is for now.

On Mon, Sep 15, 2014 at 9:04 PM, Manoj Kumar [email protected]
wrote:

In sklearn/metrics/classification.py:

@@ -1373,7 +1375,8 @@ def log_loss(y_true, y_pred, eps=1e-15, normalize=True, sample_weight=None):
return _weighted_sum(loss, sample_weight, normalize)

-def hinge_loss(y_true, pred_decision, pos_label=None, neg_label=None):
+def hinge_loss(y_true, pred_decision, pos_label=None,

  •           neg_label=None, labels=None):
    
    """Average hinge loss (non-regularized)

git blame points me to @arjoly https://github.com/arjoly . Maybe he can
confirm.


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/3607/files#r17548782.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems pos_label and neg_label should have been removed in deprecation, but were not. See 6736a21. Please remove them.


Assuming labels in y_true are encoded with +1 and -1, when a prediction
mistake is made, ``margin = y_true * pred_decision`` is always negative
(since the signs disagree), implying ``1 - margin`` is always greater than
1. The cumulated hinge loss is therefore an upper bound of the number of
mistakes made by the classifier.
In binary class case, assuming labels in y_true are encoded with +1 and -1,
when a prediction mistake is made, ``margin = y_true * pred_decision`` is
always negative (since the signs disagree), implying ``1 - margin`` is
always greater than 1. The cumulated hinge loss is therefore an upper
bound of the number of mistakes made by the classifier.

In multiclass case, the function expects that either all the labels are
included in y_true or an optional labels argument is provided which
contains all the labels. The multilabel margin is calculated according
to Crammer-Singer's method. As in the binary case, the cumulated hinge loss
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some description like this belongs in model_evaluation.rst

is an upper bound of the number of mistakes made by the classifier.

Parameters
----------
Expand All @@ -1394,6 +1401,9 @@ def hinge_loss(y_true, pred_decision, pos_label=None, neg_label=None):
pred_decision : array, shape = [n_samples] or [n_samples, n_classes]
Predicted decisions, as output by decision_function (floats).

labels : array, optional, default None
Contains all the labels for the problem. Used in multiclass hinge loss.

Returns
-------
loss : float
Expand All @@ -1403,6 +1413,16 @@ def hinge_loss(y_true, pred_decision, pos_label=None, neg_label=None):
.. [1] `Wikipedia entry on the Hinge loss
<http://en.wikipedia.org/wiki/Hinge_loss>`_

.. [2] Koby Crammer, Yoram Singer. On the Algorithmic
Implementation of Multiclass Kernel-based Vector
Machines. Journal of Machine Learning Research 2,
(2001), 265-292

.. [3] 'L1 AND L2 Regularization for Multiclass Hinge Loss Models
by Robert C. Moore, John DeNero.
<http://www.ttic.edu/sigml/symposium2011/papers/
Moore+DeNero_Regularization.pdf>'

Examples
--------
>>> from sklearn import svm
Expand All @@ -1420,27 +1440,56 @@ def hinge_loss(y_true, pred_decision, pos_label=None, neg_label=None):
>>> hinge_loss([-1, 1, 1], pred_decision) # doctest: +ELLIPSIS
0.30...

In the multiclass case:
>>> X = np.array([[0], [1], [2], [3]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: space after #

>>> Y = np.array([0, 1, 2, 3])
>>> labels = np.array([0, 1, 2, 3])
>>> est = svm.LinearSVC()
>>> est.fit(X, Y)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='l2', max_iter=1000, multi_class='ovr',
penalty='l2', random_state=None, tol=0.0001, verbose=0)
>>> pred_decision = est.decision_function([[-1], [2], [3]])
>>> y_true = [0, 2, 3]
>>> hinge_loss(y_true, pred_decision, labels) #doctest: +ELLIPSIS
0.56...
"""
# TODO: multi-class hinge-loss
check_consistent_length(y_true, pred_decision)
pred_decision = check_array(pred_decision, ensure_2d=False)
y_true = column_or_1d(y_true)
pred_decision = column_or_1d(pred_decision)

# the rest of the code assumes that positive and negative labels
# are encoded as +1 and -1 respectively
lbin = LabelBinarizer(neg_label=-1)
y_true = lbin.fit_transform(y_true)[:, 0]

if len(lbin.classes_) > 2 or (pred_decision.ndim == 2
and pred_decision.shape[1] != 1):
raise ValueError("Multi-class hinge loss not supported")
pred_decision = np.ravel(pred_decision)

try:
margin = y_true * pred_decision
except TypeError:
raise TypeError("pred_decision should be an array of floats.")
y_true_unique = np.unique(y_true)
if y_true_unique.size > 2:
if (labels is None and pred_decision.ndim > 1 and
(np.size(y_true_unique) != pred_decision.shape[1])):
raise ValueError("Please include all labels in y_true "
"or pass labels as third argument")
if labels is None:
labels = y_true_unique
le = LabelEncoder()
le.fit(labels)
y_true = le.transform(y_true)
mask = np.ones_like(pred_decision, dtype=bool)
mask[np.arange(y_true.shape[0]), y_true] = False
margin = pred_decision[~mask]
margin -= np.max(pred_decision[mask].reshape(y_true.shape[0], -1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a comment here to explain the reshape

axis=1)

else:
# Handles binary class case
# this code assumes that positive and negative labels
# are encoded as +1 and -1 respectively
pred_decision = column_or_1d(pred_decision)
pred_decision = np.ravel(pred_decision)

lbin = LabelBinarizer(neg_label=-1)
y_true = lbin.fit_transform(y_true)[:, 0]

try:
margin = y_true * pred_decision
except TypeError:
raise TypeError("pred_decision should be an array of floats.")

losses = 1 - margin
# The hinge doesn't penalize good enough predictions.
# The hinge_loss doesn't penalize good enough predictions.
losses[losses <= 0] = 0
return np.mean(losses)
Loading