Thanks to visit codestin.com
Credit goes to github.com

Skip to content

SGDClassifier -- class_weights & sample_weights #3928

@trevorstephens

Description

@trevorstephens

Easy one first, there is an unused class_weight parameter in the fit method signature, class_weight flows in through the constructor:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/stochastic_gradient.py#L527

Just to prove it:

from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
from sklearn.utils import compute_class_weight
import numpy as np

X, y = make_classification(n_features=5, weights=[0.7, 0.3],
                           n_clusters_per_class=1, random_state=415)

# Baseline
clf = SGDClassifier()
clf.fit(X, y)
print clf.coef_
# [[ 2.13434174  1.8734288   2.12685039  5.08116123  2.89369872]]

# With unused fit class_weight attribute
clf = SGDClassifier()
clf.fit(X, y, class_weight='auto')
print clf.coef_
# [[ 2.13434174  1.8734288   2.12685039  5.08116123  2.89369872]]

Now weighting the samples in different (equivalent) ways:

# With auto-weights
clf = SGDClassifier(class_weight='auto')
clf.fit(X, y)
print clf.coef_
# [[ 10.10838607  -4.29529238  13.14026606  -5.99728163   7.65887541]]

weights = compute_class_weight('auto', clf.classes_, y)
weights = dict(zip(clf.classes_, weights))
mapper = np.vectorize(lambda c: weights[c])
weights = mapper(y)

# With manual auto-weights
clf = SGDClassifier()
clf.fit(X, y, sample_weight=weights)
print clf.coef_
# [[ 10.10838607  -4.29529238  13.14026606  -5.99728163   7.65887541]]

# With manual auto-weights & unused fit class_weight attribute
clf = SGDClassifier()
clf.fit(X, y, sample_weight=weights, class_weight='auto')
print clf.coef_
# [[ 10.10838607  -4.29529238  13.14026606  -5.99728163   7.65887541]]

All fine so far, but if you do both class_weight in the constructor and sample_weights in the fitting, the resulting weights appear to be multiplicative.

# With manual auto-weights, squared
clf = SGDClassifier()
clf.fit(X, y, sample_weight=weights**2)
print clf.coef_
# [[  3.22495438  14.11510502   0.58504094   6.38631993   9.55338404]]

# With auto-weights manual auto-weights -- multiplicative
clf = SGDClassifier(class_weight='auto')
clf.fit(X, y, sample_weight=weights)
print clf.coef_
# [[  3.22495438  14.11510502   0.58504094   6.38631993   9.55338404]]

Whether this is desirable or not is one thing, but it does not appear to be documented anywhere, ie neither class_weight nor sample_weight refer to one another in their docstrings. I feel like perhaps a warning or error should be raised, or at least a mention of the interaction in the docstring.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions