Thanks to visit codestin.com
Credit goes to github.com

Skip to content

SGDClassifier -- class_weights & sample_weights #3928

Closed
@trevorstephens

Description

@trevorstephens

Easy one first, there is an unused class_weight parameter in the fit method signature, class_weight flows in through the constructor:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/stochastic_gradient.py#L527

Just to prove it:

from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
from sklearn.utils import compute_class_weight
import numpy as np

X, y = make_classification(n_features=5, weights=[0.7, 0.3],
                           n_clusters_per_class=1, random_state=415)

# Baseline
clf = SGDClassifier()
clf.fit(X, y)
print clf.coef_
# [[ 2.13434174  1.8734288   2.12685039  5.08116123  2.89369872]]

# With unused fit class_weight attribute
clf = SGDClassifier()
clf.fit(X, y, class_weight='auto')
print clf.coef_
# [[ 2.13434174  1.8734288   2.12685039  5.08116123  2.89369872]]

Now weighting the samples in different (equivalent) ways:

# With auto-weights
clf = SGDClassifier(class_weight='auto')
clf.fit(X, y)
print clf.coef_
# [[ 10.10838607  -4.29529238  13.14026606  -5.99728163   7.65887541]]

weights = compute_class_weight('auto', clf.classes_, y)
weights = dict(zip(clf.classes_, weights))
mapper = np.vectorize(lambda c: weights[c])
weights = mapper(y)

# With manual auto-weights
clf = SGDClassifier()
clf.fit(X, y, sample_weight=weights)
print clf.coef_
# [[ 10.10838607  -4.29529238  13.14026606  -5.99728163   7.65887541]]

# With manual auto-weights & unused fit class_weight attribute
clf = SGDClassifier()
clf.fit(X, y, sample_weight=weights, class_weight='auto')
print clf.coef_
# [[ 10.10838607  -4.29529238  13.14026606  -5.99728163   7.65887541]]

All fine so far, but if you do both class_weight in the constructor and sample_weights in the fitting, the resulting weights appear to be multiplicative.

# With manual auto-weights, squared
clf = SGDClassifier()
clf.fit(X, y, sample_weight=weights**2)
print clf.coef_
# [[  3.22495438  14.11510502   0.58504094   6.38631993   9.55338404]]

# With auto-weights manual auto-weights -- multiplicative
clf = SGDClassifier(class_weight='auto')
clf.fit(X, y, sample_weight=weights)
print clf.coef_
# [[  3.22495438  14.11510502   0.58504094   6.38631993   9.55338404]]

Whether this is desirable or not is one thing, but it does not appear to be documented anywhere, ie neither class_weight nor sample_weight refer to one another in their docstrings. I feel like perhaps a warning or error should be raised, or at least a mention of the interaction in the docstring.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions