SGDClassifier -- class_weights & sample_weights

Easy one first, there is an unused `class_weight` parameter in the `fit` method signature, `class_weight` flows in through the constructor:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/stochastic_gradient.py#L527

Just to prove it:

```
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
from sklearn.utils import compute_class_weight
import numpy as np

X, y = make_classification(n_features=5, weights=[0.7, 0.3],
                           n_clusters_per_class=1, random_state=415)

# Baseline
clf = SGDClassifier()
clf.fit(X, y)
print clf.coef_
# [[ 2.13434174  1.8734288   2.12685039  5.08116123  2.89369872]]

# With unused fit class_weight attribute
clf = SGDClassifier()
clf.fit(X, y, class_weight='auto')
print clf.coef_
# [[ 2.13434174  1.8734288   2.12685039  5.08116123  2.89369872]]
```

Now weighting the samples in different (equivalent) ways:

```
# With auto-weights
clf = SGDClassifier(class_weight='auto')
clf.fit(X, y)
print clf.coef_
# [[ 10.10838607  -4.29529238  13.14026606  -5.99728163   7.65887541]]

weights = compute_class_weight('auto', clf.classes_, y)
weights = dict(zip(clf.classes_, weights))
mapper = np.vectorize(lambda c: weights[c])
weights = mapper(y)

# With manual auto-weights
clf = SGDClassifier()
clf.fit(X, y, sample_weight=weights)
print clf.coef_
# [[ 10.10838607  -4.29529238  13.14026606  -5.99728163   7.65887541]]

# With manual auto-weights & unused fit class_weight attribute
clf = SGDClassifier()
clf.fit(X, y, sample_weight=weights, class_weight='auto')
print clf.coef_
# [[ 10.10838607  -4.29529238  13.14026606  -5.99728163   7.65887541]]
```

All fine so far, but if you do both `class_weight` in the constructor and `sample_weights` in the fitting, the resulting weights appear to be multiplicative. 

```
# With manual auto-weights, squared
clf = SGDClassifier()
clf.fit(X, y, sample_weight=weights**2)
print clf.coef_
# [[  3.22495438  14.11510502   0.58504094   6.38631993   9.55338404]]

# With auto-weights manual auto-weights -- multiplicative
clf = SGDClassifier(class_weight='auto')
clf.fit(X, y, sample_weight=weights)
print clf.coef_
# [[  3.22495438  14.11510502   0.58504094   6.38631993   9.55338404]]
```

Whether this is desirable or not is one thing, but it does not appear to be documented anywhere, ie neither `class_weight` nor `sample_weight` refer to one another in their docstrings. I feel like perhaps a warning or error should be raised, or at least a mention of the interaction in the docstring.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SGDClassifier -- class_weights & sample_weights #3928

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

SGDClassifier -- class_weights & sample_weights #3928

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions