Closed
Description
The code which does clamping in sklearn.semi_supervised.LabelSpreading
appears to be incorrect:
clamp_weights = np.ones((n_samples, 1))
clamp_weights[unlabeled, 0] = self.alpha
# ...
y_static = np.copy(self.label_distributions_)
if self.alpha > 0.:
y_static *= 1 - self.alpha
y_static[unlabeled] = 0
# ...
while ...:
...
self.label_distributions_ = safe_sparse_dot(
graph_matrix, self.label_distributions_)
# clamp
self.label_distributions_ = np.multiply(
clamp_weights, self.label_distributions_) + y_static
This does the following:
- If
i
th sample is labeled, then:y_new[i] = 1.0 * M * y_old[i] + (1 - alpha) * y_init[i]
- If
i
th sample is unlabeled, then:y_new[i] = alpha * M * y_old[i] + 0.0
This is clearly incorrect. The correct way to do this is:
- If
i
th sample is labeled, then:y_new[i] = alpha * M * y_old[i] + (1 - alpha) * y_init[i]
- If
i
th sample is unlabeled, then:y_new[i] = 1.0 * M * y_old[i] + 0.0
The fix is relatively simple:
-clamp_weights[unlabeled, 0] = self.alpha
+clamp_weights[~unlabeled, 0] = self.alpha
I can create a PR for this but am not sure what kind of test cases I should add to avoid a regression, if any.
Test case:
samples = [[1., 0.], [0., 1.], [1., 2.5]]
labels = [0, 1, -1]
mdl = label_propagation.LabelSpreading(kernel='rbf', max_iter=5000)
mdl.fit(samples, labels) # This will use up all 5000 iterations without converging
With the fix in place, it takes only 6 iterations.
Metadata
Metadata
Assignees
Labels
No labels