FIX normalization in semi_supervised label_propagation #31924

dschult · 2025-08-11T12:50:37Z

Fixes #31872 : strange normalization in semi-supervised label propagation

The trouble briefly:

In the dense affinity_matrix case, the current code sums axis=0 and then divides the rows by these sums. Other normalizations in semi_supervised use axis=1. This does not cause errors so long as we have symmetric affinity_matrix. The dense case arises for kernel "rbf" which provides symmetric matrices. But if someone provides their own kernel the normalization could be incorrect.
In the sparse affinity_matrix case, the current code divides all rows by the sum of the first row. This does not cause errors so long as the row sums are all the same. The sparse case arises for kernel "knn" which has all rows sum to k. But if someone provides their own kernel the normalization could be incorrect.
The normalization is different for the dense and sparse cases, which could be confusing to someone writing their own kernel.

This PR adds tests to of proper normalization that agrees between sparse and dense.
It also adjusts the code so it can work with sparse arrays or sparse matrices.

The tests check that normalization agrees between dense and sparse cases even if the affinity_matrix is not symmetric and does not have equal row sums. The errors corrected here do not arise for users who use the sklearn kernel options.

I discovered this when working on making sure sparse arrays and sparse matrices result in the same values (#31177). This PR splits it out of the other PR because it corrects/changes the current code and adds a test. Separating it from the large number of changes in the other PR is prudent, and eases review.

github-actions · 2025-08-11T12:51:45Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: e0063f6. Link to the linter CI: here}

adrinjalali · 2025-08-12T08:44:05Z

@snath-xoc @antoinebaker could you have a look here please?

snath-xoc · 2025-08-12T09:36:46Z

I can take a look, thank you for the ping ☺️

snath-xoc · 2025-08-15T06:52:38Z

sklearn/semi_supervised/tests/test_label_propagation.py



+@pytest.mark.parametrize("constructor", CONSTRUCTOR_TYPES)
+@pytest.mark.parametrize("Estimator, parameters", ESTIMATORS[1:2])


Is there a reason this is tested only for ESTIMATORS[1:2], I tried it with LabelSpreading estimators as well and it fails.... something to perhaps investigate further (and for now add in XFAIL), unless there are any insights as to why it's expected to fail?

The first and third ESTIMATORS use method rbf which creates a dense affinity_matrix.
And the fourth and following ESTIMATORS use the LabelSpreading class that constructs a laplacian_matrix instead of an affinity_matrix, so the normalization is different.

It might be cleaner to inline the Estimator and parameters here instead of fixtures since we are only testing one case. I agree that it looks strange to use only one of the list of Estimators, but that's all we want to test.

Upon reflection, I think it is worthwhile to test both dense and sparse cases for LabelPropagation._build_graph. So I've included the suggestion to use ESTIMATORS[:2].

I haven't added any tests for LabelSpreading._build_graph because the rows are not supposed to sum to 1 there. [The result there is not normalized beyond the normalization done while computing the laplacian from the affinity matrix].

sklearn/semi_supervised/tests/test_label_propagation.py

sklearn/semi_supervised/_label_propagation.py

snath-xoc · 2025-08-29T13:10:16Z

Great, thank you very much @dschult, I think we probably now need to add to the changelog, i.e. what the fix is under the PR number (you will see some example under doc/whats_new). Would you be able to add something, I can also initially help if you're unsure?

dschult · 2025-08-29T17:58:03Z

Thanks @snath-xoc
I added some text -- suggestions please. :) Or if it should go somewhere else, etc.

snath-xoc · 2025-09-05T08:36:17Z

Hey @dschult looks good to me, we need one more reviewer now, perhaps @antoinebaker?

dschult · 2025-09-08T16:23:04Z

Just a gentle bump here.
I believe #31177 is also paused while we solve this issue. So, it'd be great to get this worked out.:) Thanks

antoinebaker

Thanks @dschult for the PR!

It seems indeed that affinity_matrix was not properly row normalized in the old code. Here a few suggestions / comments.

doc/whats_new/upcoming_changes/sklearn.semi_supervised/31924.fix.rst

sklearn/semi_supervised/tests/test_label_propagation.py

antoinebaker · 2025-09-17T16:17:57Z

sklearn/semi_supervised/_label_propagation.py

+        normalizer = affinity_matrix.sum(axis=1)
        if sparse.issparse(affinity_matrix):
-            affinity_matrix.data /= np.diag(np.array(normalizer))
-        else:
+            # handle spmatrix (make 1D)
+            if sparse.isspmatrix(affinity_matrix):
+                normalizer = np.ravel(normalizer)
+            # common case: knn method gives row sum k for all rows but
+            # a user-kernel may have varied row sums so "else" handles that
+            if np.all(normalizer == normalizer[0]):
+                affinity_matrix.data /= normalizer[0]
+            else:
+                if affinity_matrix.format == "csr":
+                    repeats = np.diff(affinity_matrix.indptr)
+                    rows = np.repeat(np.arange(affinity_matrix.shape[0]), repeats)
+                else:  # CSC format
+                    rows = affinity_matrix.indices
+                affinity_matrix.data /= normalizer[rows]
+        else:  # Dense affinity_matrix


I wonder if we could use sparse multiplication instead. The code is much easier to read, and I believe it's pretty efficient for row normalization (to be confirmed). Have you tried and compared sparse multiplication with your alternatives for csr/csc formats ?

normalizer = affinity_matrix.sum(axis=1) # handle spmatrix (make 1D) if sparse.isspmatrix(affinity_matrix): normalizer = np.ravel(normalizer) if sparse.issparse(affinity_matrix): inv_normalizer = sparse.diags(1.0 / normalizer) affinity_matrix = inv_normalizer @ affinity_matrix else: affinity_matrix /= normalizer[:, np.newaxis]

Typical timing differences are longer for the matmul-by-diagonal approach than multiplying-the-guts of the sparse representation. For small graphs, matmul takes about 10 times longer, but it does better with larger matrices. It seems to level off with 1.8 times longer for CSR format and 4 times longer for CSC format. That pattern holds at least until my memory gets full. But, the times aren't very long -- max of a second or so. So, the readability can take precedence. I'll do whatever seems best. (timing details below)

Timing results using %timeit

DENSITY = 1.0 Timing guts then matmul for shape: (3, 3), format: 'csr' 8.35 μs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) 93.7 μs ± 466 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) Timing guts then matmul for shape: (3, 3), format: 'csc' 6.44 μs ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) 111 μs ± 339 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) Timing guts then matmul for shape: (2500, 2500), format: 'csr' 15.6 ms ± 63.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) 26.2 ms ± 549 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) Timing guts then matmul for shape: (2500, 2500), format: 'csc' 15.8 ms ± 44.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) 66.3 ms ± 366 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) Timing guts then matmul for shape: (5000, 5000), format: 'csr' 61.5 ms ± 248 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) 109 ms ± 252 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) Timing guts then matmul for shape: (5000, 5000), format: 'csc' 70 ms ± 3.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 317 ms ± 794 μs per loop (mean ± std. dev. of 7 runs, 1 loop each) Timing guts then matmul for shape: (10000, 10000), format: 'csr' 297 ms ± 749 μs per loop (mean ± std. dev. of 7 runs, 1 loop each) 453 ms ± 908 μs per loop (mean ± std. dev. of 7 runs, 1 loop each) Timing guts then matmul for shape: (10000, 10000), format: 'csc' 281 ms ± 439 μs per loop (mean ± std. dev. of 7 runs, 1 loop each) 1.3 s ± 31.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) =================== DENSITY = 0.01 Timing guts then matmul for shape: (2500, 2500), nnz: 62500, format: 'csr' 178 μs ± 6.21 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) 389 μs ± 9.25 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) Timing guts then matmul for shape: (2500, 2500), nnz: 62500, format: 'csc' 139 μs ± 606 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) 536 μs ± 8.74 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) Timing guts then matmul for shape: (5000, 5000), nnz: 250000, format: 'csr' 728 μs ± 38.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) 1.26 ms ± 17.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) Timing guts then matmul for shape: (5000, 5000), nnz: 250000, format: 'csc' 607 μs ± 37.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) 2.06 ms ± 95.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) Timing guts then matmul for shape: (10000, 10000), nnz: 1000000, format: 'csr' 2.68 ms ± 89.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.87 ms ± 87.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) Timing guts then matmul for shape: (10000, 10000), nnz: 1000000, format: 'csc' 2.61 ms ± 40.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) 8.55 ms ± 76.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timing code

import numpy as np rng = np.random.default_rng() import scipy as sp import sklearn affinities = [ np.array([[1.0, 1.0, 0.0], [2.0, 1.0, 1.0], [0.0, 1.0, 3.0]]), # rng.random((500, 500)), rng.random((2500, 2500)), rng.random((5000, 5000)), rng.random((10_000, 10_000)), sp.sparse.random_array((2500, 2500), density=0.01, rng=rng), sp.sparse.random_array((5000, 5000), density=0.01, rng=rng), sp.sparse.random_array((10000, 10000), density=0.01, rng=rng), # rng.random((20_000, 20_000)), # rng.random((40_000, 40_000)), ] def setup(affinity_matrix): affinity_csr = sp.sparse.csr_array(affinity_matrix) affinity_csc = sp.sparse.csc_array(affinity_matrix) return affinity_csr, affinity_csc def guts_normalization(affinity_matrix): normalizer = affinity_matrix.sum(axis=1) if affinity_matrix.format == "csr": repeats = np.diff(affinity_matrix.indptr) rows = np.repeat(np.arange(affinity_matrix.shape[0]), repeats) else: # CSC format rows = affinity_matrix.indices affinity_matrix.data /= normalizer[rows] return affinity_matrix def matmul_normalization(affinity_matrix): normalizer = affinity_matrix.sum(axis=1) inv_normalizer = sp.sparse.diags(1.0 / normalizer) affinity_matrix = inv_normalizer @ affinity_matrix return affinity_matrix print('To run the timing, use:') print('for affinity in affinities:') print(' for affinity_matrix in setup(affinity):') print(' guts = guts_normalization(affinity_matrix)') print(' matmul = matmul_normalization(affinity_matrix)') print(' np.testing.assert_allclose(guts.toarray(), matmul.toarray())') print(' print("\\nTiming guts then matmul for ", end="")') print(' print(f"shape: {affinity_matrix.shape} ", end="")') print(' print(f"format: {affinity_matrix.format=}", end="")') print(' print(f"nnz: {affinity_matrix.nnz=}")') print(' %timeit guts_normalization(affinity_matrix)') print(' %timeit matmul_normalization(affinity_matrix)')

Thanks for the timing analysis !

How much does the overhead compare to the overall fit time ?
As _build_graph is only called once, maybe the overhead is negligible compared to the many iterations in fit.

Using a 2500x2500 affinity matrix I time fit as 477ms
while the timing results for building the graph is 27ms for csr.
another way to put it into perspective: self.X_.tocsr() takes 36ms.

Since timing isn't critical here, in the long run we can just use
affinity_matrix /= normalizer[:, np.newaxis] for both dense and sparse cases.
But min dependence on SciPy has to be v1.12 before we can do that.

So I changed this to using the matrix multiply for sparse, with a comment to update when SciPy 1.12+ is required.

I think this should be ready to go.

antoinebaker

Thanks again for the PR @dschult. A couple nitpicks, otherwise LGTM !

antoinebaker · 2025-09-22T08:51:26Z

sklearn/semi_supervised/_label_propagation.py

+        # handle spmatrix (make normalizer 1D)
+        if sparse.isspmatrix(affinity_matrix):
+            normalizer = np.ravel(normalizer)
+        # Todo: when SciPy 1.12+ is min dependence, replace up to ---- with:


Suggested change

# Todo: when SciPy 1.12+ is min dependence, replace up to ---- with:

# TODO: when SciPy 1.12+ is min dependence, replace up to ---- with:

antoinebaker · 2025-09-22T08:54:45Z

sklearn/semi_supervised/tests/test_label_propagation.py

+
+    clf = Estimator(kernel=kernel_affinity_matrix).fit(X, labels)
+    graph = clf._build_graph()
+    assert_allclose(graph.sum(axis=1), 1)  # normalized


Suggested change

assert_allclose(graph.sum(axis=1), 1) # normalized

assert_allclose(graph.sum(axis=1), 1) # normalized rows

dschult · 2025-09-22T14:14:27Z

Thanks @antoinebaker! I picked the nits.
This should be ready to go.

@snath-xoc I think you're now "it" in this game of tag.
What do you think about the changes?

snath-xoc · 2025-09-26T21:43:01Z

Thank you @dschult LGTM as well, shall we mark as ready to merge @adrinjalali ?

dschult · 2025-10-01T01:05:09Z

We've got two approvals with reviewer read access.
Can someone with write access take a look? @adrinjalali ?

adrinjalali · 2025-10-02T09:00:17Z

doc/whats_new/upcoming_changes/sklearn.semi_supervised/31924.fix.rst

@@ -0,0 +1,4 @@
+- User written kernel results are now normalized in
+  :class:`semi-supervized._label_propagation.LabelPropagation`


Suggested change

:class:`semi-supervized._label_propagation.LabelPropagation`

:class:`~sklearn.semi_supervized.LabelPropagation`

adrinjalali · 2025-10-02T09:03:24Z

sklearn/semi_supervised/tests/test_label_propagation.py



+@pytest.mark.parametrize("constructor", CONSTRUCTOR_TYPES)
+@pytest.mark.parametrize("Estimator, parameters", ESTIMATORS[:2])


this ESTIMATORS[:2] is brittle. We might change that list in the future, and I'm not sure why not all of them are tested here.

I think it's testing LabelPropagation instances only. Maybe creating a new constant:

LP_ESTIMATORS = [est for est in ESTIMATORS if isinstance(est, LabelPropagation)]

Co-authored-by: antoinebaker <[email protected]>

dschult · 2025-10-02T20:04:57Z

Thanks @adrinjalali!
I've implemented those two changes.

antoinebaker · 2025-10-03T09:16:45Z

doc/whats_new/upcoming_changes/sklearn.semi_supervised/31924.fix.rst

@@ -0,0 +1,4 @@
+- User written kernel results are now normalized in
+  :class:`semi-supervized.LabelPropagation`


Suggested change

:class:`semi-supervized.LabelPropagation`

:class:`semi_supervised.LabelPropagation`

[I think so, you can check if it renders properly in rendered docs / what's new / link for LabelPropagation should redirect to the API doc]

antoinebaker · 2025-10-03T12:36:47Z

I think test_label_propagation_build_graph_normalized is now skipped:

pytest -v sklearn/semi_supervised  -k test_label_propagation_build_graph_normalized
test_label_propagation_build_graph_normalized[NOTSET-array] SKIPPED (got empty parameter set for (Estimator, parame...)

That would explain the drop in coverage.

antoinebaker · 2025-10-03T12:43:44Z

sklearn/semi_supervised/tests/test_label_propagation.py

+LP_ESTIMATORS = [
+    est for est in ESTIMATORS if isinstance(est, label_propagation.LabelPropagation)
+]
+


Suggested change

LP_ESTIMATORS = [

est for est in ESTIMATORS if isinstance(est, label_propagation.LabelPropagation)

]

LP_ESTIMATORS = [

(klass, params)

for (klass, params) in ESTIMATORS

if klass == label_propagation.LabelPropagation

]

My bad :) it wasn't the proper filter.

doc/whats_new/upcoming_changes/sklearn.semi_supervised/31924.fix.rst

…ix.rst Co-authored-by: antoinebaker <[email protected]>

dschult · 2025-10-06T18:09:34Z

Thanks @antoinebaker !
Looks like the docs now render the link correctly.

…31924) Co-authored-by: antoinebaker <[email protected]>

github-actions bot added the module:semi_supervised label Aug 11, 2025

dschult mentioned this pull request Aug 11, 2025

Strange normalization of semi-supervised label propagation in _build_graph #31872

Closed

dschult changed the title ~~Sparse normalizer~~ MAINT: Fix normalization in semi_supervised label_propagation Aug 11, 2025

dschult changed the title ~~MAINT: Fix normalization in semi_supervised label_propagation~~ FIX normalization in semi_supervised label_propagation Aug 11, 2025

dschult mentioned this pull request Aug 11, 2025

Enable config setting sparse_interface to control sparray and spmatrix creation #31177

Open

snath-xoc reviewed Aug 15, 2025

View reviewed changes

antoinebaker reviewed Sep 17, 2025

View reviewed changes

doc/whats_new/upcoming_changes/sklearn.semi_supervised/31924.fix.rst Outdated Show resolved Hide resolved

sklearn/semi_supervised/tests/test_label_propagation.py Outdated Show resolved Hide resolved

sklearn/semi_supervised/tests/test_label_propagation.py Show resolved Hide resolved

antoinebaker reviewed Sep 17, 2025

View reviewed changes

dschult force-pushed the sparse_normalizer branch from c9a4a88 to ebf27ab Compare September 19, 2025 04:47

antoinebaker approved these changes Sep 22, 2025

View reviewed changes

snath-xoc approved these changes Sep 27, 2025

View reviewed changes

adrinjalali reviewed Oct 2, 2025

View reviewed changes

dschult and others added 7 commits October 2, 2025 15:29

tests of label_propagation graph normalizer

9b1b81b

fix normalization of affinity_matrix in _build_graph

a7034a6

test dense and sparse. skip normalizing by scalar

7c0248a

normalize even when all rows have same sum

9b0f1b3

add ChangeLog text

f556059

Apply suggestions from code review

57421e5

Co-authored-by: antoinebaker <[email protected]>

directly test row sum normalized to 1

2519cf3

dschult added 4 commits October 2, 2025 15:29

simplify code to match sparse and dense cases

cdb4880

make work for older scipy versions <1.12

d572ba2

good minor tweaks

926b081

changelog link and better parametrized test

bed1545

dschult force-pushed the sparse_normalizer branch from 7c33f42 to bed1545 Compare October 2, 2025 19:29

antoinebaker reviewed Oct 3, 2025

View reviewed changes

dschult added 3 commits October 3, 2025 14:15

mis-punctuation

dee2768

fix filter for estimators

1f8276f

Merge branch 'main' into sparse_normalizer

ef8e089

antoinebaker reviewed Oct 6, 2025

View reviewed changes

doc/whats_new/upcoming_changes/sklearn.semi_supervised/31924.fix.rst Outdated Show resolved Hide resolved

Update doc/whats_new/upcoming_changes/sklearn.semi_supervised/31924.f…

e0063f6

…ix.rst Co-authored-by: antoinebaker <[email protected]>

adrinjalali approved these changes Oct 7, 2025

View reviewed changes

adrinjalali merged commit 0a96fcb into scikit-learn:main Oct 7, 2025
36 checks passed

dschult deleted the sparse_normalizer branch October 9, 2025 02:45

Tunahanyrd pushed a commit to Tunahanyrd/scikit-learn that referenced this pull request Oct 28, 2025

FIX normalization in semi_supervised label_propagation (scikit-learn#…

3501ee5

…31924) Co-authored-by: antoinebaker <[email protected]>



		@pytest.mark.parametrize("constructor", CONSTRUCTOR_TYPES)
		@pytest.mark.parametrize("Estimator, parameters", ESTIMATORS[1:2])

	# Todo: when SciPy 1.12+ is min dependence, replace up to ---- with:
	# TODO: when SciPy 1.12+ is min dependence, replace up to ---- with:

	assert_allclose(graph.sum(axis=1), 1) # normalized
	assert_allclose(graph.sum(axis=1), 1) # normalized rows

		@@ -0,0 +1,4 @@
		- User written kernel results are now normalized in
		:class:`semi-supervized._label_propagation.LabelPropagation`

	:class:`semi-supervized._label_propagation.LabelPropagation`
	:class:`~sklearn.semi_supervized.LabelPropagation`

		@@ -0,0 +1,4 @@
		- User written kernel results are now normalized in
		:class:`semi-supervized.LabelPropagation`

	:class:`semi-supervized.LabelPropagation`
	:class:`semi_supervised.LabelPropagation`

Uh oh!

FIX normalization in semi_supervised label_propagation #31924

FIX normalization in semi_supervised label_propagation #31924

Uh oh!

Conversation

dschult commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

adrinjalali commented Aug 12, 2025

Uh oh!

snath-xoc commented Aug 12, 2025

Uh oh!

snath-xoc Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

snath-xoc commented Aug 29, 2025

Uh oh!

dschult commented Aug 29, 2025

Uh oh!

snath-xoc commented Sep 5, 2025

Uh oh!

dschult commented Sep 8, 2025

Uh oh!

antoinebaker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dschult Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoinebaker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dschult commented Sep 22, 2025

Uh oh!

snath-xoc commented Sep 26, 2025

Uh oh!

dschult commented Oct 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dschult commented Oct 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoinebaker Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

dschult commented Aug 11, 2025 •

edited

Loading

github-actions bot commented Aug 11, 2025 •

edited

Loading

snath-xoc Aug 15, 2025 •

edited

Loading

dschult Sep 19, 2025 •

edited

Loading

antoinebaker Oct 3, 2025 •

edited

Loading