Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@dschult
Copy link
Contributor

@dschult dschult commented Aug 11, 2025

Fixes #31872 : strange normalization in semi-supervised label propagation

The trouble briefly:

  • In the dense affinity_matrix case, the current code sums axis=0 and then divides the rows by these sums. Other normalizations in semi_supervised use axis=1. This does not cause errors so long as we have symmetric affinity_matrix. The dense case arises for kernel "rbf" which provides symmetric matrices. But if someone provides their own kernel the normalization could be incorrect.
  • In the sparse affinity_matrix case, the current code divides all rows by the sum of the first row. This does not cause errors so long as the row sums are all the same. The sparse case arises for kernel "knn" which has all rows sum to k. But if someone provides their own kernel the normalization could be incorrect.
  • The normalization is different for the dense and sparse cases, which could be confusing to someone writing their own kernel.

This PR adds tests to of proper normalization that agrees between sparse and dense.
It also adjusts the code so it can work with sparse arrays or sparse matrices.

The tests check that normalization agrees between dense and sparse cases even if the affinity_matrix is not symmetric and does not have equal row sums. The errors corrected here do not arise for users who use the sklearn kernel options.

I discovered this when working on making sure sparse arrays and sparse matrices result in the same values (#31177). This PR splits it out of the other PR because it corrects/changes the current code and adds a test. Separating it from the large number of changes in the other PR is prudent, and eases review.

@github-actions
Copy link

github-actions bot commented Aug 11, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: e0063f6. Link to the linter CI: here

@dschult dschult changed the title Sparse normalizer MAINT: Fix normalization in semi_supervised label_propagation Aug 11, 2025
@dschult dschult changed the title MAINT: Fix normalization in semi_supervised label_propagation FIX normalization in semi_supervised label_propagation Aug 11, 2025
@adrinjalali
Copy link
Member

@snath-xoc @antoinebaker could you have a look here please?

@snath-xoc
Copy link
Contributor

I can take a look, thank you for the ping ☺️



@pytest.mark.parametrize("constructor", CONSTRUCTOR_TYPES)
@pytest.mark.parametrize("Estimator, parameters", ESTIMATORS[1:2])
Copy link
Contributor

@snath-xoc snath-xoc Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is tested only for ESTIMATORS[1:2], I tried it with LabelSpreading estimators as well and it fails.... something to perhaps investigate further (and for now add in XFAIL), unless there are any insights as to why it's expected to fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first and third ESTIMATORS use method rbf which creates a dense affinity_matrix.
And the fourth and following ESTIMATORS use the LabelSpreading class that constructs a laplacian_matrix instead of an affinity_matrix, so the normalization is different.

It might be cleaner to inline the Estimator and parameters here instead of fixtures since we are only testing one case. I agree that it looks strange to use only one of the list of Estimators, but that's all we want to test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon reflection, I think it is worthwhile to test both dense and sparse cases for LabelPropagation._build_graph. So I've included the suggestion to use ESTIMATORS[:2].

I haven't added any tests for LabelSpreading._build_graph because the rows are not supposed to sum to 1 there. [The result there is not normalized beyond the normalization done while computing the laplacian from the affinity matrix].

@snath-xoc
Copy link
Contributor

Great, thank you very much @dschult, I think we probably now need to add to the changelog, i.e. what the fix is under the PR number (you will see some example under doc/whats_new). Would you be able to add something, I can also initially help if you're unsure?

@dschult
Copy link
Contributor Author

dschult commented Aug 29, 2025

Thanks @snath-xoc
I added some text -- suggestions please. :) Or if it should go somewhere else, etc.

@snath-xoc
Copy link
Contributor

Hey @dschult looks good to me, we need one more reviewer now, perhaps @antoinebaker?

@dschult
Copy link
Contributor Author

dschult commented Sep 8, 2025

Just a gentle bump here.
I believe #31177 is also paused while we solve this issue. So, it'd be great to get this worked out.:) Thanks

Copy link
Contributor

@antoinebaker antoinebaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dschult for the PR!

It seems indeed that affinity_matrix was not properly row normalized in the old code. Here a few suggestions / comments.

Comment on lines 460 to 469
normalizer = affinity_matrix.sum(axis=1)
if sparse.issparse(affinity_matrix):
affinity_matrix.data /= np.diag(np.array(normalizer))
else:
# handle spmatrix (make 1D)
if sparse.isspmatrix(affinity_matrix):
normalizer = np.ravel(normalizer)
# common case: knn method gives row sum k for all rows but
# a user-kernel may have varied row sums so "else" handles that
if np.all(normalizer == normalizer[0]):
affinity_matrix.data /= normalizer[0]
else:
if affinity_matrix.format == "csr":
repeats = np.diff(affinity_matrix.indptr)
rows = np.repeat(np.arange(affinity_matrix.shape[0]), repeats)
else: # CSC format
rows = affinity_matrix.indices
affinity_matrix.data /= normalizer[rows]
else: # Dense affinity_matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could use sparse multiplication instead. The code is much easier to read, and I believe it's pretty efficient for row normalization (to be confirmed). Have you tried and compared sparse multiplication with your alternatives for csr/csc formats ?

normalizer = affinity_matrix.sum(axis=1)
# handle spmatrix (make 1D)
if sparse.isspmatrix(affinity_matrix):
    normalizer = np.ravel(normalizer)
if sparse.issparse(affinity_matrix):
    inv_normalizer = sparse.diags(1.0 / normalizer)
    affinity_matrix = inv_normalizer @ affinity_matrix
else:
    affinity_matrix /= normalizer[:, np.newaxis]

Copy link
Contributor Author

@dschult dschult Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typical timing differences are longer for the matmul-by-diagonal approach than multiplying-the-guts of the sparse representation. For small graphs, matmul takes about 10 times longer, but it does better with larger matrices. It seems to level off with 1.8 times longer for CSR format and 4 times longer for CSC format. That pattern holds at least until my memory gets full. But, the times aren't very long -- max of a second or so. So, the readability can take precedence. I'll do whatever seems best. (timing details below)

Timing results using %timeit
DENSITY = 1.0

Timing guts then matmul for shape: (3, 3), format: 'csr'
8.35 μs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
93.7 μs ± 466 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Timing guts then matmul for shape: (3, 3), format: 'csc'
6.44 μs ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
111 μs ± 339 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Timing guts then matmul for shape: (2500, 2500), format: 'csr'
15.6 ms ± 63.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
26.2 ms ± 549 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing guts then matmul for shape: (2500, 2500), format: 'csc'
15.8 ms ± 44.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
66.3 ms ± 366 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing guts then matmul for shape: (5000, 5000), format: 'csr'
61.5 ms ± 248 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
109 ms ± 252 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing guts then matmul for shape: (5000, 5000), format: 'csc'
70 ms ± 3.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
317 ms ± 794 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Timing guts then matmul for shape: (10000, 10000), format: 'csr'
297 ms ± 749 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
453 ms ± 908 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Timing guts then matmul for shape: (10000, 10000), format: 'csc'
281 ms ± 439 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.3 s ± 31.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
===================

DENSITY = 0.01
Timing guts then matmul for shape: (2500, 2500), nnz: 62500, format: 'csr'
178 μs ± 6.21 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
389 μs ± 9.25 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Timing guts then matmul for shape: (2500, 2500), nnz: 62500, format: 'csc'
139 μs ± 606 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
536 μs ± 8.74 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Timing guts then matmul for shape: (5000, 5000), nnz: 250000, format: 'csr'
728 μs ± 38.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.26 ms ± 17.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Timing guts then matmul for shape: (5000, 5000), nnz: 250000, format: 'csc'
607 μs ± 37.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
2.06 ms ± 95.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Timing guts then matmul for shape: (10000, 10000), nnz: 1000000, format: 'csr'
2.68 ms ± 89.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.87 ms ± 87.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Timing guts then matmul for shape: (10000, 10000), nnz: 1000000, format: 'csc'
2.61 ms ± 40.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.55 ms ± 76.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
timing code
import numpy as np
rng = np.random.default_rng()
import scipy as sp
import sklearn

affinities = [
    np.array([[1.0, 1.0, 0.0], [2.0, 1.0, 1.0], [0.0, 1.0, 3.0]]),
#    rng.random((500, 500)),
    rng.random((2500, 2500)),
    rng.random((5000, 5000)),
    rng.random((10_000, 10_000)),
    sp.sparse.random_array((2500, 2500), density=0.01, rng=rng),
    sp.sparse.random_array((5000, 5000), density=0.01, rng=rng),
    sp.sparse.random_array((10000, 10000), density=0.01, rng=rng),
#    rng.random((20_000, 20_000)),
#    rng.random((40_000, 40_000)),
]

def setup(affinity_matrix):
    affinity_csr = sp.sparse.csr_array(affinity_matrix)
    affinity_csc = sp.sparse.csc_array(affinity_matrix)
    return affinity_csr, affinity_csc

def guts_normalization(affinity_matrix):
    normalizer = affinity_matrix.sum(axis=1)
    if affinity_matrix.format == "csr":
        repeats = np.diff(affinity_matrix.indptr)
        rows = np.repeat(np.arange(affinity_matrix.shape[0]), repeats)
    else:  # CSC format
        rows = affinity_matrix.indices
    affinity_matrix.data /= normalizer[rows]
    return affinity_matrix

def matmul_normalization(affinity_matrix):
    normalizer = affinity_matrix.sum(axis=1)
    inv_normalizer = sp.sparse.diags(1.0 / normalizer)
    affinity_matrix = inv_normalizer @ affinity_matrix
    return affinity_matrix

print('To run the timing, use:')
print('for affinity in affinities:')
print('  for affinity_matrix in setup(affinity):')
print('    guts = guts_normalization(affinity_matrix)')
print('    matmul = matmul_normalization(affinity_matrix)')
print('    np.testing.assert_allclose(guts.toarray(), matmul.toarray())')
print('    print("\\nTiming guts then matmul for ", end="")')
print('    print(f"shape: {affinity_matrix.shape} ", end="")')
print('    print(f"format: {affinity_matrix.format=}", end="")')
print('    print(f"nnz: {affinity_matrix.nnz=}")')
print('    %timeit guts_normalization(affinity_matrix)')
print('    %timeit matmul_normalization(affinity_matrix)')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the timing analysis !

How much does the overhead compare to the overall fit time ?
As _build_graph is only called once, maybe the overhead is negligible compared to the many iterations in fit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a 2500x2500 affinity matrix I time fit as 477ms
while the timing results for building the graph is 27ms for csr.
another way to put it into perspective: self.X_.tocsr() takes 36ms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since timing isn't critical here, in the long run we can just use
affinity_matrix /= normalizer[:, np.newaxis] for both dense and sparse cases.
But min dependence on SciPy has to be v1.12 before we can do that.

So I changed this to using the matrix multiply for sparse, with a comment to update when SciPy 1.12+ is required.

I think this should be ready to go.

Copy link
Contributor

@antoinebaker antoinebaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the PR @dschult. A couple nitpicks, otherwise LGTM !

# handle spmatrix (make normalizer 1D)
if sparse.isspmatrix(affinity_matrix):
normalizer = np.ravel(normalizer)
# Todo: when SciPy 1.12+ is min dependence, replace up to ---- with:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Todo: when SciPy 1.12+ is min dependence, replace up to ---- with:
# TODO: when SciPy 1.12+ is min dependence, replace up to ---- with:


clf = Estimator(kernel=kernel_affinity_matrix).fit(X, labels)
graph = clf._build_graph()
assert_allclose(graph.sum(axis=1), 1) # normalized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert_allclose(graph.sum(axis=1), 1) # normalized
assert_allclose(graph.sum(axis=1), 1) # normalized rows

@dschult
Copy link
Contributor Author

dschult commented Sep 22, 2025

Thanks @antoinebaker! I picked the nits.
This should be ready to go.

@snath-xoc I think you're now "it" in this game of tag.
What do you think about the changes?

@snath-xoc
Copy link
Contributor

Thank you @dschult LGTM as well, shall we mark as ready to merge @adrinjalali ?

@dschult
Copy link
Contributor Author

dschult commented Oct 1, 2025

We've got two approvals with reviewer read access.
Can someone with write access take a look? @adrinjalali ?

@@ -0,0 +1,4 @@
- User written kernel results are now normalized in
:class:`semi-supervized._label_propagation.LabelPropagation`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:class:`semi-supervized._label_propagation.LabelPropagation`
:class:`~sklearn.semi_supervized.LabelPropagation`



@pytest.mark.parametrize("constructor", CONSTRUCTOR_TYPES)
@pytest.mark.parametrize("Estimator, parameters", ESTIMATORS[:2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ESTIMATORS[:2] is brittle. We might change that list in the future, and I'm not sure why not all of them are tested here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's testing LabelPropagation instances only. Maybe creating a new constant:

LP_ESTIMATORS = [est for est in ESTIMATORS if isinstance(est, LabelPropagation)]

@dschult dschult force-pushed the sparse_normalizer branch from 7c33f42 to bed1545 Compare October 2, 2025 19:29
@dschult
Copy link
Contributor Author

dschult commented Oct 2, 2025

Thanks @adrinjalali!
I've implemented those two changes.

@@ -0,0 +1,4 @@
- User written kernel results are now normalized in
:class:`semi-supervized.LabelPropagation`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:class:`semi-supervized.LabelPropagation`
:class:`semi_supervised.LabelPropagation`

Copy link
Contributor

@antoinebaker antoinebaker Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[I think so, you can check if it renders properly in rendered docs / what's new / link for LabelPropagation should redirect to the API doc]

@antoinebaker
Copy link
Contributor

I think test_label_propagation_build_graph_normalized is now skipped:

pytest -v sklearn/semi_supervised  -k test_label_propagation_build_graph_normalized
test_label_propagation_build_graph_normalized[NOTSET-array] SKIPPED (got empty parameter set for (Estimator, parame...)

That would explain the drop in coverage.

Comment on lines 39 to 42
LP_ESTIMATORS = [
est for est in ESTIMATORS if isinstance(est, label_propagation.LabelPropagation)
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LP_ESTIMATORS = [
est for est in ESTIMATORS if isinstance(est, label_propagation.LabelPropagation)
]
LP_ESTIMATORS = [
(klass, params)
for (klass, params) in ESTIMATORS
if klass == label_propagation.LabelPropagation
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad :) it wasn't the proper filter.

@dschult
Copy link
Contributor Author

dschult commented Oct 6, 2025

Thanks @antoinebaker !
Looks like the docs now render the link correctly.

@adrinjalali adrinjalali merged commit 0a96fcb into scikit-learn:main Oct 7, 2025
36 checks passed
@dschult dschult deleted the sparse_normalizer branch October 9, 2025 02:45
Tunahanyrd pushed a commit to Tunahanyrd/scikit-learn that referenced this pull request Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Strange normalization of semi-supervised label propagation in _build_graph

4 participants