-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
FIX normalization in semi_supervised label_propagation #31924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@snath-xoc @antoinebaker could you have a look here please? |
|
I can take a look, thank you for the ping |
|
|
||
|
|
||
| @pytest.mark.parametrize("constructor", CONSTRUCTOR_TYPES) | ||
| @pytest.mark.parametrize("Estimator, parameters", ESTIMATORS[1:2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this is tested only for ESTIMATORS[1:2], I tried it with LabelSpreading estimators as well and it fails.... something to perhaps investigate further (and for now add in XFAIL), unless there are any insights as to why it's expected to fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first and third ESTIMATORS use method rbf which creates a dense affinity_matrix.
And the fourth and following ESTIMATORS use the LabelSpreading class that constructs a laplacian_matrix instead of an affinity_matrix, so the normalization is different.
It might be cleaner to inline the Estimator and parameters here instead of fixtures since we are only testing one case. I agree that it looks strange to use only one of the list of Estimators, but that's all we want to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon reflection, I think it is worthwhile to test both dense and sparse cases for LabelPropagation._build_graph. So I've included the suggestion to use ESTIMATORS[:2].
I haven't added any tests for LabelSpreading._build_graph because the rows are not supposed to sum to 1 there. [The result there is not normalized beyond the normalization done while computing the laplacian from the affinity matrix].
|
Great, thank you very much @dschult, I think we probably now need to add to the changelog, i.e. what the fix is under the PR number (you will see some example under doc/whats_new). Would you be able to add something, I can also initially help if you're unsure? |
|
Thanks @snath-xoc |
|
Hey @dschult looks good to me, we need one more reviewer now, perhaps @antoinebaker? |
|
Just a gentle bump here. |
antoinebaker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dschult for the PR!
It seems indeed that affinity_matrix was not properly row normalized in the old code. Here a few suggestions / comments.
doc/whats_new/upcoming_changes/sklearn.semi_supervised/31924.fix.rst
Outdated
Show resolved
Hide resolved
| normalizer = affinity_matrix.sum(axis=1) | ||
| if sparse.issparse(affinity_matrix): | ||
| affinity_matrix.data /= np.diag(np.array(normalizer)) | ||
| else: | ||
| # handle spmatrix (make 1D) | ||
| if sparse.isspmatrix(affinity_matrix): | ||
| normalizer = np.ravel(normalizer) | ||
| # common case: knn method gives row sum k for all rows but | ||
| # a user-kernel may have varied row sums so "else" handles that | ||
| if np.all(normalizer == normalizer[0]): | ||
| affinity_matrix.data /= normalizer[0] | ||
| else: | ||
| if affinity_matrix.format == "csr": | ||
| repeats = np.diff(affinity_matrix.indptr) | ||
| rows = np.repeat(np.arange(affinity_matrix.shape[0]), repeats) | ||
| else: # CSC format | ||
| rows = affinity_matrix.indices | ||
| affinity_matrix.data /= normalizer[rows] | ||
| else: # Dense affinity_matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could use sparse multiplication instead. The code is much easier to read, and I believe it's pretty efficient for row normalization (to be confirmed). Have you tried and compared sparse multiplication with your alternatives for csr/csc formats ?
normalizer = affinity_matrix.sum(axis=1)
# handle spmatrix (make 1D)
if sparse.isspmatrix(affinity_matrix):
normalizer = np.ravel(normalizer)
if sparse.issparse(affinity_matrix):
inv_normalizer = sparse.diags(1.0 / normalizer)
affinity_matrix = inv_normalizer @ affinity_matrix
else:
affinity_matrix /= normalizer[:, np.newaxis]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typical timing differences are longer for the matmul-by-diagonal approach than multiplying-the-guts of the sparse representation. For small graphs, matmul takes about 10 times longer, but it does better with larger matrices. It seems to level off with 1.8 times longer for CSR format and 4 times longer for CSC format. That pattern holds at least until my memory gets full. But, the times aren't very long -- max of a second or so. So, the readability can take precedence. I'll do whatever seems best. (timing details below)
Timing results using %timeit
DENSITY = 1.0 Timing guts then matmul for shape: (3, 3), format: 'csr' 8.35 μs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) 93.7 μs ± 466 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) Timing guts then matmul for shape: (3, 3), format: 'csc' 6.44 μs ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) 111 μs ± 339 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) Timing guts then matmul for shape: (2500, 2500), format: 'csr' 15.6 ms ± 63.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) 26.2 ms ± 549 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) Timing guts then matmul for shape: (2500, 2500), format: 'csc' 15.8 ms ± 44.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) 66.3 ms ± 366 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) Timing guts then matmul for shape: (5000, 5000), format: 'csr' 61.5 ms ± 248 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) 109 ms ± 252 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) Timing guts then matmul for shape: (5000, 5000), format: 'csc' 70 ms ± 3.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 317 ms ± 794 μs per loop (mean ± std. dev. of 7 runs, 1 loop each) Timing guts then matmul for shape: (10000, 10000), format: 'csr' 297 ms ± 749 μs per loop (mean ± std. dev. of 7 runs, 1 loop each) 453 ms ± 908 μs per loop (mean ± std. dev. of 7 runs, 1 loop each) Timing guts then matmul for shape: (10000, 10000), format: 'csc' 281 ms ± 439 μs per loop (mean ± std. dev. of 7 runs, 1 loop each) 1.3 s ± 31.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) =================== DENSITY = 0.01 Timing guts then matmul for shape: (2500, 2500), nnz: 62500, format: 'csr' 178 μs ± 6.21 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) 389 μs ± 9.25 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) Timing guts then matmul for shape: (2500, 2500), nnz: 62500, format: 'csc' 139 μs ± 606 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) 536 μs ± 8.74 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) Timing guts then matmul for shape: (5000, 5000), nnz: 250000, format: 'csr' 728 μs ± 38.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) 1.26 ms ± 17.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) Timing guts then matmul for shape: (5000, 5000), nnz: 250000, format: 'csc' 607 μs ± 37.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) 2.06 ms ± 95.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) Timing guts then matmul for shape: (10000, 10000), nnz: 1000000, format: 'csr' 2.68 ms ± 89.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.87 ms ± 87.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) Timing guts then matmul for shape: (10000, 10000), nnz: 1000000, format: 'csc' 2.61 ms ± 40.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) 8.55 ms ± 76.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
timing code
import numpy as np
rng = np.random.default_rng()
import scipy as sp
import sklearn
affinities = [
np.array([[1.0, 1.0, 0.0], [2.0, 1.0, 1.0], [0.0, 1.0, 3.0]]),
# rng.random((500, 500)),
rng.random((2500, 2500)),
rng.random((5000, 5000)),
rng.random((10_000, 10_000)),
sp.sparse.random_array((2500, 2500), density=0.01, rng=rng),
sp.sparse.random_array((5000, 5000), density=0.01, rng=rng),
sp.sparse.random_array((10000, 10000), density=0.01, rng=rng),
# rng.random((20_000, 20_000)),
# rng.random((40_000, 40_000)),
]
def setup(affinity_matrix):
affinity_csr = sp.sparse.csr_array(affinity_matrix)
affinity_csc = sp.sparse.csc_array(affinity_matrix)
return affinity_csr, affinity_csc
def guts_normalization(affinity_matrix):
normalizer = affinity_matrix.sum(axis=1)
if affinity_matrix.format == "csr":
repeats = np.diff(affinity_matrix.indptr)
rows = np.repeat(np.arange(affinity_matrix.shape[0]), repeats)
else: # CSC format
rows = affinity_matrix.indices
affinity_matrix.data /= normalizer[rows]
return affinity_matrix
def matmul_normalization(affinity_matrix):
normalizer = affinity_matrix.sum(axis=1)
inv_normalizer = sp.sparse.diags(1.0 / normalizer)
affinity_matrix = inv_normalizer @ affinity_matrix
return affinity_matrix
print('To run the timing, use:')
print('for affinity in affinities:')
print(' for affinity_matrix in setup(affinity):')
print(' guts = guts_normalization(affinity_matrix)')
print(' matmul = matmul_normalization(affinity_matrix)')
print(' np.testing.assert_allclose(guts.toarray(), matmul.toarray())')
print(' print("\\nTiming guts then matmul for ", end="")')
print(' print(f"shape: {affinity_matrix.shape} ", end="")')
print(' print(f"format: {affinity_matrix.format=}", end="")')
print(' print(f"nnz: {affinity_matrix.nnz=}")')
print(' %timeit guts_normalization(affinity_matrix)')
print(' %timeit matmul_normalization(affinity_matrix)')There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the timing analysis !
How much does the overhead compare to the overall fit time ?
As _build_graph is only called once, maybe the overhead is negligible compared to the many iterations in fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a 2500x2500 affinity matrix I time fit as 477ms
while the timing results for building the graph is 27ms for csr.
another way to put it into perspective: self.X_.tocsr() takes 36ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since timing isn't critical here, in the long run we can just use
affinity_matrix /= normalizer[:, np.newaxis] for both dense and sparse cases.
But min dependence on SciPy has to be v1.12 before we can do that.
So I changed this to using the matrix multiply for sparse, with a comment to update when SciPy 1.12+ is required.
I think this should be ready to go.
c9a4a88 to
ebf27ab
Compare
antoinebaker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the PR @dschult. A couple nitpicks, otherwise LGTM !
| # handle spmatrix (make normalizer 1D) | ||
| if sparse.isspmatrix(affinity_matrix): | ||
| normalizer = np.ravel(normalizer) | ||
| # Todo: when SciPy 1.12+ is min dependence, replace up to ---- with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Todo: when SciPy 1.12+ is min dependence, replace up to ---- with: | |
| # TODO: when SciPy 1.12+ is min dependence, replace up to ---- with: |
|
|
||
| clf = Estimator(kernel=kernel_affinity_matrix).fit(X, labels) | ||
| graph = clf._build_graph() | ||
| assert_allclose(graph.sum(axis=1), 1) # normalized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| assert_allclose(graph.sum(axis=1), 1) # normalized | |
| assert_allclose(graph.sum(axis=1), 1) # normalized rows |
|
Thanks @antoinebaker! I picked the nits. @snath-xoc I think you're now "it" in this game of tag. |
|
Thank you @dschult LGTM as well, shall we mark as ready to merge @adrinjalali ? |
|
We've got two approvals with reviewer read access. |
| @@ -0,0 +1,4 @@ | |||
| - User written kernel results are now normalized in | |||
| :class:`semi-supervized._label_propagation.LabelPropagation` | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| :class:`semi-supervized._label_propagation.LabelPropagation` | |
| :class:`~sklearn.semi_supervized.LabelPropagation` |
|
|
||
|
|
||
| @pytest.mark.parametrize("constructor", CONSTRUCTOR_TYPES) | ||
| @pytest.mark.parametrize("Estimator, parameters", ESTIMATORS[:2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this ESTIMATORS[:2] is brittle. We might change that list in the future, and I'm not sure why not all of them are tested here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's testing LabelPropagation instances only. Maybe creating a new constant:
LP_ESTIMATORS = [est for est in ESTIMATORS if isinstance(est, LabelPropagation)]Co-authored-by: antoinebaker <[email protected]>
7c33f42 to
bed1545
Compare
|
Thanks @adrinjalali! |
| @@ -0,0 +1,4 @@ | |||
| - User written kernel results are now normalized in | |||
| :class:`semi-supervized.LabelPropagation` | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| :class:`semi-supervized.LabelPropagation` | |
| :class:`semi_supervised.LabelPropagation` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[I think so, you can check if it renders properly in rendered docs / what's new / link for LabelPropagation should redirect to the API doc]
|
I think pytest -v sklearn/semi_supervised -k test_label_propagation_build_graph_normalized
test_label_propagation_build_graph_normalized[NOTSET-array] SKIPPED (got empty parameter set for (Estimator, parame...)That would explain the drop in coverage. |
| LP_ESTIMATORS = [ | ||
| est for est in ESTIMATORS if isinstance(est, label_propagation.LabelPropagation) | ||
| ] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| LP_ESTIMATORS = [ | |
| est for est in ESTIMATORS if isinstance(est, label_propagation.LabelPropagation) | |
| ] | |
| LP_ESTIMATORS = [ | |
| (klass, params) | |
| for (klass, params) in ESTIMATORS | |
| if klass == label_propagation.LabelPropagation | |
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad :) it wasn't the proper filter.
doc/whats_new/upcoming_changes/sklearn.semi_supervised/31924.fix.rst
Outdated
Show resolved
Hide resolved
…ix.rst Co-authored-by: antoinebaker <[email protected]>
|
Thanks @antoinebaker ! |
…31924) Co-authored-by: antoinebaker <[email protected]>
Fixes #31872 : strange normalization in semi-supervised label propagation
The trouble briefly:
semi_superviseduse axis=1. This does not cause errors so long as we have symmetric affinity_matrix. The dense case arises for kernel"rbf"which provides symmetric matrices. But if someone provides their own kernel the normalization could be incorrect."knn"which has all rows sum tok. But if someone provides their own kernel the normalization could be incorrect.This PR adds tests to of proper normalization that agrees between sparse and dense.
It also adjusts the code so it can work with sparse arrays or sparse matrices.
The tests check that normalization agrees between dense and sparse cases even if the affinity_matrix is not symmetric and does not have equal row sums. The errors corrected here do not arise for users who use the sklearn kernel options.
I discovered this when working on making sure sparse arrays and sparse matrices result in the same values (#31177). This PR splits it out of the other PR because it corrects/changes the current code and adds a test. Separating it from the large number of changes in the other PR is prudent, and eases review.