-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
TST Extend tests for scipy.sparse.*array in sklearn/cluster/tests/test_optics.py
#27104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
scipy.sparse.*array in sklearn/cluster/tests/test_optics.py
|
@marenwestermann do you need help with this? Are the above information enough for you to investigate? Otherwise I can try to dig further to give more specific guidance. |
|
One thing that I could see is that our |
|
Thanks @ogrisel for asking. I actually do need help. I applied the suggested change, i.e. and this resulted in the following error: DetailsI then tried the following: and equally and this fixed the index errors but resulted in the following error: DetailsI noticed that the shape of So this is where I currently don't know how to proceed. I had a look at @glemaitre 's last comment but from my understanding that bug doesn't cause the problem described here. I pulled the recent changes from upstream main but the problem described here persists. |
|
PS: I finally have time again to be active on scikit-learn and I'm excited to be back. 🎉 |
|
Looking closer to the code, we need to change the following: dists = X[point_index, unproc]
if issparse(dists):
dists.sort_indices()
dists = dists.datato dists = X[[point_index], unproc]
if isinstance(dists, np.matrix):
dists = dists.A1So:
I tried locally and the tests are passing. |
|
I see, that makes sense. Thanks a lot for the detailed explanation @glemaitre. I applied the suggested changes. |
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments to optimize the number of tests to the minimum.
|
Good catch, I wouldn't have seen this myself. I had a careful look through the changes and tested them and everything looks good. |
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @marenwestermann LGTM.
ogrisel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR fixes the code of the scikit-learn library (not just the test suite) and therefore needs a dedicated entry in sklearn/doc/whats_new/v1.4.rst to document the fact that sklearn.cluster.OPTICS now accepts scipy sparse arrays as inputs.
Take example on the log entry for NMF as a reference to write this entry for the changelog.
Other than that here is a suggestion for further consistency in the code (even if it does not change the behavior seen by the tests):
|
I don't think the CI failures are due to my changes (unless it's a problem that this PR which is mentioned in the changelog isn't merged yet). |
|
Uhm. Could you solve the conflict of the changelog. |
|
Ah, I didn't see that it was a merge conflict, I fixed it. Please note that somewhere in the restructuring process of the changelog the changelog entry for #26602 seems to have gotten lost. |
|
Looks like the test |
sklearn/cluster/_optics.py
Outdated
| dists = np.asarray(dists) | ||
| elif hasattr(dists, "toarray"): | ||
| # X is a scipy sparse array. | ||
| dists = dists.toarray() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marenwestermann do you understand why this branch of the condition is not already covered by test_precomputed_dists?
I am a bit confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel The issue is that previously dists was a sparse matrix for the test case "is_sparse"=True. Now, when we want to test for the case csr_matrix, dists becomes a numpy matrix (which is currently covered in the code) and in the case where we want to test for csr_array, dists becomes a numpy array. So we never test for the case sparse array.
Now the question is should dists be sparse when X is sparse (as it was preciously the case) or is the current implementation ok? I don't have a deep enough understanding of the OPTICS estimator to make a judgement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minimal example to understand what happens here:
In [5]: from scipy import sparse
In [6]: X = sparse.random(10, 10, format="csr")
In [7]: X[0, :]
Out[7]:
<1x10 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Row format>
In [8]: X[0, [1, 2]]
Out[8]:
<1x2 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Row format>
In [9]: X[[0], [1, 2]]
Out[9]: matrix([[0., 0.]])Since we index both dimension, we get this np.matrix that is the difference with the previous code.
For me, we don't need to cover anymore the case where we would eventually have a sparse matrix because we will get already a np.matrix or a np.array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marenwestermann do you think that you can reduce to only having np.matrix or np.array (not anymore sparse matrices) for the condition. Once done, I think that we good for merging.
|
This one was waiting for merge. Thanks @marenwestermann |
Reference Issues/PRs
#27090
What does this implement/fix? Explain your changes.
Makes use of the CSR_CONTAINERS fix to test sparse arrays.
Any other comments?
I included a
NotImplementedErrorexception because I got the following test failure:Details
Not sure if this is the best solution.