-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
TST Extend tests for scipy.sparse.*array in sklearn/tests/test_random_projection.py
#27314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST Extend tests for scipy.sparse.*array in sklearn/tests/test_random_projection.py
#27314
Conversation
|
The CI is failing with because of my attempt to fix the independent function call like this: I explain this to myself as the CI testing this file with an older scikit-learn version with a dependency from a scipy version older than 1.8 (where sparse arrays where introduced).(?) Please let me know, if I should call this function with a |
| data, data_csr = make_sparse_random_data( | ||
| sp.coo_array, n_samples, n_features, n_nonzeros | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this data generation. It means that we need to modify the following tests:
test_try_to_transform_before_fittest_SparseRandomProj_output_representationtest_correct_RandomProjection_dimensions_embeddingtest_random_projection_feature_names_out
You can call the make_sparse_random_data in the test itself and we will parametrize outside. For the first test (i.e., test_try_to_transform_before_fit), we don't have the parametrize. I think we can also avoid the parametrization on feature_names_out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, now I see that the generated data is used in some tests, which I had missed before.
I have implemented it how you hinted me, @glemaitre, not parametrizing the first and the last test, but I couldn't figure out why testing the sparse martrix on them is unnecessary.
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, the rest is fine.
|
@KartikeyBartwal This is already a pull-request and not an issue. Please refer to the original issue #27090 for further information and check the remaining work to be done. |
Co-authored-by: Guillaume Lemaitre <[email protected]>
Our CI always tests the development code (that is the The oldest supported versions of our dependencies are defined in |
| # Make some random data with uniformly located non zero entries with | ||
| # Gaussian distributed values | ||
| def make_sparse_random_data(n_samples, n_features, n_nonzeros, random_state=0): | ||
| @pytest.mark.parametrize("coo_container", COO_CONTAINERS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a test function: it does not start with test_, to pytest.mark.* annotations do nothing on this kind of helper functions.
You have to instead parametrize the test functions that call this make_sparse_random_data and pass the coo_container as argument to this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks for hinting me.
I believe, that I did a mistake in test_SparseRandomProj_output_representation before that I have now fixed. This test makes data using the datatypes from the coo_container and then converts the data into the datatypes from the csr_container to also do some checks.
| # Make some random data with uniformly located non zero entries with | ||
| # Gaussian distributed values | ||
| def make_sparse_random_data(n_samples, n_features, n_nonzeros, random_state=0): | ||
| @pytest.mark.parametrize("coo_container", COO_CONTAINERS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @pytest.mark.parametrize("coo_container", COO_CONTAINERS) |
|
|
||
| @pytest.mark.parametrize("random_projection_cls", all_RandomProjection) | ||
| def test_random_projection_feature_names_out(random_projection_cls): | ||
| data, _ = make_sparse_random_data(sp.coo_array, n_samples, n_features, n_nonzeros) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular this call needs to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did parametrize this test now . In this test, the data made with both datatypes from the COO container is created, fitted and then _feature_names_out are compared to the expected feature names.
Though, @glemaitre was suggesting maybe not to parametrize here, did I get this right? If so, should I put make_sparse_random_data(sp.coo_matrix, n_samples, n_features, n_nonzeros) instead?
| assert johnson_lindenstrauss_min_dim(100, eps=1e-5) == 368416070986 | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("coo_container", COO_CONTAINERS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I was thinking to not parametrize since the test is not really intended for that. But since not all the CIs support arrays, we cannot use them by default so this is best to parametrize as you did. Sorry for the confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, I'm grateful for your support.
|
|
||
|
|
||
| def test_try_to_transform_before_fit(): | ||
| data, _ = make_sparse_random_data(sp.coo_matrix, n_samples, n_features, n_nonzeros) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, this is is bit the same than for get_feature_names_out, we should probably parametrize even if not really needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there will be a time when sp.coo_matrix won't be supported by all the dependencies in the CI tests, even though currently sparse matrix is still supported by all of them. Got it.
| data, _ = make_sparse_random_data(coo_container, 5, n_features, int(n_features / 4)) | ||
|
|
||
| for RandomProjection in all_RandomProjection: | ||
| rp_dense = RandomProjection(n_components=3, random_state=1).fit(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird here since data is not dense. I would expect instead:
rp_dense = RandomProjection(n_components=3, random_state=1).fit(data.toarray())
rp_spase = RandomProjection(n_components=3, random_state=1).fit(data)This would be inline with what the name of the variables meant. If this is the case, then we don't need to have both CSR and COO but only one of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data here is dense.
This is because make_sparse_random_data returns the data in two types (dense and sparse in csr format). And from all the 9 tests where it is used, only test_inverse_transform uses the sparse data option, too.
This part in make_sparse_random_data:
data_coo = sp.coo_martix(a,(b,c), shape=(n_samples, n_features))
return data_coo.toarray(), data_coo.tocsr()I wonder why it turns a coo_matrix into a scr_matrix. Is there a reason for this? If not, can I simplify the function?
I'd also like to rename data, _ = make_sparse_random_data(...)
into something like what was used in test_inverse_transform:
X_dense, X_csr = make_sparse_random_data(...)
It's clearer to read.
What do you think, @glemaitre ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the name is a bit ambiguous for this function.
I wonder why it turns a coo_matrix into a scr_matrix. Is there a reason for this? If not, can I simplify the function?
Most of the algorithm in scikit-learn will use CSR matrices and will make COO -> CSR. So returning a CSR matrix will avoid this conversion.
X_dense, X_csr = make_sparse_random_data(...)
Indeed, this is already better because it informs the reader that the function is returning both dense and sparse format.
Another option (but I don't know how much refactoring it requires) is to always return only a single type. Instead of coo_container, we can have a sparse_format: if None, we return a dense matrix, otherwise we pass the type of container that we desired. A bit similar to the changes in #27438.
However, if it changes many lines, then your original proposal is already a good enough proposal @StefanieSenger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option (but I don't know how much refactoring it requires) is to always return only a single type. Instead of coo_container, we can have a sparse_format: if None, we return a dense matrix, otherwise we pass the type of container that we desired. A bit similar to the changes in #27438.
@glemaitre
I will push a proposal in a minute, but it doesn't get rid of coo_container as a param. I have tried to substitute coo_container with sparse_format, but now I believe that we have to keep coo_container, because we need to parametrize those datatypes in all of the tests that use make_sparse_random_data. Please let me know if this makes sense.
At least, the function now is much more explicit about what type it returns.
Ah, and there's a function def make_sparse_random_data that is exactly equal in benchmarks/bench_random_projections.py. Should this one be changed the same way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, and there's a function def make_sparse_random_data that is exactly equal in benchmarks/bench_random_projections.py. Should this one be changed the same way?
You can let the benchmark on its own.
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have only some last changes following your improvments. After that it looks fine to me.
| n_samples, | ||
| n_features, | ||
| n_nonzeros, | ||
| random_state=0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also unsualy in our test:
| random_state=0, | |
| random_state=None, |
Could you then pass random_state=0 in the call from each test functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I‘ve tried to use global_random_seed for this, but this lead to failing tests for test_random_projection_embedding_quality. Does this mean this test is sensitive to the random seed it had used before?
| assert isinstance(rp.transform(data), np.ndarray) | ||
|
|
||
| sparse_data = sp.csr_matrix(data) | ||
| sparse_data = csr_container(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of calling the csr_container here, could you create sparse_data at the beginning by calling twice make_sparse_random_data twice but once with sparse_format=None and another with sparse_format="csr". We don't need to parametrize for both COO and CSR container.
Using the same random_state=0 in both call should lead to the same data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, I did it. I have used global_random_seed here, because I believe that assert isinstance(...) orsp.issparsewould pass even if the data was different?
Should I change that to use random_state=0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let's keep the random_state to 0. We can revisit this test in a latter PR to see how much flaky it is by changing the random_state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @glemaitre, I have changed these two data generations to use random_state=0.
| rp_dense = RandomProjection(n_components=3, random_state=1).fit(data) | ||
| rp_sparse = RandomProjection(n_components=3, random_state=1).fit( | ||
| sp.csr_matrix(data) | ||
| csr_container(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I would as well call twice the make_sparse_random_data and create the same data and data_sparse from the beginning. No need for the product of all possibilities arrays/matrices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good you found that, thanks for hinting me. Less testing cases again. :)
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @StefanieSenger
jjerphan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thank you, @StefanieSenger!
Just minor nitpicks for getting to a self-documenting code (this modifies the state of the file before your contribution).
Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
|
Thank you, @jjerphan, it looks much cleaner now. |
|
Hi Stefanie, One last thing: could you add an entry in this section for 1.4's changelog? :) scikit-learn/doc/whats_new/v1.4.rst Lines 132 to 182 in 8db3aac
|
|
Hey @jjerphan, just did it. Thanks for your support. :) |
…dom_projection.py` (scikit-learn#27314) Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Julien Jerphanion <[email protected]>
Reference Issues/PRs
Towards #27090.
What does this implement/fix? Explain your changes.
This PR substitutes scipy sparse matrices with the scipy containers introduced in #27095 in the
sklearn/tests/test_random_projection.pytest file.Comment
It was a bit tricky, because one of the parametrized functions was re-used outside of test functions (see below). I am not sure why, since it doesn't seem to be re-used afterwards. I made up a solution.
n_samples, n_features = (10, 1000)
n_nonzeros = int(n_samples * n_features / 100.0)
data, data_csr = make_sparse_random_data(n_samples, n_features, n_nonzeros)