-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
issue-11643: Fix reify to handle sparse arrays and other objects without __len__ #12103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 9 files ± 0 9 suites ±0 3h 15m 55s ⏱️ + 4m 49s Results for commit 24e0a92. ± Comparison against base commit b7ba831. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that our CI environment has scipy installed could you add some tests that use some actual sparse arrays, rather than mocking everything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine to me, but I'd appreciate an additional review from another maintainer before merging this.
cc @TomAugspurger @jrbourbeau @quasiben if you're around
|
The latest test failures are unrelated to this PR btw |
dask/utils.py
Outdated
| # Sparse-like objects | ||
| if hasattr(obj, "nnz"): | ||
| return obj.nnz == 0 | ||
| if hasattr(obj, "shape"): | ||
| return 0 in obj.shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike that these can fail with an exception. We don't know that just because an object has .nnz that it's comparable to an int, and we don't know that obj.shape is a sequence. I'd feel better if these also catch exceptions (probably broad ones) so that we get through to the fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, Fixed in commit: b3d818d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm always looking for ways to reduce the amount of guessing / duck typing we do, but I'm not sure if there's a way around that here. Just one comment about a couple of checks in is_empty that might fail. Otherwise I think this is fine.
| if hasattr(obj, "nnz"): | ||
| try: | ||
| return obj.nnz == 0 | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if contextlib.suppress would be a more readable solution for these? Or is that too magic? @TomAugspurger
| if hasattr(obj, "nnz"): | |
| try: | |
| return obj.nnz == 0 | |
| except Exception: | |
| pass | |
| with contextlib.suppress(Exception): | |
| return obj.nnz == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 , this does seem cleaner since we're only suppressing the exception vs trying to recover from it
Tests for the bags module pass:
Tests for the utils module pass:
pre-commit run --all-filesHow did I verify that the fix works:
I recreated the test mentioned in the issue above, here's the code:
if __name__ == "__main__": import dask.bag as db import numpy as np from dask import delayed from scipy.sparse import csr_array def add(x, y): return x + y @delayed def create_sparse_array_delayed(): return csr_array(np.random.random((10, 10))) @delayed def create_array_delayed(): return np.random.random((10, 10)) db.from_sequence( [csr_array(np.random.random((10, 10))), csr_array(np.random.random((10, 10)))]).fold( add).compute() # works with sparse arrays when created from sequence db.from_delayed([create_array_delayed(), create_array_delayed()]).fold(add).compute() # works with numpy arrays print(db.from_delayed([create_sparse_array_delayed(), create_sparse_array_delayed()]).fold(add).compute())This now returns the result instead of the bug: