-
-
Notifications
You must be signed in to change notification settings - Fork 373
Omit chunks with no elements in slice selection with step #1154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This stops chunks being read unnecessarily when a slice selection with a step was used. Previously all chunks spanning the start-end range would be used regardless of whether they contained any elements. Fixes zarr-developers#843.
|
Here's a short demonstration of this: import numpy as np
import zarr
def track_chunk_calls(z: zarr.Array):
"""Wrap a zarr.Array such that it prints out the details of the requested chunk."""
f = z._chunk_getitem
def _track(chunk_coords, chunk_sel, out, out_sel, **kwargs):
print(f"{chunk_coords=}, {chunk_sel=}, {out_sel=}")
return f(chunk_coords, chunk_sel, out, out_sel, **kwargs)
z._chunk_getitem = _track
# Create a chunked array
N = 50
C = 5
z = zarr.ones((N,), chunks=(C,))
# Define a sparse slice selection
sl = slice(4, 41, 12)
# Calculate the chunks the requested elements are in
chunk_inds = np.arange(N)[sl] // C
print(f"## Chunks that must be read: {chunk_inds}")
print("## Chunks that are actually read:")
track_chunk_calls(z)
z[sl]Prior to the fix in this PR, this gives the output: demonstrating that it is reading all the chunks spanning the range, not just the minimal set of After applying the fix only the minimal chunks are actually read: |
Codecov Report
@@ Coverage Diff @@
## main #1154 +/- ##
=======================================
Coverage 99.95% 99.95%
=======================================
Files 36 36
Lines 14117 14142 +25
=======================================
+ Hits 14110 14135 +25
Misses 7 7
|
|
Thanks so much for this PR @jrs65! It looks like a simple solution with a clear improvement.
I'm not sure it's an implementation detail. Not accessing data unnecessarily is an important feature, particularly when the access might be very expensive (think cloud storage, large chunks, etc.). From my point of view, it would be great to have tests for this sort of thing. On thing we do in Xarray's test suite is to define a special kind of store which errors if you try to access any of its data. This allows us to guard against undesirable data access. We could take a similar approach here in Zarr. |
Great. That's along the lines I was thinking of anyway, so I will put together a set of unit tests along those lines, and update this PR with them. I'll add them to |
|
That sounds great! Thanks so much for volunteering to take that on. It's a lot more work than just fixing the bug! π We welcome your contributions to zarr-python and will do our best to help with your PR. π |
This tests that only the expected set of chunks are accessed during basic slice selection operations for both reads and writes to an array.
|
@rabernat very happy to do so! Compared to hacking around in the internals of HDF5 and h5py, the I've implemented tests for this case. In the end I used the existing Let me know what you think, and anything else that needs to be done to the PR. |
|
@jrs65: I'll second @rabernat's thanks! I've triggered the GHA actions. Let's see how the tests far. (Also looking forward to what the asv benchmarks look like before & after. cc: @MSanKeys963) |
|
Tests are all green! π @jrs65, do you want to add yourself to docs/release.rst? Otherwise, unless there are objections, I could see getting this out for a quick patch release. |
|
Agree this is a nice improvement. Thank you! π Are there other indexers that would benefit from the same kind of fix? |
Sure! I'll add it down in a provisional 2.13.3 patch release, presuming you'll edit that around when you decide when to get it out. |
I've just read over the code for the other indexers. I'm unsure about Regardless, I think it would be very straightforward to extend the tests to ensure that going forward the other selection types continue to only access what they need. If I find a bit of time to spare in the next week I might give that ago and see what happens. |
|
Adding to 2.13.3 sounds good. Thanks! π Also thanks for taking the time to look into other indexers. No worries if you don't have time to look further. Figured this was a good opportunity to revisit since we are already looking at performance here. Though happy to spin off into a new issue for a future date if that's preferable π |
|
Errrmm... ooops. Sorry. I typo'd my own name in the Release Notes π I'm just going to push a quick fix to that one. |
|
No worries. Thanks again Richard! π Planning on merging after CI passes. |
|
Re-launched the job after an unrelated failure. I'll keep an eye on it for release. |
|
Thanks Josh! π Looks like it passed π’ |
|
Thanks, @jakirkham. I created the GH release. π |
This stops chunks being read unnecessarily when a slice selection with a step was used (reported in #843).
Previously all chunks spanning the start-end range would be used regardless of whether they contained any elements. The changes in this PR are a small modification to
zarr.indexing.SliceDimIndexerto stop it yielding empty selections up tozarr.Array._get_selection. Previously the chunk would be read and have no elements extracted from it, this stops it ever looking at the empty chunk.I'm unsure about if a unit test should be added for this. It would be feasible to have a test asserting that only the minimal set of chunks are read, but that feels like it's testing an implementation detail. I'm happy to construct a test if that would be useful. Regardless I'll add some test code in a comment that succinctly shows the issue and the fix.
TODO: