Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@jrs65
Copy link
Contributor

@jrs65 jrs65 commented Oct 1, 2022

This stops chunks being read unnecessarily when a slice selection with a step was used (reported in #843).

Previously all chunks spanning the start-end range would be used regardless of whether they contained any elements. The changes in this PR are a small modification to zarr.indexing.SliceDimIndexer to stop it yielding empty selections up to zarr.Array._get_selection. Previously the chunk would be read and have no elements extracted from it, this stops it ever looking at the empty chunk.

I'm unsure about if a unit test should be added for this. It would be feasible to have a test asserting that only the minimal set of chunks are read, but that feels like it's testing an implementation detail. I'm happy to construct a test if that would be useful. Regardless I'll add some test code in a comment that succinctly shows the issue and the fix.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

This stops chunks being read unnecessarily when a slice selection with a
step was used. Previously all chunks spanning the start-end range would
be used regardless of whether they contained any elements.

Fixes zarr-developers#843.
@jrs65
Copy link
Contributor Author

jrs65 commented Oct 1, 2022

Here's a short demonstration of this:

import numpy as np
import zarr


def track_chunk_calls(z: zarr.Array):
    """Wrap a zarr.Array such that it prints out the details of the requested chunk."""

    f = z._chunk_getitem

    def _track(chunk_coords, chunk_sel, out, out_sel, **kwargs):
        print(f"{chunk_coords=}, {chunk_sel=}, {out_sel=}")
        return f(chunk_coords, chunk_sel, out, out_sel, **kwargs)

    z._chunk_getitem = _track


# Create a chunked array
N = 50
C = 5
z = zarr.ones((N,), chunks=(C,))

# Define a sparse slice selection
sl = slice(4, 41, 12)

# Calculate the chunks the requested elements are in
chunk_inds = np.arange(N)[sl] // C
print(f"## Chunks that must be read: {chunk_inds}")

print("## Chunks that are actually read:")
track_chunk_calls(z)
z[sl]

Prior to the fix in this PR, this gives the output:

## Chunks that must be read: [0 3 5 8]
## Chunks that are actually read:
chunk_coords=(0,), chunk_sel=(slice(4, 5, 12),), out_sel=(slice(0, 1, None),)
chunk_coords=(1,), chunk_sel=(slice(11, 5, 12),), out_sel=(slice(1, 1, None),)
chunk_coords=(2,), chunk_sel=(slice(6, 5, 12),), out_sel=(slice(1, 1, None),)
chunk_coords=(3,), chunk_sel=(slice(1, 5, 12),), out_sel=(slice(1, 2, None),)
chunk_coords=(4,), chunk_sel=(slice(8, 5, 12),), out_sel=(slice(2, 2, None),)
chunk_coords=(5,), chunk_sel=(slice(3, 5, 12),), out_sel=(slice(2, 3, None),)
chunk_coords=(6,), chunk_sel=(slice(10, 5, 12),), out_sel=(slice(3, 3, None),)
chunk_coords=(7,), chunk_sel=(slice(5, 5, 12),), out_sel=(slice(3, 3, None),)
chunk_coords=(8,), chunk_sel=(slice(0, 1, 12),), out_sel=(slice(3, 4, None),)

demonstrating that it is reading all the chunks spanning the range, not just the minimal set of [0, 3, 5, 8].

After applying the fix only the minimal chunks are actually read:

## Chunks that must be read: [0 3 5 8]
## Chunks that are actually read:
chunk_coords=(0,), chunk_sel=(slice(4, 5, 12),), out_sel=(slice(0, 1, None),)
chunk_coords=(3,), chunk_sel=(slice(1, 5, 12),), out_sel=(slice(1, 2, None),)
chunk_coords=(5,), chunk_sel=(slice(3, 5, 12),), out_sel=(slice(2, 3, None),)
chunk_coords=(8,), chunk_sel=(slice(0, 1, 12),), out_sel=(slice(3, 4, None),)

@codecov
Copy link

codecov bot commented Oct 1, 2022

Codecov Report

Merging #1154 (b7b25cb) into main (2dcffcd) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1154   +/-   ##
=======================================
  Coverage   99.95%   99.95%           
=======================================
  Files          36       36           
  Lines       14117    14142   +25     
=======================================
+ Hits        14110    14135   +25     
  Misses          7        7           
Impacted Files Coverage Ξ”
zarr/indexing.py 100.00% <100.00%> (ΓΈ)
zarr/tests/test_indexing.py 100.00% <100.00%> (ΓΈ)
zarr/util.py 100.00% <0.00%> (ΓΈ)

@rabernat
Copy link
Contributor

rabernat commented Oct 1, 2022

Thanks so much for this PR @jrs65! It looks like a simple solution with a clear improvement.

It would be feasible to have a test asserting that only the minimal set of chunks are read, but that feels like it's testing an implementation detail.

I'm not sure it's an implementation detail. Not accessing data unnecessarily is an important feature, particularly when the access might be very expensive (think cloud storage, large chunks, etc.). From my point of view, it would be great to have tests for this sort of thing.

On thing we do in Xarray's test suite is to define a special kind of store which errors if you try to access any of its data. This allows us to guard against undesirable data access. We could take a similar approach here in Zarr.

@jrs65
Copy link
Contributor Author

jrs65 commented Oct 1, 2022

Thanks so much for this PR @jrs65! It looks like a simple solution with a clear improvement.

I'm not sure it's an implementation detail. Not accessing data unnecessarily is an important feature, particularly when the access might be very expensive (think cloud storage, large chunks, etc.). From my point of view, it would be great to have tests for this sort of thing.

On thing we do in Xarray's test suite is to define a special kind of store which errors if you try to access any of its data. This allows us to guard against undesirable data access. We could take a similar approach here in Zarr.

Great. That's along the lines I was thinking of anyway, so I will put together a set of unit tests along those lines, and update this PR with them. I'll add them to zarr/tests/test_indexing.py unless you have another suggestion.

@rabernat
Copy link
Contributor

rabernat commented Oct 1, 2022

That sounds great! Thanks so much for volunteering to take that on. It's a lot more work than just fixing the bug! πŸ™ We welcome your contributions to zarr-python and will do our best to help with your PR. πŸ™

This tests that only the expected set of chunks are accessed during
basic slice selection operations for both reads and writes to an array.
@jrs65
Copy link
Contributor Author

jrs65 commented Oct 2, 2022

@rabernat very happy to do so! Compared to hacking around in the internals of HDF5 and h5py, the zarr codebase is a joy to understand and modify.

I've implemented tests for this case. In the end I used the existing zarr.tests.util.CountingDict implementation rather than reinventing the wheel. I use that as a backing store, and then after each operation I check that the expected chunks have been accessed once and that no other chunks have been touched.

Let me know what you think, and anything else that needs to be done to the PR.

@joshmoore
Copy link
Member

@jrs65: I'll second @rabernat's thanks! I've triggered the GHA actions. Let's see how the tests far. (Also looking forward to what the asv benchmarks look like before & after. cc: @MSanKeys963)

@joshmoore
Copy link
Member

Tests are all green! πŸŽ‰ @jrs65, do you want to add yourself to docs/release.rst?

Otherwise, unless there are objections, I could see getting this out for a quick patch release.

@jakirkham
Copy link
Member

Agree this is a nice improvement. Thank you! πŸ™

Are there other indexers that would benefit from the same kind of fix?

@jrs65
Copy link
Contributor Author

jrs65 commented Oct 4, 2022

Tests are all green! πŸŽ‰ @jrs65, do you want to add yourself to docs/release.rst?

Sure! I'll add it down in a provisional 2.13.3 patch release, presuming you'll edit that around when you decide when to get it out.

@jrs65
Copy link
Contributor Author

jrs65 commented Oct 4, 2022

Agree this is a nice improvement. Thank you! πŸ™

Are there other indexers that would benefit from the same kind of fix?

I've just read over the code for the other indexers. BoolArrayDimIndexer, IntArrayDimIndexer, CoordinateIndexer, and MaskIndexer (a subclass of CoordinateIndexer) look like they explicitly calculate which chunks contain the requested elements and then ignores those that don't. So I believe those are all fine.

I'm unsure about OrthogonalIndexer it does not seem to explicitly do that, so I'd need to read in more detail.

Regardless, I think it would be very straightforward to extend the tests to ensure that going forward the other selection types continue to only access what they need. If I find a bit of time to spare in the next week I might give that ago and see what happens.

@jakirkham
Copy link
Member

Adding to 2.13.3 sounds good. Thanks! πŸ™

Also thanks for taking the time to look into other indexers. No worries if you don't have time to look further. Figured this was a good opportunity to revisit since we are already looking at performance here. Though happy to spin off into a new issue for a future date if that's preferable πŸ™‚

@jrs65
Copy link
Contributor Author

jrs65 commented Oct 6, 2022

Errrmm... ooops. Sorry. I typo'd my own name in the Release Notes πŸ˜† I'm just going to push a quick fix to that one.

@jakirkham jakirkham enabled auto-merge (squash) October 6, 2022 18:28
@jakirkham
Copy link
Member

No worries. Thanks again Richard! πŸ™

Planning on merging after CI passes.

@joshmoore
Copy link
Member

Re-launched the job after an unrelated failure. I'll keep an eye on it for release.

@jakirkham
Copy link
Member

Thanks Josh! πŸ™ Looks like it passed 🟒

@jakirkham jakirkham merged commit d6e35a5 into zarr-developers:main Oct 7, 2022
@joshmoore
Copy link
Member

Thanks, @jakirkham. I created the GH release. πŸ‘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants