Omit chunks with no elements in slice selection with step #1154

jrs65 · 2022-10-01T19:52:05Z

This stops chunks being read unnecessarily when a slice selection with a step was used (reported in #843).

Previously all chunks spanning the start-end range would be used regardless of whether they contained any elements. The changes in this PR are a small modification to zarr.indexing.SliceDimIndexer to stop it yielding empty selections up to zarr.Array._get_selection. Previously the chunk would be read and have no elements extracted from it, this stops it ever looking at the empty chunk.

I'm unsure about if a unit test should be added for this. It would be feasible to have a test asserting that only the minimal set of chunks are read, but that feels like it's testing an implementation detail. I'm happy to construct a test if that would be useful. Regardless I'll add some test code in a comment that succinctly shows the issue and the fix.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

This stops chunks being read unnecessarily when a slice selection with a step was used. Previously all chunks spanning the start-end range would be used regardless of whether they contained any elements. Fixes zarr-developers#843.

jrs65 · 2022-10-01T19:55:45Z

Here's a short demonstration of this:

import numpy as np
import zarr


def track_chunk_calls(z: zarr.Array):
    """Wrap a zarr.Array such that it prints out the details of the requested chunk."""

    f = z._chunk_getitem

    def _track(chunk_coords, chunk_sel, out, out_sel, **kwargs):
        print(f"{chunk_coords=}, {chunk_sel=}, {out_sel=}")
        return f(chunk_coords, chunk_sel, out, out_sel, **kwargs)

    z._chunk_getitem = _track


# Create a chunked array
N = 50
C = 5
z = zarr.ones((N,), chunks=(C,))

# Define a sparse slice selection
sl = slice(4, 41, 12)

# Calculate the chunks the requested elements are in
chunk_inds = np.arange(N)[sl] // C
print(f"## Chunks that must be read: {chunk_inds}")

print("## Chunks that are actually read:")
track_chunk_calls(z)
z[sl]

Prior to the fix in this PR, this gives the output:

## Chunks that must be read: [0 3 5 8]
## Chunks that are actually read:
chunk_coords=(0,), chunk_sel=(slice(4, 5, 12),), out_sel=(slice(0, 1, None),)
chunk_coords=(1,), chunk_sel=(slice(11, 5, 12),), out_sel=(slice(1, 1, None),)
chunk_coords=(2,), chunk_sel=(slice(6, 5, 12),), out_sel=(slice(1, 1, None),)
chunk_coords=(3,), chunk_sel=(slice(1, 5, 12),), out_sel=(slice(1, 2, None),)
chunk_coords=(4,), chunk_sel=(slice(8, 5, 12),), out_sel=(slice(2, 2, None),)
chunk_coords=(5,), chunk_sel=(slice(3, 5, 12),), out_sel=(slice(2, 3, None),)
chunk_coords=(6,), chunk_sel=(slice(10, 5, 12),), out_sel=(slice(3, 3, None),)
chunk_coords=(7,), chunk_sel=(slice(5, 5, 12),), out_sel=(slice(3, 3, None),)
chunk_coords=(8,), chunk_sel=(slice(0, 1, 12),), out_sel=(slice(3, 4, None),)

demonstrating that it is reading all the chunks spanning the range, not just the minimal set of [0, 3, 5, 8].

After applying the fix only the minimal chunks are actually read:

## Chunks that must be read: [0 3 5 8]
## Chunks that are actually read:
chunk_coords=(0,), chunk_sel=(slice(4, 5, 12),), out_sel=(slice(0, 1, None),)
chunk_coords=(3,), chunk_sel=(slice(1, 5, 12),), out_sel=(slice(1, 2, None),)
chunk_coords=(5,), chunk_sel=(slice(3, 5, 12),), out_sel=(slice(2, 3, None),)
chunk_coords=(8,), chunk_sel=(slice(0, 1, 12),), out_sel=(slice(3, 4, None),)

codecov · 2022-10-01T20:20:09Z

Codecov Report

Merging #1154 (b7b25cb) into main (2dcffcd) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1154   +/-   ##
=======================================
  Coverage   99.95%   99.95%           
=======================================
  Files          36       36           
  Lines       14117    14142   +25     
=======================================
+ Hits        14110    14135   +25     
  Misses          7        7

Impacted Files	Coverage Δ
zarr/indexing.py	`100.00% <100.00%> (ø)`
zarr/tests/test_indexing.py	`100.00% <100.00%> (ø)`
zarr/util.py	`100.00% <0.00%> (ø)`

rabernat · 2022-10-01T21:01:42Z

Thanks so much for this PR @jrs65! It looks like a simple solution with a clear improvement.

It would be feasible to have a test asserting that only the minimal set of chunks are read, but that feels like it's testing an implementation detail.

I'm not sure it's an implementation detail. Not accessing data unnecessarily is an important feature, particularly when the access might be very expensive (think cloud storage, large chunks, etc.). From my point of view, it would be great to have tests for this sort of thing.

On thing we do in Xarray's test suite is to define a special kind of store which errors if you try to access any of its data. This allows us to guard against undesirable data access. We could take a similar approach here in Zarr.

jrs65 · 2022-10-01T21:42:13Z

Thanks so much for this PR @jrs65! It looks like a simple solution with a clear improvement.

I'm not sure it's an implementation detail. Not accessing data unnecessarily is an important feature, particularly when the access might be very expensive (think cloud storage, large chunks, etc.). From my point of view, it would be great to have tests for this sort of thing.

On thing we do in Xarray's test suite is to define a special kind of store which errors if you try to access any of its data. This allows us to guard against undesirable data access. We could take a similar approach here in Zarr.

Great. That's along the lines I was thinking of anyway, so I will put together a set of unit tests along those lines, and update this PR with them. I'll add them to zarr/tests/test_indexing.py unless you have another suggestion.

rabernat · 2022-10-01T22:01:19Z

That sounds great! Thanks so much for volunteering to take that on. It's a lot more work than just fixing the bug! 🙏 We welcome your contributions to zarr-python and will do our best to help with your PR. 🙏

This tests that only the expected set of chunks are accessed during basic slice selection operations for both reads and writes to an array.

jrs65 · 2022-10-02T00:15:46Z

@rabernat very happy to do so! Compared to hacking around in the internals of HDF5 and h5py, the zarr codebase is a joy to understand and modify.

I've implemented tests for this case. In the end I used the existing zarr.tests.util.CountingDict implementation rather than reinventing the wheel. I use that as a backing store, and then after each operation I check that the expected chunks have been accessed once and that no other chunks have been touched.

Let me know what you think, and anything else that needs to be done to the PR.

joshmoore · 2022-10-02T19:28:24Z

@jrs65: I'll second @rabernat's thanks! I've triggered the GHA actions. Let's see how the tests far. (Also looking forward to what the asv benchmarks look like before & after. cc: @MSanKeys963)

joshmoore · 2022-10-04T08:57:58Z

Tests are all green! 🎉 @jrs65, do you want to add yourself to docs/release.rst?

Otherwise, unless there are objections, I could see getting this out for a quick patch release.

jakirkham · 2022-10-04T09:36:06Z

Agree this is a nice improvement. Thank you! 🙏

Are there other indexers that would benefit from the same kind of fix?

jrs65 · 2022-10-04T20:48:09Z

Tests are all green! 🎉 @jrs65, do you want to add yourself to docs/release.rst?

Sure! I'll add it down in a provisional 2.13.3 patch release, presuming you'll edit that around when you decide when to get it out.

jrs65 · 2022-10-04T21:01:28Z

Agree this is a nice improvement. Thank you! 🙏

Are there other indexers that would benefit from the same kind of fix?

I've just read over the code for the other indexers. BoolArrayDimIndexer, IntArrayDimIndexer, CoordinateIndexer, and MaskIndexer (a subclass of CoordinateIndexer) look like they explicitly calculate which chunks contain the requested elements and then ignores those that don't. So I believe those are all fine.

I'm unsure about OrthogonalIndexer it does not seem to explicitly do that, so I'd need to read in more detail.

Regardless, I think it would be very straightforward to extend the tests to ensure that going forward the other selection types continue to only access what they need. If I find a bit of time to spare in the next week I might give that ago and see what happens.

jakirkham · 2022-10-04T21:32:09Z

Adding to 2.13.3 sounds good. Thanks! 🙏

Also thanks for taking the time to look into other indexers. No worries if you don't have time to look further. Figured this was a good opportunity to revisit since we are already looking at performance here. Though happy to spin off into a new issue for a future date if that's preferable 🙂

jrs65 · 2022-10-06T16:59:34Z

Errrmm... ooops. Sorry. I typo'd my own name in the Release Notes 😆 I'm just going to push a quick fix to that one.

jakirkham · 2022-10-06T18:32:18Z

No worries. Thanks again Richard! 🙏

Planning on merging after CI passes.

joshmoore · 2022-10-07T11:07:54Z

Re-launched the job after an unrelated failure. I'll keep an eye on it for release.

jakirkham · 2022-10-07T15:32:42Z

Thanks Josh! 🙏 Looks like it passed 🟢

joshmoore · 2022-10-09T09:42:04Z

Thanks, @jakirkham. I created the GH release. 👍

jrs65 mentioned this pull request Oct 1, 2022

Poor performance when slicing with steps #843

Closed

Test that only the required chunks are accessed during basic selections

e92aaf5

This tests that only the expected set of chunks are accessed during basic slice selection operations for both reads and writes to an array.

Update release notes.

a501ef0

Fix typos in release notes.

b7b25cb

jakirkham enabled auto-merge (squash) October 6, 2022 18:28

jakirkham merged commit d6e35a5 into zarr-developers:main Oct 7, 2022

jakirkham approved these changes Oct 7, 2022

View reviewed changes

Uh oh!

Omit chunks with no elements in slice selection with step #1154

Omit chunks with no elements in slice selection with step #1154

Uh oh!

Conversation

jrs65 commented Oct 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrs65 commented Oct 1, 2022

Uh oh!

codecov bot commented Oct 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rabernat commented Oct 1, 2022

Uh oh!

jrs65 commented Oct 1, 2022

Uh oh!

rabernat commented Oct 1, 2022

Uh oh!

jrs65 commented Oct 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshmoore commented Oct 2, 2022

Uh oh!

joshmoore commented Oct 4, 2022

Uh oh!

jakirkham commented Oct 4, 2022

Uh oh!

jrs65 commented Oct 4, 2022

Uh oh!

jrs65 commented Oct 4, 2022

Uh oh!

jakirkham commented Oct 4, 2022

Uh oh!

jrs65 commented Oct 6, 2022

Uh oh!

jakirkham commented Oct 6, 2022

Uh oh!

joshmoore commented Oct 7, 2022

Uh oh!

jakirkham commented Oct 7, 2022

Uh oh!

joshmoore commented Oct 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jrs65 commented Oct 1, 2022 •

edited

Loading

codecov bot commented Oct 1, 2022 •

edited

Loading

jrs65 commented Oct 2, 2022 •

edited

Loading