Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Support rechunking to seasonal frequency with SeasonalResampler #10519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

dhruvak001
Copy link
Contributor

users could not use SeasonResampler for chunking operations in xarray, despite it being a natural fit for seasonal data analysis. When attempting ds.chunk(time=SeasonResampler(["DJF", "MAMJ", "JAS", "ON"])), users encountered obscure errors because the chunking logic was hardcoded to only work with TimeResampler objects. This limitation prevented efficient seasonal analysis workflows and forced users to use workarounds or manual chunking strategies.

Now Added a generalized chunking approach by adding a resolve_chunks method to the Resampler base class and updating the chunking logic to work with all Resampler objects, not just TimeResampler. We also added a _for_chunking method to SeasonResampler that ensures drop_incomplete=False during chunking operations to prevent silent data loss. The solution maintains full backward compatibility with existing TimeResampler functionality while enabling seamless seasonal chunking

@dhruvak001 dhruvak001 changed the title Support chunking Support rechunking to seasonal frequency with SeasonalResampler Jul 9, 2025
DHRUVA KUMAR KAUSHAL added 2 commits July 10, 2025 00:06
@dhruvak001 dhruvak001 requested a review from dcherian July 18, 2025 08:16
Comment on lines +1035 to +1043
def _for_chunking(self) -> Self:
"""
Return a version of this resampler suitable for chunking.

For SeasonResampler, this returns a version with drop_incomplete=False
to prevent data from being silently dropped during chunking operations.
"""
return type(self)(seasons=self.seasons, drop_incomplete=False)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _for_chunking(self) -> Self:
"""
Return a version of this resampler suitable for chunking.
For SeasonResampler, this returns a version with drop_incomplete=False
to prevent data from being silently dropped during chunking operations.
"""
return type(self)(seasons=self.seasons, drop_incomplete=False)


# Create a temporary resampler that ignores drop_incomplete for chunking
# This prevents data from being silently dropped during chunking
resampler_for_chunking = self._for_chunking()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
resampler_for_chunking = self._for_chunking()
resampler_for_chunking = type(self)(seasons=self.seasons, drop_incomplete=False)

Comment on lines +1140 to +1158
data = create_test_data()
for chunks in [1, 2, 3, 4, 5]:
rechunked = data.chunk({"dim1": chunks})
assert rechunked.chunks["dim1"] == (chunks,) * (8 // chunks) + (
(8 % chunks,) if 8 % chunks else ()
)

rechunked = data.chunk({"dim2": chunks})
assert rechunked.chunks["dim2"] == (chunks,) * (9 // chunks) + (
(9 % chunks,) if 9 % chunks else ()
)

rechunked = data.chunk({"dim1": chunks, "dim2": chunks})
assert rechunked.chunks["dim1"] == (chunks,) * (8 // chunks) + (
(8 % chunks,) if 8 % chunks else ()
)
assert rechunked.chunks["dim2"] == (chunks,) * (9 // chunks) + (
(9 % chunks,) if 9 % chunks else ()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data = create_test_data()
for chunks in [1, 2, 3, 4, 5]:
rechunked = data.chunk({"dim1": chunks})
assert rechunked.chunks["dim1"] == (chunks,) * (8 // chunks) + (
(8 % chunks,) if 8 % chunks else ()
)
rechunked = data.chunk({"dim2": chunks})
assert rechunked.chunks["dim2"] == (chunks,) * (9 // chunks) + (
(9 % chunks,) if 9 % chunks else ()
)
rechunked = data.chunk({"dim1": chunks, "dim2": chunks})
assert rechunked.chunks["dim1"] == (chunks,) * (8 // chunks) + (
(8 % chunks,) if 8 % chunks else ()
)
assert rechunked.chunks["dim2"] == (chunks,) * (9 // chunks) + (
(9 % chunks,) if 9 % chunks else ()
)

)

# Test standard seasons
rechunked = ds.chunk(x=2, time=SeasonResampler(["DJF", "MAM", "JJA", "SON"]))
Copy link
Contributor

@dcherian dcherian Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll need to error on a missing season like this:

        rechunked = ds.chunk(x=2, time=SeasonResampler(["DJF", "MAM", "SON"]))

chunks = chunks.dropna(name).astype(int)
chunks_tuple: tuple[int, ...] = tuple(chunks.data.tolist())
return chunks_tuple
return resampler.compute_chunks(name, variable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return resampler.compute_chunks(name, variable)
newchunks = resampler.compute_chunks(name, variable)
if sum(newchunks) != variable.shape[0]:
raise ValueError(f"Logic bug in rechunking using {resampler!r}. New chunks tuple does not match size of data. Please open an issue.")
return newchunks

Let's protect ourselves a bit from logic bugs in the resampler

# Test standard seasons
rechunked = ds.chunk(x=2, time=SeasonResampler(["DJF", "MAM", "JJA", "SON"]))
# Should have multiple chunks along time dimension
assert len(rechunked.chunksizes["time"]) > 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's assert an actual value here.

Comment on lines +1169 to +1173
N = 365 * 2 # 2 years
if use_cftime:
time = xr.date_range("2001-01-01", periods=N, freq="D", use_cftime=True)
else:
time = xr.date_range("2001-01-01", periods=N, freq="D")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
N = 365 * 2 # 2 years
if use_cftime:
time = xr.date_range("2001-01-01", periods=N, freq="D", use_cftime=True)
else:
time = xr.date_range("2001-01-01", periods=N, freq="D")
N = 366 + 365 # 2 years
time = xr.date_range("2000-01-01", periods=N, freq="D", use_cftime=use_cftime)

By starting in 2000, we can check leap year logic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we parameterize this over calendars too? 360_day, noleap and standard should be good enough.

{"x": 2, "time": SeasonResampler(["DJFM", "AM", "JJA", "SON"])}
)
# Should have multiple chunks along time dimension
assert len(rechunked.chunksizes["time"]) > 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too let's assert actual chunks tuple

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support rechunking to seasonal frequency with SeasonalResampler
2 participants