Implement lazy loading to defer metadata RPCs until data access time … #253

pvrraju · 2025-09-05T01:44:39Z

Hi @alxmrs,

Added lazy loading (issue #44):

lazy_load=True option

Defers metadata RPCs → faster large dataset opens

Backward compatible + tested

can you review my PR?

…oogle#44)

alxmrs · 2025-09-11T17:00:56Z

xee/ext.py

+      self._info_cache = {}
+
+    # Perform minimal RPCs if lazy loading is enabled
+    if getattr(self, 'lazy_load', False):


Instead of this eager getattr, I think we should just pass lazy_load as an argument to this function.

alxmrs · 2025-09-11T17:02:41Z

xee/ext.py

+      return self._info_cache
+
+    # Full metadata loading if not lazy
+    if not self._info_cache or len(self._info_cache) < 5:  # Check if we have full metadata


I don't like this magic number. Is there a more principled way to check if the cache needs updating?

alxmrs · 2025-09-11T17:05:08Z

xee/ext.py

+      columns = ['system:id', self.primary_dim_property]
+      properties = (
+          self.image_collection.reduceColumns(
+              ee.Reducer.toList().repeat(len(columns)), columns
+          ).get('list')
+      ).getInfo()


Idea: maybe we can break this out into a function and call it here and in get_info as needed.

alxmrs · 2025-09-11T17:06:04Z

xee/ext_integration_test.py

+
+    # Verify that lazy opening is faster than regular opening
+    self.assertLess(lazy_open_time, regular_open_time, 
+                    f"Lazy loading ({lazy_open_time:.2f}s) should be faster than regular loading ({regular_open_time:.2f}s)")


Thanks, I'm really happy to see this test.

alxmrs · 2025-09-11T17:06:50Z

xee/ext_integration_test.py

+        '1992-10-05', '1992-10-06')  # Using a smaller date range for the test
+
+    # Open dataset with lazy loading
+    start_time = time.time()


Let's use time.perf_counter() to capture time instead (and at the places below).

alxmrs · 2025-09-11T17:07:17Z

xee/ext_integration_test.py

+    # Verify that both datasets have the same structure
+    self.assertEqual(lazy_ds.dims, regular_ds.dims)
+    self.assertEqual(list(lazy_ds.data_vars), list(regular_ds.data_vars))
+
+    # Access data and verify it's the same
+    var_name = list(lazy_ds.data_vars)[0]
+    lazy_data = lazy_ds[var_name].isel(time=0).values
+    regular_data = regular_ds[var_name].isel(time=0).values
+
+    # Both should have same shape and data should not be all zeros or NaNs
+    self.assertEqual(lazy_data.shape, regular_data.shape)
+    self.assertTrue(np.allclose(lazy_data, regular_data, equal_nan=True))


Thanks this is good to have in the test.

alxmrs · 2025-09-11T17:10:52Z

xee/ext.py

+    if not hasattr(self, '_info_cache'):
+      self._info_cache = {}


I'm concerned that we have two levels of caching: this private dictionary and the functools.cached_property level. Is there any way we can use one system instead of both? I anticipate cache invalidation problems down the line.

One way we could clean this system up is by breaking out what info we get into multiple functions (that are also cached): e.g. a helper fn gets the first line essential stuff below. Since it's cached, we'd need less if...else logic to manage state; that would be happen in Python's memoizer decorator -- we'd make use of the cache just by calling the helper functions.

WDYT about this approach?

alxmrs

Provided specific and high-level feedback.

In addition, I have a general warning: I don't have much control over whether PRs like this will actually get merged into Xee; patches like this require secondary review to ensure that the code is compatible with Google's internal use of Xee, and I'm not sure if there is capacity on the relevant teams to do these types of reviews right now.

pvrraju · 2025-09-12T01:20:26Z

Hi @alxmrs I've addressed all the feedback from the review:

Pass lazy_load as an argument to get_info() instead of using getattr
Replaced the magic number with proper key sets to check cache status
Extracted property fetching into a separate helper method _fetch_collection_properties()
Used time.perf_counter() instead of time.time() in tests for more accurate timing
Refactored the caching system to use a more consistent approach

Let me know if there's anything else I should adjust!

jdbcode · 2025-12-18T21:16:05Z

Thanks @pvrraju and @alxmrs!
Let's revisit this after #275 is in a pre or stable release.

Implement lazy loading to defer metadata RPCs until data access time (g…

660100d

…oogle#44)

pvrraju mentioned this pull request Sep 8, 2025

Mimic xr.open_zarr(..., chunks=None) to provide a way for lazy, fast open_dataset() calls. #44

Open

alxmrs reviewed Sep 11, 2025

View reviewed changes

alxmrs suggested changes Sep 11, 2025

View reviewed changes

Address reviewer feedback for lazy loading implementation

01abbbd

pvrraju requested a review from alxmrs September 12, 2025 16:26

jdbcode mentioned this pull request Dec 18, 2025

Preallocate tiles numpy #254

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement lazy loading to defer metadata RPCs until data access time … #253

Implement lazy loading to defer metadata RPCs until data access time … #253

Uh oh!

pvrraju commented Sep 5, 2025

Uh oh!

alxmrs Sep 11, 2025

Uh oh!

alxmrs Sep 11, 2025

Uh oh!

alxmrs Sep 11, 2025

Uh oh!

alxmrs Sep 11, 2025

Uh oh!

alxmrs Sep 11, 2025

Uh oh!

alxmrs Sep 11, 2025

Uh oh!

alxmrs Sep 11, 2025

Uh oh!

alxmrs left a comment

Uh oh!

pvrraju commented Sep 12, 2025

Uh oh!

jdbcode commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Implement lazy loading to defer metadata RPCs until data access time … #253

Are you sure you want to change the base?

Implement lazy loading to defer metadata RPCs until data access time … #253

Uh oh!

Conversation

pvrraju commented Sep 5, 2025

Uh oh!

alxmrs Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

alxmrs Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

alxmrs Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

alxmrs Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

alxmrs Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

alxmrs Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

alxmrs Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

alxmrs left a comment

Choose a reason for hiding this comment

Uh oh!

pvrraju commented Sep 12, 2025

Uh oh!

jdbcode commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants