Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@pvrraju
Copy link

@pvrraju pvrraju commented Sep 5, 2025

Hi @alxmrs,

Added lazy loading (issue #44):

lazy_load=True option

Defers metadata RPCs → faster large dataset opens

Backward compatible + tested

can you review my PR?

xee/ext.py Outdated
self._info_cache = {}

# Perform minimal RPCs if lazy loading is enabled
if getattr(self, 'lazy_load', False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this eager getattr, I think we should just pass lazy_load as an argument to this function.

xee/ext.py Outdated
return self._info_cache

# Full metadata loading if not lazy
if not self._info_cache or len(self._info_cache) < 5: # Check if we have full metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this magic number. Is there a more principled way to check if the cache needs updating?

xee/ext.py Outdated
Comment on lines 383 to 388
columns = ['system:id', self.primary_dim_property]
properties = (
self.image_collection.reduceColumns(
ee.Reducer.toList().repeat(len(columns)), columns
).get('list')
).getInfo()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea: maybe we can break this out into a function and call it here and in get_info as needed.

Comment on lines +583 to +586

# Verify that lazy opening is faster than regular opening
self.assertLess(lazy_open_time, regular_open_time,
f"Lazy loading ({lazy_open_time:.2f}s) should be faster than regular loading ({regular_open_time:.2f}s)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'm really happy to see this test.

'1992-10-05', '1992-10-06') # Using a smaller date range for the test

# Open dataset with lazy loading
start_time = time.time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use time.perf_counter() to capture time instead (and at the places below).

Comment on lines +588 to +599
# Verify that both datasets have the same structure
self.assertEqual(lazy_ds.dims, regular_ds.dims)
self.assertEqual(list(lazy_ds.data_vars), list(regular_ds.data_vars))

# Access data and verify it's the same
var_name = list(lazy_ds.data_vars)[0]
lazy_data = lazy_ds[var_name].isel(time=0).values
regular_data = regular_ds[var_name].isel(time=0).values

# Both should have same shape and data should not be all zeros or NaNs
self.assertEqual(lazy_data.shape, regular_data.shape)
self.assertTrue(np.allclose(lazy_data, regular_data, equal_nan=True))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is good to have in the test.

Comment on lines +297 to +298
if not hasattr(self, '_info_cache'):
self._info_cache = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that we have two levels of caching: this private dictionary and the functools.cached_property level. Is there any way we can use one system instead of both? I anticipate cache invalidation problems down the line.

One way we could clean this system up is by breaking out what info we get into multiple functions (that are also cached): e.g. a helper fn gets the first line essential stuff below. Since it's cached, we'd need less if...else logic to manage state; that would be happen in Python's memoizer decorator -- we'd make use of the cache just by calling the helper functions.

WDYT about this approach?

Copy link
Contributor

@alxmrs alxmrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided specific and high-level feedback.

In addition, I have a general warning: I don't have much control over whether PRs like this will actually get merged into Xee; patches like this require secondary review to ensure that the code is compatible with Google's internal use of Xee, and I'm not sure if there is capacity on the relevant teams to do these types of reviews right now.

@pvrraju
Copy link
Author

pvrraju commented Sep 12, 2025

Hi @alxmrs I've addressed all the feedback from the review:

  1. Pass lazy_load as an argument to get_info() instead of using getattr
  2. Replaced the magic number with proper key sets to check cache status
  3. Extracted property fetching into a separate helper method _fetch_collection_properties()
  4. Used time.perf_counter() instead of time.time() in tests for more accurate timing
  5. Refactored the caching system to use a more consistent approach

Let me know if there's anything else I should adjust!

@pvrraju pvrraju requested a review from alxmrs September 12, 2025 16:26
@jdbcode
Copy link
Member

jdbcode commented Dec 18, 2025

Thanks @pvrraju and @alxmrs!
Let's revisit this after #275 is in a pre or stable release.

@jdbcode jdbcode mentioned this pull request Dec 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants