-
Notifications
You must be signed in to change notification settings - Fork 40
Implement lazy loading to defer metadata RPCs until data access time … #253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
xee/ext.py
Outdated
| self._info_cache = {} | ||
|
|
||
| # Perform minimal RPCs if lazy loading is enabled | ||
| if getattr(self, 'lazy_load', False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this eager getattr, I think we should just pass lazy_load as an argument to this function.
xee/ext.py
Outdated
| return self._info_cache | ||
|
|
||
| # Full metadata loading if not lazy | ||
| if not self._info_cache or len(self._info_cache) < 5: # Check if we have full metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this magic number. Is there a more principled way to check if the cache needs updating?
xee/ext.py
Outdated
| columns = ['system:id', self.primary_dim_property] | ||
| properties = ( | ||
| self.image_collection.reduceColumns( | ||
| ee.Reducer.toList().repeat(len(columns)), columns | ||
| ).get('list') | ||
| ).getInfo() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea: maybe we can break this out into a function and call it here and in get_info as needed.
|
|
||
| # Verify that lazy opening is faster than regular opening | ||
| self.assertLess(lazy_open_time, regular_open_time, | ||
| f"Lazy loading ({lazy_open_time:.2f}s) should be faster than regular loading ({regular_open_time:.2f}s)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'm really happy to see this test.
xee/ext_integration_test.py
Outdated
| '1992-10-05', '1992-10-06') # Using a smaller date range for the test | ||
|
|
||
| # Open dataset with lazy loading | ||
| start_time = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use time.perf_counter() to capture time instead (and at the places below).
| # Verify that both datasets have the same structure | ||
| self.assertEqual(lazy_ds.dims, regular_ds.dims) | ||
| self.assertEqual(list(lazy_ds.data_vars), list(regular_ds.data_vars)) | ||
|
|
||
| # Access data and verify it's the same | ||
| var_name = list(lazy_ds.data_vars)[0] | ||
| lazy_data = lazy_ds[var_name].isel(time=0).values | ||
| regular_data = regular_ds[var_name].isel(time=0).values | ||
|
|
||
| # Both should have same shape and data should not be all zeros or NaNs | ||
| self.assertEqual(lazy_data.shape, regular_data.shape) | ||
| self.assertTrue(np.allclose(lazy_data, regular_data, equal_nan=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this is good to have in the test.
| if not hasattr(self, '_info_cache'): | ||
| self._info_cache = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that we have two levels of caching: this private dictionary and the functools.cached_property level. Is there any way we can use one system instead of both? I anticipate cache invalidation problems down the line.
One way we could clean this system up is by breaking out what info we get into multiple functions (that are also cached): e.g. a helper fn gets the first line essential stuff below. Since it's cached, we'd need less if...else logic to manage state; that would be happen in Python's memoizer decorator -- we'd make use of the cache just by calling the helper functions.
WDYT about this approach?
alxmrs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provided specific and high-level feedback.
In addition, I have a general warning: I don't have much control over whether PRs like this will actually get merged into Xee; patches like this require secondary review to ensure that the code is compatible with Google's internal use of Xee, and I'm not sure if there is capacity on the relevant teams to do these types of reviews right now.
|
Hi @alxmrs I've addressed all the feedback from the review:
Let me know if there's anything else I should adjust! |
Hi @alxmrs,
Added lazy loading (issue #44):
lazy_load=True option
Defers metadata RPCs → faster large dataset opens
Backward compatible + tested
can you review my PR?