-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
#1161 WIP to vectorize isel_points #1162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WIP to use dask vindex to point based selection
@@ -996,7 +996,7 @@ def sel(self, method=None, tolerance=None, **indexers): | |||
) | |||
return self.isel(**pos_indexers)._replace_indexes(new_indexes) | |||
|
|||
def isel_points(self, dim='points', **indexers): | |||
def isel_points_old(self, dim='points', **indexers): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the old function instead? That will make GitHub show a more informative diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. Sorry it's a bit of a mess doing this under deadlines
So it seems to work fine in the Dask case, but I don't have a deep understanding of how DataArrays are constructed from arrays and dims so it fails in the non-dask case. Also not sure how you feel about making a special case for the dask backend here (since up till now it was all backend agnostic). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with adding a special case for dask here, especially since the general logic will flow the same in both cases.
name=name) | ||
|
||
|
||
return merge([v for k, v in variables.items() if k not in self.dims]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just use the normal Dataset
constructor instead of passing these into merge
.
slc = [indexers_dict[k] if k in indexers_dict else slice(None, None) for k in self.dims] | ||
coords = [self.variables[k] for k in non_indexed] | ||
|
||
# TODO need to figure out how to make sure we get the indexed vs non indexed dimensions in the right order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you transpose the arrays so that the dimensions being indexed are always leading dimensions (something like reordered = self.transpose(*(indexed_dims + non_indexed_dims))
), indexing .vindex
on a dask array and directly on a numpy array will give the same result.
You really don't want to allow indexed dimensions in the middle somewhere because NumPy has unintuitive results for reordering axes in some edge cases.
Work to get tests passing - completely re-worked logic to allow for wide range of Dataset formats and normal behaviour
WIP to use dask vindex to point based selection
Work to get tests passing - completely re-worked logic to allow for wide range of Dataset formats and normal behaviour
@shoyer I'm down to 1 test failing locally in
instead of
But here I'm not sure if my code is wrong or the test. It seems that the test requires I've updated the test according to how I think it should be working, but please correct me if i misunderstood. |
The development version of xarray includes a change that makes indexes optional. So if you use the Dataset/DataArray constructor it no longer adds new coordinates for each dimension by default. |
Datasets no longer require/generate index coordinate for every dimension
OK I adjusted for the new behaviour and all tests pass locally, hopefully travis agrees... Edit: Looks like it's green |
# Transpose the var to ensure that the indexed dims come first | ||
# These dims will be collapsed in the output. | ||
# To avoid edge cases in numpy want to transpose to ensure the indexed dimensions are first | ||
# However transpose is not lazy, so want to avoid using it for dask case (??) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a potential concern if the data is not loaded in dask but is still pointing to a lazily loaded array of values on disk in a netCDF file -- calling .transpose
will load the entire array into memory.
That said, I think this change is still worth doing, and we shouldn't add special cases for different data types. It is not unreasonable to encourage users to use dask if they are concerned about memory usage and lazy loading.
# These dims will be collapsed in the output. | ||
# To avoid edge cases in numpy want to transpose to ensure the indexed dimensions are first | ||
# However transpose is not lazy, so want to avoid using it for dask case (??) | ||
var = var.transpose(*(list(d for d in indexer_dims if d in var.dims) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might transpose the entire dataset once, instead of transposing each variable in turn. It should do basically the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that originally but then I thought that if there are variables that are not indexed at all we can skip applying transpose to them this way (and triggering a data load on them too). Does that make sense?
|
||
variables = OrderedDict() | ||
|
||
for name in data_vars: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to use this loop for all elements in self.variables
, instead of adding special cases for coords
and data_vars
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look into it, it's originally this way to preserve the behaviour expected in the tests but now that I understand that behaviour better I might be able to simplify
I believe Dataset.transpose already skips variables that don't need to be
transposed.
…On Sat, Dec 24, 2016 at 5:40 AM mangecoeur ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In xarray/core/dataset.py <#1162>:
> + coord_dim = var.dims[0] # should just be one?
+ selection = take(var, indexers_dict[coord_dim])
+
+ sel_coords[c] = (dim, selection)
+ else:
+ sel_coords[c] = self.coords[c]
+
+ variables = OrderedDict()
+
+ for name in data_vars:
+ var = self.variables[name]
+ # Transpose the var to ensure that the indexed dims come first
+ # These dims will be collapsed in the output.
+ # To avoid edge cases in numpy want to transpose to ensure the indexed dimensions are first
+ # However transpose is not lazy, so want to avoid using it for dask case (??)
+ var = var.transpose(*(list(d for d in indexer_dims if d in var.dims) +
I did that originally but then I thought that if there are variables that
are not indexed at all we can skip applying transpose to them this way (and
triggering a data load on them too). Does that make sense?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1162>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1j6axEaUtX-1kH3n_fmLy4o5ErRiks5rLRLHgaJpZM4LLMXX>
.
|
@shoyer Tidied up based on recommendations, now everything done in a single loop (still need to make distinction between variables and coordinates for output but still a lot neater) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few minor suggestions but generally this looks very nice!
Do you have any benchmarks you can share? It would be nice to know the magnitude of the improvement here. (It would also be a nice line to include in the release notes.)
|
||
# Add the var to variables or coords accordingly | ||
if name in coords: | ||
sel_coords[name] = (var_dims, selection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than conditionally adding variables to sel_coords
or variables
, consider putting everything in variables
and using the private constructor self._replace_vars_and_dims
to return a new dataset with new updated variables. See elsewhere in this file for examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note: recreating variables with a tuple drops attrs
. It would be better to write (var_dims, selection, var.attrs)
or better yet type(var)(var_dims, selection, var.attrs)
to be explicit about constructing variable objects (you would need to do this if you switch to use _replace_vars_and_dims
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I will try that. I had originally found that the tests weren't happy if I didn't make that distinction, I think because the test expects variables to be classified into vars or coords, but since both vars and coords can depend on other coords it didn't seem to infer correctly which to consider an element a 'var' and when to consider it a 'coord'. Will see if the private api fixes that.
# Special case for dask backed arrays to use vectorised list indexing | ||
sel = variable.data.vindex[slices] | ||
else: | ||
# Otherwise assume backend is numpy array with 'fancy' indexing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a note to remove this helper function after numpy supports vindex
, with a reference to numpy/numpy#6075
@@ -1034,6 +1036,7 @@ def sel(self, method=None, tolerance=None, drop=False, **indexers): | |||
return result._replace_indexes(new_indexes) | |||
|
|||
def isel_points(self, dim='points', **indexers): | |||
# type: () -> Dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be # type: (...) -> Dataset
dim = as_variable(dim, name='points') | ||
name = 'points' if not hasattr(dim, 'name') else dim.name | ||
dim_coord = as_variable(dim, name=name) | ||
dim = name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not immediately obvious to me what this is doing. That said, this is forgivable, given that I'm pretty sure you just duplicated it from concat
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to make sure we fallback to the default name 'points' in case the supplied index doesn't have its own name
use _replace_vars_and_dims, simplify new dataset creation, preserve attributes, clarify dim vs dim_name (don’t re-use variable name to reduce confusion)
Completed changes based on recommendations and cleaned up old code and comments. As for benchmarks, I don't have anything rigourous but I do have the following example %%time
data = dataset.isel_points(time=np.arange(0,1000), lat=np.ones(1000, dtype=int), lon=np.ones(1000, dtype=int))
data.load() Results:
|
This looks great to me. I'll merge this shortly after releasing 0.9.0 -- could you please add a brief release note in a new section for 0.9.1? (It won't be a long wait.) |
Ok will wait for 0.9.0 to be released |
"""Returns a new dataset with each array indexed pointwise along the | ||
specified dimension(s). | ||
|
||
This method selects pointwise values from each array and is akin to | ||
the NumPy indexing behavior of `arr[[0, 1], [0, 1]]`, except this | ||
method does not require knowing the order of each array's dimensions. | ||
|
||
Will use dask vectorised operation if available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not add this comment -- implementation details usually don't make it into docstrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
righto
Note - waiting for 0.9.0 to be released before updating whats new, don't want to end up with conflicts in docs |
Actually, if you want to write that now I think we can squeeze this in. We
are still a day or two away from the release.
On Mon, Jan 23, 2017 at 9:52 AM mangecoeur <[email protected]> wrote:
Note - waiting for 0.9.0 to be released before updating whats new, don't
want to end up with conflicts in docs
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1162 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1iQJ4F8sn_fekpwRH-yDpJX6X0h0ks5rVOjhgaJpZM4LLMXX>
.
|
OK added a performance improvements section to the docs |
Looks good. Unfortunately there was a merge conflict with the quantile PR so you need to merge master again. |
Crickey. Fixed merge hopefully it works (I hate merge conflicts) |
OK, build passed so I'm merging. Thanks! |
WIP to use dask vindex to point based selection