Thanks to visit codestin.com
Credit goes to github.com

Skip to content

#1161 WIP to vectorize isel_points #1162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Jan 23, 2017
Merged

#1161 WIP to vectorize isel_points #1162

merged 37 commits into from
Jan 23, 2017

Conversation

mangecoeur
Copy link
Contributor

WIP to use dask vindex to point based selection

@@ -996,7 +996,7 @@ def sel(self, method=None, tolerance=None, **indexers):
)
return self.isel(**pos_indexers)._replace_indexes(new_indexes)

def isel_points(self, dim='points', **indexers):
def isel_points_old(self, dim='points', **indexers):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the old function instead? That will make GitHub show a more informative diff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Sorry it's a bit of a mess doing this under deadlines

@mangecoeur
Copy link
Contributor Author

mangecoeur commented Dec 14, 2016

So it seems to work fine in the Dask case, but I don't have a deep understanding of how DataArrays are constructed from arrays and dims so it fails in the non-dask case. Also not sure how you feel about making a special case for the dask backend here (since up till now it was all backend agnostic).

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with adding a special case for dask here, especially since the general logic will flow the same in both cases.

name=name)


return merge([v for k, v in variables.items() if k not in self.dims])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just use the normal Dataset constructor instead of passing these into merge.

slc = [indexers_dict[k] if k in indexers_dict else slice(None, None) for k in self.dims]
coords = [self.variables[k] for k in non_indexed]

# TODO need to figure out how to make sure we get the indexed vs non indexed dimensions in the right order
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you transpose the arrays so that the dimensions being indexed are always leading dimensions (something like reordered = self.transpose(*(indexed_dims + non_indexed_dims))), indexing .vindex on a dask array and directly on a numpy array will give the same result.

You really don't want to allow indexed dimensions in the middle somewhere because NumPy has unintuitive results for reordering axes in some edge cases.

@mangecoeur
Copy link
Contributor Author

@shoyer I'm down to 1 test failing locally in sel_points but not sure what the desired behaviour is. I get:

<xarray.Dataset>
Dimensions:  (points: 3)
Coordinates:
  * points   (points) int64 0 1 2
Data variables:
    foo      (points) int64 0 4 8

instead of

AssertionError: <xarray.Dataset>
Dimensions:  (points: 3)
Coordinates:
  o points   (points) -
Data variables:
    foo      (points) int64 0 4 8

But here I'm not sure if my code is wrong or the test. It seems that the test requires sel_points NOT to generate a new coordinate values for points - however I'm pretty sure isel_points does require this (it passes in any case). Don't really see a way in my code to generate subsets without having a matching coordinate array (I don't know how to use the Dataset constructors without one for instance).

I've updated the test according to how I think it should be working, but please correct me if i misunderstood.

@shoyer
Copy link
Member

shoyer commented Dec 23, 2016

The development version of xarray includes a change that makes indexes optional. So if you use the Dataset/DataArray constructor it no longer adds new coordinates for each dimension by default.

@mangecoeur
Copy link
Contributor Author

mangecoeur commented Dec 23, 2016

OK I adjusted for the new behaviour and all tests pass locally, hopefully travis agrees...

Edit: Looks like it's green

# Transpose the var to ensure that the indexed dims come first
# These dims will be collapsed in the output.
# To avoid edge cases in numpy want to transpose to ensure the indexed dimensions are first
# However transpose is not lazy, so want to avoid using it for dask case (??)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a potential concern if the data is not loaded in dask but is still pointing to a lazily loaded array of values on disk in a netCDF file -- calling .transpose will load the entire array into memory.

That said, I think this change is still worth doing, and we shouldn't add special cases for different data types. It is not unreasonable to encourage users to use dask if they are concerned about memory usage and lazy loading.

# These dims will be collapsed in the output.
# To avoid edge cases in numpy want to transpose to ensure the indexed dimensions are first
# However transpose is not lazy, so want to avoid using it for dask case (??)
var = var.transpose(*(list(d for d in indexer_dims if d in var.dims) +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might transpose the entire dataset once, instead of transposing each variable in turn. It should do basically the same thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that originally but then I thought that if there are variables that are not indexed at all we can skip applying transpose to them this way (and triggering a data load on them too). Does that make sense?


variables = OrderedDict()

for name in data_vars:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to use this loop for all elements in self.variables, instead of adding special cases for coords and data_vars?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into it, it's originally this way to preserve the behaviour expected in the tests but now that I understand that behaviour better I might be able to simplify

@shoyer
Copy link
Member

shoyer commented Dec 24, 2016 via email

@mangecoeur
Copy link
Contributor Author

@shoyer Tidied up based on recommendations, now everything done in a single loop (still need to make distinction between variables and coordinates for output but still a lot neater)

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few minor suggestions but generally this looks very nice!

Do you have any benchmarks you can share? It would be nice to know the magnitude of the improvement here. (It would also be a nice line to include in the release notes.)


# Add the var to variables or coords accordingly
if name in coords:
sel_coords[name] = (var_dims, selection)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than conditionally adding variables to sel_coords or variables, consider putting everything in variables and using the private constructor self._replace_vars_and_dims to return a new dataset with new updated variables. See elsewhere in this file for examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note: recreating variables with a tuple drops attrs. It would be better to write (var_dims, selection, var.attrs) or better yet type(var)(var_dims, selection, var.attrs) to be explicit about constructing variable objects (you would need to do this if you switch to use _replace_vars_and_dims).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will try that. I had originally found that the tests weren't happy if I didn't make that distinction, I think because the test expects variables to be classified into vars or coords, but since both vars and coords can depend on other coords it didn't seem to infer correctly which to consider an element a 'var' and when to consider it a 'coord'. Will see if the private api fixes that.

# Special case for dask backed arrays to use vectorised list indexing
sel = variable.data.vindex[slices]
else:
# Otherwise assume backend is numpy array with 'fancy' indexing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note to remove this helper function after numpy supports vindex, with a reference to numpy/numpy#6075

@@ -1034,6 +1036,7 @@ def sel(self, method=None, tolerance=None, drop=False, **indexers):
return result._replace_indexes(new_indexes)

def isel_points(self, dim='points', **indexers):
# type: () -> Dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be # type: (...) -> Dataset

dim = as_variable(dim, name='points')
name = 'points' if not hasattr(dim, 'name') else dim.name
dim_coord = as_variable(dim, name=name)
dim = name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately obvious to me what this is doing. That said, this is forgivable, given that I'm pretty sure you just duplicated it from concat.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to make sure we fallback to the default name 'points' in case the supplied index doesn't have its own name

use _replace_vars_and_dims, simplify new dataset creation, preserve
attributes, clarify dim vs dim_name (don’t re-use variable name to
reduce confusion)
@mangecoeur
Copy link
Contributor Author

Completed changes based on recommendations and cleaned up old code and comments.

As for benchmarks, I don't have anything rigourous but I do have the following example dataset weather data from the CFSR dataset, 7 variables at hourly resolution, collected in one netCDF3 file per variable per month. In the particular case the difference is striking!

%%time
data = dataset.isel_points(time=np.arange(0,1000), lat=np.ones(1000, dtype=int), lon=np.ones(1000, dtype=int))
data.load()

Results:

xarray 0.8.2
CPU times: user 1min 21s, sys: 41.5 s, total: 2min 2s
Wall time: 47.8 s

master
CPU times: user 385 ms, sys: 238 ms, total: 623 ms
Wall time: 288 ms

@shoyer
Copy link
Member

shoyer commented Jan 15, 2017

This looks great to me. I'll merge this shortly after releasing 0.9.0 -- could you please add a brief release note in a new section for 0.9.1? (It won't be a long wait.)

@mangecoeur
Copy link
Contributor Author

Ok will wait for 0.9.0 to be released

"""Returns a new dataset with each array indexed pointwise along the
specified dimension(s).

This method selects pointwise values from each array and is akin to
the NumPy indexing behavior of `arr[[0, 1], [0, 1]]`, except this
method does not require knowing the order of each array's dimensions.

Will use dask vectorised operation if available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not add this comment -- implementation details usually don't make it into docstrings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

righto

@mangecoeur
Copy link
Contributor Author

Note - waiting for 0.9.0 to be released before updating whats new, don't want to end up with conflicts in docs

@shoyer
Copy link
Member

shoyer commented Jan 23, 2017 via email

@mangecoeur
Copy link
Contributor Author

OK added a performance improvements section to the docs

@shoyer
Copy link
Member

shoyer commented Jan 23, 2017

Looks good. Unfortunately there was a merge conflict with the quantile PR so you need to merge master again.

@mangecoeur
Copy link
Contributor Author

Crickey. Fixed merge hopefully it works (I hate merge conflicts)

@shoyer shoyer merged commit 4bb630f into pydata:master Jan 23, 2017
@shoyer
Copy link
Member

shoyer commented Jan 23, 2017

OK, build passed so I'm merging. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants