#1161 WIP to vectorize isel_points #1162

mangecoeur · 2016-12-13T00:19:46Z

WIP to use dask vindex to point based selection

shoyer · 2016-12-13T17:23:56Z

xarray/core/dataset.py

@@ -996,7 +996,7 @@ def sel(self, method=None, tolerance=None, **indexers):
        )
        return self.isel(**pos_indexers)._replace_indexes(new_indexes)

-    def isel_points(self, dim='points', **indexers):
+    def isel_points_old(self, dim='points', **indexers):


Can you remove the old function instead? That will make GitHub show a more informative diff.

done. Sorry it's a bit of a mess doing this under deadlines

mangecoeur · 2016-12-14T10:10:11Z

So it seems to work fine in the Dask case, but I don't have a deep understanding of how DataArrays are constructed from arrays and dims so it fails in the non-dask case. Also not sure how you feel about making a special case for the dask backend here (since up till now it was all backend agnostic).

shoyer

I'm OK with adding a special case for dask here, especially since the general logic will flow the same in both cases.

shoyer · 2016-12-21T23:47:29Z

xarray/core/dataset.py

+                          name=name)
+
+
+        return merge([v for k, v in variables.items() if k not in self.dims])


I think you can just use the normal Dataset constructor instead of passing these into merge.

shoyer · 2016-12-21T23:47:33Z

xarray/core/dataset.py

+        slc = [indexers_dict[k] if k in indexers_dict else slice(None, None) for k in self.dims]
+        coords = [self.variables[k] for k in non_indexed]
+
+        # TODO need to figure out how to make sure we get the indexed vs non indexed dimensions in the right order


If you transpose the arrays so that the dimensions being indexed are always leading dimensions (something like reordered = self.transpose(*(indexed_dims + non_indexed_dims))), indexing .vindex on a dask array and directly on a numpy array will give the same result.

You really don't want to allow indexed dimensions in the middle somewhere because NumPy has unintuitive results for reordering axes in some edge cases.

Work to get tests passing - completely re-worked logic to allow for wide range of Dataset formats and normal behaviour

WIP to use dask vindex to point based selection

Work to get tests passing - completely re-worked logic to allow for wide range of Dataset formats and normal behaviour

mangecoeur · 2016-12-23T01:42:03Z

@shoyer I'm down to 1 test failing locally in sel_points but not sure what the desired behaviour is. I get:

<xarray.Dataset>
Dimensions:  (points: 3)
Coordinates:
  * points   (points) int64 0 1 2
Data variables:
    foo      (points) int64 0 4 8

instead of

AssertionError: <xarray.Dataset>
Dimensions:  (points: 3)
Coordinates:
  o points   (points) -
Data variables:
    foo      (points) int64 0 4 8

But here I'm not sure if my code is wrong or the test. It seems that the test requires sel_points NOT to generate a new coordinate values for points - however I'm pretty sure isel_points does require this (it passes in any case). Don't really see a way in my code to generate subsets without having a matching coordinate array (I don't know how to use the Dataset constructors without one for instance).

I've updated the test according to how I think it should be working, but please correct me if i misunderstood.

shoyer · 2016-12-23T01:46:01Z

The development version of xarray includes a change that makes indexes optional. So if you use the Dataset/DataArray constructor it no longer adds new coordinates for each dimension by default.

Datasets no longer require/generate index coordinate for every dimension

mangecoeur · 2016-12-23T18:13:52Z

OK I adjusted for the new behaviour and all tests pass locally, hopefully travis agrees...

Edit: Looks like it's green

shoyer · 2016-12-24T00:28:47Z

xarray/core/dataset.py

+            # Transpose the var to ensure that the indexed dims come first
+            # These dims will be collapsed in the output.
+            # To avoid edge cases in numpy want to transpose to ensure the indexed dimensions are first
+            # However transpose is not lazy, so want to avoid using it for dask case (??)


This is a potential concern if the data is not loaded in dask but is still pointing to a lazily loaded array of values on disk in a netCDF file -- calling .transpose will load the entire array into memory.

That said, I think this change is still worth doing, and we shouldn't add special cases for different data types. It is not unreasonable to encourage users to use dask if they are concerned about memory usage and lazy loading.

shoyer · 2016-12-24T00:29:42Z

xarray/core/dataset.py

+            # These dims will be collapsed in the output.
+            # To avoid edge cases in numpy want to transpose to ensure the indexed dimensions are first
+            # However transpose is not lazy, so want to avoid using it for dask case (??)
+            var = var.transpose(*(list(d for d in indexer_dims if d in var.dims) +


You might transpose the entire dataset once, instead of transposing each variable in turn. It should do basically the same thing.

I did that originally but then I thought that if there are variables that are not indexed at all we can skip applying transpose to them this way (and triggering a data load on them too). Does that make sense?

shoyer · 2016-12-24T00:30:46Z

xarray/core/dataset.py

+
+        variables = OrderedDict()
+
+        for name in data_vars:


It is possible to use this loop for all elements in self.variables, instead of adding special cases for coords and data_vars?

I will look into it, it's originally this way to preserve the behaviour expected in the tests but now that I understand that behaviour better I might be able to simplify

shoyer · 2016-12-24T14:20:23Z

I believe Dataset.transpose already skips variables that don't need to be transposed.

…

On Sat, Dec 24, 2016 at 5:40 AM mangecoeur ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In xarray/core/dataset.py <#1162>: > + coord_dim = var.dims[0] # should just be one? + selection = take(var, indexers_dict[coord_dim]) + + sel_coords[c] = (dim, selection) + else: + sel_coords[c] = self.coords[c] + + variables = OrderedDict() + + for name in data_vars: + var = self.variables[name] + # Transpose the var to ensure that the indexed dims come first + # These dims will be collapsed in the output. + # To avoid edge cases in numpy want to transpose to ensure the indexed dimensions are first + # However transpose is not lazy, so want to avoid using it for dask case (??) + var = var.transpose(*(list(d for d in indexer_dims if d in var.dims) + I did that originally but then I thought that if there are variables that are not indexed at all we can skip applying transpose to them this way (and triggering a data load on them too). Does that make sense? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1162>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1j6axEaUtX-1kH3n_fmLy4o5ErRiks5rLRLHgaJpZM4LLMXX> .

mangecoeur · 2016-12-24T17:49:10Z

@shoyer Tidied up based on recommendations, now everything done in a single loop (still need to make distinction between variables and coordinates for output but still a lot neater)

shoyer

I have a few minor suggestions but generally this looks very nice!

Do you have any benchmarks you can share? It would be nice to know the magnitude of the improvement here. (It would also be a nice line to include in the release notes.)

shoyer · 2017-01-06T17:28:47Z

xarray/core/dataset.py

+
+                # Add the var to variables or coords accordingly
+                if name in coords:
+                    sel_coords[name] = (var_dims, selection)


Rather than conditionally adding variables to sel_coords or variables, consider putting everything in variables and using the private constructor self._replace_vars_and_dims to return a new dataset with new updated variables. See elsewhere in this file for examples.

Also note: recreating variables with a tuple drops attrs. It would be better to write (var_dims, selection, var.attrs) or better yet type(var)(var_dims, selection, var.attrs) to be explicit about constructing variable objects (you would need to do this if you switch to use _replace_vars_and_dims).

Ok I will try that. I had originally found that the tests weren't happy if I didn't make that distinction, I think because the test expects variables to be classified into vars or coords, but since both vars and coords can depend on other coords it didn't seem to infer correctly which to consider an element a 'var' and when to consider it a 'coord'. Will see if the private api fixes that.

shoyer · 2017-01-06T17:30:43Z

xarray/core/dataset.py

+                # Special case for dask backed arrays to use vectorised list indexing
+                sel = variable.data.vindex[slices]
+            else:
+                # Otherwise assume backend is numpy array with 'fancy' indexing


Add a note to remove this helper function after numpy supports vindex, with a reference to numpy/numpy#6075

shoyer · 2017-01-06T17:31:39Z

xarray/core/dataset.py

@@ -1034,6 +1036,7 @@ def sel(self, method=None, tolerance=None, drop=False, **indexers):
        return result._replace_indexes(new_indexes)

    def isel_points(self, dim='points', **indexers):
+        # type: () -> Dataset


should be # type: (...) -> Dataset

shoyer · 2017-01-06T17:34:24Z

xarray/core/dataset.py

-            dim = as_variable(dim, name='points')
+            name = 'points' if not hasattr(dim, 'name') else dim.name
+            dim_coord = as_variable(dim, name=name)
+            dim = name


It's not immediately obvious to me what this is doing. That said, this is forgivable, given that I'm pretty sure you just duplicated it from concat.

This is just to make sure we fallback to the default name 'points' in case the supplied index doesn't have its own name

use _replace_vars_and_dims, simplify new dataset creation, preserve attributes, clarify dim vs dim_name (don’t re-use variable name to reduce confusion)

mangecoeur · 2017-01-15T18:53:26Z

Completed changes based on recommendations and cleaned up old code and comments.

As for benchmarks, I don't have anything rigourous but I do have the following example dataset weather data from the CFSR dataset, 7 variables at hourly resolution, collected in one netCDF3 file per variable per month. In the particular case the difference is striking!

%%time
data = dataset.isel_points(time=np.arange(0,1000), lat=np.ones(1000, dtype=int), lon=np.ones(1000, dtype=int))
data.load()

Results:

xarray 0.8.2
CPU times: user 1min 21s, sys: 41.5 s, total: 2min 2s
Wall time: 47.8 s

master
CPU times: user 385 ms, sys: 238 ms, total: 623 ms
Wall time: 288 ms

shoyer · 2017-01-15T22:40:04Z

This looks great to me. I'll merge this shortly after releasing 0.9.0 -- could you please add a brief release note in a new section for 0.9.1? (It won't be a long wait.)

mangecoeur · 2017-01-16T11:59:01Z

Ok will wait for 0.9.0 to be released

shoyer · 2017-01-23T17:47:40Z

xarray/core/dataset.py

        """Returns a new dataset with each array indexed pointwise along the
        specified dimension(s).

        This method selects pointwise values from each array and is akin to
        the NumPy indexing behavior of `arr[[0, 1], [0, 1]]`, except this
        method does not require knowing the order of each array's dimensions.

+        Will use dask vectorised operation if available


Let's not add this comment -- implementation details usually don't make it into docstrings

mangecoeur · 2017-01-23T17:52:33Z

Note - waiting for 0.9.0 to be released before updating whats new, don't want to end up with conflicts in docs

shoyer · 2017-01-23T17:56:09Z

Actually, if you want to write that now I think we can squeeze this in. We are still a day or two away from the release. On Mon, Jan 23, 2017 at 9:52 AM mangecoeur <[email protected]> wrote: Note - waiting for 0.9.0 to be released before updating whats new, don't want to end up with conflicts in docs — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1162 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1iQJ4F8sn_fekpwRH-yDpJX6X0h0ks5rVOjhgaJpZM4LLMXX> .

mangecoeur · 2017-01-23T18:04:09Z

OK added a performance improvements section to the docs

shoyer · 2017-01-23T18:55:49Z

Looks good. Unfortunately there was a merge conflict with the quantile PR so you need to merge master again.

mangecoeur · 2017-01-23T20:09:24Z

Crickey. Fixed merge hopefully it works (I hate merge conflicts)

shoyer · 2017-01-23T20:20:51Z

OK, build passed so I'm merging. Thanks!

mangecoeur added 4 commits June 6, 2016 13:40

Merge remote-tracking branch 'pydata/master'

67bf82c

Merge remote-tracking branch 'pydata/master'

8a85f1e

Merge remote-tracking branch 'pydata/master'

bdcfd4f

First draft of dask vindex enabled isel_points

7513251

WIP to use dask vindex to point based selection

mangecoeur mentioned this pull request Dec 13, 2016

Generated Dask graph is huge - performance issue? #1161

Closed

shoyer reviewed Dec 13, 2016

View reviewed changes

remove old sel points

8a58da0

shoyer reviewed Dec 21, 2016

View reviewed changes

mangecoeur added 18 commits December 22, 2016 01:22

completely re-worked logic

74d18e7

Work to get tests passing - completely re-worked logic to allow for wide range of Dataset formats and normal behaviour

improve support for using array dim

30a80b2

clean up comments

f21f26e

Still trying to get semantics of dim and coords indexing right

bb85163

Tests for sel_points, isel_points working

7314ece

First draft of dask vindex enabled isel_points

8bf5310

WIP to use dask vindex to point based selection

remove old sel points

f9f194f

completely re-worked logic

cafad2b

Work to get tests passing - completely re-worked logic to allow for wide range of Dataset formats and normal behaviour

improve support for using array dim

b7a1b97

clean up comments

57c9d3c

Still trying to get semantics of dim and coords indexing right

e59ab70

Tests for sel_points, isel_points working

a920579

Merge remote-tracking branch 'origin/master'

d853522

more tweaks, fixed tests after merge

482e978

revert to origin test now working

29119f2

clean up coordinate construction

e95c2b0

further simplify coords construction

0e84ef3

Update tests to require generation of selection axis coordinate array

9e56469

revert change to test to reflect latest dataset behaviour

77bb913

mangecoeur added 3 commits December 23, 2016 18:13

fix coord generation to match latest dataset behaviour

313ac0b

Datasets no longer require/generate index coordinate for every dimension

cleanup

b5619d7

Merge remote-tracking branch 'pydata/master'

fd4ba04

shoyer reviewed Dec 24, 2016

View reviewed changes

mangecoeur added 3 commits December 24, 2016 18:36

Simplified code by just looping once over variables

d54ea3e

Merge remote-tracking branch 'pydata/master'

ce27394

tidy up

1585d92

shoyer reviewed Jan 6, 2017

View reviewed changes

mangecoeur added 3 commits January 15, 2017 17:16

pydata#1162 further improvements

426e6cb

use _replace_vars_and_dims, simplify new dataset creation, preserve attributes, clarify dim vs dim_name (don’t re-use variable name to reduce confusion)

Merge remote-tracking branch 'pydata/master'

e6f524e

Formatting

88c3f95

shoyer approved these changes Jan 15, 2017

View reviewed changes

shoyer reviewed Jan 23, 2017

View reviewed changes

mangecoeur added 2 commits January 23, 2017 17:50

remove impl detail from docstr

c046167

Merge remote-tracking branch 'pydata/master'

de269a7

Add performance improvements section and short descr

ea54263

Fix merge conflict

fcad21e

shoyer merged commit 4bb630f into pydata:master Jan 23, 2017

		name=name)


		return merge([v for k, v in variables.items() if k not in self.dims])

Uh oh!

#1161 WIP to vectorize isel_points #1162

#1161 WIP to vectorize isel_points #1162

Uh oh!

Conversation

mangecoeur commented Dec 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mangecoeur commented Dec 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mangecoeur commented Dec 23, 2016

Uh oh!

shoyer commented Dec 23, 2016

Uh oh!

mangecoeur commented Dec 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer commented Dec 24, 2016 via email

Uh oh!

mangecoeur commented Dec 24, 2016

Uh oh!

shoyer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mangecoeur commented Jan 15, 2017

Uh oh!

shoyer commented Jan 15, 2017

Uh oh!

mangecoeur commented Jan 16, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mangecoeur commented Jan 23, 2017

Uh oh!

shoyer commented Jan 23, 2017 via email

Uh oh!

mangecoeur commented Jan 23, 2017

Uh oh!

shoyer commented Jan 23, 2017

Uh oh!

mangecoeur commented Jan 23, 2017

Uh oh!

shoyer commented Jan 23, 2017

mangecoeur commented Dec 14, 2016 •

edited

Loading

mangecoeur commented Dec 23, 2016 •

edited

Loading