Thanks to visit codestin.com
Credit goes to github.com

Skip to content

WIP: Implement oindex #6075

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 21 commits into from
Closed

WIP: Implement oindex #6075

wants to merge 21 commits into from

Conversation

seberg
Copy link
Member

@seberg seberg commented Jul 14, 2015

Seems I was productive during trvale.

Should be good enough to try around, if someone can contribute, please do, it will probably be a while before I look at it again seriously.

@seberg seberg force-pushed the oindex branch 2 times, most recently from 0562b2f to 7145b07 Compare July 14, 2015 11:31
@seberg
Copy link
Member Author

seberg commented Jul 14, 2015

See also a start for an NEP at https://gist.github.com/seberg/976373b6a2b7c4188591

Of course some of the things are a bit from my perspective, I did not actually run the examples, so won't guarantee they are all correct ;) (I know that @njsmith was pondering some more restrictions in some places with boolean indices, but I do not see the reason for that right now).

Same again, don't expect instant followup, it was more a way to keep me awake....

@mhvk
Copy link
Contributor

mhvk commented Jul 14, 2015

@seberg - the fancy indexing has me flummoxed often enough that I'd welcome any simplification! Two broad comments (not sure if you prefer them here or on your NEP; happy to repost):

  1. It would be lovely to have some simply way to use the output of np.arg[min|max|sort] as a proper index (to avoid hackery as in astropy -- if there is a simpler way already, let me know!)
  2. Instead of adding two new methods, might it be better to add a keyword argument to take? e.g., method=[fancy|outer|vector], with the default fancy giving current behaviour to preserve backwards compatibility).

@seberg
Copy link
Member Author

seberg commented Jul 14, 2015

Well, take is a single inter array index, it does not have all of these problems at all, so I don't really think that applies.

About your point 1. A function which does this would be nice. One could of course think about doing something like indexing (to also allow slices, etc.), but I am not clear on how you would do it. But something like np.pick(arr, argmin_res, axis=...), one could maybe do (not sure about the details now though, did not think about it).

@jaimefrio
Copy link
Member

Perhaps a make_me_an_indexing_tuple=False new keyword argument for the argxxx functions?

@mhvk
Copy link
Contributor

mhvk commented Jul 14, 2015

@jaimefrio - yes, probably much more sensible -- see #6078.

@seberg - the name is of course an implementation detail, though it is helpful to have an obvious one. For oindex, if I understand correctly, it is really a generalization of a slice, where along each axis instead of indices in a fixed pattern as set by slice(start, stop, step), one can have an arbitrary array of indices. Names like slice, subset would be more obvious to me (although the former might suggests it is a view, which this will not generally be).

The logic in suggesting take was that one can think of oindex as being a more general form of take, if it was defined by (! marks lines different from current docstring):

  Definition:  np.take(a, indices, axis=None, out=None, mode='raise')
  Docstring:
! Take elements from an array along one or more axes.

! This function does the same thing as "outer" indexing (indexing arrays
  using arrays); however, it can be easier to use if you need elements
! along one or more given axes.

  Parameters
  ----------
  a : array_like
      The source array.
! indices : array_like or list of array_likes and slices
      The indices of the values to extract.

      .. versionadded:: 1.8.0

      Also allow scalars for indices.
! 
!     .. versionadded:: 1.11.0
! 
!     Allow multiple axes
!     
! axis : int or tuple of int, optional
      The axis over which to select values. By default, the flattened
!     input array is used for a single indices array, or the number of
!     axes equal to the length of the list of indices.
! 
! 
! a = np.arange(2*3*4).reshape(2, 3, 4)
! np.take(a, [1, 2], axis=1).shape
! (2, 2, 4)
! 
! np.take(a, [slice(None), [1,2], [0, 3]]).shape
! (2, 2, 2)
! 
! np.take(a, [[1,2], [0, 3]], axis=(1, 2)).shape
! (2, 2, 2)
!
! np.take(a, [[0], [1,2]]).shape
! (1, 2, 4)

@seberg
Copy link
Member Author

seberg commented Aug 10, 2015

Just to note, I added a small paragraph about this problem to the NEP, I think I will mail it to the list (unless someone feels I should add something) shortly after the 1.10 release.
One thing is that np.take has incompatible default logic for index_arr that has more then one dimensions. Since you would need a second axes argument (to map the identically iterated axes, and the old one to say along which axis you take). And the defaults could not be the arg* inverse.

@seberg
Copy link
Member Author

seberg commented Nov 29, 2015

Updated. It now implements everything in the NEP, oindex, vindex and lindex. However, does not throw a fit when plain indexing is potentially not clear (but that should be a trivial addition).

It currently works by broadcasting the arrays, in principle could be slightly faster by instead using axes reordering in nditer, but bleh ;).

@seberg
Copy link
Member Author

seberg commented Nov 29, 2015

Boooo! Should now also (hopefully) give the "unclear" warnings -- not sure if with the exact rules as in the NEP -- if you enable deprecation warnings always/error.

@homu
Copy link
Contributor

homu commented Jan 19, 2016

☔ The latest upstream changes (presumably #7027) made this pull request unmergeable. Please resolve the merge conflicts.

@homu
Copy link
Contributor

homu commented May 24, 2016

☔ The latest upstream changes (presumably #7667) made this pull request unmergeable. Please resolve the merge conflicts.

@homu
Copy link
Contributor

homu commented Feb 16, 2017

☔ The latest upstream changes (presumably #8043) made this pull request unmergeable. Please resolve the merge conflicts.

@eric-wieser
Copy link
Member

@mhvk:

It would be lovely to have some simply way to use the output of np.arg[min|max|sort] as a proper index

I've filed an issue for this at #8708. You can mostly leverage np.ix_ here

@mattip
Copy link
Member

mattip commented Jun 21, 2018

does this need a rebase/conflict resolution for the NEP discussion to progress?

seberg added 2 commits July 29, 2018 17:52
Implement

```
multiindex.prepared(dtype=None, shape=None,convert_booleans={"not_single", "always"})
```
which gives some information. On field access (which is not possible to reach)
would return `{"type": "field-access", "orig_index": orig_index}`, on non field index
returns a much larger dict with most information.
@seberg
Copy link
Member Author

seberg commented Jul 29, 2018

Tests should be passing now. I added a prepared_index to "multiindex" (bad name for now), which cannot be reached for normal indexing right now, but we can expose that. What this does can be tested with:

In [1]: class subclass(np.ndarray):
   ...:     def __getitem__(self, obj):
   ...:         if isinstance(obj, np.core.multiarray._multiindex):
   ...:             print(obj.prepared(convert_booleans="not_single"))
   ...:         return super().__getitem__(obj)
   ...:     

In [2]: arr = np.zeros((2, 3)).view(subclass)

In [3]: arr.oindex[[0, 1], [0, 1]]
{'type': 'index', 'method': 'oindex', 'orig_index': ([0, 1], [0, 1]), 'view': False, 'simplified_index': (array([0, 1]), array([0, 1])), 'scalar': False, 'ellipsis_dims': None, 'result_ndim': 2}
Out[3]: 
subclass([[0., 0.],
          [0., 0.]])

In [4]: arr.vindex[[0, 1], [0, 1]]
{'type': 'index', 'method': 'vindex', 'orig_index': ([0, 1], [0, 1]), 'view': False, 'simplified_index': (array([0, 1]), array([0, 1])), 'scalar': False, 'ellipsis_dims': None, 'result_ndim': 1}
Out[4]: subclass([0., 0.])

In [5]: arr.oindex[arr > 0]
{'type': 'index', 'method': 'oindex', 'orig_index': subclass([[False, False, False],
          [False, False, False]]), 'view': False, 'simplified_index': subclass([[False, False, False],
          [False, False, False]]), 'scalar': False, 'ellipsis_dims': None, 'result_ndim': 1}
Out[5]: subclass([], dtype=float64)

In [6]: arr.oindex[True, 0, ...]
{'type': 'index', 'method': 'oindex', 'orig_index': (True, 0, Ellipsis), 'view': False, 'simplified_index': (True, 0, Ellipsis), 'scalar': False, 'ellipsis_dims': 1, 'result_ndim': 2}
Out[6]: subclass([[0., 0., 0.]])

In [7]: arr.oindex[np.array(True), 0, ...]
{'type': 'index', 'method': 'oindex', 'orig_index': (array(True), 0, Ellipsis), 'view': False, 'simplified_index': (True, 0, Ellipsis), 'scalar': False, 'ellipsis_dims': 1, 'result_ndim': 2}
Out[7]: subclass([[0., 0., 0.]])

In [8]: arr.vindex[0, 1]
{'type': 'index', 'method': 'vindex', 'orig_index': (0, 1), 'view': False, 'simplified_index': (0, 1), 'scalar': True, 'ellipsis_dims': None, 'result_ndim': 0}
Out[8]: 0.0

In [9]: arr.vindex[0, 1, ...]
{'type': 'index', 'method': 'vindex', 'orig_index': (0, 1, Ellipsis), 'view': True, 'simplified_index': (0, 1, Ellipsis), 'scalar': False, 'ellipsis_dims': 0, 'result_ndim': 0}
Out[9]: subclass(0.)

In [10]: arr.vindex[0, np.array(1), ...]
{'type': 'index', 'method': 'vindex', 'orig_index': (0, array(1), Ellipsis), 'view': False, 'simplified_index': (0, 1, Ellipsis), 'scalar': False, 'ellipsis_dims': 0, 'result_ndim': 0}
Out[10]: subclass(0.)

@hameerabbasi would this type of thing help you. Note a few oddities you cannot see... 0-D booleans are converted to False/True (for plain indexing, which is not available). Also the simplified_index includes the Ellipsis to be safer, but if view is False if you just use the simplified_index you will have to enforce the copy manually! (this is a bit tedious, but they to signal it would otherwise be creating a scalar array).

EDIT: To note, valid arguments for convert_booleans are "always" (not 0D) and "not_single". With this "not_single" a single boolean array is given without the tuple wrapping it, all other indices are wrapped in tuples. Note that all arrays are converted to intp and all scalars to python integers. I think we might be able to guarantee it, because we could add a convert_to_intp=True kwarg later.

EDIT: OOps, about 0-D booleans being True/False. This can be reached of course, I just had a bug.

@seberg
Copy link
Member Author

seberg commented Jul 29, 2018

Another note. This does not expose how outer indexing itself is done (e.g. the expanded arrays fed into "typical" indexing plus the transpose), we could expose this to some degree, but it is more involved, probably.

@seberg
Copy link
Member Author

seberg commented Jul 29, 2018

A screw it, the simplified_index is of course nonsense for oindex as is, because boolean indices would need to be grouped somehow.

EDIT: OK, one way around it would be to actually do the dimension expansion necessary to broadcast everything to a fancy index. The actual transpose would be a different issue though. But I think we could repor tthat as well.

EDIT: Or something like a tuple-of arrays (note that a 1-D boolean index is ok), but without more hacks, that fully means that numpy does not understand that simplified index anymore....

#define PLAIN_INDEXING 1
#define OUTER_INDEXING 2
#define VECTOR_INDEXING 4
#define FANCY_INDEXING 8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to imply that these can be or'd together - does that make any sense at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right, but before going too much in depth, might want to mostly discuss the gist. There are also larger code blocks that need to be deleted due to not support bools in vindex. Its pretty alpha, though with code removal, and disabling that prepared thingy, it should probably work fine. (plus making the warnings more conservative).

PyObject *index; /* The indexing object */
int indexing_method; /* See mapping.h */
/* If bound is 1, the following are information about the array */
int bound;
Copy link
Member

@eric-wieser eric-wieser Jul 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What purpose does binding serve? If it's useful, I'd be inclined to have separate Index and BoundIndex classes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not have to pass in the shape and dtype again if you are a subclass, but I guess forcing the subclass to write prepared(self.shape, self.dtype) or .prepared(self) is OK too.

Copy link
Member

@eric-wieser eric-wieser Jul 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whose job it is to do the binding seems kinda orthogonal - what I'm saying is that this would be better represented with multiple classes, rather than a single class with a bunch of flags - especially given the precedent for bound objects in python. It sounds like the only value in binding is to make the prepared() function work anyway? Perhaps that could be exposed as index.bind(shape, dtype).data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, its much like set and frozenset. Probably should only have the bound version anyway, and expose all of these as attributes and just allow to create a new one from an old one with a new shape (and dtype), could be a method or not.

Got a bit lost in the fastest way to return something instead of thinking about how is nicest. My only real interest is currently if such info can help subclasses or not.

@hameerabbasi
Copy link
Contributor

hameerabbasi commented Jul 30, 2018

@seberg Thanks for the ping!

The problem here is np.core.multiarray._multiindex itself is private. Maybe a function that'll convert tuple?

I doubt anyone will need to know how .oindex is implemented in terms of legacy indexing. What would be nice is .oindex in terms of .vindex. AFAIK, Dask and XArray do not implement (or have plans to implement) legacy indexing, although I believe XArray plans to implement .oindex and .vindex, and so do I (in pydata/sparse).

@mattip It seems NEP-21 (which this is an implementation of) is still in the draft stage. We should ping the mailing list for acceptance before we actually merge this.

This is an implementation detail for sure, but I believe that since .vindex is simpler and more understandable than legacy indexing, we should flip the logic and actually implement legacy indexing in terms of .vindex. This may lead to a small performance gain and easier maintenance in the future if we ever want to change the default behaviour of __getitem__. Otherwise we might have the rather ugly legacy indexing as a base forever.

@seberg
Copy link
Member Author

seberg commented Jul 30, 2018

@hameerabbasi, this is far from ready, since it is lacking a lot in terms of tests, etc. The API would definitely not look as it is now. What I could imagine right now is probably this, make this rather be exposed as:

class NDIndex(index, array=None, shape=None, dtype=None, method="plain", convert_booleans=?):
    properties:
         type : {"indexing", "field access"}
         method : method
         view : if-result-is-view
        scalar : if-result-is-scalar
        vector_index : simplified_index_but_expanded_dims
        vector_transpose : transpose rule for the index result

I somewhat think that simply not allowing field access might be an option as well (would remove the need for that dtype). If this is supposed to really give information, then I think keeping the original index around is probably unnecessary.
Getting the transpose rule might be a bit annoying, but probably not be so bad, have to read the code a bit though to be sure, so if it helps subclasses quite a bit, sounds fine. The problem is some lock-in to supporting this output.

Also just to note: numpy will cast all vector inputs (also the booleans) to intp arrays during the preprocessing (which this exposes), is that even desired for projects like dask/xarray?

@hameerabbasi
Copy link
Contributor

hameerabbasi commented Jul 30, 2018

The view thing might be useful for Dask, in particular, they might use it to determine if they should create a view, and if this created view will use any extra memory. I don't believe they do this currently. Sparse arrays will have different view semantics, so we probably won't in the short-term, but me might long-term. XArray probably won't need it.

The scalar thing is useful to know for pydata/sparse, currently we do something hacky depending on if the indices are incomplete and the last index is an ellipsis. I can't speak for Dask or XArray.

The vector_index thing should probably not be broadcasted to the output dims. I can see myself using .flatten and since this would not produce a view for broadcasted dimensions... Which can make memory usage blow up for something like .oindex. It's probably best if it was an input to .vindex, one can easily broadcast it with np.broadcast_arrays.

Getting the transpose rule might be a bit annoying, but probably not be so bad, have to read the code a bit though to be sure, so if it helps subclasses quite a bit, sounds fine. The problem is some lock-in to supporting this output.

The vector_transpose rule is useful if someone actually wants to implement legacy indexing. I can speak for myself: I don't plan to. I'm not sure of XArray or Dask will be interested in this, but given that no one has put in the effort to support it so far, and nor have there been significant requests for it: Probably not. I can probably explain the rule to you now (from the docs): If not all scalars and arrays (so-called advanced indices) are next to each other, then the "advanced index dimension" goes at the start. Otherwise, it goes where these advanced indices were in the original array.

numpy will cast all vector inputs (also the booleans) to intp arrays during the preprocessing (which this exposes), is that even desired for projects like dask/xarray

AFAICT, I don't believe they'll care so long as it isn't worse than before.

@shoyer
Copy link
Member

shoyer commented Jul 30, 2018 via email

@seberg
Copy link
Member Author

seberg commented Jul 30, 2018

Well, thinking about it a little, my current take would be:

  1. Advanced array-likes such as xarray or dask probably have little to no use of this anway.
    (they might strive to test against it, but the best we could do there is to make a list of the
    odder parts, so they can decide what to do).
  2. Since any subclass can choose not to use such an Index object, probably should just do what
    comes relatively easy, and not wonder too much about it. If xarray/dask needs code to decide if
    view should be returned, we can still help out with its own piece of code.
  3. Even oindex transpose is tricky, at least once you allow for boolean arrays (it is somewhat OK
    for 1D bools). This is the reason why I suggested vector_index, the vector index would be the
    correct index for .vindex to get the same result (minus the back transpose, and possibly
    difficulties with 0-D bools, I am not quite sure there). vector_index is
    a bit annoying, but it has advantage that it can handle higher dimensional booleans fine.
  4. vector_transpose rule should be ok to implement and also means you can implement oindex
    immediately if you have vindex.
  5. About the binding, decided to always force it. If someone wants to do funny stuff, they can get
    the original index themselves and feed it in again.

@hameerabbasi
Copy link
Contributor

About the binding, decided to always force it. If someone wants to do funny stuff, they can get
the original index themselves and feed it in again.

As long as there will be an externally exposed method that does the same thing.

Remember that pydata/sparse, XArray and Dask aren't subclasses. Subclasses are things like astropy.Quantity, MaskedArray and np.matrix. They're actually independently implemented duck-arrays. Subclasses are not really recommended, and are being slowly being discouraged.

@seberg
Copy link
Member Author

seberg commented Jul 31, 2018

@hameerabbasi yes, of course, which is why above I always had shape=..., dtype=... since it allows to replicate things. Now, what I think we can do, and would probably be nice, is to basically provide the tools that if you have transpose and vindex defined, oindex (and all others) can be supported trivially as well (there may be problems with 0D booleans, I have to check/think about it).

But, to get to the transpose part, should probably clean up some technical dept in the MapIter code :(, nothing huge but also not very quick maybe.

@seberg
Copy link
Member Author

seberg commented Jul 31, 2018

So, actually, to really move on here, probably the best thing is to leave it at it, the Index object is probably a good idea, and for now could really just store the original index (or the prepared one, hardly matters -- except if it is the prepared one, will need to enforce the same shape and fields in the dtype, so thats a bit annoying maybe).

And then continue trying to make the life of things such as pysparse easier afterwards. I have pointed out what we can do reasonably or with a bit more work. But, for the next 2-4 months I really shouldn't do much here :(.

@seberg
Copy link
Member Author

seberg commented Aug 12, 2020

@hameerabbasi This is the old PR, which is probably largely finished. I expect the main open issues are some boolean indexing decisions (which I honestly do not think matter much either way) and how to handle subclasses (i.e. I think we may need a helper object to indicate which type of indexing is going on, so that a subclass implementing only __getitem__ works at least probably with the new methods as well.)

Since there are deprecation tests up there, I guess that I already did some deprecations. I would suggest if we pick this up, that we do not include any deprecations in the first iteration, but only add new features. That way existing tests can definitely stay untouched which would create a lot of churn (and merge conflicts).

This is high impact, but it stalled, probably largely since it was never reviewed. I do not want to pick it up myself, but I may be able to review it. This was written by myself in 2015 or so, so effectively even without much code changes that is a different person reviewing ;).

EDIT: Anyone picking this up, should probably also push officially accepting the corresponding NEP.

@hameerabbasi hameerabbasi self-assigned this Aug 14, 2020
Base automatically changed from master to main March 4, 2021 02:03
@charris
Copy link
Member

charris commented Apr 21, 2021

@seberg Needs rebae.

@hameerabbasi hameerabbasi removed their assignment Apr 21, 2021
@seberg
Copy link
Member Author

seberg commented Apr 11, 2022

This would still nice to revive! It may not be an insane amount of work. One thing that still needs to be settled is how to exactly work with subclasses. It may be OK to break them, but it would be nice to make sure that a subclass like masked arrays can remain working reasonably well.

There should be a few ideas for that discussed here (or in old mailing list posts). In general, this was pretty far along with that exception and probably code cleanup.

@seberg seberg closed this Apr 11, 2022
@seberg seberg added the 64 - Good Idea Inactive PR with a good start or idea. Consider studying it if you are working on a related issue. label Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 25 - WIP 64 - Good Idea Inactive PR with a good start or idea. Consider studying it if you are working on a related issue. component: numpy._core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants