Thanks to visit codestin.com
Credit goes to github.com

Skip to content

NEP: add default-dtype-object-deprecation nep 34 #14674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Oct 31, 2019
Merged

Conversation

mattip
Copy link
Member

@mattip mattip commented Oct 10, 2019

NEP to deprecate a = np.array([1, 2], [1]) without explicitly stating dtype=object.

xref gh-5303, related to gh-13913, gh-14341.

@rgommers
Copy link
Member

A few high level comments:

  • Title and description are unclear, is it about "list of lists" specifically, something a little broader (like tuple of lists/tuples too), or any kind of sequence?
  • Your current text says you want to raise an error for explicit dtype=object as well in some cases. This seems odd, explicit dtype=object should preserve current behavior I'd think.
  • Related to the previous point: np.ragged_array_object is not a good idea, just make dtype=object not raise.
  • Please add some rationale for why to special-case this type of input, rather than making all object array creation raise unless it's explicitly requested.

@eric-wieser
Copy link
Member

np.ragged_array_object is not a good idea, just make dtype=object not raise.

I believe it is a good idea, because it would have a non-optional array_depth argument. This would solve the following problem:

a = np.ragged_array_object([[1, 2], [3, 4]], array_depth=1)
b = np.ragged_array_object([[1, 2], [3, 4, 5]], array_depth=1)
assert a.ndim == b.ndim == 1

Today there is no easy way to construct a.

The name is awful, but I think to be useful at all, a depth argument is needed.

@rgommers
Copy link
Member

This would solve the following problem:

Try to look at this from an end user perspective. Why is this a problem worth solving? Why would you want such behavior? In every case I can think of, there are much better alternatives.

Another way to say this: if you would design such a feature without knowing about the current issue that gave rise to this NEP, would you ever propose np.ragged_array_object for inclusion in NumPy?

@eric-wieser
Copy link
Member

eric-wieser commented Oct 13, 2019

Why would you want such behavior?

We're all agreed I think that no one wants the behavior of np.array([[1, 2], [1]])

My claim is that additionally no one wants the behavior of np.array([[1, 2], [1]], dtype=object). To illustrate that, consider:

student_lists = np.array([class1.students, class2.students], dtype=object)
assert student_lists.ndim == 1

This code works just fine unless by some coincidence the two classes have the same number of students. Because of this, I strongly think np.array([[1, 2], [1]], dtype=object) should also be an error.

What remains is the question "what is the correct way to write the above?". Today, the best I can come up with is the wasteful:

def ragged_array_object(seq, depth):
    arr = np.array(seq, dtype=object)  # this relies on the broken behavior!
    assert arr.ndim >= depth
    arr2 = np.empty(arr.shape[:depth], dtype=object)
    arr2[...] = arr
    return arr2

Perhaps forcing the user to do this workaround is ok (edit: forcing the user to use a workaround also breaks this workaround), but removing np.array([[1, 2], [1]], dtype=object) would be a lot more palatable if we provided an easy replacement.

would you ever propose np.ragged_array_object for inclusion in NumPy?

Something similar to it, yes - I've repeatedly found myself wanting np.ragged_array_object(some_obj, depth=0).

@rgommers
Copy link
Member

We're all agreed I think that no one wants the behavior of np.array([[1, 2], [1]])

yes, I think so too

My claim is that additionally no one wants the behavior of np.array([[1, 2], [1]], dtype=object).

Perhaps, but not because of some minor inconsistency. more because of object arrays being very hard to work with anyway, and hence not being very useful.

What remains is the question "what is the correct way to write the above?".

That is not what remains, that's what I'm trying to make clear with my questions above. What remains is: are you (the user) really trying to work with ragged arrays here? If so, how good do you want that experience to be? NumPy is limited and is unlikely to ever get nice, well-thought-out support for ragged arrays. NumPy can try to make it less likely for users that don't want ragged arrays to not shoot themselves in the foot (hence, raising an error by default). For users that do want ragged arrays, they're probably better off using Arrow (I think, not an expert) or XND (which has proper ragged array support), or staying with list of lists.

Perhaps forcing the user to do this workaround is ok, but removing np.array([[1, 2], [1]], dtype=object) would be a lot more palatable if we provided an easy replacement.

I think just documenting some workaround is much better, for the unlikely case where a user really wants these kind of object arrays (she probably doesn't!).

@seberg
Copy link
Member

seberg commented Oct 14, 2019

My 2 cents about ragged arrays: I am not sure I like a special function for it, but I think usage out there is large enough that we cannot just ignore it. The long way to write it is:

arr = np.empty(correct_shape, dtype=object)
arr[...] = values

is fairly short, but also not very discoverable, plus you need to know the exact shape instead of just the number of dimensions which could be enough.

The next way would be to add a specific ndim=3 kwarg, so that we can stop shape discovery when we hit the correct dimension (we already do this, just the ndim on the python level is always 32).

The last way would require an additional hook for dtypes (which we can always add later if we do new dtypes; and which actually would somewhat make sense also to solve the "tuple is scalar" issue for void coercion). And that is provide a way ask the DType "Is this a scalar or a sequence" during coercion.

We could then write np.array([...], dtype=PyObject[np.ndarray]) (assuming the inside sequences are arrays in this case). I have not looked into that yet, because it seems to me that it is easier to just special case tuples+void right now.

@seberg
Copy link
Member

seberg commented Oct 14, 2019

We may have to decide here (or at least mention) if we want to do the same thing for the two other cases which create object arrays possibly surprisingly):

  1. Failure to promote: np.array([np.array((1,2), "i,i"), 1]) (note that the conversion to string or string+floats is not as such a failure to promote right now, because we actually define it specifically)

  2. The dtype of large python integers can fluctuate randomly (especially on windows).

The 2. part is likely a different issue (or at least is much easier addressed on its own). The first one may have overlap here depending on where we want to fix this. Note that if we are not in a hurry, I am hoping to rewrite the whole thing (I have actually done so, but it needs cleaning up) for new dtypes. Although, that does not matter too much w.r.t. starting this now in the old code already.

@mattip
Copy link
Member Author

mattip commented Oct 24, 2019

Updated to put off discussion of depth until later, reformatted to use a Usage and Impact section.

Comment on lines 89 to 94
It was also suggested to add a kwarg `depth` to array creation, or perhaps to
add another array creation API function `ragged_array_object`. The goal was
to eliminate the ambiguity in creating an object array from `array([[1, 2],
[1]], dtype=object)`: should the returned array have a shape of `(1,)`, or
`(2,)`? This NEP does not deal with that issue, and only deprecates the use of
`array` with no `dtype=object` for ragged arrays.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double `` throughout

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps worth adding "as a consequence of choosing not to deal with this issue, users of ragged arrays may be faced with a second deprecation cycle in the future" or something.

I think I agree with declaring this out of scope - there's a lot of value to making noise when ragged arrays weren't intended.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a lot of value to making noise when ragged arrays weren't intended.

I agree

I think I agree with declaring this out of scope

I don't mind either, but I would suggest to do whatever is easier (I don't know what that is). If just ragged arrays are easier, perhaps give that as a rationale. If we have to jump through extra hoops to do just ragged arrays (IIRC we failed once before?), then why bother?

Decimal(10)])``. This too is out of scope for the current NEP: only if all the
top-level elements are `sequences`_ will we require an explicit
``dtype=object``.
- It was also suggested to deprecate all automatic creation of ``object``-dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps drop "It was also suggested to" and "we could". and phrase these all in the imperative, "Deprecate all", "Continue with", "Add a kwarg".

arrays, which would require a dtype for something like ``np.array([Decimal(10),
Decimal(10)])``. This too is out of scope for the current NEP: only if all
the top-level elements are `sequences`_ will we require an explicit
``dtype=object``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only if they are ragged, right? np.array([[[Decimal(1)]]]) is still fine?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intention is that should still just work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I think this sentence needs rewording

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the confusing last clause of the sentence

@mattip
Copy link
Member Author

mattip commented Oct 24, 2019

Copying some valuable comments in case they get lost in the rewrites

... I would suggest to do whatever is easier (I don't know what that is). If just ragged arrays are easier, perhaps give that as a rationale. If we have to jump through extra hoops to do just ragged arrays (IIRC we failed once before?), then why bother (i.e., fail if no dtype is specified, mattip)?

Since you call out ndarray here - will this start failing?

outer = np.array([None, None])
outer[0] = outer[1] = np.array([1, 2, 3])
np.array(outer).shape  # today: (2,)
np.array([outer]).shape  # today: (1, 2,)

outer_ragged = np.array([None, None])
outer_ragged[0] = np.array([1, 2, 3])
outer_ragged[1] = np.array([1, 2, 3, 4])
# will both of these emit warnings?
np.array(outer_ragged).shape  # today: (2,)
np.array([outer_ragged]).shape  # today: (1, 2,)

Examples of things that need decisions but should probably still work:

np.array([[[Decimal(1)]]])


@mattip
Copy link
Member Author

mattip commented Oct 24, 2019

The previous failure was trying to change the error message of array(<lists-of-lists>, dtype=int) to something that indicated ragged arrays.

As for the examples, maybe we should get a PR going, mark it WIP and see how painful this all is and what we can and cannot detect.

@mattip
Copy link
Member Author

mattip commented Oct 28, 2019

xref gh-14794

@mattip mattip changed the title NEP: add default-dtype-object-deprecation nep NEP: add default-dtype-object-deprecation nep 34 Oct 28, 2019
@mattip
Copy link
Member Author

mattip commented Oct 29, 2019

I think this is ready for the mailing list? From NEP 000

Once the PR is in place, the NEP should be announced on the mailing list for discussion. Discussion about implementation details will take place on the pull request, but once editorial issues are solved, the PR should be merged, even if with draft status. The mailing list e-mail will contain the NEP upto the section titled “Backward compatibility”,

cycle in the future.

- It was also suggested to deprecate all automatic creation of ``object``-dtype
arrays, which would require a dtype for something like ``np.array([Decimal(10),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a comment "this sentence doesn't make sense to me. Why would that require a new dtype?" and before hitting the "add comment" button I realized you mean ...would require adding `dtype=object` for ..... I suggest that as a rephrase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or possibly require adding an explicit ``dtype=object`` for ...

Backward compatibility
----------------------

Anyone depending on ragged nested sequences creating object arrays will need to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Anyone depending on ragged nested sequences creating object arrays will need to
Anyone depending on creating object arrays from ragged nested sequences will need to

``(1,)``, or ``(2,)``? This NEP does not deal with that issue, and only
deprecates the use of ``array`` with no ``dtype=object`` for ragged nested
sequences. Users of ragged nested sequences may face another deprecation
cycle in the future.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add something like: Rationale: we expect that there are very few users who intend to use ragged arrays like that, this was never intended as a use case of NumPy arrays. Users are likely better off with another library or just using list of lists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrasing I was pushing for earlier was more along the lines of "this isn't a big enough problem to be worth bringing into the scope of this NEP", rather than "this isn't something in scope for numpy". I'd still consider attempting to solve this in some future NEP. The rationale in my mind was simply "this is a different problem that can be solved later, and is lower value than the contents of this NEP".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That phrasing works for me too.

I'd still consider attempting to solve this in some future NEP

I would suggest that we have many more interesting/important things to do. But yeah, we've never had the discussion so let's just postpone it rather than discussing it now.


- It was also suggested to deprecate all automatic creation of ``object``-dtype
arrays, which would require a dtype for something like ``np.array([Decimal(10),
Decimal(10)])``. This too is out of scope for the current NEP.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add something like: Rationale: it's harder to asses the impact of this larger change, we're not sure how many users this may impact.

@rgommers
Copy link
Member

I think this is ready for the mailing list?

agreed, it reads well. a few final comments made

Comment on lines 81 to 93
behaviour will emit a ``DeprecationWarning``. There is an open question whether
the ``assert_equal`` family of functions should be changed or users be forced
to change code like

```
np.assert_equal(a, [[1, 2], 3])
```

to

```
np.assert_equal(a, np.array([[1, 2], 3], dtype=object)
```
Copy link
Member

@eric-wieser eric-wieser Oct 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this decision any different to deciding whether the following should be allowed?

>>> np.add([1, (2, 3)], [4, (5, 6)])
array([5, (2, 3, 5, 6)], dtype=object)

vs requiring it be spelt

>>> np.add(np.array([1, (2, 3)], dtype=object), np.array([4, (5, 6)], dtype=object))
array([5, (2, 3, 5, 6)], dtype=object)

At any rate, it would be worth drawing attention to the fact that since this affects asarray, it affects almost every numpy function with list → array semantics that uses asarray internally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have a point. Special-casing array_equal and friends would make it hard to remove the deprecation in the future. We would need to try: ... except around internal asarray use. I will drop this discussion point, users should modify their code now to avoid the warning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reformulated and moved to Usage and Impact for emphasis.

@eric-wieser
Copy link
Member

Note that this comment still needs addressing:

Since you call out ndarray here - will this start failing?

My impression is that the answer is no (at least, based on your PR), and that the wording needs tweaking there. Let's add a test to the PR and see what the behavior is?

@mattip
Copy link
Member Author

mattip commented Oct 29, 2019

Let's add a test to the PR and see what the behavior is?

Added here, they all pass with no DeprecationWarning

@eric-wieser
Copy link
Member

they all pass with no DeprecationWarning

Right - so we need to change either the implementation or the NEP, because currently they disagree. My feeling would be to just remain silent on np.ndarrays being considered sequence objects in the NEP, which is the easiest path - if nothing else, the case that we're trying to stop users being bitten by is for nested lists of regular python objects, not so much lists of arrays.

@mattip
Copy link
Member Author

mattip commented Oct 29, 2019

Are you referring to the outer_ragged example? I added a footnote hopefully clarifying the algorithm.

Co-Authored-By: Hameer Abbasi <[email protected]>
@mattip
Copy link
Member Author

mattip commented Oct 31, 2019

There were no repsonses on the mailing list. I assume that means no-one opposes the NEP. Can we merge the draft status?

@rgommers rgommers merged commit ed7a077 into numpy:master Oct 31, 2019
@rgommers
Copy link
Member

Yep, merged! Thanks @mattip and @eric-wieser

@mattip mattip deleted the nep-0034 branch November 2, 2020 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants