Thanks to visit codestin.com
Credit goes to github.com

Skip to content

API: Enforce one copy for __array__ when copy=True #26215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 19, 2024

Conversation

mtsokol
Copy link
Member

@mtsokol mtsokol commented Apr 5, 2024

Addresses: #26208

Hi @seberg @ngoldbaum,

This PR enables passing copy=True to def __array__(...) and prevents NumPy from making another copy once it can assume that copy=True was successfully passed and the copy was made by the implementor.

Following #26208 (comment) I implemented it with a copy_indicator that is set to 1, which later is used to decide whether NPY_ARRAY_ENSURECOPY flag can be removed.

I added a test test___array__copy_once that checks if only one copy is made with np.array(my_arr, copy=True). Without copy_indicator the test fails as np.array(my_arr, copy=True) doubles delta RSS (I used psutil package and added it as another test dependency).

I assumed that we want to prevent making implicit copies with __array__. Therefore the user is also responsible for properly implementing dtype argument.

class MyArray:
    def __array__(self, dtype=None, copy=None):
        return np.array([1,2,3], dtype=int)

my_arr = MyArray()
np.array(my_arr, dtype=float)  # Error! Produced ndarray has incorrect dtype! `__array__` should use dtype arg

@mtsokol mtsokol self-assigned this Apr 5, 2024
@mtsokol mtsokol force-pushed the one-copy-__array__ branch from 592ae0a to 4f629bf Compare April 5, 2024 11:10
@mtsokol mtsokol linked an issue Apr 5, 2024 that may be closed by this pull request
@mtsokol mtsokol force-pushed the one-copy-__array__ branch 6 times, most recently from f7ff1b7 to d654d93 Compare April 5, 2024 13:52
@mtsokol mtsokol added this to the 2.1.0 release milestone Apr 5, 2024
@mtsokol mtsokol force-pushed the one-copy-__array__ branch from d654d93 to 8c9f073 Compare April 5, 2024 14:31
Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two comments for now. But yeah, we may need the copy indicator unfortunately (maybe call it was_copied, though? It has to be passed by reference anyway, so it is obvious for the caller that it is an output).

It would be good to have a test that we don't pass copy=True if someone calls np.array([array_like]) (i.e. there is a list).

else:
return base_arr

process = psutil.Process()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just do this without psutils: Just store copy you return, then compare do an identity/base check. Easier, and no need for psutil magic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right - I removed psutils and stored copies/base arrays instead.

@@ -970,7 +972,7 @@ PyArray_DiscoverDTypeAndShape_Recursive(
npy_intp out_shape[NPY_MAXDIMS],
coercion_cache_obj ***coercion_cache_tail_ptr,
PyArray_DTypeMeta *fixed_DType, enum _dtype_discovery_flags *flags,
int copy)
int copy, int *copy_indicator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This already has the _dtype_discovery_flags I think it makes sense to use those "internally" to this part of the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I merged copy_indicator into _dtype_discovery_flags.

@mtsokol mtsokol force-pushed the one-copy-__array__ branch from 8c9f073 to 973ce28 Compare April 8, 2024 18:02
@mtsokol
Copy link
Member Author

mtsokol commented Apr 8, 2024

(maybe call it was_copied, though?

@seberg Sure - renamed!

It would be good to have a test that we don't pass copy=True if someone calls np.array([array_like]) (i.e. there is a list).

Sure! I changed implementation so that sequences don't pass copy=True to it's elements and added a test for that.

@mtsokol mtsokol force-pushed the one-copy-__array__ branch 2 times, most recently from c1a3d15 to 854340f Compare April 8, 2024 22:10
@mtsokol mtsokol requested a review from seberg April 8, 2024 22:14
@mtsokol mtsokol force-pushed the one-copy-__array__ branch from 854340f to 553c473 Compare April 10, 2024 08:36
@mtsokol
Copy link
Member Author

mtsokol commented Apr 10, 2024

I did one more rebase to solve conflicts.

Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One request for clarification inline and one more general suggestion.

I think was_copied and COPY_WAS_CREATED should be renamed was_copied_by__array__ and COPY_WAS_CREATED_BY__ARRAY__ to emphasize that this is specifically to handle subtleties around __array__ implementations. Alternative names or just comments around where they are used explaining that it has something to do with __array__ would help readers trying to understand why we have both an ensure copy flag and a flag tracking if a copy already happened.

@charris
Copy link
Member

charris commented Apr 10, 2024

Does this need a backport?

@ngoldbaum
Copy link
Member

Does this need a backport?

No this change is targeted for 2.1

@mtsokol
Copy link
Member Author

mtsokol commented Apr 10, 2024

I think was_copied and COPY_WAS_CREATED should be renamed was_copied_by__array__ and COPY_WAS_CREATED_BY__ARRAY__ to emphasize that this is specifically to handle subtleties around __array__ implementations.

These names are clearer and more specific - done!

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I suppose we could use the argument in PyArray_DiscoverDTypeAndShape for other things in principle, although I am not sure there is anything concrete, so let's use the name, it is much nicer everywhere else (and all other places will go away when the deprecation goes away).

The ENSURE_COPY flag setting logic looks incorrect, though (I admit, I didn't try it).

/* TODO: As of NumPy 2.0 this path is only reachable by C-API. */
Py_SETREF(new, PyArray_NewCopy((PyArrayObject *)new, NPY_KEEPORDER));
if (was_copied_by__array__ != NULL && copy == 1 &&
must_copy_but_copy_kwarg_unimplemented == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add a comment to remove the was_copied_by__array__ argument again here, but it is probably clear enough (when it happens).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -1615,6 +1622,10 @@ PyArray_FromAny_int(PyObject *op, PyArray_Descr *in_descr,
assert(cache->converted_obj == op);
arr = (PyArrayObject *)(cache->arr_or_sequence);
/* we may need to cast or assert flags (e.g. copy) */
if (was_copied_by__array__ == 1 && flags & NPY_ARRAY_ENSURECOPY) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (was_copied_by__array__ == 1 && flags & NPY_ARRAY_ENSURECOPY) {
if (was_copied_by__array__ == 1) {

You don't need to test this, it is OK to unset the flag if it isn't set. (to me that is clearer, but maybe it's me)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -1615,6 +1622,10 @@ PyArray_FromAny_int(PyObject *op, PyArray_Descr *in_descr,
assert(cache->converted_obj == op);
arr = (PyArrayObject *)(cache->arr_or_sequence);
/* we may need to cast or assert flags (e.g. copy) */
if (was_copied_by__array__ == 1 && flags & NPY_ARRAY_ENSURECOPY) {
flags = flags & ~NPY_ARRAY_ENSURECOPY;
flags = flags | NPY_ARRAY_ENSURENOCOPY;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong, so please add a test:

np.asarray(array_like, copy=True, order="F")

should just make two copies. You probably want to make sure that users honors the dtype that was passed. That might be nice, but if it is not very clean to test for, I am not sure we should worry much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, a copy still needs to be made to convert to order="F". Then I shouldn't add ENSURENOCOPY flag but only remove ENSURECOPY one.

About honoring dtype: I can check if arr in cache has the same dtype as requested dtype variable. I wonder if it's sufficient for comparing dtypes in C-API:

PyArray_TYPE(arr) == dtype->type_num

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, unfortunately checking that is not really trivial (the only way to do it, is the way that the function uses later to see if it has to make a copy).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized, it is even more complicated, for subarray dtypes (maybe those should just never be passed on). But that is probably not really new, although more pronounced by nudging downstream to actually implement dtype.

Maybe the solution is to just unset the dtype when a copy was made, although unfortunately that doesn't help us in the path when no copy was made and dtype is used by the __array__ implementor.

(Eplenation: if you write dtype=(4,)i you get back an array with an additional dimension and dtype=i. But if __array__ does that, then you end up doing another cast!)

Copy link
Member Author

@mtsokol mtsokol Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's maybe skip checking dtype after __array__ call?
What we currently have is that:

np.array(array_like, dtype=new_dtype, copy=False)

errors when a dtype wasn't honored in __array__ implementation (which is nice I think) and

np.arrray(array_like, dtype=new_dtype)

get's another copy if dtype wasn't honored, same as order="F" case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's maybe skip checking dtype after array call?
For here, that is perfectly fine to me, it is just a mild performance nudge for downstream after all.

However, maybe we should make a new issue. I am slightly worried that downstream until now always just ignored dtype=, and now we will get new bugs with subarray dtypes.

OTOH, I will also say that, I don't expect anyone to explicitly run into it, so in a sense, it is an unrelated bug fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated flags logic here and added a test with:

np.asarray(array_like, copy=True, order="F")

@mtsokol mtsokol force-pushed the one-copy-__array__ branch from e6a3ce5 to 3549902 Compare April 11, 2024 10:16
Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK to merge now.

I don't think this is going to cause much additional churn over adding the copy keyword in the first place, hopefully downstream did the right thing.

@charris charris changed the title API: Enforce one copy for __array__ when copy=True API: Enforce one copy for __array__ when copy=True Apr 17, 2024
@mtsokol
Copy link
Member Author

mtsokol commented Apr 19, 2024

Ok, merging it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: pass copy=True to __array__
4 participants