Thanks to visit codestin.com
Credit goes to github.com

Skip to content

DOC: Improve the description of the dtype parameter in numpy.array docstring #10614

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
godaygo opened this issue Feb 16, 2018 · 17 comments · Fixed by #22736
Closed

DOC: Improve the description of the dtype parameter in numpy.array docstring #10614

godaygo opened this issue Feb 16, 2018 · 17 comments · Fixed by #22736

Comments

@godaygo
Copy link
Contributor

godaygo commented Feb 16, 2018

The description of the dtype parameter in numpy.array docstring looks as follows:

dtype : data-type, optional

The desired data-type for the array. If not given, then the type will be determined as the minimum type required to hold the objects in the sequence. This argument can only be used to ‘upcast’ the array. For downcasting, use the .astype(t) method.

which is rather misleading, because in reality the behavior is different. For example for integers the type chain on Windows is (int32 -> int64 -> uint64 -> object), generally it starts with np.int_. Also, for example, there is no np.uint32 in this chain, and also there is no np.int8 and etc.

One more thought, this behavior is also inconsistent with the following snippet:

>>> (np.array(1, np.int32) + np.array(1, np.uint64)).dtype
dtype('float64')  #instead of dtype('O') 

Thus in np.array numpy follows Python rules and in expression it follows C's rules. But it also follows deliberately:

>>> (np.array([0x7fffffff]) + np.array([0x7fffffff])).dtype
dtype('int32')    #instead of dtype('uint32')

Another moment that by default the dtype will be numpy.float64 instead of np.int_ which contradicts with the description. Related issue is #10405.

I don't know what is the right way to resolve this tangle. I opened this issue here because it seems that there is no interest in discussing this on numpy's ML.

p.s.: one more point, is there any real benefit to define np.int_ as C's long (instead of 8 byte on 64-bit and 4 byte on 32-bit), with the following differences on Windows 64-bit and other OSs. As for me, I've never did meet the advantages of this choice, only a few inconveniences, and the need to provide dtype= everywhere.

@eric-wieser
Copy link
Member

eric-wieser commented Feb 16, 2018

is there any real benefit to define np.int_ as C's long

Yes, because we probably need to provide some way to obtain the C long type in numpy - we provide np.cint and np.longlong for the same reason.

I think the question you mean to ask is "Why is the default integer type np.int_"?

@godaygo
Copy link
Contributor Author

godaygo commented Feb 16, 2018

@eric-wieser as I wrote, I've asked this question on the ML, because I think it is more suitable there than here, but there was no feedback. In fact, there are certainly a few related questions, I'm sorry for that (but I did not want to separate questions so as not to litter).

I think the question you mean to ask is "Why is the default integer type np.int_"?

Partly yes :-)

Nevertheless, I find this phrase - "If not given, then the type will be determined as the minimum type required to hold the objects in the sequence" to be incorrect,


Also I can't understand why this casts to uint64:

(np.array([0x8fffffffffffffff]) + np.array([0x8fffffffffffffff])).dtype
dtype('uint64')

But this does not cast to uint32, while producing array([-2]) value (what?)):

(np.array([0x7fffffff]) + np.array([0x7fffffff])).dtype
dtype('int32')

Also I don't understand why this casts to double and not to object:

>>> (np.array(1, np.int32) + np.array(1, np.uint64)).dtype
dtype('float64')  #instead of dtype('O') 

And if C's rules are intended why it casts to double and not to float?

@ahaldane
Copy link
Member

ahaldane commented Feb 16, 2018

The fact that np.int32 + np.uint64 -> np.float64 is something that we have often discussed on the mailing list and on github, eg #7126. For better or for worse, numpy type-conversion rules are indeed different from C type-conversion rules, and it seems it's too late to change it now for back-compatibility reasons.

Also, I'm not sure what you mean by "type chain", but if you mean the C type-conversion rules, they are more complicated than what you said. But numpy does not use C-conversion rules anyway.

As to the main issue here of whether the docstring could be improved, I agree it could. I think a more precise (but less descriptive) description might be something like:

The desired data-type for the array. If not given, then the type will be determined 
from the input objects using `np.result_type`.

@eric-wieser
Copy link
Member

eric-wieser commented Feb 16, 2018

This argument can only be used to ‘upcast’ the array. For downcasting, use the .astype(t) method.

This is plain wrong:

>>> a = np.array(256, np.uint16)
>>> np.array(a, dtype=np.uint8) # definitely not a downcast... /s
array(0, dtype=uint8)

@godaygo
Copy link
Contributor Author

godaygo commented Feb 18, 2018

@ahaldane from yours link I found another one #5745. And it looks that the question of this type with the corresponding surprise occurs at least every two months.

Also, I'm not sure what you mean by "type chain", but if you mean the C type-conversion rules, they are more complicated than what you said. But numpy does not use C-conversion rules anyway.

I meant that numpy is trying to follow or mimics C's rules in someway . At least I always had such a feeling. The problem is that in C I know what will happen and what promotions are well defined but in numpy I need to guess and to check.

Also numpy's behavior of casting uint64 to float64 is pretty weird and with this implicit cast a precision can be lost, which is not good for implicit behavior. For me this behavior with integers has a smell of Python 2's two integer types. May be I'm wrong but what downsides to cast to object and raise a warning? If backwards compatibility, I'm curious to see how someone uses this uint -> float...

Back to the topic:

To be honest, I found the current implicit treatment of dtype in np.array is error prone. I think the next approach (idea) is more correct. There will be 6-7 possibilities: (maybe int32), int64, float64, complex128, bytes_, unicode_, object (in some cases with a warning), without any implicit guessing. And if someone needs non-default type, the type should be provided through dtype= argument. There will be no unsigned defaults in this default types, because they are somewhat special. Does this sounds reasonable (and possible) or it's too late to change?

@eric-wieser
Copy link
Member

I think there's a major disconnect between the documentation and what we actually do. What you describe above sounds like what we do already (except with int_ instead of int32).

Can you give an example where that's not the case?

@godaygo
Copy link
Contributor Author

godaygo commented Feb 18, 2018

@eric-wieser Yes, in the following moments:

  1. As you've noted np.int_ - I think there should be no differences between OSs on the same computer architectures (on 8bit -> 8bit, on 16bit -> 16bit and so on). I think this is a real win. Also some projects can't follow such default np.int_ assumptions of numpy. See Default integer size in numpy and numba on Windows [continuation of #2643 issue] numba/numba#2729.
  2. np.array([0xffffffffffffffff]).dtype currently is dtype('uint64') and unsigned types are tricky. I think this should be promoted to object with a warning.

p.s.
@eric-wieser

is there any real benefit to define np.int_ as C's long

Yes, because we probably need to provide some way to obtain the C long type in numpy - we provide np.cint and np.longlong for the same reason.

Why np.long was not enough ( I know that this is not the same as np.int_, it is a Python2 artifact)?

@ahaldane
Copy link
Member

I think most of us devs feel we should try to limit unintentional creation of object array rather than do it more often. See #5353, #7547, #368, for example. Object arrays can't perform quite a few operations that normal arrays can, and often lead to confusing bugs if the user wasn't expecting an object array.

I agree the creation of the unsigned integers seems undocumented and might be surprising, I hadn't understood that part of your comment before. The code that actually does that is here (link). We should at least document it somewhere.

@godaygo
Copy link
Contributor Author

godaygo commented Feb 19, 2018

I think most of us devs feel we should try to limit unintentional creation of object array rather than do it more often. See #5353, #7547, #368, for example.

@ahaldane I respect the choice of developers and of course I do not have so much experience with numpy. From the provided links I found yours docs https://gist.github.com/ahaldane/c3f9bcf1f62d898be7c7. Seems that they are in WIP state.

The #7547 is not really related, in any case it is hard to make a reasonable assumption what the result for np.asarray([1.,2.,3.]) + 1000000000000000000000000000 should be. In fact, implicit type conversions also create situations where it is not clear what someone expects:

 In[2]: np.asarray([1, 2, 3]) * 'a'
 TypeError: ufunc 'multiply' did not contain a loop with signature matching types dtype('<U11') dtype('<U11') dtype('<U11')

 In[3]: np.asarray([1, 2, 3], dtype=np.object) * 'a'
Out[3]: array(['a', 'aa', 'aaa'], dtype=object)

Reading through another ones I have a feeling that all problems are related with the fact that numpy does not implement integers with unlimited width as well as the enormous freedom in type conversions. So in my opinion if you isolate integers with infinite precision into a distinct type from object and tighten the rules for implicit conversion of types, most of these issues and questions will disappear by themselves. (Perhaps these thoughts fit adequately only in my head :-)

@ahaldane
Copy link
Member

@godaygo There are actually a number similar related problems involving dtypes which many developers (including those more ancient and revered than me:) have been planning on trying to fix for a long time. See the "epic dtype cleanup plan" #2899. One goal is to make it easy to add something like a "multiprecision integer" type.

In fact, BIDS obtained a grant to fund two numpy developers, one of the goals being to finally implement this plan. Last I heard the position is still open and you can apply if it interests you: http://numpy-discussion.10968.n7.nabble.com/Position-at-BIDS-UC-Berkeley-to-work-on-NumPy-td45085.html

@ahaldane
Copy link
Member

Also, you comment about a dedicated multiprecision int reminds me of this issue and comment: #3804 (comment)

@Atcold
Copy link

Atcold commented May 8, 2018

I have a similar question:

a = array((1, 2, 3, 4))
a, a.shape, a.dtype
# (array([1, 2, 3, 4]), (4,), dtype('int64'))

It doesn't seem like dtype('int64') is "the minimum type required to hold the objects in the sequence".
Am I missing something?

@mattip
Copy link
Member

mattip commented Jun 17, 2020

The documentation of the dtype kwarg still states

The desired data-type for the array. If not given, then the type will be determined as the minimum type required to hold the objects in the sequence.

I wish we could write "If not given, then the type will be determined according to internal arbitrary rules that may change between versions." In practice it would be good to link to a part of a NEP or other documentation that describes the algorithm used.

@seberg
Copy link
Member

seberg commented Jun 17, 2020

I think "minimum" is a bit incorrect though. We basically do arr1 = np.array(item) for every element, and then go through all items using np.promote_types(arr1.dtype, arr2.dtype), etc.

There are a couple of subtleties to this, and I don't want to list them all, especially since I am trying to just remove some of them and hope nobody notices... (especially around strings). Plus what dtype is used for each individual integer can be surprising...

@burnpanck
Copy link

I agree especially with the following point of @godaygo :

  1. As you've noted np.int_ - I think there should be no differences between OSs on the same computer architectures (on 8bit -> 8bit, on 16bit -> 16bit and so on). I think this is a real win. Also some projects can't follow such default np.int_ assumptions of numpy. See Default integer size in numpy and numba on Windows [continuation of #2643 issue] numba/numba#2729.

I am a numpy user for over 15 years now, yet I have not been aware of this difference. This has bitten us in production, where a lazy piece of code actively converted a m8 timedelta to dtype=int (instead of dtype="i8") which caused wrong results only on windows because the data silently wrapped around. A big fat warning in the documentation around "np.int_" being a 32bit int on 64bit windows might have prevented that; even better, the default int should follow least-surprise (i.e. be the same on every 64bit platform).

@InessaPawson InessaPawson changed the title The description of the dtype parameter in numpy.array docstring DOC: Improve the description of the dtype parameter in numpy.array docstring Aug 26, 2022
@melissawm
Copy link
Member

I am wondering if there's anything we can do here to make this issue more actionable. It looks like the situation is as follows:

  • The description for the dtype paramenter in numpy.array is imprecise
  • This is partly due to some historical behavior that may need a deeper consideration and proposal to get fixed
  • In the meantime, we can correct/improve the dosctring by changing it to read something like:
dtype: data-type, optional

The desired data-type for the array. If not given, then the type will be determined according to internal arbitrary rules that may change between versions. See NEP 50 for more details.

Does this make sense? If not, any suggestions for some definition of "Done" for this issue?

@seberg
Copy link
Member

seberg commented Dec 5, 2022

NEP 50 doesn't actually propose to change anything here yet :), although it does explain the problem a bit.

The weird subtleties I hinted at long ago are gone now. So the rules now are the same as calling np.array() on each element and then promoting them (pairwise from left to right, depth first).
The only weird part is that integer rule of trying long -> int64 -> uint64 -> object and that our promotion rules are sometimes weird/broken.

Can we just say something like NumPy will try to use a default dtype that can represent the values (by applying promotion rules when necessary)? That glances over the integer nonsense, but at least it doesn't say "minimal"...

I don't have a great idea, but maybe it helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants