Thanks to visit codestin.com
Credit goes to github.com

Skip to content

array(...) casts mix lists of int and string to string automatically (excepted dtype object) #6550

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sdementen opened this issue Oct 22, 2015 · 8 comments
Labels
01 - Enhancement 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. component: numpy.dtype

Comments

@sdementen
Copy link

The following code explains the issue

>>> import numpy
>>> print numpy.__version__
1.10.1
>>> a = [["a", 1], [2, 3]]
>>> b = numpy.array(a)
>>> print repr(a)
[['a', 1], [2, 3]]
>>> print repr(b)
array([['a', '1'],
       ['2', '3']],
      dtype='|S1')

I would have expected that not specifying a dtype would have given the result as if I had specify a dtype(object) (to be able to hold both ints and strings)

>>> b = numpy.array(a, dtype=numpy.dtype(object))
>>> print repr(b)
array([['a', 1],
       [2, 3]], dtype=object)

as the doc http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html describes

dtype : data-type, optional
The desired data-type for the array. If not given, then the type will be determined as the minimum type required to hold the objects in the sequence. This argument can only be used to ‘upcast’ the array. For downcasting, use the .astype(t) method.

But I may have missed some point in the docs

@sdementen
Copy link
Author

I think the same bug is behind the weird behavior of assert_array_equal here below

>>> a = [["a", 1], [2, 3]]
>>> b = numpy.array(a, dtype=numpy.dtype(object))
>>> print repr(b)
array([['a', 1],
       [2, 3]], dtype=object)
>>> assert_array_equal(a,b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\lib\site-packages\numpy\testing\utils.py", line 782, in assert_array_equal
    verbose=verbose, header='Arrays are not equal')
  File "...\lib\site-packages\numpy\testing\utils.py", line 708, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not equal

(mismatch 75.0%)
 x: array([['a', '1'],
       ['2', '3']],
      dtype='|S1')
 y: array([['a', 1],
       [2, 3]], dtype=object)

@shoyer
Copy link
Member

shoyer commented Oct 22, 2015

I agree with you than object dtype would be preferable, but this is also pretty long standing behavior in numpy.

@sdementen
Copy link
Author

The bug is only with strings and float ? Or on any mixed type data ?
In any case, it could be good to fix the documentation with the explanation
of how the dtype is defined in these cases.

On Thursday, October 22, 2015, Stephan Hoyer [email protected]
wrote:

I agree with you than object dtype would be preferable, but this is also
pretty long standing behavior in numpy.


Reply to this email directly or view it on GitHub.<
https://ci5.googleusercontent.com/proxy/iesF3lqbulO9zxUOR6iTm4ZkwPqiciAWiKOh6AzTDNm2WC5XuyEwuUwQbaoVb2wgbM940E0TEaVtnILeGAhoGl-2qexMJQsa7bv5jNGU1B0G01OLNLLSFB4lP2TaffN2zJxO1CyH871eiK0iGj17WBezvifoWw=s0-d-e1-ft#https://github.com/notifications/beacon/ABPpdkwpRh4dRPwP4k8r7XZPUhX-ctp8ks5o-QaogaJpZM4GTxMb.gif

@njsmith
Copy link
Member

njsmith commented Oct 22, 2015

This smells like it's related to gh-6061 (which has an unreviewed PR sitting around since July: gh-6067). Merging gh-6067 directly may not fix this because I'm not sure that array uses the same promotion rules as ufuncs, but the same basic logic applies. Silently "upcasting" numbers into strings doesn't make any sense, and I think the only reason it's survived this long is that people don't use numpy strings much so they don't get as much scrutiny as other things.

@njsmith
Copy link
Member

njsmith commented Oct 22, 2015

@sdementen: Though note that there is also an independent plan to change array so that it never automatically returns object dtypes and requires you to specify dtype=object if that's what you want. So specifying dtype=object explicitly as a workaround for this issue is also future-proofing your code...

@sdementen
Copy link
Author

I felt upon this bug while using xlwings (interface with Excel). It was
reading a range of cells that contained strings (header of the data) and
numbers (the data itself). It used a numpy array to read everything to be
able to use the nice slice notation of arrays. However, as there were
strings in the header, it converted the whole array to strings. Slicing
then to get the header gave strings (which is ok) but slicing to get the
data returned also strings (not ok).
Converting always to dtype=object is neither optimal as sometimes the range
of cells contains only numbers (for which object is not optimal as dtype).
But indeed, the bug can be circumvented from outside numpy.

On Thursday, October 22, 2015, Nathaniel J. Smith [email protected]
wrote:

@sdementen https://github.com/sdementen: Though note that there is also
an independent plan to change array so that it never automatically
returns object dtypes and requires you to specify dtype=object if that's
what you want. So specifying dtype=object explicitly as a workaround for
this issue is also future-proofing your code...


Reply to this email directly or view it on GitHub
#6550 (comment).

@chrisjbillington
Copy link
Contributor

This one has tripped me up recently - putting tuples in HDF5 attributes via h5py converts them to numpy arrays, and I put a tuple of an int and a string in and got back two strings later.

Most of the casting numpy does is to types that pass equality checks, but since '1' == 1 is False, this casting has extra risk, and I would definitely have preferred an error. It might be long standing behaviour in numpy, but I'd be happy to see it deprecated.

@Al2a2d2m
Copy link

I dont this that this is a bug. we can find the same behavior in R. both upcast dtype to the minimal type required to handle data. object dtype seems extreme to me. for example numpy uses it when we do not have a matrix structure that is array elements did not have the same dimemsion ex: np.[1,2,3],[1,2]])

@shoyer shoyer added the 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. label May 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. component: numpy.dtype
Projects
None yet
Development

No branches or pull requests

6 participants