Unicode byteorder seems mostly broken #3939

seberg · 2013-10-17T22:10:06Z

Is it supposed to be possible to use non-native unicode byteorder?

The unicode comparison functions cannot handle non-native byteorder, however this also applies to the dtype transfer functions. The copyswap functions do anticipate it, but only for 4-byte wide unicode (and I think there can be 2-byte wide as a compile option?), maybe that is why printing works...

In [19]: a = np.array(['asdf']).astype(unicode)

In [20]: a.byteswap().newbyteorder()
Out[20]: 
array([u'asdf'], 
      dtype='>U4')

In [21]: a.astype('>U4')
Out[21]: 
array([u'\U61000000\U73000000\U64000000\U66000000'], 
      dtype='>U4')

In [22]: a == a.byteswap().newbyteorder()
Out[22]: array([False], dtype=bool)

charris · 2013-10-17T22:19:16Z

Numpy only uses 32 bit unicode, while Python can be either 16 or 32 bits depending on the configuration.

seberg · 2013-10-17T22:25:21Z

ah ok, that makes sense... So things are a bit simpler. We will need a dedicated unicode dtype transfer function, and either use the new iterator or force a copy in the comparison functions

seberg · 2016-05-23T17:55:06Z

@jreback no idea if it interests you, but I just opened gh-7664 to hopefully fix this soonish. OOps, I had forgotten about the "numpy only uses 32bit" part, haha. Makes the code much nicer...

jreback · 2016-05-24T20:09:05Z

thanks @seberg unicode I mostly like not to deal with :)

charris added Defect labels Feb 24, 2014

jreback mentioned this issue May 27, 2014

HDFStore still corrupted reads with utf8 pandas-dev/pandas#7244

Open

ahaldane closed this as completed in 0bf9478 Jun 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Unicode byteorder seems mostly broken #3939

Unicode byteorder seems mostly broken #3939

seberg commented Oct 17, 2013

charris commented Oct 17, 2013

Uh oh!

seberg commented Oct 17, 2013

Uh oh!

seberg commented May 23, 2016

Uh oh!

jreback commented May 24, 2016

Uh oh!

Uh oh!

Unicode byteorder seems mostly broken #3939

Unicode byteorder seems mostly broken #3939

Comments

seberg commented Oct 17, 2013

charris commented Oct 17, 2013

Uh oh!

seberg commented Oct 17, 2013

Uh oh!

seberg commented May 23, 2016

Uh oh!

jreback commented May 24, 2016

Uh oh!