BUG: HDFStore utf8 corrupted reads #6505

wabu · 2014-02-28T15:06:06Z

When reading back utf8 encoded data, it randomly is corrupted when read back.
The output contains random values or fails with an UnicodeDecodeError.

After commenting out the fast version (astype) in _unconvert_string_array the issue did not show up anymore.

pyton 3.3, numpy 1.8.0, pytables 3.1.0

I'll try to get a reproducible test up in the next days.

The text was updated successfully, but these errors were encountered:

jreback · 2014-02-28T15:18:54Z

are you looking at master? this changed from 0.13.1

this is very tricky because the decoding should work and be very fast via numpy (could be a bug their too). the vectorized decoding is quite slow, FYI (but I bet ALWAYS works)

wabu · 2014-02-28T18:14:27Z

Yes, its on master. I'll dig deeper into it. Just wanted to already open a ticket for reference.

jreback · 2014-02-28T18:23:34Z

perfect....note that the current code DOES pass 3.3 / 1.8 on windows / linux,

but if you have a test case that breaks...by all means

jreback · 2014-03-22T21:44:45Z

@wabu update on this?

jreback · 2014-04-06T15:57:19Z

@wabu update on this?

jreback · 2014-04-06T17:43:30Z

@wabu I just merged #6821 pls reopen if you have a test case which fails this. It should automatically decode via a fast loop, and fall back to a vectorized decode otherwise.

wabu · 2014-05-27T06:27:12Z

sorry for the long absence. It's still a problem, so I created a test that fails. It still is invalid for older python version but shows that there is a problem.

I also tried to reproduce it with plain numpy.astype('U').astype(object), but it didn't show up. Moreover pytables can read the correct data from the hdf5 file.

jreback · 2014-05-27T07:31:46Z

hmm
can u try this on master
and show_versions

wabu · 2014-05-27T08:17:06Z

here's a more complete test.
It creates random valid utf8 string, stores them with pandas, reads back with plain pytables and unconverts both with astype and vectorized call, all running fine. But when read back with select, it begins to get wrong data. In productive code, I also did get total wrong strings, here mostly empty strings ...

INSTALLED VERSIONS
------------------
commit: None
python: 3.3.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-61-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_NZ.UTF-8

pandas: 0.14.0rc1-73-g8793356
nose: 1.3.0
Cython: 0.19.1
numpy: 1.8.1
scipy: 0.12.0
statsmodels: 0.5.0
IPython: 2.0.0-dev
sphinx: None
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2013b
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.x
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
bq: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: 2.5.1 (dt dec pq3 ext)
ERROR: data read corrupted (44/100)
                         orig h5ed
0   '?K>۫\x01cz;\n\x0bq\x16='   ''
1       'e0=\x14\u05ee0\x13F'   ''
2          '\x19]D,f<\n3P]Ֆ֭'   ''
3  '\x7fݘ\x1cF<U\tt6&\x03?\\'   ''
4   'љ\x1d{\x04\x01D/f[\x16q'   ''
ERROR: data read corrupted (52/100)
                      orig h5ed
0        'e\x1e˵{\x17jqҒ3'   ''
1         '͟ռ_μ(SXn5\x1e$'   ''
2  '?ّQ"\x13[\U00078cb4|%'   ''
3        '8ڼC5Y\x1cCV\x13'   ''
4     'h㷜\x03^h&xo\x171MI'   ''
ERROR: data read corrupted (41/100)
                         orig h5ed
0   'u#ШJ\x0fClA\x0c\x13lts`'   ''
1                 'g"v5ː\x16'   ''
2             '#tyV\x1bȹDSxh'   ''
3  '\x06\x10\x10WZb7/<)ڏ\x11'   ''
4     '\x17xښI\x1dݺ_K9\x19V<'   ''
ERROR: data read corrupted (44/100)
                          orig h5ed
0                '2ӣ\x12:ɠ2wH'   ''
1              'ydO}m\x1d˜]2&'   ''
2  '\x179\x17X8^\x07s\x19_=+̀'   ''
3             '?,08/\x08%ւ#}"'   ''
4              'P螩cftfSN\r\nj'   ''

wabu · 2014-05-27T08:17:57Z

example with nonempty strings:

                        orig                                         h5ed
0  '7PFKۑw\x1d\x1bxyv\x03;S'   '\x002\x003\x004\x005\x006\x007\x008\x009'
1              '\x12漣\x03_L'  ':\x00;\x00<\x00=\x00>\x00?\x00@\x00A\x00B'
2            'ӉwxIy⣂&QC\x14'   '\x00C\x00D\x00E\x00F\x00G\x00H\x00I\x00J'
3          'u\x01\x10aۋZ9<s'   '\x00T\x00U\x00V\x00W\x00X\x00Y\x00Z\x00['
4       '!\x0b\x06зt\x13QMR'                                      '\x00c'

jreback · 2014-05-27T10:44:51Z

ok why don't u create a new issue (and link to this one)
and put your test link there too

jreback added Bug labels Feb 28, 2014

jreback added this to the 0.14.0 milestone Mar 22, 2014

jreback closed this as completed Apr 6, 2014

jreback added the Can't Repro label Apr 6, 2014

wabu mentioned this issue May 27, 2014

HDFStore still corrupted reads with utf8 #7244

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: HDFStore utf8 corrupted reads #6505

BUG: HDFStore utf8 corrupted reads #6505

wabu commented Feb 28, 2014

jreback commented Feb 28, 2014

Uh oh!

wabu commented Feb 28, 2014

Uh oh!

jreback commented Feb 28, 2014

Uh oh!

jreback commented Mar 22, 2014

Uh oh!

jreback commented Apr 6, 2014

Uh oh!

jreback commented Apr 6, 2014

Uh oh!

wabu commented May 27, 2014

Uh oh!

jreback commented May 27, 2014

Uh oh!

wabu commented May 27, 2014

Uh oh!

wabu commented May 27, 2014

Uh oh!

jreback commented May 27, 2014

Uh oh!

Uh oh!

BUG: HDFStore utf8 corrupted reads #6505

BUG: HDFStore utf8 corrupted reads #6505

Comments

wabu commented Feb 28, 2014

jreback commented Feb 28, 2014

Uh oh!

wabu commented Feb 28, 2014

Uh oh!

jreback commented Feb 28, 2014

Uh oh!

jreback commented Mar 22, 2014

Uh oh!

jreback commented Apr 6, 2014

Uh oh!

jreback commented Apr 6, 2014

Uh oh!

wabu commented May 27, 2014

Uh oh!

jreback commented May 27, 2014

Uh oh!

wabu commented May 27, 2014

Uh oh!

wabu commented May 27, 2014

Uh oh!

jreback commented May 27, 2014

Uh oh!