Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BUG: HDFStore utf8 corrupted reads #6505

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wabu opened this issue Feb 28, 2014 · 11 comments
Closed

BUG: HDFStore utf8 corrupted reads #6505

wabu opened this issue Feb 28, 2014 · 11 comments
Labels
Bug IO HDF5 read_hdf, HDFStore Unicode Unicode strings
Milestone

Comments

@wabu
Copy link
Contributor

wabu commented Feb 28, 2014

When reading back utf8 encoded data, it randomly is corrupted when read back.
The output contains random values or fails with an UnicodeDecodeError.

After commenting out the fast version (astype) in _unconvert_string_array the issue did not show up anymore.

pyton 3.3, numpy 1.8.0, pytables 3.1.0

I'll try to get a reproducible test up in the next days.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2014

are you looking at master? this changed from 0.13.1

this is very tricky because the decoding should work and be very fast via numpy (could be a bug their too). the vectorized decoding is quite slow, FYI (but I bet ALWAYS works)

@wabu
Copy link
Contributor Author

wabu commented Feb 28, 2014

Yes, its on master. I'll dig deeper into it. Just wanted to already open a ticket for reference.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2014

perfect....note that the current code DOES pass 3.3 / 1.8 on windows / linux,

but if you have a test case that breaks...by all means

@jreback
Copy link
Contributor

jreback commented Mar 22, 2014

@wabu update on this?

@jreback jreback added this to the 0.14.0 milestone Mar 22, 2014
@jreback
Copy link
Contributor

jreback commented Apr 6, 2014

@wabu update on this?

@jreback
Copy link
Contributor

jreback commented Apr 6, 2014

@wabu I just merged #6821 pls reopen if you have a test case which fails this. It should automatically decode via a fast loop, and fall back to a vectorized decode otherwise.

@wabu
Copy link
Contributor Author

wabu commented May 27, 2014

sorry for the long absence. It's still a problem, so I created a test that fails. It still is invalid for older python version but shows that there is a problem.

I also tried to reproduce it with plain numpy.astype('U').astype(object), but it didn't show up. Moreover pytables can read the correct data from the hdf5 file.

@jreback
Copy link
Contributor

jreback commented May 27, 2014

hmm
can u try this on master
and show_versions

@wabu
Copy link
Contributor Author

wabu commented May 27, 2014

here's a more complete test.
It creates random valid utf8 string, stores them with pandas, reads back with plain pytables and unconverts both with astype and vectorized call, all running fine. But when read back with select, it begins to get wrong data. In productive code, I also did get total wrong strings, here mostly empty strings ...

INSTALLED VERSIONS
------------------
commit: None
python: 3.3.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-61-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_NZ.UTF-8

pandas: 0.14.0rc1-73-g8793356
nose: 1.3.0
Cython: 0.19.1
numpy: 1.8.1
scipy: 0.12.0
statsmodels: 0.5.0
IPython: 2.0.0-dev
sphinx: None
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2013b
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.x
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
bq: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: 2.5.1 (dt dec pq3 ext)
ERROR: data read corrupted (44/100)
                         orig h5ed
0   '?K>۫\x01cz;\n\x0bq\x16='   ''
1       'e0=\x14\u05ee0\x13F'   ''
2          '\x19]D,f<\n3P]Ֆ֭'   ''
3  '\x7fݘ\x1cF<U\tt6&\x03?\\'   ''
4   'љ\x1d{\x04\x01D/f[\x16q'   ''
ERROR: data read corrupted (52/100)
                      orig h5ed
0        'e\x1e˵{\x17jqҒ3'   ''
1         '͟ռ_μ(SXn5\x1e$'   ''
2  '?ّQ"\x13[\U00078cb4|%'   ''
3        '8ڼC5Y\x1cCV\x13'   ''
4     'h㷜\x03^h&xo\x171MI'   ''
ERROR: data read corrupted (41/100)
                         orig h5ed
0   'u#ШJ\x0fClA\x0c\x13lts`'   ''
1                 'g"v5ː\x16'   ''
2             '#tyV\x1bȹDSxh'   ''
3  '\x06\x10\x10WZb7/<)ڏ\x11'   ''
4     '\x17xښI\x1dݺ_K9\x19V<'   ''
ERROR: data read corrupted (44/100)
                          orig h5ed
0                '2ӣ\x12:ɠ2wH'   ''
1              'ydO}m\x1d˜]2&'   ''
2  '\x179\x17X8^\x07s\x19_=+̀'   ''
3             '?,08/\x08%ւ#}"'   ''
4              'P螩cftfSN\r\nj'   ''

@wabu
Copy link
Contributor Author

wabu commented May 27, 2014

example with nonempty strings:

                        orig                                         h5ed
0  '7PFKۑw\x1d\x1bxyv\x03;S'   '\x002\x003\x004\x005\x006\x007\x008\x009'
1              '\x12漣\x03_L'  ':\x00;\x00<\x00=\x00>\x00?\x00@\x00A\x00B'
2            'ӉwxIy⣂&QC\x14'   '\x00C\x00D\x00E\x00F\x00G\x00H\x00I\x00J'
3          'u\x01\x10aۋZ9<s'   '\x00T\x00U\x00V\x00W\x00X\x00Y\x00Z\x00['
4       '!\x0b\x06зt\x13QMR'                                      '\x00c'

@jreback
Copy link
Contributor

jreback commented May 27, 2014

ok why don't u create a new issue (and link to this one)
and put your test link there too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore Unicode Unicode strings
Projects
None yet
Development

No branches or pull requests

2 participants