loadtxt() changes numbers if integers are read as strings #17277

fratajcz · 2020-09-09T10:57:46Z

When I read a file with readtxt() and the dtype is set to string, the contents are changed if the column only contains integers. I need to read IDs, which can be integers, but can also contain characters, so i need to read them as strings.

Namely, the last number (in this case 100000) loses one "0" and becomes 10000. Frankly, this only happens to the last number. Even weirder, this only happens if the list ends with a number ending with a zero. It took me hours to track down this issue in my code. Do you have an idea why this happens?

Reproducing code example:

>>> import numpy as np
>>> import pandas as pd

>>> liste = list(range(1,100001))
>>> df = pd.DataFrame(liste)
>>> df
            0
0           1
1           2
2           3
3           4
4           5
...       ...
99995   99996
99996   99997
99997   99998
99998   99999
99999  100000

[100000 rows x 1 columns]
>>> df.to_csv("testfile",header=False,index=False)
>>> liste2 = np.loadtxt("testfile",dtype="str",delimiter=",",skiprows=0,usecols=0)
>>> liste2[-1]
'10000'
>>> liste2 = np.loadtxt("testfile",dtype="int",delimiter=",",skiprows=0,usecols=0)
>>> liste2[-1]
100000
>>> liste2 = np.loadtxt("testfile",dtype="str",delimiter=",",skiprows=0,usecols=0)
>>> liste1str = list(map(str,liste))
>>> liste1str == liste2
array([ True,  True,  True, ...,  True,  True, False])
>>> liste2[99]
'100'
>>> liste2[999]
'1000'
>>> liste2[9999]
'10000'
>>> liste2[99999]
'10000'

The text was updated successfully, but these errors were encountered:

WarrenWeckesser · 2020-09-09T13:32:20Z

@fratajcz, thanks for reporting this problem. This is a bug in how loadtxt determines the string dtype when the given dtype is 'S' or 'U' (with no explicit string length) and the longest string occurs after 50001 lines.

loadtxt reads lines of data in chunks of 50001 lines. The code that reads and assembles the chunks begins at https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L1148, where you can see that when dtype is, say, 'S', the first chunk will establish the actual length of the string dtype. Subsequent chunks are appended to the array X using X.resize(...), which does not change the dtype.

Here's another example. The last value in the result of loadtxt should be b'9999' (and the dtype of the array should be 'S4').

In [50]: y = np.ones(50002, dtype=int)                                                       

In [51]: y[-1] = 9999                                                                        

In [52]: np.savetxt('y.txt', y, fmt="%d")                                                    

In [53]: np.loadtxt('y.txt', dtype='S')                                                      
Out[53]: array([b'1', b'1', b'1', ..., b'1', b'1', b'9'], dtype='|S1')

If we skip one row, so all the data to be read is in the first block of 50001 lines, the dtype and the last value are correct:

In [54]: np.loadtxt('y.txt', dtype='S', skiprows=1)                                          
Out[54]: array([b'1', b'1', b'1', ..., b'1', b'1', b'9999'], dtype='|S4')

WarrenWeckesser · 2020-09-09T13:47:09Z

@fratajcz (or anyone else encountering this problem): if you know in advance what the maximum length of your IDs will be, you can work around the bug by giving an explicit length with the dtype, e.g. if the maximum length of an ID is 8, you can use dtype='U8' in the call to loadtxt() instead of dtype="str".

fratajcz · 2020-09-09T15:02:10Z

Hi @WarrenWeckesser,

thanks for the clarification, that explains a lot!

Cheers!

Closes numpy#17277. If loadtxt is passed an unsized string or byte dtype, the size is set automatically from the longest entry in the first 50000 lines. If longer entries appeared later, they were silently truncated.

seberg · 2022-01-11T14:41:00Z

This was fixed in 1.22 (it is also fixed in the C-parser, but that doesn't really matter).

seberg · 2022-01-12T03:59:44Z

Woops, no, not yet fixed probably, that would be gh-19042

seberg · 2022-02-08T13:50:33Z

Fixed by gh-20580

WarrenWeckesser changed the title ~~readtxt() changes numbers if integers are read as strings~~ loadtxt() changes numbers if integers are read as strings Sep 9, 2020

WarrenWeckesser added 00 - Bug component: numpy.lib labels Sep 9, 2020

leondgarse mentioned this issue Dec 23, 2020

Reformat and speed up IJB evaluation deepinsight/insightface#1349

Merged

DFEvans mentioned this issue May 19, 2021

BUG: fix string truncation bug in loadtxt #19042

Closed

mberr mentioned this issue Dec 21, 2021

load_triples and np.loadtxt might load incorrect data pykeen/pykeen#694

Closed

seberg closed this as completed Jan 11, 2022

seberg reopened this Jan 12, 2022

seberg closed this as completed Feb 8, 2022

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

loadtxt() changes numbers if integers are read as strings #17277

loadtxt() changes numbers if integers are read as strings #17277

fratajcz commented Sep 9, 2020 •

edited

Loading

WarrenWeckesser commented Sep 9, 2020

Uh oh!

WarrenWeckesser commented Sep 9, 2020

Uh oh!

fratajcz commented Sep 9, 2020

Uh oh!

seberg commented Jan 11, 2022

Uh oh!

seberg commented Jan 12, 2022

Uh oh!

seberg commented Feb 8, 2022

Uh oh!

This comment was marked as off-topic.

Uh oh!

loadtxt() changes numbers if integers are read as strings #17277

loadtxt() changes numbers if integers are read as strings #17277

Comments

fratajcz commented Sep 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproducing code example:

WarrenWeckesser commented Sep 9, 2020

Uh oh!

WarrenWeckesser commented Sep 9, 2020

Uh oh!

fratajcz commented Sep 9, 2020

Uh oh!

seberg commented Jan 11, 2022

Uh oh!

seberg commented Jan 12, 2022

Uh oh!

seberg commented Feb 8, 2022

Uh oh!

This comment was marked as off-topic.

fratajcz commented Sep 9, 2020 •

edited

Loading