Thanks to visit codestin.com
Credit goes to github.com

Skip to content

loadtxt() changes numbers if integers are read as strings #17277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fratajcz opened this issue Sep 9, 2020 · 7 comments
Closed

loadtxt() changes numbers if integers are read as strings #17277

fratajcz opened this issue Sep 9, 2020 · 7 comments

Comments

@fratajcz
Copy link

fratajcz commented Sep 9, 2020

When I read a file with readtxt() and the dtype is set to string, the contents are changed if the column only contains integers. I need to read IDs, which can be integers, but can also contain characters, so i need to read them as strings.

Namely, the last number (in this case 100000) loses one "0" and becomes 10000. Frankly, this only happens to the last number. Even weirder, this only happens if the list ends with a number ending with a zero. It took me hours to track down this issue in my code. Do you have an idea why this happens?

Reproducing code example:

>>> import numpy as np
>>> import pandas as pd

>>> liste = list(range(1,100001))
>>> df = pd.DataFrame(liste)
>>> df
            0
0           1
1           2
2           3
3           4
4           5
...       ...
99995   99996
99996   99997
99997   99998
99998   99999
99999  100000

[100000 rows x 1 columns]
>>> df.to_csv("testfile",header=False,index=False)
>>> liste2 = np.loadtxt("testfile",dtype="str",delimiter=",",skiprows=0,usecols=0)
>>> liste2[-1]
'10000'
>>> liste2 = np.loadtxt("testfile",dtype="int",delimiter=",",skiprows=0,usecols=0)
>>> liste2[-1]
100000
>>> liste2 = np.loadtxt("testfile",dtype="str",delimiter=",",skiprows=0,usecols=0)
>>> liste1str = list(map(str,liste))
>>> liste1str == liste2
array([ True,  True,  True, ...,  True,  True, False])
>>> liste2[99]
'100'
>>> liste2[999]
'1000'
>>> liste2[9999]
'10000'
>>> liste2[99999]
'10000'
@WarrenWeckesser WarrenWeckesser changed the title readtxt() changes numbers if integers are read as strings loadtxt() changes numbers if integers are read as strings Sep 9, 2020
@WarrenWeckesser
Copy link
Member

@fratajcz, thanks for reporting this problem. This is a bug in how loadtxt determines the string dtype when the given dtype is 'S' or 'U' (with no explicit string length) and the longest string occurs after 50001 lines.

loadtxt reads lines of data in chunks of 50001 lines. The code that reads and assembles the chunks begins at https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L1148, where you can see that when dtype is, say, 'S', the first chunk will establish the actual length of the string dtype. Subsequent chunks are appended to the array X using X.resize(...), which does not change the dtype.

Here's another example. The last value in the result of loadtxt should be b'9999' (and the dtype of the array should be 'S4').

In [50]: y = np.ones(50002, dtype=int)                                                       

In [51]: y[-1] = 9999                                                                        

In [52]: np.savetxt('y.txt', y, fmt="%d")                                                    

In [53]: np.loadtxt('y.txt', dtype='S')                                                      
Out[53]: array([b'1', b'1', b'1', ..., b'1', b'1', b'9'], dtype='|S1')

If we skip one row, so all the data to be read is in the first block of 50001 lines, the dtype and the last value are correct:

In [54]: np.loadtxt('y.txt', dtype='S', skiprows=1)                                          
Out[54]: array([b'1', b'1', b'1', ..., b'1', b'1', b'9999'], dtype='|S4')

@WarrenWeckesser
Copy link
Member

@fratajcz (or anyone else encountering this problem): if you know in advance what the maximum length of your IDs will be, you can work around the bug by giving an explicit length with the dtype, e.g. if the maximum length of an ID is 8, you can use dtype='U8' in the call to loadtxt() instead of dtype="str".

@fratajcz
Copy link
Author

fratajcz commented Sep 9, 2020

Hi @WarrenWeckesser,

thanks for the clarification, that explains a lot!

Cheers!

DFEvans added a commit to DFEvans/numpy that referenced this issue May 19, 2021
Closes numpy#17277. If loadtxt is passed an unsized string or byte dtype,
the size is set automatically from the longest entry in the first
50000 lines. If longer entries appeared later, they were silently
truncated.
anntzer pushed a commit to anntzer/numpy that referenced this issue Aug 17, 2021
Closes numpy#17277. If loadtxt is passed an unsized string or byte dtype,
the size is set automatically from the longest entry in the first
50000 lines. If longer entries appeared later, they were silently
truncated.
anntzer pushed a commit to anntzer/numpy that referenced this issue Aug 17, 2021
Closes numpy#17277. If loadtxt is passed an unsized string or byte dtype,
the size is set automatically from the longest entry in the first
50000 lines. If longer entries appeared later, they were silently
truncated.
anntzer pushed a commit to anntzer/numpy that referenced this issue Aug 21, 2021
Closes numpy#17277. If loadtxt is passed an unsized string or byte dtype,
the size is set automatically from the longest entry in the first
50000 lines. If longer entries appeared later, they were silently
truncated.
anntzer pushed a commit to anntzer/numpy that referenced this issue Aug 22, 2021
Closes numpy#17277. If loadtxt is passed an unsized string or byte dtype,
the size is set automatically from the longest entry in the first
50000 lines. If longer entries appeared later, they were silently
truncated.
anntzer pushed a commit to anntzer/numpy that referenced this issue Aug 23, 2021
Closes numpy#17277. If loadtxt is passed an unsized string or byte dtype,
the size is set automatically from the longest entry in the first
50000 lines. If longer entries appeared later, they were silently
truncated.
anntzer pushed a commit to anntzer/numpy that referenced this issue Aug 26, 2021
Closes numpy#17277. If loadtxt is passed an unsized string or byte dtype,
the size is set automatically from the longest entry in the first
50000 lines. If longer entries appeared later, they were silently
truncated.
@seberg
Copy link
Member

seberg commented Jan 11, 2022

This was fixed in 1.22 (it is also fixed in the C-parser, but that doesn't really matter).

@seberg seberg closed this as completed Jan 11, 2022
@seberg
Copy link
Member

seberg commented Jan 12, 2022

Woops, no, not yet fixed probably, that would be gh-19042

@seberg seberg reopened this Jan 12, 2022
@seberg
Copy link
Member

seberg commented Feb 8, 2022

Fixed by gh-20580

@seberg seberg closed this as completed Feb 8, 2022
@enricorox

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants