You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just spent an hour tracking down a bug related to appending a char array to another char array by using indexed assignments. Basically, most of my examples exhibited somewhat crazy behavior that took me a while to realize that they are all caused by the maximum length of the initial array.
While np.char.add works fine, creating a new array with a new dtype (read, maximum length), indexed assignment don't return a new array. Also, concatenation is not the problem at all, it's the assignments themselves. This is consistent with how numeric arrays work in numpy, but in combination with the fact that numpy creates char arrays with the smallest type possible, this is surprising to say the least.
Now, with numeric arrays, one hits this problem much less frequently, since numeric types are less granular and numpy does not choose the smallest dtype possible. E.g., np.array(1).dtype is int32 and not (u)int8. By contrast, np.array("1").dtype is <U1.
Also, as you see above, some indexed assignments on numeric arrays overflow with an error; with char arrays, results are shortened silently.
For the record, here's a list of independent (I though!) real-life surprising behaviors caused by the current behavior:
Do not autoselect the narrowest type possible upon creation of an array; by contrast, autoselect a wider type. (I must admit when proposing this, I had not expected that to have an impact on memory at all, but it seems that x = np.array("1", dtype="<U500000000")does allocate a low of memory.) Still, there is a reason for np.array(1).dtype to be int32, and that also has an impact on memory allocation.
Allow auto-extension ("promotion"?) of the array on indexed assignments.
Introduce a new auto-extending type. Again, before typing this, I had expected that np.array("1", dtype=str) would be that, but it seems simply that np.array("1", dtype=str) == np.array("1", "<U1").
As a final alternative, I'd vote for throwing an error on indexed assignments that exceed the maximum length of the array.
The text was updated successfully, but these errors were encountered:
I agree that the behavior you're running into is confusing. Unfortunately it's very old behavior in NumPy and I don't think we can change any of those behaviors very easily since it's likely there are users relying on the truncation that you don't want. It's unfortunately the nature of fixed-width strings that NumPy has to choose a per-element string length and it can't anticipate what you might do subsequently with the array (although you could manually choose a string size in your application, since you know more about what's happening than NumPy does).
That said, I'm actively working on a new string dtype for NumPy that won't have the limitations you're running into, see #24483.
That said, I'm actively working on a new string dtype for NumPy that won't have the limitations you're running into, see #24483.
What a crazy coincidence! Good luck - I may comment on that PR if I find time to read it (although I am sure there are plenty of numpyers contributing ideas and comments already).
Proposed new feature or change:
I just spent an hour tracking down a bug related to appending a char array to another char array by using indexed assignments. Basically, most of my examples exhibited somewhat crazy behavior that took me a while to realize that they are all caused by the maximum length of the initial array.
While
np.char.add
works fine, creating a new array with a new dtype (read, maximum length), indexed assignment don't return a new array. Also, concatenation is not the problem at all, it's the assignments themselves. This is consistent with how numeric arrays work in numpy, but in combination with the fact that numpy creates char arrays with the smallest type possible, this is surprising to say the least.Take this example:
I would have expected
arr[0]
to be'BB'
both times, but it's true only in the first case.As I said, this is not uncommon in numpy. This example fails only in the second case:
Now, with numeric arrays, one hits this problem much less frequently, since numeric types are less granular and numpy does not choose the smallest dtype possible. E.g.,
np.array(1).dtype
isint32
and not(u)int8
. By contrast,np.array("1").dtype
is<U1
.Also, as you see above, some indexed assignments on numeric arrays overflow with an error; with char arrays, results are shortened silently.
For the record, here's a list of independent (I though!) real-life surprising behaviors caused by the current behavior:
I have a number of ideas to make this work:
x = np.array("1", dtype="<U500000000")
does allocate a low of memory.) Still, there is a reason fornp.array(1).dtype
to beint32
, and that also has an impact on memory allocation.np.array("1", dtype=str)
would be that, but it seems simply thatnp.array("1", dtype=str) == np.array("1", "<U1")
.As a final alternative, I'd vote for throwing an error on indexed assignments that exceed the maximum length of the array.
The text was updated successfully, but these errors were encountered: