ENH: char arrays auto-extending on indexed assignments #24506

bersbersbers · 2023-08-23T07:19:25Z

Proposed new feature or change:

I just spent an hour tracking down a bug related to appending a char array to another char array by using indexed assignments. Basically, most of my examples exhibited somewhat crazy behavior that took me a while to realize that they are all caused by the maximum length of the initial array.

While np.char.add works fine, creating a new array with a new dtype (read, maximum length), indexed assignment don't return a new array. Also, concatenation is not the problem at all, it's the assignments themselves. This is consistent with how numeric arrays work in numpy, but in combination with the fact that numpy creates char arrays with the smallest type possible, this is surprising to say the least.

Take this example:

import numpy as np

arr = np.array(["A", "A", "A", "AA"])
arr[0] = "BB"
print(arr[0])

arr = np.array(["A", "A", "A"])
arr[0] = "BB"
print(arr[0])

I would have expected arr[0] to be 'BB' both times, but it's true only in the first case.

As I said, this is not uncommon in numpy. This example fails only in the second case:

import numpy as np

arr = np.array([1 << 30, 1 << 31])
arr[0] = 1 << 31
print(arr[0])

arr = np.array([1 << 30, 1 << 30])
arr[0] = 1 << 31
print(arr[0])

Now, with numeric arrays, one hits this problem much less frequently, since numeric types are less granular and numpy does not choose the smallest dtype possible. E.g., np.array(1).dtype is int32 and not (u)int8. By contrast, np.array("1").dtype is <U1.

Also, as you see above, some indexed assignments on numeric arrays overflow with an error; with char arrays, results are shortened silently.

For the record, here's a list of independent (I though!) real-life surprising behaviors caused by the current behavior:

import numpy as np

X = "X"
LIST = [X] * 3
STR = X * 3

print("0) replacing fails as expected")
arr = np.array(["A", "B", "C"])
try:
    arr[arr != "C"] = LIST
except ValueError:
    print("Caught ValueError")
else:
    raise Exception("Expected ValueError")

print("1) replacing works despite length mismatch (should it fail?)")
arr = np.array(["A", "B", "C"])
arr[arr != "C"] = STR
print(arr)

print("2) appending does nothing (should print ['AX', 'BX', 'C'])")
arr = np.array(["A", "B", "C"])
arr[arr != "C"] = np.char.add(arr[arr != "C"], X)
print(arr)

print("3) appending works incompletely (should print ['AX', 'BBX', 'CCX'])")
arr = np.array([["A", "BB", "CC"]])
arr[arr != "C"] = np.char.add(arr[arr != "C"], f" {X}")
print(arr)

print("4) appending works incompletely (should print ['A X', 'BBB X', 'CCC X'])")
arr = np.array([["A", "BBB", "CCC"]], dtype="<U4")
arr[arr != "C"] = np.char.add(arr[arr != "C"], f" {X}")
print(arr)

I have a number of ideas to make this work:

Do not autoselect the narrowest type possible upon creation of an array; by contrast, autoselect a wider type. (I must admit when proposing this, I had not expected that to have an impact on memory at all, but it seems that x = np.array("1", dtype="<U500000000") does allocate a low of memory.) Still, there is a reason for np.array(1).dtype to be int32, and that also has an impact on memory allocation.
Allow auto-extension ("promotion"?) of the array on indexed assignments.
Introduce a new auto-extending type. Again, before typing this, I had expected that np.array("1", dtype=str) would be that, but it seems simply that np.array("1", dtype=str) == np.array("1", "<U1").

As a final alternative, I'd vote for throwing an error on indexed assignments that exceed the maximum length of the array.

The text was updated successfully, but these errors were encountered:

ngoldbaum · 2023-08-23T14:31:57Z

I agree that the behavior you're running into is confusing. Unfortunately it's very old behavior in NumPy and I don't think we can change any of those behaviors very easily since it's likely there are users relying on the truncation that you don't want. It's unfortunately the nature of fixed-width strings that NumPy has to choose a per-element string length and it can't anticipate what you might do subsequently with the array (although you could manually choose a string size in your application, since you know more about what's happening than NumPy does).

That said, I'm actively working on a new string dtype for NumPy that won't have the limitations you're running into, see #24483.

bersbersbers · 2023-08-23T15:00:30Z

That said, I'm actively working on a new string dtype for NumPy that won't have the limitations you're running into, see #24483.

What a crazy coincidence! Good luck - I may comment on that PR if I find time to read it (although I am sure there are plenty of numpyers contributing ideas and comments already).

ngoldbaum · 2023-08-23T15:31:06Z

OK, closing this for now, I think it's pretty unlikely we're going to change any these behaviors in the existing fixed-width string dtypes.

ngoldbaum closed this as completed Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: char arrays auto-extending on indexed assignments #24506

ENH: char arrays auto-extending on indexed assignments #24506

bersbersbers commented Aug 23, 2023

ngoldbaum commented Aug 23, 2023

Uh oh!

bersbersbers commented Aug 23, 2023

Uh oh!

ngoldbaum commented Aug 23, 2023

Uh oh!

Uh oh!

ENH: char arrays auto-extending on indexed assignments #24506

ENH: char arrays auto-extending on indexed assignments #24506

Comments

bersbersbers commented Aug 23, 2023

Proposed new feature or change:

ngoldbaum commented Aug 23, 2023

Uh oh!

bersbersbers commented Aug 23, 2023

Uh oh!

ngoldbaum commented Aug 23, 2023

Uh oh!