Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: char arrays auto-extending on indexed assignments #24506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bersbersbers opened this issue Aug 23, 2023 · 3 comments
Closed

ENH: char arrays auto-extending on indexed assignments #24506

bersbersbers opened this issue Aug 23, 2023 · 3 comments

Comments

@bersbersbers
Copy link
Contributor

Proposed new feature or change:

I just spent an hour tracking down a bug related to appending a char array to another char array by using indexed assignments. Basically, most of my examples exhibited somewhat crazy behavior that took me a while to realize that they are all caused by the maximum length of the initial array.

While np.char.add works fine, creating a new array with a new dtype (read, maximum length), indexed assignment don't return a new array. Also, concatenation is not the problem at all, it's the assignments themselves. This is consistent with how numeric arrays work in numpy, but in combination with the fact that numpy creates char arrays with the smallest type possible, this is surprising to say the least.

Take this example:

import numpy as np

arr = np.array(["A", "A", "A", "AA"])
arr[0] = "BB"
print(arr[0])

arr = np.array(["A", "A", "A"])
arr[0] = "BB"
print(arr[0])

I would have expected arr[0] to be 'BB' both times, but it's true only in the first case.

As I said, this is not uncommon in numpy. This example fails only in the second case:

import numpy as np

arr = np.array([1 << 30, 1 << 31])
arr[0] = 1 << 31
print(arr[0])

arr = np.array([1 << 30, 1 << 30])
arr[0] = 1 << 31
print(arr[0])

Now, with numeric arrays, one hits this problem much less frequently, since numeric types are less granular and numpy does not choose the smallest dtype possible. E.g., np.array(1).dtype is int32 and not (u)int8. By contrast, np.array("1").dtype is <U1.

Also, as you see above, some indexed assignments on numeric arrays overflow with an error; with char arrays, results are shortened silently.

For the record, here's a list of independent (I though!) real-life surprising behaviors caused by the current behavior:

import numpy as np

X = "X"
LIST = [X] * 3
STR = X * 3

print("0) replacing fails as expected")
arr = np.array(["A", "B", "C"])
try:
    arr[arr != "C"] = LIST
except ValueError:
    print("Caught ValueError")
else:
    raise Exception("Expected ValueError")

print("1) replacing works despite length mismatch (should it fail?)")
arr = np.array(["A", "B", "C"])
arr[arr != "C"] = STR
print(arr)

print("2) appending does nothing (should print ['AX', 'BX', 'C'])")
arr = np.array(["A", "B", "C"])
arr[arr != "C"] = np.char.add(arr[arr != "C"], X)
print(arr)

print("3) appending works incompletely (should print ['AX', 'BBX', 'CCX'])")
arr = np.array([["A", "BB", "CC"]])
arr[arr != "C"] = np.char.add(arr[arr != "C"], f" {X}")
print(arr)

print("4) appending works incompletely (should print ['A X', 'BBB X', 'CCC X'])")
arr = np.array([["A", "BBB", "CCC"]], dtype="<U4")
arr[arr != "C"] = np.char.add(arr[arr != "C"], f" {X}")
print(arr)

I have a number of ideas to make this work:

  • Do not autoselect the narrowest type possible upon creation of an array; by contrast, autoselect a wider type. (I must admit when proposing this, I had not expected that to have an impact on memory at all, but it seems that x = np.array("1", dtype="<U500000000") does allocate a low of memory.) Still, there is a reason for np.array(1).dtype to be int32, and that also has an impact on memory allocation.
  • Allow auto-extension ("promotion"?) of the array on indexed assignments.
  • Introduce a new auto-extending type. Again, before typing this, I had expected that np.array("1", dtype=str) would be that, but it seems simply that np.array("1", dtype=str) == np.array("1", "<U1").

As a final alternative, I'd vote for throwing an error on indexed assignments that exceed the maximum length of the array.

@ngoldbaum
Copy link
Member

I agree that the behavior you're running into is confusing. Unfortunately it's very old behavior in NumPy and I don't think we can change any of those behaviors very easily since it's likely there are users relying on the truncation that you don't want. It's unfortunately the nature of fixed-width strings that NumPy has to choose a per-element string length and it can't anticipate what you might do subsequently with the array (although you could manually choose a string size in your application, since you know more about what's happening than NumPy does).

That said, I'm actively working on a new string dtype for NumPy that won't have the limitations you're running into, see #24483.

@bersbersbers
Copy link
Contributor Author

That said, I'm actively working on a new string dtype for NumPy that won't have the limitations you're running into, see #24483.

What a crazy coincidence! Good luck - I may comment on that PR if I find time to read it (although I am sure there are plenty of numpyers contributing ideas and comments already).

@ngoldbaum
Copy link
Member

OK, closing this for now, I think it's pretty unlikely we're going to change any these behaviors in the existing fixed-width string dtypes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants