BUG: hashlib: hash collision for StringDType

### Describe the issue:

Normally, the buffer interface is disabled for StringDType:
```python
>>> bytes(np.asarray(["foo"], dtype="T"))
ValueError: cannot include dtype 'numpy.dtypes.StringDType' in a buffer
```

`hashlib.hash.digest` bypasses this restriction.
Everything works as expected for as long as all strings in the array are short enough to be embedded:

```python
>>> hashlib.sha256(np.asarray(["foo"], dtype="T")).hexdigest()
'00b794341eca9773c68aa1f97a4607f3bdb174be0621d105bcb5c476918357f2'
>>> hashlib.sha256(np.asarray(["foo"], dtype="T")).hexdigest()
'00b794341eca9773c68aa1f97a4607f3bdb174be0621d105bcb5c476918357f2'
>>> hashlib.sha256(np.asarray(["bar"], dtype="T")).hexdigest()
'3ba8f5d1f21987f366af576a1b700c385e12e704c0e8d0766bff242424628acb'
```

However, as soon as one reaches the threshold of strings in the arena, it falls apart, with different strings returning the same hash. It seems that the hash includes only the string length, not the buffer pointer:

```python
>>> a = np.asarray(["This string is extremely long"], dtype="T")
>>> b = np.asarray(["This string is extremely long"], dtype="T")
>>> c = np.asarray(["This string is different long"], dtype="T")
>>> d = np.asarray(["This string is extremely long and longer"], dtype="T")
>>> hashlib.sha256(a).hexdigest()
'efe5a2fd86134b1643906cbd7ee3bc22c3e6fd7c3b16f895618b939a4190f556'
>>> hashlib.sha256(b).hexdigest()
'efe5a2fd86134b1643906cbd7ee3bc22c3e6fd7c3b16f895618b939a4190f556'
>>> hashlib.sha256(c).hexdigest()
'efe5a2fd86134b1643906cbd7ee3bc22c3e6fd7c3b16f895618b939a4190f556'  # Hash collision!
>>> hashlib.sha256(d).hexdigest()
>>> '4fbaa09460a3fba24f4a354960fc0a0c06cc70e7cd49b2d37e0a326161a09e33'
```

This can have devastating consequences. It is very easy to forget, when writing unit tests, that StringDType changes memory layout once it passes the maximum length for embedded strings. As a result, it's easy to write an entire test suite that never triggers the use case of long strings, reach production, and then get hash collisions in production.

### Mitigation:

The [documentation says](https://numpy.org/devdocs/user/basics.strings.html#variable-width-strings)

> Also note that unlike fixed-width strings and most other NumPy data types, StringDType does not store the string data in the “main” ndarray data buffer. Instead, the array buffer is used to store metadata about where the string data are stored in memory. This difference means that code expecting the array buffer to contain string data will not function correctly, and will need to be updated to support StringDType.

### Expected behaviour:

`hashlib.hash.digest` should always raise when called on a StringDType array, just like `bytes` does.
However, this is a breaking change for users that just never happen to have long strings in their data, or that sanitize their data first. So it should follow a deprecation cycle.

### Python and NumPy Versions:

Reproduced on 
- python 3.9, numpy 2.0.2
- python 3.13, numpy 2.3.0

### Context for the issue:

[versioned-hdf5](https://github.com/deshaw/versioned-hdf5) hashes chunks for the purpose of deduplication. If a chunk is updated to an identical value to a chunk that already exists on disk, it is not written again. Likewise, two identical chunks in memory result in a single chunk on disk.
I'm currently in the process of adding NpyStrings support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: hashlib: hash collision for StringDType #29226

Describe the issue:

Mitigation:

Expected behaviour:

Python and NumPy Versions:

Context for the issue:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

BUG: hashlib: hash collision for StringDType #29226

Description

Describe the issue:

Mitigation:

Expected behaviour:

Python and NumPy Versions:

Context for the issue:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions