ENH: Add a public API for generating hashable buffers

### Proposed new feature or change:

c.f. https://github.com/numpy/numpy/issues/29226

The standard-library hashing functions use the buffer protocol to expose the bytes in the ndarray buffer to the hash functions in `hashlib`.

This is suboptimal for dtypes that hold references (StringDType, see #29226 for how this goes wrong), but is a more general issue. Using `hashlib` like this breaks in several ways:

* Where two values that compare equal can have different byte representations and thus different hashes:

```python
>>> hashlib.sha256(np.array(+0.0)).hexdigest()
'af5570f5a1810b7af78caf4bc70a660f0df51e42baf91d4de5b2328de0e83dfc'
>>> hashlib.sha256(np.array(-0.0)).hexdigest()
'e6ad6c9a3a3b7658c35bacf6553fcb8ffe34387534a648fe18f875b8f7a86ddb'
>>> np.array(-0.0) == np.array(0.0)
np.True_
```

* Where two arrays have identical byte buffers:

```python
>>> hashlib.sha256(np.array(['ab', 'cd'])).hexdigest()
'877c5b2fcb8523ae0edcc5b3207902bedc0ed296273c646291d34e6a63d8313e'
>>> hashlib.sha256(np.array(['abcd'])).hexdigest()
'877c5b2fcb8523ae0edcc5b3207902bedc0ed296273c646291d34e6a63d8313e'
```

```python
>>> byte_data = b'\x01\x02\x03\x04\x05\x06\x07\x08'
>>> hashlib.sha256(np.frombuffer(byte_data, dtype=np.uint8)).hexdigest()
'66840dda154e8a113c31dd0ad32f7f3a366a80e8136979d8f5a101d3d29d6f72'
>>> hashlib.sha256(np.frombuffer(byte_data, dtype=np.int64)).hexdigest()
'66840dda154e8a113c31dd0ad32f7f3a366a80e8136979d8f5a101d3d29d6f72'
```

## Proposal

IMO we should add a new array hashing API to NumPy. The hash should be based on the bytes of all the values in the array as well as the shape of the array and DType metadata.

We can also compute hashes for all user DTypes that don't include references, and add an API user DTypes that do include references (and StringDType) can use to produce a hash from array items.

To avoid adding a new ndarray member function, the Python API could be something like:

```python
>>> np.hash(arr, algorithm='sha256')
```

I don't think it needs any other keyword arguments, since it's a statistic based on the whole array and all its metadata.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Add a public API for generating hashable buffers #29229

Proposed new feature or change:

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: Add a public API for generating hashable buffers #29229

Description

Proposed new feature or change:

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions