Description
Proposed new feature or change:
c.f. #29226
The standard-library hashing functions use the buffer protocol to expose the bytes in the ndarray buffer to the hash functions in hashlib
.
This is suboptimal for dtypes that hold references (StringDType, see #29226 for how this goes wrong), but is a more general issue. Using hashlib
like this breaks in several ways:
- Where two values that compare equal can have different byte representations and thus different hashes:
>>> hashlib.sha256(np.array(+0.0)).hexdigest()
'af5570f5a1810b7af78caf4bc70a660f0df51e42baf91d4de5b2328de0e83dfc'
>>> hashlib.sha256(np.array(-0.0)).hexdigest()
'e6ad6c9a3a3b7658c35bacf6553fcb8ffe34387534a648fe18f875b8f7a86ddb'
>>> np.array(-0.0) == np.array(0.0)
np.True_
- Where two arrays have identical byte buffers:
>>> hashlib.sha256(np.array(['ab', 'cd'])).hexdigest()
'877c5b2fcb8523ae0edcc5b3207902bedc0ed296273c646291d34e6a63d8313e'
>>> hashlib.sha256(np.array(['abcd'])).hexdigest()
'877c5b2fcb8523ae0edcc5b3207902bedc0ed296273c646291d34e6a63d8313e'
>>> byte_data = b'\x01\x02\x03\x04\x05\x06\x07\x08'
>>> hashlib.sha256(np.frombuffer(byte_data, dtype=np.uint8)).hexdigest()
'66840dda154e8a113c31dd0ad32f7f3a366a80e8136979d8f5a101d3d29d6f72'
>>> hashlib.sha256(np.frombuffer(byte_data, dtype=np.int64)).hexdigest()
'66840dda154e8a113c31dd0ad32f7f3a366a80e8136979d8f5a101d3d29d6f72'
Proposal
IMO we should add a new array hashing API to NumPy. The hash should be based on the bytes of all the values in the array as well as the shape of the array and DType metadata.
We can also compute hashes for all user DTypes that don't include references, and add an API user DTypes that do include references (and StringDType) can use to produce a hash from array items.
To avoid adding a new ndarray member function, the Python API could be something like:
>>> np.hash(arr, algorithm='sha256')
I don't think it needs any other keyword arguments, since it's a statistic based on the whole array and all its metadata.