Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: Add a public API for generating hashable buffers #29229

Open
@ngoldbaum

Description

@ngoldbaum

Proposed new feature or change:

c.f. #29226

The standard-library hashing functions use the buffer protocol to expose the bytes in the ndarray buffer to the hash functions in hashlib.

This is suboptimal for dtypes that hold references (StringDType, see #29226 for how this goes wrong), but is a more general issue. Using hashlib like this breaks in several ways:

  • Where two values that compare equal can have different byte representations and thus different hashes:
>>> hashlib.sha256(np.array(+0.0)).hexdigest()
'af5570f5a1810b7af78caf4bc70a660f0df51e42baf91d4de5b2328de0e83dfc'
>>> hashlib.sha256(np.array(-0.0)).hexdigest()
'e6ad6c9a3a3b7658c35bacf6553fcb8ffe34387534a648fe18f875b8f7a86ddb'
>>> np.array(-0.0) == np.array(0.0)
np.True_
  • Where two arrays have identical byte buffers:
>>> hashlib.sha256(np.array(['ab', 'cd'])).hexdigest()
'877c5b2fcb8523ae0edcc5b3207902bedc0ed296273c646291d34e6a63d8313e'
>>> hashlib.sha256(np.array(['abcd'])).hexdigest()
'877c5b2fcb8523ae0edcc5b3207902bedc0ed296273c646291d34e6a63d8313e'
>>> byte_data = b'\x01\x02\x03\x04\x05\x06\x07\x08'
>>> hashlib.sha256(np.frombuffer(byte_data, dtype=np.uint8)).hexdigest()
'66840dda154e8a113c31dd0ad32f7f3a366a80e8136979d8f5a101d3d29d6f72'
>>> hashlib.sha256(np.frombuffer(byte_data, dtype=np.int64)).hexdigest()
'66840dda154e8a113c31dd0ad32f7f3a366a80e8136979d8f5a101d3d29d6f72'

Proposal

IMO we should add a new array hashing API to NumPy. The hash should be based on the bytes of all the values in the array as well as the shape of the array and DType metadata.

We can also compute hashes for all user DTypes that don't include references, and add an API user DTypes that do include references (and StringDType) can use to produce a hash from array items.

To avoid adding a new ndarray member function, the Python API could be something like:

>>> np.hash(arr, algorithm='sha256')

I don't think it needs any other keyword arguments, since it's a statistic based on the whole array and all its metadata.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions