Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BUG: hashlib: hash collision for StringDType #29226

Open
@crusaderky

Description

@crusaderky

Describe the issue:

Normally, the buffer interface is disabled for StringDType:

>>> bytes(np.asarray(["foo"], dtype="T"))
ValueError: cannot include dtype 'numpy.dtypes.StringDType' in a buffer

hashlib.hash.digest bypasses this restriction.
Everything works as expected for as long as all strings in the array are short enough to be embedded:

>>> hashlib.sha256(np.asarray(["foo"], dtype="T")).hexdigest()
'00b794341eca9773c68aa1f97a4607f3bdb174be0621d105bcb5c476918357f2'
>>> hashlib.sha256(np.asarray(["foo"], dtype="T")).hexdigest()
'00b794341eca9773c68aa1f97a4607f3bdb174be0621d105bcb5c476918357f2'
>>> hashlib.sha256(np.asarray(["bar"], dtype="T")).hexdigest()
'3ba8f5d1f21987f366af576a1b700c385e12e704c0e8d0766bff242424628acb'

However, as soon as one reaches the threshold of strings in the arena, it falls apart, with different strings returning the same hash. It seems that the hash includes only the string length, not the buffer pointer:

>>> a = np.asarray(["This string is extremely long"], dtype="T")
>>> b = np.asarray(["This string is extremely long"], dtype="T")
>>> c = np.asarray(["This string is different long"], dtype="T")
>>> d = np.asarray(["This string is extremely long and longer"], dtype="T")
>>> hashlib.sha256(a).hexdigest()
'efe5a2fd86134b1643906cbd7ee3bc22c3e6fd7c3b16f895618b939a4190f556'
>>> hashlib.sha256(b).hexdigest()
'efe5a2fd86134b1643906cbd7ee3bc22c3e6fd7c3b16f895618b939a4190f556'
>>> hashlib.sha256(c).hexdigest()
'efe5a2fd86134b1643906cbd7ee3bc22c3e6fd7c3b16f895618b939a4190f556'  # Hash collision!
>>> hashlib.sha256(d).hexdigest()
>>> '4fbaa09460a3fba24f4a354960fc0a0c06cc70e7cd49b2d37e0a326161a09e33'

This can have devastating consequences. It is very easy to forget, when writing unit tests, that StringDType changes memory layout once it passes the maximum length for embedded strings. As a result, it's easy to write an entire test suite that never triggers the use case of long strings, reach production, and then get hash collisions in production.

Mitigation:

The documentation says

Also note that unlike fixed-width strings and most other NumPy data types, StringDType does not store the string data in the “main” ndarray data buffer. Instead, the array buffer is used to store metadata about where the string data are stored in memory. This difference means that code expecting the array buffer to contain string data will not function correctly, and will need to be updated to support StringDType.

Expected behaviour:

hashlib.hash.digest should always raise when called on a StringDType array, just like bytes does.
However, this is a breaking change for users that just never happen to have long strings in their data, or that sanitize their data first. So it should follow a deprecation cycle.

Python and NumPy Versions:

Reproduced on

  • python 3.9, numpy 2.0.2
  • python 3.13, numpy 2.3.0

Context for the issue:

versioned-hdf5 hashes chunks for the purpose of deduplication. If a chunk is updated to an identical value to a chunk that already exists on disk, it is not written again. Likewise, two identical chunks in memory result in a single chunk on disk.
I'm currently in the process of adding NpyStrings support.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions