Description
Describe the issue:
Normally, the buffer interface is disabled for StringDType:
>>> bytes(np.asarray(["foo"], dtype="T"))
ValueError: cannot include dtype 'numpy.dtypes.StringDType' in a buffer
hashlib.hash.digest
bypasses this restriction.
Everything works as expected for as long as all strings in the array are short enough to be embedded:
>>> hashlib.sha256(np.asarray(["foo"], dtype="T")).hexdigest()
'00b794341eca9773c68aa1f97a4607f3bdb174be0621d105bcb5c476918357f2'
>>> hashlib.sha256(np.asarray(["foo"], dtype="T")).hexdigest()
'00b794341eca9773c68aa1f97a4607f3bdb174be0621d105bcb5c476918357f2'
>>> hashlib.sha256(np.asarray(["bar"], dtype="T")).hexdigest()
'3ba8f5d1f21987f366af576a1b700c385e12e704c0e8d0766bff242424628acb'
However, as soon as one reaches the threshold of strings in the arena, it falls apart, with different strings returning the same hash. It seems that the hash includes only the string length, not the buffer pointer:
>>> a = np.asarray(["This string is extremely long"], dtype="T")
>>> b = np.asarray(["This string is extremely long"], dtype="T")
>>> c = np.asarray(["This string is different long"], dtype="T")
>>> d = np.asarray(["This string is extremely long and longer"], dtype="T")
>>> hashlib.sha256(a).hexdigest()
'efe5a2fd86134b1643906cbd7ee3bc22c3e6fd7c3b16f895618b939a4190f556'
>>> hashlib.sha256(b).hexdigest()
'efe5a2fd86134b1643906cbd7ee3bc22c3e6fd7c3b16f895618b939a4190f556'
>>> hashlib.sha256(c).hexdigest()
'efe5a2fd86134b1643906cbd7ee3bc22c3e6fd7c3b16f895618b939a4190f556' # Hash collision!
>>> hashlib.sha256(d).hexdigest()
>>> '4fbaa09460a3fba24f4a354960fc0a0c06cc70e7cd49b2d37e0a326161a09e33'
This can have devastating consequences. It is very easy to forget, when writing unit tests, that StringDType changes memory layout once it passes the maximum length for embedded strings. As a result, it's easy to write an entire test suite that never triggers the use case of long strings, reach production, and then get hash collisions in production.
Mitigation:
Also note that unlike fixed-width strings and most other NumPy data types, StringDType does not store the string data in the “main” ndarray data buffer. Instead, the array buffer is used to store metadata about where the string data are stored in memory. This difference means that code expecting the array buffer to contain string data will not function correctly, and will need to be updated to support StringDType.
Expected behaviour:
hashlib.hash.digest
should always raise when called on a StringDType array, just like bytes
does.
However, this is a breaking change for users that just never happen to have long strings in their data, or that sanitize their data first. So it should follow a deprecation cycle.
Python and NumPy Versions:
Reproduced on
- python 3.9, numpy 2.0.2
- python 3.13, numpy 2.3.0
Context for the issue:
versioned-hdf5 hashes chunks for the purpose of deduplication. If a chunk is updated to an identical value to a chunk that already exists on disk, it is not written again. Likewise, two identical chunks in memory result in a single chunk on disk.
I'm currently in the process of adding NpyStrings support.