Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit bc323aa

Browse files
committed
NEP: respond to NEP 55 PR review comments
1 parent d043fe7 commit bc323aa

File tree

1 file changed

+52
-23
lines changed

1 file changed

+52
-23
lines changed

doc/neps/nep-0055-string_dtype.rst

Lines changed: 52 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -405,34 +405,63 @@ layout:
405405
};
406406
407407
Where ``len`` is the length, in bytes, of the string and ``buf`` is a pointer to
408-
the beginning of a null-terminated UTF-8 encoded bytestream.
408+
the beginning of a UTF-8 encoded bytestream containing the string data. We do
409+
not append a trailing null character to the byte stream, so users attempting to
410+
pass the ``buf`` field to an API expecting a C string must must create a copy
411+
with a trailing null. This choice also means that unlike the fixed-width strings
412+
in NumPy, ``StringDType`` array entries can contain arbitrary embedded or
413+
trailing null characters.
409414

410415
We propose storing the string data for this data type in an external
411416
heap-allocated arena buffer whose bookkeeping is managed by the ``StringDType``
412417
instance associated with the array. Using a per-array arena allocator ensures
413-
that the string buffers for nearby array elements are nearby on the heap. We
414-
do allow ``NULL`` ``npy_static_string`` entries in the array buffer,
415-
representing either an empty string or a missing data sentinel, depending on the
416-
parameters of the ``StringDType`` instance associated with the array, so string
417-
data for array entries are not necessarily always adjacent on the heap.
418-
419-
In addition to making a typedef for this struct public, we also plan to add an
420-
interface for allocating, copying, and freeing strings with this layout via the
421-
arena allocator to the public numpy C API to ease downstream integration.
422-
423-
Each array element has an overhead of 17 bytes on a 64 bit architecture,
424-
including one byte for a NULL-terminating character in the string buffer. We
425-
could reduce the memory overhead by using an unsigned 32 bit int as the length
426-
instead of ``size_t``, since real-world downstream usages of object string
427-
arrays that need to support individual array elements longer than the maximum 32
428-
bit unsigned int are likely rare.
429-
430-
Finally, in the future we may decide to exploit the small string optimization
431-
[6]_ to encode strings smaller than the size of the ``npy_static_string`` struct
418+
that the string buffers for nearby array elements are usually nearby on the
419+
heap. We do not guarantee that neighboring array elements are contiguous on the
420+
heap to support missing data and allow mutation of array entries, see below for
421+
more discussion on how these topics affect the memory layout.
422+
423+
In addition to making a typedef for ``npy_static_string`` public, we also plan
424+
to add an interface for allocating, copying, and freeing strings with this
425+
layout via the arena allocator to the public numpy C API to ease downstream
426+
integration.
427+
428+
In the future we may decide to exploit the small string optimization [6]_ to
429+
encode strings smaller than the size of the ``npy_static_string`` struct
432430
directly in the array buffer, bypassing the need for a heap allocation for that
433-
entry. For arrays consisting entirely of small strings this bypasses the need to
434-
do any sidecar heap allocations. This should be relatively straightforward to
435-
add but has not been completed yet to focus on other aspects of the proposal.
431+
entry. If this is implemented, we will reserve the most significant byte in the
432+
``len`` for flags, including a flag to indicate the array element is stored in
433+
the array buffer. For arrays consisting entirely of small strings this will
434+
bypass the need to do any sidecar heap allocations. This should be relatively
435+
straightforward to add but has not been completed yet to focus on other aspects
436+
of the proposal. While reserving a whole byte for flags may be unnecessary, we
437+
will still have 12 bits of space in the ``len`` field, which is much more than
438+
is likely to be necessary to store the length of a single array element in
439+
real-world use, and having 256 possibilities for flags gives us flexibility for
440+
the future.
441+
442+
Besides the string data itself, each array element requires 16 bytes of storage
443+
for the ``npy_static_string`` instance in the array buffer. In principle we
444+
could use a 32 bit integer to store the ``len`` field, saving 4 bytes per array
445+
element, but if we only use a single bit for the small string optimization
446+
that will still leave us with an uncomfortably small 7 bits of space in the
447+
``len`` field. In addition, making use of the small string optimization will
448+
somewhat offset the memory cost of a 64 bit ``len`` field, since many real-world
449+
use-cases employ small strings.
450+
451+
Missing Data
452+
++++++++++++
453+
454+
By default, zeroed out entries in the array buffer represent empty
455+
strings. However, if the DType instance was created with an ``na_object`` field,
456+
zeroed-out entries represent missing data. By making this choice, a zero-filled
457+
newly allocated buffer returned by ``calloc`` does not need any additional
458+
post-processing to produce an empty array. This choice also means casts between
459+
different missing data representations are views.
460+
461+
Whether or not the ``na_object`` is set, empty strings are not stored in the
462+
sidecar buffer since they require no additional storage besides the entry in the
463+
array buffer itself. This means that adjacent entries in the sidecar buffer are
464+
not necessarily adjacent entries in the array buffer.
436465

437466
Mutation and Thread Safety
438467
++++++++++++++++++++++++++

0 commit comments

Comments
 (0)