@@ -405,34 +405,63 @@ layout:
405
405
};
406
406
407
407
Where ``len `` is the length, in bytes, of the string and ``buf `` is a pointer to
408
- the beginning of a null-terminated UTF-8 encoded bytestream.
408
+ the beginning of a UTF-8 encoded bytestream containing the string data. We do
409
+ not append a trailing null character to the byte stream, so users attempting to
410
+ pass the ``buf `` field to an API expecting a C string must must create a copy
411
+ with a trailing null. This choice also means that unlike the fixed-width strings
412
+ in NumPy, ``StringDType `` array entries can contain arbitrary embedded or
413
+ trailing null characters.
409
414
410
415
We propose storing the string data for this data type in an external
411
416
heap-allocated arena buffer whose bookkeeping is managed by the ``StringDType ``
412
417
instance associated with the array. Using a per-array arena allocator ensures
413
- that the string buffers for nearby array elements are nearby on the heap. We
414
- do allow ``NULL `` ``npy_static_string `` entries in the array buffer,
415
- representing either an empty string or a missing data sentinel, depending on the
416
- parameters of the ``StringDType `` instance associated with the array, so string
417
- data for array entries are not necessarily always adjacent on the heap.
418
-
419
- In addition to making a typedef for this struct public, we also plan to add an
420
- interface for allocating, copying, and freeing strings with this layout via the
421
- arena allocator to the public numpy C API to ease downstream integration.
422
-
423
- Each array element has an overhead of 17 bytes on a 64 bit architecture,
424
- including one byte for a NULL-terminating character in the string buffer. We
425
- could reduce the memory overhead by using an unsigned 32 bit int as the length
426
- instead of ``size_t ``, since real-world downstream usages of object string
427
- arrays that need to support individual array elements longer than the maximum 32
428
- bit unsigned int are likely rare.
429
-
430
- Finally, in the future we may decide to exploit the small string optimization
431
- [6 ]_ to encode strings smaller than the size of the ``npy_static_string `` struct
418
+ that the string buffers for nearby array elements are usually nearby on the
419
+ heap. We do not guarantee that neighboring array elements are contiguous on the
420
+ heap to support missing data and allow mutation of array entries, see below for
421
+ more discussion on how these topics affect the memory layout.
422
+
423
+ In addition to making a typedef for ``npy_static_string `` public, we also plan
424
+ to add an interface for allocating, copying, and freeing strings with this
425
+ layout via the arena allocator to the public numpy C API to ease downstream
426
+ integration.
427
+
428
+ In the future we may decide to exploit the small string optimization [6 ]_ to
429
+ encode strings smaller than the size of the ``npy_static_string `` struct
432
430
directly in the array buffer, bypassing the need for a heap allocation for that
433
- entry. For arrays consisting entirely of small strings this bypasses the need to
434
- do any sidecar heap allocations. This should be relatively straightforward to
435
- add but has not been completed yet to focus on other aspects of the proposal.
431
+ entry. If this is implemented, we will reserve the most significant byte in the
432
+ ``len `` for flags, including a flag to indicate the array element is stored in
433
+ the array buffer. For arrays consisting entirely of small strings this will
434
+ bypass the need to do any sidecar heap allocations. This should be relatively
435
+ straightforward to add but has not been completed yet to focus on other aspects
436
+ of the proposal. While reserving a whole byte for flags may be unnecessary, we
437
+ will still have 12 bits of space in the ``len `` field, which is much more than
438
+ is likely to be necessary to store the length of a single array element in
439
+ real-world use, and having 256 possibilities for flags gives us flexibility for
440
+ the future.
441
+
442
+ Besides the string data itself, each array element requires 16 bytes of storage
443
+ for the ``npy_static_string `` instance in the array buffer. In principle we
444
+ could use a 32 bit integer to store the ``len `` field, saving 4 bytes per array
445
+ element, but if we only use a single bit for the small string optimization
446
+ that will still leave us with an uncomfortably small 7 bits of space in the
447
+ ``len `` field. In addition, making use of the small string optimization will
448
+ somewhat offset the memory cost of a 64 bit ``len `` field, since many real-world
449
+ use-cases employ small strings.
450
+
451
+ Missing Data
452
+ ++++++++++++
453
+
454
+ By default, zeroed out entries in the array buffer represent empty
455
+ strings. However, if the DType instance was created with an ``na_object `` field,
456
+ zeroed-out entries represent missing data. By making this choice, a zero-filled
457
+ newly allocated buffer returned by ``calloc `` does not need any additional
458
+ post-processing to produce an empty array. This choice also means casts between
459
+ different missing data representations are views.
460
+
461
+ Whether or not the ``na_object `` is set, empty strings are not stored in the
462
+ sidecar buffer since they require no additional storage besides the entry in the
463
+ array buffer itself. This means that adjacent entries in the sidecar buffer are
464
+ not necessarily adjacent entries in the array buffer.
436
465
437
466
Mutation and Thread Safety
438
467
++++++++++++++++++++++++++
0 commit comments