|
| 1 | +# Quiescent-State Based Reclamation (QSBR) |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +When implementing lock-free data structures, a key challenge is determining |
| 6 | +when it is safe to free memory that has been logically removed from a |
| 7 | +structure. Freeing memory too early can lead to use-after-free bugs if another |
| 8 | +thread is still accessing it. Freeing it too late results in excessive memory |
| 9 | +consumption. |
| 10 | + |
| 11 | +Safe memory reclamation (SMR) schemes address this by delaying the free |
| 12 | +operation until all concurrent read accesses are guaranteed to have completed. |
| 13 | +Quiescent-State Based Reclamation (QSBR) is a SMR scheme used in Python's |
| 14 | +free-threaded build to manage the lifecycle of shared memory. |
| 15 | + |
| 16 | +QSBR requires threads to periodically report that they are in a quiescent |
| 17 | +state. A thread is in a quiescent state if it holds no references to shared |
| 18 | +objects that might be reclaimed. Think of it as a checkpoint where a thread |
| 19 | +signals, "I am not in the middle of any operation that relies on a shared |
| 20 | +resource." In Python, the eval_breaker provides a natural and convenient place |
| 21 | +for threads to report this state. |
| 22 | + |
| 23 | + |
| 24 | +## Use in Free-Threaded Python |
| 25 | + |
| 26 | +While CPython's memory management is dominated by reference counting and a |
| 27 | +tracing garbage collector, these mechanisms are not suitable for all data |
| 28 | +structures. For example, the backing array of a list object is not individually |
| 29 | +reference-counted but may have a shorter lifetime than the `PyListObject` that |
| 30 | +contains it. We could delay reclamation until the next GC run, but we want |
| 31 | +reclamation to be prompt and to run the GC less frequently in the free-threaded |
| 32 | +build, as it requires pausing all threads. |
| 33 | + |
| 34 | +Many operations in the free-threaded build are protected by locks. However, for |
| 35 | +performance-critical code, we want to allow reads to happen concurrently with |
| 36 | +updates. For instance, we want to avoid locking during most list read accesses. |
| 37 | +If a list is resized while another thread is reading it, QSBR provides the |
| 38 | +mechanism to determine when it is safe to free the list's old backing array. |
| 39 | + |
| 40 | +Specific use cases for QSBR include: |
| 41 | + |
| 42 | +* Dictionary keys (`PyDictKeysObject`) and list arrays (`_PyListArray`): When a |
| 43 | +dictionary or list that may be shared between threads is resized, we use QSBR |
| 44 | +to delay freeing the old keys or array until it's safe. For dicts and lists |
| 45 | +that are not shared, their storage can be freed immediately upon resize. |
| 46 | + |
| 47 | +* Mimalloc `mi_page_t`: Non-locking dictionary and list accesses require |
| 48 | +cooperation from the memory allocator. If an object is freed and its memory is |
| 49 | +reused, we must ensure the new object's reference count field is at the same |
| 50 | +memory location. In practice, this means when a mimalloc page (`mi_page_t`) |
| 51 | +becomes empty, we don't immediately allow it to be reused for allocations of a |
| 52 | +different size class. QSBR is used to determine when it's safe to repurpose the |
| 53 | +page or return its memory to the OS. |
| 54 | + |
| 55 | + |
| 56 | +## Implementation Details |
| 57 | + |
| 58 | + |
| 59 | +### Core Implementation |
| 60 | + |
| 61 | +The proposal to add QSBR to Python is contained in |
| 62 | +[Github issue 115103](https://github.com/python/cpython/issues/115103). |
| 63 | +Many details of that proposal have been copied here, so they can be kept |
| 64 | +up-to-date with the actual implementation. |
| 65 | + |
| 66 | +Python's QSBR implementation is based on FreeBSD's "Global Unbounded |
| 67 | +Sequences." [^1][^2][^3]. It relies on a few key counters: |
| 68 | + |
| 69 | +* Global Write Sequence (`wr_seq`): A per-interpreter counter, `wr_seq`, is started |
| 70 | +at 1 and incremented by 2 each time it is advanced. This ensures its value is |
| 71 | +always odd, which can be used to distinguish it from other state values. When |
| 72 | +an object needs to be reclaimed, `wr_seq` is advanced, and the object is tagged |
| 73 | +with this new sequence number. |
| 74 | + |
| 75 | +* Per-Thread Read Sequence: Each thread has a local read sequence counter. When |
| 76 | +a thread reaches a quiescent state (e.g., at the eval_breaker), it copies the |
| 77 | +current global `wr_seq` to its local counter. |
| 78 | + |
| 79 | +* Global Read Sequence (`rd_seq`): This per-interpreter value stores the minimum |
| 80 | +of all per-thread read sequence counters (excluding detached threads). It is |
| 81 | +updated by a "polling" operation. |
| 82 | + |
| 83 | +To free an object, the following steps are taken: |
| 84 | + |
| 85 | +1. Advance the global `wr_seq`. |
| 86 | + |
| 87 | +2. Add the object's pointer to a deferred-free list, tagging it with the new |
| 88 | + `wr_seq` value as its qsbr_goal. |
| 89 | + |
| 90 | +Periodically, a polling mechanism processes this deferred-free list: |
| 91 | + |
| 92 | +1. The minimum read sequence value across all active threads is calculated and |
| 93 | + stored as the global `rd_seq`. |
| 94 | + |
| 95 | +2. For each item on the deferred-free list, if its qsbr_goal is less than or |
| 96 | + equal to the new `rd_seq`, its memory is freed, and it is removed from the: |
| 97 | + list. Otherwise, it remains on the list for a future attempt. |
| 98 | + |
| 99 | + |
| 100 | +### Deferred Advance Optimization |
| 101 | + |
| 102 | +To reduce memory contention from frequent updates to the global `wr_seq`, its |
| 103 | +advancement is sometimes deferred. Instead of incrementing `wr_seq` on every |
| 104 | +reclamation request, each thread tracks its number of deferrals locally. Once |
| 105 | +the deferral count reaches a limit (QSBR_DEFERRED_LIMIT, currently 10), the |
| 106 | +thread advances the global `wr_seq` and resets its local count. |
| 107 | + |
| 108 | +When an object is added to the deferred-free list, its qsbr_goal is set to |
| 109 | +`wr_seq` + 2. By setting the goal to the next sequence value, we ensure it's safe |
| 110 | +to defer the global counter advancement. This optimization improves runtime |
| 111 | +speed but may increase peak memory usage by slightly delaying when memory can |
| 112 | +be reclaimed. |
| 113 | + |
| 114 | + |
| 115 | +## Limitations |
| 116 | + |
| 117 | +Determining the `rd_seq` requires scanning over all thread states. This operation |
| 118 | +could become a bottleneck in applications with a very large number of threads |
| 119 | +(e.g., >1,000). Future work may address this with more advanced mechanisms, |
| 120 | +such as a tree-based structure or incremental scanning. For now, the |
| 121 | +implementation prioritizes simplicity, with plans for refinement if |
| 122 | +multi-threaded benchmarks reveal performance issues. |
| 123 | + |
| 124 | + |
| 125 | +## References |
| 126 | + |
| 127 | +[^1]: https://youtu.be/ZXUIFj4nRjk?t=694 |
| 128 | +[^2]: https://people.kernel.org/joelfernandes/gus-vs-rcu |
| 129 | +[^3]: http://bxr.su/FreeBSD/sys/kern/subr_smr.c#44 |
0 commit comments