-
-
Notifications
You must be signed in to change notification settings - Fork 0
feat: implement zero-copy return from workers via shm #75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: kitsuyaazuma <[email protected]>
Signed-off-by: kitsuyaazuma <[email protected]>
Signed-off-by: kitsuyaazuma <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a zero-copy shared-memory IPC mode by pre-allocating shared-memory buffers for worker results and introducing utilities to move, replace, and reconstruct tensors without pickling.
- Introduce
SHMHandle,process_tensors_in_object, andreconstruct_from_shared_memoryin utils. - Refactor
ProcessPoolClientTrainerto prepare per-client shared-memory buffers and use the new utilities inlocal_processandworker. - Update tests and the FedAvg trainer to exercise the zero-copy return path.
Reviewed Changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_core/test_client_trainer.py | Add tensor field, import SHMHandle, and implement buffer prep |
| src/blazefl/core/utils.pyi | Define SHMHandle, process_tensors_in_object, and reconstruction |
| src/blazefl/core/utils.py | Implement tensor traversal, replace, and reconstruction utilities |
| src/blazefl/core/client_trainer.pyi | Update worker signature, add prepare_uplink_package_buffer |
| src/blazefl/core/client_trainer.py | Use new utils to move/replace tensors and reconstruct results |
| src/blazefl/core/init.py[.pyi] | Update exports to include new utilities |
| src/blazefl/contrib/fedavg.py | Extend FedAvg trainer for shared-memory uplink packages |
Comments suppressed due to low confidence (3)
src/blazefl/core/utils.py:21
- The default
max_depthis set to 1 but the docstring describes a default of 10. Align the code and documentation by either updating the default to 10 or correcting the docstring.
obj: T, mode: Literal["move", "replace"], max_depth: int = 1
src/blazefl/core/utils.py:41
- This line describes a default of 10 for
max_depth, but the function signature uses 1. Please keep these in sync.
max_depth: The maximum recursion depth. Defaults to 10.
src/blazefl/core/utils.py:20
- [nitpick] Consider adding unit tests for
process_tensors_in_objectandreconstruct_from_shared_memoryto verify correct round-trip behavior, handle nested structures, and test the shared-memory paths.
def process_tensors_in_object(
Signed-off-by: kitsuyaazuma <[email protected]>
Signed-off-by: kitsuyaazuma <[email protected]>
Signed-off-by: kitsuyaazuma <[email protected]>
Signed-off-by: kitsuyaazuma <[email protected]>
Signed-off-by: kitsuyaazuma <[email protected]>
WHAT
This PR introduces a "zero-copy"
shared_memoryIPC mode to significantly improve performance in multi-process training. It refactors theProcessPoolClientTrainerto pre-allocate shared memory buffers for worker results, eliminating serialization overhead for the return trip from workers to the parent process. A new utility,process_tensors_in_object, is introduced to handle both moving tensors to shared memory and creating lightweight "handle" packages.WHY
Profiling revealed that even with shared memory for the parent-to-worker data path, pickling the
UplinkPackagefor the return trip was a major performance bottleneck. This change addresses that bottleneck directly by avoiding tensor serialization on the return path. This leads to a substantial reduction in round-trip time and significantly improves the overall throughput and scalability of the federated learning process.