Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Releases: ai-dynamo/nixl

v1.2.0

30 May 01:23
5c80115

Choose a tag to compare

1.2.0

Summary

The NVIDIA® NIXL Release 1.2.0 adds OS-assigned port support to the metadata listener so multi-peer agents can bind to port 0 and discover the kernel-chosen port at runtime instead of statically reserving one, and tightens the Libfabric backend's EFA write path with FI_MORE-based descriptor batching pinned to a single endpoint per rail. The Libfabric change groups up to 16 consecutive write descriptors before flushing the doorbell to the device, reducing PCIe round trips for small-message high-descriptor-count transfers while keeping a standard round-robin read path so reads do not regress.

NIXL 1.2.0 also tightens the build and packaging story for downstream consumers. UCX now configures UCX_MAX_HCA_PER_GPU=auto on UCX >= 1.21, NIXL gains a nixl_cuda_arch_list meson option to compile against a user-selected SM list instead of the full datacenter default sweep, and liburing is sourced from a meson wrap pinned to WrapDB liburing_2.14-1 so the POSIX backend's io_uring support is available out of the box from a source build and from every shipping container. The Rust bindings stop silently swallowing backend registration failures and now surface the C API status from register_memory directly to the caller.

Major Features

  • OS-Assigned Port Support for Metadata Listener: Added support for binding the metadata listener to port 0 so the OS picks an available port; the bound port is retrieved via getsockname() and emitted on NIXL_INFO. The Python multi-peer test helpers in .gitlab/test_python.sh now exercise this path so concurrent examples no longer collide on a hard-coded port. (#1439)

API Changes

  • [Core] Metadata Listener Port Type Migration: Upgraded the default listener-port variables and structure fields (listenPort, listen_port) in src/api/cpp/nixl_params.h, src/api/cpp/nixl_types.h, and src/utils/stream/metadata_stream.{h,cpp} from int to uint16_t to match standard socket definitions. The Rust bindings (src/bindings/rust/wrapper.{h,cpp}, src/bindings/rust/src/agent.rs) gain a DEFAULT_COMM_PORT constant mirroring the C++ default_comm_port. (#1439)

Enhancements

Performance

  • [Libfabric] EFA Doorbell Batching via FI_MORE: libfabric_backend.cpp now batches up to 16 consecutive write descriptors on the same EP-pinned rail and submits them through fi_writemsg() with the FI_MORE flag, draining via a flushing fi_writemsg() on batch close. A stable per-transfer base_offset is reserved once in postXfer() (instead of per-descriptor) so every descriptor in a transfer sees the same rail assignment. Delivers a 30%–58% write bandwidth improvement for small-message high-descriptor-count transfers; the read path keeps the existing round-robin layout to avoid regressing reads. (#1626)

Networking & Backend

  • [UCX] Auto-Selected MAX_HCA_PER_GPU on UCX >= 1.21: src/plugins/ucx/ucx_utils.cpp now sets UCX_MAX_HCA_PER_GPU=auto when the linked UCX is >= 1.21, letting UCX pick the right HCA-to-GPU mapping on modern multi-HCA hosts instead of relying on the historical default. (#1637)

Packaging & Distribution

  • Configurable CUDA Target Selection (nixl_cuda_arch_list): Added a nixl_cuda_arch_list meson option (meson.build, meson_options.txt, contrib/build-wheel.sh) defaulting to sm_80, sm_86, sm_89, sm_90, sm_100, sm_103, sm_120 for full datacenter coverage. Users compiling for a single architecture can pass e.g. -Dnixl_cuda_arch_list=90,100 for materially faster builds; when nixl_ep is enabled, the sm_8x entries are dropped automatically since the EP example only supports newer architectures. (#1639)
  • liburing via meson Wrap: Replaced the per-container liburing install paths (apt liburing-dev on Ubuntu, git clone + make on manylinux, git clone + make in nixlbench builder, .gitlab/build.sh) with a single meson wrap pinned to WrapDB liburing_2.14-1. POSIX backend io_uring support is now always available when building from source and identical across contrib/Dockerfile, contrib/Dockerfile.manylinux, and nixlbench/contrib/Dockerfile. ATTRIBUTIONS-CPP.md bumped to liburing 2.14. (#1577)

Bugfixes

  • [Rust] Surface register_memory Errors: src/bindings/rust/src/agent.rs now returns the C API status from register_memory instead of constructing a RegistrationHandle after a failed registration; the Rust integration tests in src/bindings/rust/tests/tests.rs are updated to create a backend and pass opt args so backend registration failures are exercised and no longer silently ignored. (#1632)

Known Issues

Full Changelog: 1.1.0...1.2.0

v1.1.0

12 May 06:06
05e4243

Choose a tag to compare

1.1.0

Summary

The NVIDIA® NIXL Release 1.1.0 introduces a high-throughput dispatch/combine kernel path for the NIXL-EP example program and modernizes the telemetry data model for production-scale deployments. A new dedicated nixl_ep_ht.cu kernel set ships alongside the renamed low-latency nixl_ep_ll.cu, paired with a VMM-based device memory allocator and elastic-scaling fixes covering destruction flows, non-consecutive rank topologies, and signaling-buffer corruption during scale-up, while the core plugin manager now defers backend loading until first use to reduce agent startup cost. Telemetry consumers must adopt two breaking changes -- the public event signature drops its timestamp field, and the Prometheus exporter migrates agent_xfer_time / agent_xfer_post_time from Gauge to Counter, suffixes counters with _total, and removes the per-backend metric category in favor of standardized transfer, performance, and memory categories. Downstream consumers should also switch their requirements.txt from nixl[cu12] / nixl[cu13] to plain nixl -- the meta wheel now bundles both CUDA backends and auto-selects at runtime based on torch.version.cuda (see [API Changes]).

NIXL 1.1.0 also delivers significant networking, storage, and backend improvements. The Libfabric backend extends NUMA-aware rail selection to additional EC2 instance topologies, adds completion-queue locking for FI_THREAD_COMPLETION semantics, and resolves multi-GPU memory registration and notification-override regressions on transfer-handle repost. A new Dell ObjectScale S3-over-RDMA accelerated engine joins the OBJ plugin for high-bandwidth object storage, UCX now raises an explicit error when VRAM is misclassified as host memory, and nixlbench adds Neuron (Trainium/Inferentia) device support. Across core transfer paths, batched insertion of sorted descriptor lists and a bounded in-memory telemetry buffer contribute to lower per-request overhead and predictable memory behavior in long-running agents.

Major Features

  • NIXL-EP High-Throughput Kernels: Added a new high-throughput dispatch/combine kernel path (examples/device/ep/csrc/kernels/nixl_ep_ht.cu) alongside the renamed low-latency nixl_ep_ll.cu, with matching test_ht.py coverage and configurable GPU timeouts so the example can be tuned for production-scale runs. (#1341, #1503, #1520)
  • Plugin Manager: Deferred Plugin Loading: Plugins are now loaded the first time they are actually used instead of at agent construction. Reduces agent startup cost when only a subset of plugins is exercised and removes dead code from the telemetry path. (#1546, #1564)
  • Dell ObjectScale S3-over-RDMA Engine: New accelerated S3 engine under src/plugins/obj/s3_accel/dell/ (with tests) that talks to Dell ObjectScale over RDMA via a dedicated client. Wired into obj_backend.cpp and exercised through test/gtest/unit/obj/. (#1327)

API Changes

  • [Telemetry] Prometheus Exporter Migration: Removed the NIXL_TELEMETRY_BACKEND event category and createOrUpdateBackendEvent(). Counters are now registered with a _total suffix to match Prometheus naming conventions, and the agent_xfer_time / agent_xfer_post_time metrics moved from Gauge to Counter. (#1308)

  • [Telemetry] Event Signature Cleanup: Removed the public timestamp field from nixlTelemetryEvent, simplified backend_engine.h and telemetry_plugin.h signatures, and tightened the buffer_plugin API. Downstream consumers must update event-construction sites. (#1522)

  • [Packaging] Switch downstream installs from nixl[cu12] / nixl[cu13] to plain nixl: Action required: update your requirements.txt (or pyproject.toml / setup.py) to depend on nixl instead of nixl[cu12] or nixl[cu13]. The nixl meta wheel on PyPI now installs both nixl-cu12 and nixl-cu13 backends in a single step and selects the correct one at runtime from torch.version.cuda. The [cu12] / [cu13] extras are still accepted as no-op aliases so existing pins keep working, but new installs should drop the extra.

    torch is a mandatory runtime dependency and must be installed from the PyTorch index matching your CUDA driver, either before nixl or in the same pip install invocation. nixl declares torch as a dependency, but the default PyPI torch is CPU-only; pass --index-url https://download.pytorch.org/whl/cu130 (or the appropriate CUDA variant) so pip resolves a CUDA-enabled build. At import nixl time the meta package reads torch.version.cuda to select between the bundled nixl-cu12 and nixl-cu13 backends. (#1574, #1578)

Enhancements

Performance

  • [Core] Batch Insertion for Sorted Descriptor Lists: Added a batched insert path in nixl_descriptors.cpp / nixl_memory_section.cpp so registering large descriptor lists no longer pays the per-element ordered-insert cost. New sec_desc_list gtest covers the path. (#1479)
  • [Core] Deferred Plugin Loading: See Major Features. Avoids parsing/loading unused backends at agent startup. (#1546, #1564)

Networking & Backend

  • [Libfabric] NUMA-Aware Rail Selection on Additional Instance Types: Extended the rail-manager / topology code paths to recognize more EC2 instance topologies (including c5n.18xlarge), warn cleanly when no policy applies, and keep the NUMA-aware policy from regressing on hardware it has not been tuned for. (#1461)
  • [Libfabric] Endpoint Locking for FI_THREAD_COMPLETION: Expanded the CQ mutex to cover endpoint-bound posting operations, and skip in-line CQ progress when the dedicated progress thread is enabled. (#1457)
  • [Libfabric] Active Rail Tracking: Reworked libfabric_rail_manager reference counting and added a sizable mock/unit-test harness (libfabric_mock_stubs.h, rail_active_refcount_test.cpp) so rail activation/deactivation is now covered by tests. (#1510)
  • [Libfabric] Multi-GPU Memory Registration: Restored the original two-if registration pattern in libfabric_backend.cpp and downgraded the multi-GPU detection log from WARN to INFO. Validated end-to-end with nixlbench --scheme tp --mode MG. (#1506)
  • [Libfabric] EFA Hardware Warning: Emit a clear warning when EFA hardware is present but the LIBFABRIC backend is not in use, plus new hw_warning_test coverage. (#1287)
  • [Libfabric] Log-Level Cleanup in EFA Path: Reclassified noisy WARN messages and added an Accelerator-PCI prefix to topology-mapping INFO logs for context. (#1462)
  • [UCX] Disable Emulated RMA Protocols: On UCX >= 1.21, force PROTO_EMULATION_ENABLE=n and pin IB_TX_INLINE_RESP=0 in ucx_utils.cpp so the backend refuses to fall back to software-emulated RMA -- transfers either run over true RDMA or fail fast instead of silently degrading. (#1611)
  • [UCX] Timeout Warning on Device Memory List Creation: Added a configurable timeout warning in mem_list.cpp so slow VRAM registrations surface visibly instead of hanging silently. (#1410)
  • [UCX] Plugin Cleanup: Removed an unused ucx_backend class field. (#1512)
  • [Mooncake] Dependency Bump: Updated the Mooncake submodule/build to v0.3.9, including matching CI matrix entries. (#1448)

NIXL Expert Parallelism (EP)

  • VMM API for Device Memory Allocation: Added a new vmm.cpp / vmm.hpp layer so the NIXL-EP example uses the CUDA VMM API instead of plain cudaMalloc for its device buffers. (#1415)
  • CUDA Graph Reuse Across Elastic Scaling: The low-latency dispatch/combine kernel path now reuses its CUDA graphs across elastic scale up/down instead of rebuilding them on every rank change. Buffer.connect_ranks gains an activate=False option (LL mode only) so newly connected ranks can be staged masked and activated later via update_mask_buffer. (#1584)
  • High-Throughput Follow-Up Fixes: Removed a redundant count buffer, fixed internode destruction flows, re-guarded p2p_ptr_get with is_rank_masked, and merged duplicated !low_latency_mode blocks. (#1503)
  • Robust Destruction Flows: Deregister buffers before cudaFree, fix disconnect rank ordering, skip remote prepMemView when there are no peers, and warn (rather than throw) on destructor failures. (#1430)
  • Non-Consecutive Rank Support: Move p2p_ptr_get calls inside the rank-mask guard so configurations like ranks [0, 2] no longer dereference uninitialized P2P mappings. (#1478)
  • Signaling Buffer Corruption on Elastic Scale-Up: Size the signaling region for the maximum expert count so growing num_experts no longer overlaps with send/recv data. (#1451)
  • Planned-SIGTERM Handling in Elastic Test: elastic.py now recognizes the intentional SIGTERM injected to simulate rank failure and does not flag those workers as errors. (#1500)
  • GCC maybe-uninitialized Fix in ht_dispatch: Switched to raw pointers in ternary expressions matching the existing recv_topk_* pattern so -Werror=maybe-uninitialized no longer fails the example build. (#1525)
  • Cleanup: Removed an unused variable in the EP CSRS kernels. (#1508)

Limitations

  • Cross-NVL-domain runs are not supported in this release. Launching NIXL EP across nodes that belong to different NVLink domains (for example, with SLURM --segment 1, or any allocation that spans NVL blocks) will fail during connection setup.

Packaging & Distribution

  • Unified Meta Wheel + CUDA-Matched Torch: Implementation side of the pip install nixl simplification described under API Changes. Adds a -Drelease_wheel=true meson option (meson_options.txt, nixl-meta/meson.build, pyproject.toml.in) that toggles the unified meta wheel for release builds vs. a single-backend wheel for source builds; the manylinux Dockerfile emits the meta wheel only on the cu12 pass since it is identical for cu12/cu13; the vLLM and SGLang Dockerfiles drop their per-CUDA wheel-selection logic; and `.gitlab/b...
Read more

v1.0.1

14 Apr 19:56
d196ff6

Choose a tag to compare

1.0.1

Summary

NVIDIA® NIXL Release 1.0.1 is a targeted maintenance release focusing on NIXL-EP stability fixes, libfabric transport reliability improvements, and build/packaging improvements across UCX, Python wheel, and Docker environments.

NIXL-EP Fixes

  • Fix Destruction Flows: Fixed resource cleanup and destruction ordering in NIXL-EP to prevent crashes and resource leaks during shutdown (#1452).
  • Fix Signaling Buffer Corruption During Elastic Scale-Up: Fixed a signaling buffer corruption issue in NIXL-EP that could occur when new nodes join during elastic scale-up, ensuring correct buffer state across topology changes (#1453).

Libfabric Fixes

  • Fix Notification Override on Transfer Handle Repost: Fixed an issue in the libfabric backend where updated notification messages were ignored when transfer handles were reposted, causing reposted transfers to always use the original notification from initial preparation time (#1482, #1433).
  • Fix Endpoint Thread Safety: Added proper mutex locking for all endpoint access in the libfabric backend to satisfy FI_THREAD_COMPLETION thread-safety requirements, preventing potential race conditions during concurrent I/O operations (#1483, #1457).

Build & Packaging

  • Enable UCX EP Support in Python Wheel Build: Added UCX endpoint support to the Python wheel build, enabling NIXL-EP functionality for pip-installed deployments (#1440).
  • Disable gdrcopy in UCX Build: Disabled gdrcopy in the UCX build to avoid linkage conflicts in environments where gdrcopy is not available or not needed (#1436).
  • Fix Abseil Version Conflicts: Resolved Abseil version conflicts in NIXL builds and Docker images that could cause linker errors or runtime symbol mismatches (#1432).
  • Bump RDMA Memory Check UCX Version: Updated the UCX version used for RDMA memory checks to align with the latest supported UCX release (#1445).
  • Pin Torch Version to 2.11: Pinned the PyTorch dependency to version 2.11 for reproducible builds and compatibility (#1471).
  • Add pkg-config Install: Added missing pkg-config installation to the build environment, fixing build failures in minimal container images (#1450).
  • Fix Dependency Issues: Removed strict PyTorch version check during module initialization to allow broader compatibility, and unified UCX checkout behavior to consistently use the configured UCX reference (#1488).

Full Changelog: 1.0.0...1.0.1

v1.0.0

13 Mar 06:56
4071a53

Choose a tag to compare

1.0.0

Summary

The NVIDIA® NIXL Release 1.0.0 marks a major milestone in API stability and production readiness. This release finalizes the Device API V2 transition by removing the legacy V1 implementation and normalizing naming conventions, establishing a clean and stable 1.0 API surface. A new two-phase configuration framework replaces ad-hoc environment and API configuration handling, providing a consistent and extensible runtime configuration model across all NIXL components.

NIXL 1.0.0 also delivers significant improvements to networking, storage, and backend infrastructure. The Libfabric backend gains NUMA-aware rail selection for topology-optimized transport on multi-socket systems, while the POSIX plugin introduces per-engine IO queuing for higher filesystem concurrency. Cloud storage maturity advances with Azure Blob connection string and CA bundle support, S3 CRT multipart threshold corrections, and expanded object storage functional testing. Across core transfer paths, descriptor iteration and transfer request creation have been optimized, telemetry overhead reduced, and UCCL batch transfer handling simplified, contributing to improved throughput and lower latency in production deployments.

Major Features

  • Device API V2 Finalization: Removed the legacy Device API V1 implementation and completed all V2-oriented cleanups, including normalizing MemoryView to MemView across the codebase. This establishes Device API V2 as the sole, stable device programming interface for NIXL 1.0. (#1342, #1337, #1376)
  • Configuration Framework: Introduced a comprehensive two-phase configuration system spanning environment-driven (Part 1) and API-driven (Part 2) settings. This replaces the previous ad-hoc configuration approach with a structured, extensible model for runtime configuration management. (#1301, #1346)
  • Libfabric NUMA-Aware Rail Selection: Added a NUMA-aware rail selection policy for DRAM_SEG memory types in the Libfabric backend. This enables topology-optimized transport selection on multi-socket systems by aligning memory access with the closest available network rail. (#1302)
  • Libfabric Control Rail Removal: Removed the legacy control rail from the Libfabric backend, simplifying the rail management architecture and reducing resource overhead. (#1386)
  • POSIX Per-Engine IO Queue: Added per-engine IO queue support in the POSIX plugin, improving concurrency and scheduling behavior for filesystem-backed storage workflows. (#1051)
  • Azure Blob Storage Enhancements: Extended the Azure Blob Storage plugin with connection string authentication support and configurable CA bundle settings, improving deployment flexibility in enterprise and air-gapped environments. (#1351, #1329)
  • Rust Runtime Library Resolution: Introduced dlopen-based nixl-sys stubs via libnixl_capi.so, enabling Rust consumers to resolve NIXL at runtime without build-time linking. Stubs use dlopen/dlsym for lazy forwarding to the real implementation. (#1358)
  • UCCL Batch Transfer Optimization: Simplified and optimized UCCL handling for batch transfer workflows, reducing overhead in multi-transfer scenarios. (#1271)

API Changes

  • [Device] API V1 Removal: The Device API V1 has been fully removed. Device API V2 is now the only supported device API path. Applications still using V1 interfaces must migrate to V2. (#1342)
  • [Device] MemoryView to MemView Rename: Device API V2 naming has been normalized from MemoryView to MemView across all interfaces. Downstream code should update symbol references accordingly. (#1337)
  • [Config] New Configuration Model: Runtime configuration now follows the new two-phase environment/API configuration framework. Existing environment variable and API-based configuration patterns should be reviewed against the new model. (#1301, #1346)

Enhancements

Performance

  • [Core] Transfer Request Optimization: Optimized createXferReq performance to reduce overhead in the hot transfer request creation path. (#1338)
  • [Core] Descriptor List Iteration: Optimized nixlDescList iteration for faster descriptor traversal during transfer operations. (#1322)
  • [Telemetry] Reduced Runtime Overhead: Optimized nixlTelemetry::addXferTime for lower per-transfer telemetry cost. (#1365)
  • [Telemetry] Remove Backend Export: Removed backend telemetry export to eliminate unnecessary overhead in the telemetry pipeline. (#1364)
  • [UCCL] Batch Transfer Simplification: Simplified and optimized UCCL for batch transfer workflows, reducing per-transfer overhead. (#1271)
  • [UCX] Removed Extra Copy: Eliminated an unnecessary data copy from prepMemoryView in the UCX plugin, reducing memory transfer overhead. (#1261)
  • [Core] Serialization Optimizations: Added validation checks and small optimizations to serialization/deserialization utilities. (#1277)

Networking & Backend

  • [Libfabric] Remove Post-Operation Retry Delay: Removed the delay between post-operation retries in the Libfabric backend, reducing latency for retry-sensitive workloads. (#1335)
  • [Libfabric] TCP Provider Fix: Fixed TCP provider behavior in the Libfabric backend for correct operation on TCP-based fabrics. (#1348)
  • [Libfabric] EFA Hardware Warning: Added a warning when EFA hardware is present but the Libfabric backend is not selected, improving user guidance during initialization. (#1287)
  • [UCX] VRAM Memory-Type Validation: NIXL now raises an error when UCX incorrectly reports VRAM memory as host memory, preventing silent misclassification that could cause data corruption. (#1385, #1393)
  • [UCX] Targeted GDA Configuration: RC GDA configuration is now applied only when relevant UCX transports are active, avoiding unnecessary configuration side-effects. (#1347)
  • [UCX] Utils Refactoring: Minor refactoring of UCX backend utility code for improved maintainability. (#1291)
  • [Core] Memory Handling Refactor Preparation: Refactored internal core/agent memory handling structures in preparation for follow-on memory management improvements. (#1361, #1370)
  • [Core] Listener Error Reporting: Fixed incorrect error reporting in the core metadata listener. (#1345)

NIXL-EP Example Improvements

  • Release Build Tuning: Disabled fast fault detection for release builds to reduce overhead in production deployments. (#1275)

Build & CI

  • CI: GPU Test Migration to Slurm: Moved GPU tests to Slurm-based scheduling with improved allocation timeouts and test timeout handling for more reliable GPU CI execution. (#1250, #1366, #1336, #1318)
  • CI: Build Abort Logic: Improved abort logic for obsolete builds and added automatic cancellation of previous dispatcher builds for the same PR. (#1317, #1281)
  • CI: Mooncake Non-Interactive Build: Configured Mooncake builds to run without interactive prompts. (#1290)
  • CI: GHA Runner Pinning: Pinned GitHub Actions runner versions and kubectl for improved CI stability. (#804)
  • CI: Demo Pinning: Pinned ci-demo to the stable_nixl tag for reproducible demo builds. (#1343)
  • CI: UCX Bug Workaround: Added a workaround for a UCX bug affecting CI test runs. (#1310)
  • Build: Wheel Environment Updates: Upgraded hwloc in the wheel build environment, disabled nvlm in manylinux Dockerfile, and updated S3 SDK version. (#1396, #1403, #1387)
  • Build: Meson Device API V2 Fix: Fixed the ucx_gpu_device_api_v2_available detection in meson.build. (#1316)
  • Code Quality Automation: Added CodeRabbit configuration for AI-assisted code reviews and expanded the code style guide for contributor consistency. (#1293, #1143)

Benchmarks

  • [nixlbench] Aggregate BW Fix: Fixed aggregate bandwidth calculation in pairwise single-group mode. (#1299)
  • [nixlbench] SHA256 Checksum: Switched to SHA256 checksum algorithm when uploading test objects for READ tests. (#1286)
  • [nixlbench] GUSLI READ Fix: Fixed GUSLI READ consistency check failure and cleanup crash. (#1300)
  • [nixlbench] CLI Simplification: Reduced the number of CLI argument combinations in benchmark test matrix. (#1363)

Documentation

  • Code Style Guide: Expanded the code style guide with additional guidance for contributors. (#1143)

Bugfixes

  • [OBJ/S3 CRT] Multipart Sizing: Aligned CRT partSize and multipart upload threshold with the CRT minimum part size limit, preventing invalid multipart configurations. (#1368)
  • [Libfabric] Unit Test Regression: Fixed a regression in Libfabric unit tests and updated related README and comments. (#1394)
  • [Libfabric] TCP Provider: Fixed TCP provider behavior for correct operation on TCP-based fabrics. (#1348)
  • [Core] Worker File Offset: Fixed file_offset calculation in nixl_worker. (#1399)
  • [Core] Listener Error Reporting: Fixed wrong error reporting in the core metadata listener. (#1345)
  • [Core] Metadata Listener Setup: Fixed silent failure on metadata listener socket setup, improving error visibility. (#1371)
  • [Core] Metadata Test Coverage: Handled all error message cases in metadata test for complete error path coverage. (#1372)
  • [UCX] VRAM Misclassification: Fixed silent VRAM-as-host-memory misclassification by raising an explicit error when UCX reports incorrect memory types. (#1385, #1393)
  • [UCX] Worker Test: Fixed ucx_worker_test for the USE_VRAM case. (#1373)
  • [POSIX] Naming: Fixed POSIX plugin naming conventions. (#1296)
  • [Core] String Utility Cleanup: Removed deprecated strEqual and cleaned up common string utility tools. (#1391, #1309)
  • [Core] CUDA Memory Init: Fixed CUDA memory initialization ordering. (#1360)

Test Infrastructure

  • Stricter Test Failures: Updated Google Test harness to fail on unexpected error or warning log messages, improving detection of regressions. (#1288)
    -...
Read more

0.10.1

03 Mar 20:13
d5c127e

Choose a tag to compare

0.10.1

Summary

NVIDIA® NIXL Release 0.10.1 is a targeted maintenance release focusing on improvements to Rust bindings, Python packaging, and testing infrastructure.

Rust Bindings

  • Runtime Library Resolution: Introduced libnixl_capi.so, a shared library that exports nixl_capi_* C symbols. This allows downstream consumers of the nixl-sys Rust crate to load NIXL at runtime without requiring build-time linking. The NIXL stubs have been rewritten from abort-on-call to use dlopen/dlsym for lazy forwarding to the real implementation (#1358).
  • Build Checks: Added a libnixl.so existence check in build.rs before attempting to link. This ensures the fallback to stubs works correctly when nixl-sys is used as a git dependency where headers are available but libraries are not installed (#1358).

Python Packaging & Testing

  • Exact Version Pinning: Pinned the Python CUDA-specific dependencies (nixl-cu12 and nixl-cu13) to exact versions in the nixl meta-package. This prevents version mismatches and ensures consistent environments when installing the meta-package (#1354).
  • Testing Infrastructure: Added support for running Python tests with a pre-created virtual environment (with support for both standard pip and uv pip), improving CI flexibility and local development workflows (#1353).

Full Changelog: 0.10.0...0.10.1

0.10.0

18 Feb 01:38
b71d09e

Choose a tag to compare

Summary

The NVIDIA® NIXL Release 0.10.0 delivers major advancements in cloud storage, networking, and device integration, reinforcing NIXL’s hardware‑agnostic design. This release introduces full AWS Neuron device support, enabling developers to leverage heterogeneous compute environments with consistent performance and unified APIs. Alongside this, NIXL 0.10.0 expands storage capabilities with a new Azure Blob Storage plugin, S3 CRT client integration, and a hierarchical object storage architecture designed to support upcoming S3 RDMA acceleration.

The release also debuts the comprehensive Device API V2, extending across host-side, backend, and device-side implementations to simplify integration and improve portability. Networking enhancements include support for Slingshot/CXI providers and multiple Libfabric performance optimizations, such as CQ batch reads and an improved threading model for higher concurrency and throughput.

Major Features

  • Azure Blob Storage Plugin: Introduce the initial implementation of Azure Blob Storage support as another cloud object storage backend for NIXL. This plugin uses the Azure SDK for C++ to provide object segment storage and OAuth-based authentication through Microsoft Entra ID. It serves as a functional alternative to the existing S3 backend and includes integration tests, nixlbench benchmarking support, and CI infrastructure for validation. (#1233)
  • S3 CRT Client Support: Added support for the AWS Common Runtime (CRT) S3 client, enabling higher throughput S3 transfers with automatic multipart upload/download parallelism. (#1127)
  • Hierarchical Object Storage Architecture: Refactored the object storage plugin into a modular, inheritance-based client and engine architecture. This introduces a clean separation of concerns with support for standard S3, CRT-based S3, and a placeholder for future accelerated (RDMA) S3 backends with vendor-specific extensions. (#1247)
  • Device API V2: A comprehensive redesign of the NIXL Device API, delivering a new programming model across the full stack:
  • API definition and interface updates. (#1229)
  • Core host-side implementation. (#1230)
  • UCX backend implementation. (#1245)
  • GPU and UCX device-side implementation. (#1255)
  • Libfabric Slingshot/CXI Support: Added support for the HPE Slingshot/CXI provider in Libfabric, including FI_MR_ENDPOINT memory registration mode handling. This enables NIXL to run on Slingshot interconnect fabrics commonly found in HPC environments. (#1242)
  • AWS Neuron Device Support: Added OFI (OpenFabrics Interfaces) support for AWS Neuron devices (Trainium/Inferentia), enabling NIXL networking over Neuron-based accelerators. (#1258)
  • Dual License Updates: Updated licensing information in the GitHub repository to reflect dual licensing (Apache 2.0 + MIT for DeepEP-derived code).

API Changes

  • [Device] Device API V2: The Device API has been redesigned with new interfaces for host-side and device-side operations, along with a memory view API that simplifies device initialization. This is a significant API change from the v1 Device API. (#1229, #1230, #1245, #1255)

Enhancements

Performance

  • [Python] GIL Release: Released the Python Global Interpreter Lock (GIL) in time-consuming NIXL functions, improving concurrency for multi-threaded Python applications. (#1232)
  • [Libfabric] CQ Batch Reads & Threading Model: Implemented batch completion queue reads (16 entries per read) and changed the threading model from FI_THREAD_SAFE to FI_THREAD_COMPLETION for improved Libfabric performance. (#1272)
  • [Libfabric] Remove CM Thread: Removed the connection management thread as the Libfabric plugin moves to the EFA protocol, simplifying the connection flow and reducing overhead. (#1251)
  • [Libfabric] EFA Infinite RNR Retry: Enabled infinite Receiver Not Ready (RNR) retry at the EFA firmware level (FI_OPT_EFA_RNR_RETRY=7), improving reliability by preventing RNR timeout failures. (#1207)
  • [Libfabric] Notification Fragmentation: Implemented notification fragmentation for large messages in the Libfabric backend, enabling reliable transfer of messages that exceed single-notification size limits. (#1142)
  • [Libfabric] Disable Unsolicited Write Recv: Disabled unsolicited write receive for EFA RDM endpoints, reducing completion queue overflow under high write loads. (#1084)
  • [UCX] RC GDA Multi-Channel Config: Added configuration support for RC GDA (GPU Direct Access) number of channels, allowing tuning of GPU-Direct RDMA transport parallelism. (#1206)

NIXL-EP example Improvements

  • Memory View API Migration: Migrated NIXL-EP example to the new Device API V2 memory view interface, greatly simplifying device and host initialization code. (#1256)
  • Unified Data and Counter Buffers: Consolidated data and counter buffer management for cleaner resource handling. (#1186)
  • Channels Modulo Elimination: Removed channels modulo logic for simplified channel management. (#1200)
  • UCX Multi-Channel API: Migrated to the UCX multi-channel API for improved transfer parallelism. (#1175)
  • TCPStore Migration: Migrated metadata exchange and elastic coordination from etcd to PyTorch TCPStore for simplified deployment. (#1144, #1155)

Networking & Backend

  • [UCX] Plugin Reorganization: Moved UCX utilities from src/utils/ucx/ to src/plugins/ucx/ for improved code organization and maintainability. (#1005, #1254, #1234)
  • [UCX] Removed CUDA Context Management: Removed manual CUDA context management from the UCX plugin, simplifying the backend. (#946)
  • [UCX] Hardware Warning Improvements: Added warnings when hardware is not supported by UCX, improving user feedback during initialization. (#1241)
  • [Core] Plugin Manager Refactor: Refactored the plugin manager for improved memory safety and string handling. (#1209)

Build & CI

  • CI: Azure Infrastructure: Added Azurite (Azure Storage emulator), Azure SDK, and Azure CLI to CI for Azure Blob Storage plugin testing. (#1263)
  • CI Build Speed: Improved speed of building NIXL in CI pipelines. (#1139)
  • Build Instructions: Added missing ninja install step and detailed build instructions for first-time users. (#1181)
  • Telemetry Metrics Description: Updated metrics types and descriptions for the telemetry system. (#1280)
  • External Plugin Headers: Installed missing headers required by external plugin builds. (#1135)

Benchmarks

  • [nixlbench] Config File Support: Added --config_file parameter support to nixlbench, enabling benchmark configuration via external files. (#989)
  • [kvbench] New Models: Added GPT-OSS and Qwen3 model support to kvbench for expanded LLM benchmarking. (#1108)
  • [nixlbench] Neuron Device Support: Added support for AWS Neuron devices with VRAM segment type in nixlbench, using dynamic library loading to avoid hard dependency on libnrt. (#1265)
  • [nixlbench] Filenames Parameter: Added --filenames parameter for specifying storage object names. (#1192)
  • [nixlbench] GUSLI Auto-Generated Config: nixlbench now automatically generates GUSLI configuration. (#1193)
  • [nixlbench] Backward Compatibility: Restored gflags-based flag parsing for backward compatibility. (#1289)

Documentation

  • Supported Platforms: Clarified supported platforms and Linux build prerequisites in the documentation. (#1211)
  • Build Directory Reference: Updated build directory references in README. (#1246)
  • Typo Fixes: Fixed typos across documentation files. (#1260)

Bugfixes

  • [GUSLI] Incorrect Offset: Fixed an incorrect offset calculation in the GUSLI plugin. (#1244)
  • [GPUNetIO] Build Fix: Fixed the GPUNetIO plugin build. (#1266)
  • [UCX] Config Parsing: Fixed UCX configuration parsing issues. (#1201)
  • [UCX] Memory Leak: Configured a sane value for the rcache unreleased threshold, fixing a memory leak in the UCX backend. (#1210)
  • [UCX] Plugin Bug: Fixed a bug in the UCX plugin. (#1082)
  • [UCCL] Consistency Checks: Fixed consistency check logic in the UCCL backend. (#1151)
  • [OBJ/S3] Virtual Addressing: Set useVirtualAddressing for S3CRTClient to work around an AWS SDK bug. (#1283)
  • [Prometheus] CMake Build: Fixed Prometheus telemetry plugin build with newer CMake versions. (#1196)
  • [Prometheus] Security Patch: Applied a security patch for a dependency of the Prometheus plugin. (#1204)
  • [Core] Role Validation: Fixed role parameter validation logic error. (#1237)
  • [Core] HW Detection Logging: Replaced error/exception with warning log messages in hardware detection support to avoid false alarm error messages. (#1273)
  • [Telemetry] Dangling Pointer: Fixed a dangling pointer in the telemetry plugin getName/getVersion methods that could cause crashes. (#1148)
  • [Libfabric] GPU-to-EFA Mapping: Fixed GPU-to-EFA NIC mapping to use PCI bus IDs instead of device indices, correcting topology-aware routing on multi-GPU/multi-NIC systems. (#1149)
  • [Build] AWS Dependencies: Fixed AWS build dependencies. (#1262)
  • [CI] etcd Process Kill: Fixed pkill etcd from accidentally killing other containers' processes. (#1227)

Test Infrastructure

  • [Rust] Sync Manager Tests: Added test coverage for the Rust sync_manager. (#1006)
  • Debugging: Added descriptor list dump on posting with debug logging enabled for improved troubleshooting. (#1243)

Known Issues

Full Changelog: 0.9.0...0.10.0

0.9.0

21 Jan 19:17
2d475e4

Choose a tag to compare

Summary

NVIDIA® NIXL Release 0.9.0 delivers significant new capabilities and performance improvements. Key highlights include the introduction of the UCCL backend for optimized collective communication, a new Telemetry Plugin infrastructure with Prometheus support, and the NIXL-EP example demonstrating expert-parallel dispatch. This release also adds support for Python 3.14, enables Shared Memory for Libfabric intra-node transfers, and includes important performance optimizations for core request handling.

This version contains breaking changes, specifically the removal of support for Python 3.9.

Major Features

  • UCCL Backend Integration: Added support for the UCCL P2P backend, enabling efficient GPU memory transfers over RDMA. (#895)
  • Telemetry Plugin Infrastructure: A new extensible telemetry plugin system has been introduced, allowing for custom metric exporters.
  • Prometheus Exporter: A new plugin to export metrics to Prometheus. (#1091)
  • Cyclic Buffer Exporter: A plugin for high-performance cyclic buffer telemetry. (#1088)
  • Plugin Manager: Infrastructure to support loading and managing telemetry plugins. (#1070)
  • NIXL-EP (Elasticity Example): Introduced examples/device/ep, a comprehensive example demonstrating expert-parallel dispatch and combine operations using the NIXL device API. This includes improved metadata fetching and CI integration. (#1043, #1132, #1104, #1077)
    • Enable CUDA IPC NVLINK backend: NIXL-EP now enables the CUDA IPC NVLINK backend for improved intra-node GPU communication. (#1099)
  • Libfabric Shared Memory Support: Enabled the shared memory provider (shm) for NVLink intra-node transfers in the Libfabric backend. This improves performance for local GPU-to-GPU communication by leveraging NVLink without requiring network transport. (#1076)

Breaking Changes

  • Python 3.9 Support Removed: Support for Python 3.9 has been removed. The supported Python versions are now 3.10 through 3.14. (#1071)

API Changes

  • [Python] Python 3.14 Support: Added official support for Python 3.14. (#1071)
  • [Python] Explicit API Exports: Python APIs are now explicitly exported using __all__ to control the public namespace and cleaner imports. (#1062)
  • [Rust] Custom Backend Parameters: Added support for passing custom backend parameters in Rust bindings. (#900)
  • [Rust] Descriptor Serialization: Added Serde serialization support for RegDescList and XferDescList. (#829)
  • [Rust] Indexing Support: Added Index/IndexMut and get/get_mut methods for descriptor lists. (#1003)

Enhancements

Performance

  • [Core] Request Handling Optimization: Implemented request handling optimizations to reduce overhead for large batches of small messages. (#1009)
  • [UCX] Relaxed Ordering: Set UCX_IB_PCI_RELAXED_ORDERING=try by default to improve PCI performance where supported. (#1012)
  • [Libfabric] EFA Unsolicited Write Recv: Added FI_OPT_EFA_USE_UNSOLICITED_WRITE_RECV option to disable unsolicited write receives on EFA RDM, reducing CQ overflows under high load. (#1084)
  • [Libfabric] Large Message Notifications: Implemented notification fragmentation for large messages to better support large transfers (e.g., TensorRT-LLM disaggregated workloads). (#1182)
  • [NIXL-EP] Parallel Metadata Fetch: Optimized connection establishment in NIXL-EP by parallelizing metadata fetches. (#1132)

Build & CI

  • Selective Plugin Building: Added Meson options to selectively enable or disable specific plugins during build (-Denable_plugins=...). (#951)
  • CUDA 13 Support: Updated CI infrastructure to support CUDA 13 for GPU tests. (#996)
  • POSIX Plugin Dependencies: Fixed POSIX plugin dependency handling in Meson to resolve build issues with TRTLLM. (#1086)
  • Python License: Fixed license identifier in Python packages to match the LICENSE file. (#1119)
  • manylinux wheel packaging: Added the uring library to the manylinux 0.9.0 Docker image so it’s included with wheels (enables POSIX plugin access to performant async I/O options). (#1185)

Documentation

  • Libfabric Guide: Improved clarity and grammar in the Libfabric README. (#1007)
  • Examples: Enhanced the basic Python examples and nixl_ep documentation. (#1078, #1093)

Bugfixes

  • [POSIX] AIO Resubmission: Fixed a bug where the Linux AIO plugin could not correctly resubmit I/O requests. (#1020)
  • [Libfabric] Sockets Deadlock: Fixed a connection deadlock with the sockets provider by reducing the CQ read timeout. (#1080)
  • [Libfabric] Topology Grouping: Fixed GPU NIC grouping logic in libfabric_topology when multiple GPUs share a NIC. (#1024)
  • [Libfabric] GPU-to-EFA mapping: Fixed GPU-to-EFA mapping by using PCI bus IDs instead of GPU IDs, ensuring correct device association. (#1184)
  • [Core] Metadata Crash: Fixed a crash that occurred if a peer closed the connection during metadata exchange. (#854)
  • [Telemetry] Dangling Pointer: Resolved a dangling pointer issue in getName/getVersion in the telemetry plugin. (#1148)
  • [Benchmark] nixlbench Memory: nixlbench now allocates page-aligned memory by default to ensure consistency. (#1060)
  • [Python] venv Support: Fixed Python test scripts to work correctly inside uv virtual environments. (#1106)
  • [UCX] Device API Detection: Fixed detection logic for UCX GPU device API support. (#990)

Benchmarks & Test Infrastructure

  • [kvbench] FLOPs Estimation: Added Tensor Parallel (TP) scaling and MLP FLOPs to compute time estimates in kvbench. (#1083)
  • [nixlbench] Consistency Check: Added data validation consistency checks to nixlbench. (#1103)

Known Issues

Full Changelog: 0.8.0...0.9.0

0.8.0

20 Nov 02:03
08d6094

Choose a tag to compare

Summary

NVIDIA® NIXL Release 0.8.0 delivers significant performance improvements, major new capabilities, and important dependency updates. Key highlights include a massive optimization for large-batch workloads in the UCX backend, the introduction of a new POSIX backend using Linux AIO for high-performance storage I/O, and direct CUDA memory registration support for the Libfabric backend.

This version contains breaking changes, including the removal of the legacy Multi-Object UCX backend and an update to the minimum required Libfabric version. It also introduces support for Python 3.13 and changes the default build type to Release for optimized performance out of the box.

Major Features & Improvements

  • UCX Performance for Large-Batch Workloads: The request handling mechanism in the UCX backend has been overhauled to reduce overheads. For workloads with large batches (~64k) of small messages (<1KB), as in sglang and other LLM inference engines that use paged attention, this change significantly reduces latency and improves time-to-first-token (TTFT). Internal benchmarks show a ~50% performance increase in nixlbench and a ~20% TTFT reduction in sglang for these scenarios. (#982)
  • Linux AIO plugin for the POSIX backend: The POSIX backend now leverages the Linux Asynchronous I/O (AIO) API where available. This provides a high-performance, asynchronous interface for data transfers to and from local storage. Internal benchmarks show an increase in read throughput for read sizes above 100 kB (#885)
  • Libfabric CUDA Memory Registration: The Libfabric backend can now directly register CUDA memory regions using fi_mr_regattr. Thus adds support for extended memory registration attributes and optimized RDMA behavior. (#960)
  • Python: Added support for Python 3.13. (#994)

Breaking Changes

  • UCX Multi-Object Backend Removed: The legacy Multi-Object (UCX_MO) backend has been removed. Users should migrate to the primary UCX backend, which now incorporates multi-device support. (#898)
  • Libfabric Minimum Version Increased: The minimum required version of Libfabric has been raised to v1.21.0 to support new features. (#961)
  • Default Build Type is now Release: When building from source, the default build type is now Release instead of Debug. This ensures that default builds are optimized for performance. (#869)

API Changes

  • [Rust] New RegDescList and XferDescList APIs have been added to the Rust interface for descriptor management. (#828)
  • [Python] Obsolete and unused code from the Python API has been removed. (#985)

Enhancements

  • [Build] The build system now searches for libraries in paths specified by the NIXL_PREFIX environment variable, making it easier to link against custom builds. (#998)

Bugfixes

  • [Core] Metadata exchanges over sockets have been improved with better error handling. (#999)
  • [Core] The metadata exchange over sockets communication queue is now fully flushed before stopping the listener thread, preventing potential data loss during shutdown. (#830)
  • [Libfabric] Fixed multiple issues with metadata handling for partial loads and offset calculations that could lead to data corruption. (#969, #978)
  • [Bindings] Resolved a crash in Python examples that occurred when the NIXL_PLUGIN_DIR environment variable was not set. (#963)
  • [Rust] Fixed an issue that prevented Rust stubs from building correctly. (#1001)

Benchmarks & Test Infrastructure

  • [nixlbench] Fixed a memory management bug related to cudaFree. (#965)
  • [nixlbench] The tool will now correctly exit with a failure code if an I/O vector consistency check fails. (#992)
  • [CI] Switched CI jobs from PyTorch-based images to cuda-dl-base images for better unification and consistency. (#924)

Known Issues

  • [GPUNETIO] The GPUNETIO plugin is not available in CUDA 13 environments. This will be addressed in a future release.

Full Changelog: 0.7.1...0.8.0

0.7.1

06 Nov 19:06
97c9b5b

Choose a tag to compare

Summary

NVIDIA® NIXL Release 0.7.1 is a maintenance release that introduces Python packaging support for CUDA 13.0, improves the interface of the Device API, and resolves critical issues in the Libfabric backend.

  • CUDA 12 and CUDA 13 Python Wheels:
    This release provides officially supported CUDA 13 Python wheels, splitting the packaging into a nixl meta-package and the platform-specific packages nixl-cu12 and nixl-cu13. More information below under Python packaging changes.
  • Device API: Single-thread Support for Multiple Queue Pairs:
    Single-threaded applications can now drive multiple QPs from a single thread, for higher performance without the overhead of creating multiple agents or multiple threads.
  • Device API: Improved Asynchronous API Handling:
    The Device API plugin's post functions were changed to return NIXL_IN_PROG to signal that an operation has been submitted but is not yet complete, improving the predictability and performance of asynchronous calls.

Python packaging changes

NIXL is now packaged using a nixl PyPI meta-package and CUDA platform-specific wheels for CUDA 12 (nixl-cuda12) and CUDA 13 (nixl-cuda13). (#915, #954, #956, #966).

PyPi users

The desired wheel can now be installed from PyPi with pip install nixl[cu12] or pip install nixl[cu13].

For backwards compatibility, pip install nixl installs automatically nixl[cu12], continuing to work seamlessly for CUDA 12 users without requiring changes to downstream project dependencies.

CUDA 13 users must use pip install nixl[cu13] to install NIXL.

If both nixl-cu12 and nixl-cu13 are installed at the same time in an environment, nixl-cu13 takes precedence.

Python installation from source

Pip installations from source code through pip install . now require additional steps to build and install the nixl meta-package:

On CUDA 12:

pip install .
pip install meson meson-python pybind11 tomlkit
meson setup build
ninja -C build
pip install build/src/bindings/python/nixl-meta/nixl-*-py3-none-any.whl

On CUDA 13:

pip install .
pip install meson meson-python pybind11 tomlkit
./contrib/tomlutil.py --wheel-name nixl-cu13 pyproject.toml
meson setup build
ninja -C build
pip install build/src/bindings/python/nixl-meta/nixl-*-py3-none-any.whl

See also this for a full example in docker.

API changes

  • [Device API] Asynchronous post functions now return NIXL_IN_PROG to signal that an operation is in flight, providing a clearer status for non-blocking calls (#911).
  • [Device API] Added support for worker_id selection in the backend, allowing for more granular performance tuning with multiple QPs driven by a single thread (#938).

Bugfixes

  • [Libfabric] Corrected an issue with offset calculations that could cause data corruption in certain transfer scenarios (#883).
  • [Libfabric] Fixed a bug in handling asymmetrical rail configurations on heterogeneous nodes (#908).
  • [Libfabric] Addressed issues with metadata handling for partial and multi-load transfers (#976, #986).
  • [Bindings] Fixed an error in the Python API when NIXL_LOG_LEVEL was set to TRACE (#890).
  • [Plugins] Reduced logging noise by changing messages for plugins with missing external dependencies from ERROR to INFO (#967).

Known issues

  • [GPUNETIO] The GPUNETIO plugin is not available in CUDA 13 environments. This will be addressed in a future release.

Benchmarks

  • nixlbench now allows compilation even if the etcd development libraries are not found on the system (#959).
  • The PostXferReq timer in nixlbench was changed to provide more accurate latency measurements (#944).
  • For CUDA 13, the nixlbench PyTorch dependency is now installed from the stable channel (#943).

Full Changelog: 0.7.0...0.7.1

0.7.0

24 Oct 19:05
3deee09

Choose a tag to compare

Summary

NVIDIA® NIXL Release 0.7.0 introduces the new GUSLI storage plugin, adds build support for the CUDA 13.0 toolkit, and delivers key improvements to the Libfabric, UCX and GPUNetIO backends.

  • GUSLI Storage Plugin:
    Introduces a new storage backend based on NVIDIA GUSLI for high-performance access to flash storage.
  • CUDA 13 Build Support:
    Enables NIXL and the nixlbench suite to be built from source with CUDA 13.0, ensuring compatibility with the latest drivers and libraries. Official binary packages with CUDA 13.0 support are planned for a future release.

New Features

  • Introduced the GUSLI storage plugin for high-throughput, low-latency I/O (#887).

Improvements

  • [Build] Added support for building NIXL and nixlbench from source with CUDA 13 (#820).
  • [Libfabric] Added support for non-GDR instances on AWS and resolved python wheel compatibility issues (#901, #937).
  • [Device API] Improved support for GPU-initiated UCX transfers in MoE workloads with easier to use APIs and lower latency (#815).

Bugfixes

  • [Libfabric] Addressed several stability issues, including double-free errors in topology initialization, resource cleanup on disconnect, and handling of asymmetrical rail configurations in heterogeneous nodes (#839, #860, #926).
  • [Libfabric] Improved resilience by manually progressing the completion queue when resources are unavailable and adding retry logic for failed operations (#856, #859).
  • [Libfabric] Fixed EFA device discovery to correctly identify device IDs (#876).

Dependencies

  • [GPUNetIO] Upgraded the GPUNetIO backend to the DOCA 3.1 Verbs library (#733).

Bindings

  • The Python wheel no longer bundles libfabric, libefa, and libhwloc, as the libraries are available in many base images for AWS and this prevents version mismatch. (#937).
  • Added a new Python example for remote storage operations (#841).

Benchmarks

  • Added nixlbench support for the new GUSLI backend (#897, #929).
  • Improved nixlbench flexibility by making etcd optional for storage backends and fixing logic for key collisions (#862, #878).
  • Fixed an API parameter mismatch in the nixlbench POSIX benchmark (#880).

Build and Test Infrastructure

  • Expanded CI coverage by enabling tests on DGX systems and adding GPU-specific tests for nixlbench (#834, #780).
  • Resolved a package installation failure for libibverbs-dev on the CUDA 13 base container image (#889).
  • Improved the build system to avoid building gtest when CUDA or UCX dependencies are not found (#925).
  • Added a backend selection option for easier debugging of different transfer backends (#822).
  • Refined release build packaging to correctly manage the inclusion of tests and examples (#872, #896).

Full Changelog: 0.6.1...0.7.0