-
Couldn't load subscription status.
- Fork 1.3k
Description
Sighting report
In NVMe/TCP with TCP zero-copy enabled, stopping a listener can cause 8-byte IO data corruption on the initiator because I/O buffers are reused (with SPDK internal pointers written into them) while still referenced by the kernel due to abrupt request abortion during qpair teardown.
Expected Behavior
When an NVMe/TCP listener is stopping, all in-flight I/O requests—especially those already submitted to the kernel via TCP zero-copy—should be safely completed or drained before their associated I/O buffers are released or reused. No data corruption should occur on the initiator side.
Note: This issue is specific to TCP zero-copy (i.e.,
spdk_sockzero-copy send/receive), not bdev-level zero-copy. The two mechanisms are independent in SPDK.
Current Behavior
During shutdown of an NVMe/TCP listener with TCP zero-copy enabled, there is a race condition where I/O buffers are prematurely freed and overwritten with internal SPDK metadata (e.g., a pointer address) while the kernel is still using them for outstanding TCP sends. This results in the initiator receiving corrupted data—specifically, 8-byte segments overwritten with SPDK-internal pointer values (e.g., 0x7f...xxxx).
This issue only occurs when TCP zero-copy is enabled. In non-TCP-zero-copy mode, data is copied into kernel buffers before the SPDK I/O buffer is released, so buffer reuse is safe.
Root Cause & Call Stacks
The problem arises from two concurrent paths during qpair destruction:
1. TCP requests are abruptly aborted while still in-flight in the kernel
During listener stop, the qpair cleanup path calls spdk_sock_abort_requests(), which aborts all pending socket requests—including those already submitted to the kernel via zero-copy send:
nvmf_qpair_request_cleanup (only wait all bdev ios done)
└─ state_cb (_nvmf_qpair_destroy)
└─ spdk_nvmf_poll_group_remove
└─ nvmf_transport_poll_group_remove
└─ nvmf_tcp_poll_group_remove
└─ spdk_sock_group_remove_sock
└─ posix_sock_group_impl_remove_sock
└─ spdk_sock_abort_requests // ← Aborts pending_req that are already sent to kernel
At this point, the I/O buffers are still referenced by the kernel (due to TCP zero-copy), but TCP socket layer canceled these requests and callback to upper layer with ECanceled error code.
2. I/O buffers are immediately recycled and overwritten with metadata
Shortly after, during qpair destruction, the same (now-aborted) requests are cleaned up and their buffers returned to the poll group cache—with a dirty pointer written into the buffer header:
_nvmf_tcp_qpair_destroy
└─ nvmf_tcp_cleanup_all_states
└─ nvmf_tcp_drain_state_queue (state=TCP_REQUEST_STATE_TRANSFERRING_CONTROLLER_TO_HOST)
└─ nvmf_tcp_request_free
└─ nvmf_tcp_req_process
└─ spdk_nvmf_request_free_buffers
└─ TAILQ_INSERT_HEAD(&group->buf_cache, (struct spdk_nvmf_transport_pg_cache_buf *)req->buffers[i], link)
// ← Writes a TAILQ link (pointer) into the first 8 bytes of the I/O buffer
Because the buffer is still mapped in a zero-copy send, the kernel may transmit this corrupted 8-byte header to the initiator.
Although close() is called on the TCP socket before freeing I/O buffers, it does not guarantee that zero-copy buffers already submitted to the kernel won’t be transmitted afterward.
Additional Evidence
During the failure window, the following message appears just before the corrupted I/O returns:
"The recv state of tqpair=%p is same with the state(%d) to be set"
This confirms the TCP request was already in TRANSFERRING_CONTROLLER_TO_HOST state and was aborted mid-transfer.
Possible Solution
The core issue is that nvmf_qpair_request_cleanup() only checks for bdev-layer outstanding I/Os, but ignores in-flight TCP zero-copy sends that have already left the bdev layer.
Proposed fixes:
- Introduce a transport-level “drain” phase that ensures all zero-copy buffers are no longer referenced before allowing
spdk_nvmf_request_free_buffers.
Steps to Reproduce
- Configure an SPDK NVMe/TCP target with TCP zero-copy enabled (e.g.,
uringsock_impl +zerocopy_send=true). - Start an initiator (
nvme connect) and issue continuous read I/O (to trigger controller-to-host data path). - While I/O is active, stop the NVMf listener (e.g., via RPC
nvmf_subsystem_listener_delete). - Observe that a small fraction of I/O responses contain 8-byte corruption matching SPDK pointer values.
- Correlate with log message:
"The recv state of tqpair=%p is same with the state(%d) to be set"appearing just before corruption.
Context (Environment including OS version, SPDK version, etc.)
- SPDK version: master
- OS: [e.g., Ubuntu 22.04, Linux kernel ≥ 5.10 with
SO_ZEROCOPYsupport] - Transport: NVMe/TCP with TCP zero-copy enabled (not bdev zero-copy)
- Sock implementation:
tcpwithenable_zerocopy_send_server=true - Application: Internal
chunkdservice using SPDK’s NVMf target library - Reproducibility: Low probability per I/O, but consistent under load during listener teardown