Thanks to visit codestin.com
Credit goes to github.com

Skip to content

IO data corruption in NVMe/TCP **TCP zero-copy** mode during listener shutdown due to premature I/O buffer reuse #3760

@KKRainbow

Description

@KKRainbow

Sighting report

In NVMe/TCP with TCP zero-copy enabled, stopping a listener can cause 8-byte IO data corruption on the initiator because I/O buffers are reused (with SPDK internal pointers written into them) while still referenced by the kernel due to abrupt request abortion during qpair teardown.

Expected Behavior

When an NVMe/TCP listener is stopping, all in-flight I/O requests—especially those already submitted to the kernel via TCP zero-copy—should be safely completed or drained before their associated I/O buffers are released or reused. No data corruption should occur on the initiator side.

Note: This issue is specific to TCP zero-copy (i.e., spdk_sock zero-copy send/receive), not bdev-level zero-copy. The two mechanisms are independent in SPDK.

Current Behavior

During shutdown of an NVMe/TCP listener with TCP zero-copy enabled, there is a race condition where I/O buffers are prematurely freed and overwritten with internal SPDK metadata (e.g., a pointer address) while the kernel is still using them for outstanding TCP sends. This results in the initiator receiving corrupted data—specifically, 8-byte segments overwritten with SPDK-internal pointer values (e.g., 0x7f...xxxx).

This issue only occurs when TCP zero-copy is enabled. In non-TCP-zero-copy mode, data is copied into kernel buffers before the SPDK I/O buffer is released, so buffer reuse is safe.

Root Cause & Call Stacks

The problem arises from two concurrent paths during qpair destruction:

1. TCP requests are abruptly aborted while still in-flight in the kernel

During listener stop, the qpair cleanup path calls spdk_sock_abort_requests(), which aborts all pending socket requests—including those already submitted to the kernel via zero-copy send:

nvmf_qpair_request_cleanup (only wait all bdev ios done)
└─ state_cb (_nvmf_qpair_destroy)
└─ spdk_nvmf_poll_group_remove
└─ nvmf_transport_poll_group_remove
└─ nvmf_tcp_poll_group_remove
└─ spdk_sock_group_remove_sock
└─ posix_sock_group_impl_remove_sock
└─ spdk_sock_abort_requests // ← Aborts pending_req that are already sent to kernel

At this point, the I/O buffers are still referenced by the kernel (due to TCP zero-copy), but TCP socket layer canceled these requests and callback to upper layer with ECanceled error code.

2. I/O buffers are immediately recycled and overwritten with metadata

Shortly after, during qpair destruction, the same (now-aborted) requests are cleaned up and their buffers returned to the poll group cache—with a dirty pointer written into the buffer header:

_nvmf_tcp_qpair_destroy
└─ nvmf_tcp_cleanup_all_states
└─ nvmf_tcp_drain_state_queue (state=TCP_REQUEST_STATE_TRANSFERRING_CONTROLLER_TO_HOST)
└─ nvmf_tcp_request_free
└─ nvmf_tcp_req_process
└─ spdk_nvmf_request_free_buffers
└─ TAILQ_INSERT_HEAD(&group->buf_cache, (struct spdk_nvmf_transport_pg_cache_buf *)req->buffers[i], link)
// ← Writes a TAILQ link (pointer) into the first 8 bytes of the I/O buffer

Because the buffer is still mapped in a zero-copy send, the kernel may transmit this corrupted 8-byte header to the initiator.

Although close() is called on the TCP socket before freeing I/O buffers, it does not guarantee that zero-copy buffers already submitted to the kernel won’t be transmitted afterward.

Additional Evidence

During the failure window, the following message appears just before the corrupted I/O returns:

"The recv state of tqpair=%p is same with the state(%d) to be set"

This confirms the TCP request was already in TRANSFERRING_CONTROLLER_TO_HOST state and was aborted mid-transfer.

Possible Solution

The core issue is that nvmf_qpair_request_cleanup() only checks for bdev-layer outstanding I/Os, but ignores in-flight TCP zero-copy sends that have already left the bdev layer.

Proposed fixes:

  • Introduce a transport-level “drain” phase that ensures all zero-copy buffers are no longer referenced before allowing spdk_nvmf_request_free_buffers.

Steps to Reproduce

  1. Configure an SPDK NVMe/TCP target with TCP zero-copy enabled (e.g., uring sock_impl + zerocopy_send=true).
  2. Start an initiator (nvme connect) and issue continuous read I/O (to trigger controller-to-host data path).
  3. While I/O is active, stop the NVMf listener (e.g., via RPC nvmf_subsystem_listener_delete).
  4. Observe that a small fraction of I/O responses contain 8-byte corruption matching SPDK pointer values.
  5. Correlate with log message: "The recv state of tqpair=%p is same with the state(%d) to be set" appearing just before corruption.

Context (Environment including OS version, SPDK version, etc.)

  • SPDK version: master
  • OS: [e.g., Ubuntu 22.04, Linux kernel ≥ 5.10 with SO_ZEROCOPY support]
  • Transport: NVMe/TCP with TCP zero-copy enabled (not bdev zero-copy)
  • Sock implementation: tcp with enable_zerocopy_send_server=true
  • Application: Internal chunkd service using SPDK’s NVMf target library
  • Reproducibility: Low probability per I/O, but consistent under load during listener teardown

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions