IO data corruption in NVMe/TCP **TCP zero-copy** mode during listener shutdown due to premature I/O buffer reuse

# Sighting report

In NVMe/TCP with TCP zero-copy enabled, stopping a listener can cause 8-byte IO data corruption on the initiator because I/O buffers are reused (with SPDK internal pointers written into them) while still referenced by the kernel due to abrupt request abortion during qpair teardown.


## Expected Behavior

When an NVMe/TCP listener is stopping, all in-flight I/O requests—especially those already submitted to the kernel via **TCP zero-copy**—should be safely completed or drained before their associated I/O buffers are released or reused. No data corruption should occur on the initiator side.

> **Note**: This issue is specific to **TCP zero-copy** (i.e., `spdk_sock` zero-copy send/receive), **not** bdev-level zero-copy. The two mechanisms are independent in SPDK.

## Current Behavior

During shutdown of an NVMe/TCP listener with **TCP zero-copy enabled**, there is a race condition where I/O buffers are prematurely freed and overwritten with internal SPDK metadata (e.g., a pointer address) **while the kernel is still using them** for outstanding TCP sends. This results in the initiator receiving corrupted data—specifically, 8-byte segments overwritten with SPDK-internal pointer values (e.g., `0x7f...xxxx`).

This issue **only occurs when TCP zero-copy is enabled**. In non-TCP-zero-copy mode, data is copied into kernel buffers before the SPDK I/O buffer is released, so buffer reuse is safe.

## Root Cause & Call Stacks

The problem arises from two concurrent paths during qpair destruction:

### 1. **TCP requests are abruptly aborted while still in-flight in the kernel**

During listener stop, the qpair cleanup path calls `spdk_sock_abort_requests()`, which aborts **all pending socket requests—including those already submitted to the kernel via zero-copy send**:

```
nvmf_qpair_request_cleanup (only wait all bdev ios done)
└─ state_cb (_nvmf_qpair_destroy)
└─ spdk_nvmf_poll_group_remove
└─ nvmf_transport_poll_group_remove
└─ nvmf_tcp_poll_group_remove
└─ spdk_sock_group_remove_sock
└─ posix_sock_group_impl_remove_sock
└─ spdk_sock_abort_requests // ← Aborts pending_req that are already sent to kernel
```

At this point, the I/O buffers are still referenced by the kernel (due to TCP zero-copy), but TCP socket layer canceled these requests and callback to upper layer with ECanceled error code.


### 2. **I/O buffers are immediately recycled and overwritten with metadata**

Shortly after, during qpair destruction, the same (now-aborted) requests are cleaned up and their buffers returned to the poll group cache—**with a dirty pointer written into the buffer header**:

```
_nvmf_tcp_qpair_destroy
└─ nvmf_tcp_cleanup_all_states
└─ nvmf_tcp_drain_state_queue (state=TCP_REQUEST_STATE_TRANSFERRING_CONTROLLER_TO_HOST)
└─ nvmf_tcp_request_free
└─ nvmf_tcp_req_process
└─ spdk_nvmf_request_free_buffers
└─ TAILQ_INSERT_HEAD(&group->buf_cache, (struct spdk_nvmf_transport_pg_cache_buf *)req->buffers[i], link)
// ← Writes a TAILQ link (pointer) into the first 8 bytes of the I/O buffer
```


Because the buffer is still mapped in a zero-copy send, the kernel may transmit this **corrupted 8-byte header** to the initiator.

Although close() is called on the TCP socket before freeing I/O buffers, it does not guarantee that zero-copy buffers already submitted to the kernel won’t be transmitted afterward.

### Additional Evidence

During the failure window, the following message appears **just before** the corrupted I/O returns:

`"The recv state of tqpair=%p is same with the state(%d) to be set"`

This confirms the TCP request was already in `TRANSFERRING_CONTROLLER_TO_HOST` state and was aborted mid-transfer.

## Possible Solution

The core issue is that `nvmf_qpair_request_cleanup()` only checks for **bdev-layer outstanding I/Os**, but **ignores in-flight TCP zero-copy sends** that have already left the bdev layer.

### Proposed fixes:

- Introduce a transport-level “drain” phase that ensures all zero-copy buffers are no longer referenced before allowing `spdk_nvmf_request_free_buffers`.

## Steps to Reproduce

1. Configure an SPDK NVMe/TCP target with **TCP zero-copy enabled** (e.g., `uring` sock_impl + `zerocopy_send=true`).
2. Start an initiator (`nvme connect`) and issue continuous read I/O (to trigger controller-to-host data path).
3. While I/O is active, stop the NVMf listener (e.g., via RPC `nvmf_subsystem_listener_delete`).
4. Observe that a small fraction of I/O responses contain **8-byte corruption** matching SPDK pointer values.
5. Correlate with log message: `"The recv state of tqpair=%p is same with the state(%d) to be set"` appearing just before corruption.

## Context (Environment including OS version, SPDK version, etc.)

- **SPDK version**: master
- **OS**: [e.g., Ubuntu 22.04, Linux kernel ≥ 5.10 with `SO_ZEROCOPY` support]
- **Transport**: NVMe/TCP with **TCP zero-copy enabled** (not bdev zero-copy)
- **Sock implementation**: `tcp` with `enable_zerocopy_send_server=true`
- **Application**: Internal `chunkd` service using SPDK’s NVMf target library
- **Reproducibility**: Low probability per I/O, but consistent under load during listener teardown


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

IO data corruption in NVMe/TCP TCP zero-copy mode during listener shutdown due to premature I/O buffer reuse #3760

Sighting report

Expected Behavior

Current Behavior

Root Cause & Call Stacks

1. TCP requests are abruptly aborted while still in-flight in the kernel

2. I/O buffers are immediately recycled and overwritten with metadata

Additional Evidence

Possible Solution

Proposed fixes:

Steps to Reproduce

Context (Environment including OS version, SPDK version, etc.)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

IO data corruption in NVMe/TCP **TCP zero-copy** mode during listener shutdown due to premature I/O buffer reuse #3760

Description

Sighting report

Expected Behavior

Current Behavior

Root Cause & Call Stacks

1. TCP requests are abruptly aborted while still in-flight in the kernel

2. I/O buffers are immediately recycled and overwritten with metadata

Additional Evidence

Possible Solution

Proposed fixes:

Steps to Reproduce

Context (Environment including OS version, SPDK version, etc.)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

IO data corruption in NVMe/TCP TCP zero-copy mode during listener shutdown due to premature I/O buffer reuse #3760