The splice() weekly news

[Posted April 24, 2006 by corbet]

Jens Axboe sent around a note on the status of splice(). He notes that the splice() and tee() interfaces - on both the user and kernel side - should be stable now, with no further changes anticipated. The sendfile() system call has been reworked to use the splice() machinery, though that process will not be complete until after the 2.6.18 kernel cycle opens.

While splice() might be stable, things are still happening. In particular, Jens has added yet another system call:

    long vmsplice(int fd, void *buffer, size_t len, unsigned int flags);

While the regular splice() call will connect a pipe to a file, this call, instead, is designed to feed user-space memory directly into a pipe. So the memory range of len bytes starting at buffer will be pushed into the pipe represented by fd. The flags argument is not currently used.

Using vmsplice(), an application which generates data in a memory buffer can send that data on to its eventual destination in a zero-copy manner. With a suitably-sized buffer, the application can do easy double-buffering; half of the buffer can be under I/O with vmsplice() while the other half is being filled. If the buffer is big enough, the application need only call vmsplice() each time half of the buffer has been filled, and the rest will simply work with no need for multiple threads or complicated synchronization mechanisms.

Getting the buffer size right is important, however. If the buffer is at least twice as large as the maximum number of pages that the kernel will load into a pipe at an given time, a successful vmsplice() of half of the buffer can be safely interpreted by the application as meaning that the other half of the buffer is no longer under I/O. Since half of the buffer will completely fill the space available within a kernel pipe, that half can only be inserted when all other data has been consumed out of the pipe - in simple situations, anyway. So, after vmsplice() succeeds, the application can safely refill the second half with new data. If the application gets confused, however, it could find itself overwriting data which has not yet been consumed by the kernel.

Jens's patch adds a couple of fcntl() operations intended to help in this regard. The F_GETPSZ operation will return the maximum number of pages which can be inserted into a pipe buffer, which is also the maximum number of pages which can be under I/O from a vmsplice() operation. There is also F_SETPSZ for changing the maximum size, though that operation just returns EINVAL for now. Linus, however, worries that this information is not enough to know that a given page is no longer under I/O. In situations where there are other buffers in the kernel - perhaps just another pipe in series - the kernel could still have references to a page even after that page has been consumed out of the original pipe. Networking adds some challenges of its own: if a page has been vmsplice()ed to a TCP socket, it will not be reusable until the remote host has acknowledged the receipt of the data contained within that page. That acknowledgment will arrive long after the page has been consumed out of the pipe buffer.

What this all means is that the vmsplice() interface probably needs a bit more work. In particular, there may need to be yet another system call which will allow an application to know that the kernel is done with a specific page. The current vmsplice() implementation is also unable to connect an incoming pipe to user-space memory. Making the read side work is a rather more complicated affair, and may not happen anytime in the near future.

Index entries for this article
Kernel	splice()
Kernel	vmsplice()

to post comments

The splice() weekly news

Posted Apr 27, 2006 3:08 UTC (Thu) by dang (guest, #310) [Link] (2 responses)

Jens's work on splice also led to an interesting bit of benchmarking involving Nick Piggin's lockless page cache patch. Check your favorite threaded version of LKML for today around 9:50 EST. More cool stuff.

The splice() weekly news

Posted Apr 27, 2006 8:22 UTC (Thu) by cloose (guest, #5066) [Link] (1 responses)

To save others from searching:
http://marc.theaimsgroup.com/?l=linux-mm&m=1146059631...

The splice() weekly news

Posted Apr 27, 2006 14:11 UTC (Thu) by jzbiciak (guest, #5246) [Link]

Very interesting! Granted, it's a microbenchmark, but it still shows a place where Linux wasn't scaling as well as one would like. (In this case, it was a pretty dramatic fall off.)

2.6.17-rc3 vmsplice differs

Posted Apr 27, 2006 11:37 UTC (Thu) by axboe (subscriber, #904) [Link]

Some corrections to this article, which are excusable since it was probably done prior to 2.6.17-rc3 being released with vmsplice included:

The syscall actually looks like this now:

long vmsplice(int fd, const struct iovec *iov, unsigned long nr_segs,
unsigned int flags);

So you can pass in several chunks and get them spliced into the pipe in one go. The fcntl() bits are removed for now, as the article mentions it isn't completely clear how we'll handle the reuse case yet.

How does this differ from aio_write?

Posted Apr 27, 2006 15:32 UTC (Thu) by kingdon (guest, #4526) [Link] (3 responses)

Seems like there is a more familiar interface for saying "write this data at some point, and let me know when you are done so I can reuse (or de-mmap) this memory", namely aio_write (or lio_listio where there are several noncontiguous blocks of memory).

Now, there might be various semantic differences (like whether one has to write entire pages or can write less), but I'm curious whether the two things could/should be separate or unified.

The tricky part...

Posted Apr 27, 2006 16:33 UTC (Thu) by axboe (subscriber, #904) [Link] (2 responses)

isn't so much how to notify reusability, but rather when to determine the safety of doing so. We still need to change a bit of infrastructure for this - eg, get rid of ->sendpage() and actually pass the pipe_inode_info down for network transmit and only have it do the ->release on the buffers when they have been sent out. The issue right now is that the ->release is done as soon as we pass the page to the network stack, which is too soon of course.

What does vmsplice() add?

Posted Apr 28, 2006 2:17 UTC (Fri) by xoddam (subscriber, #2322) [Link] (1 responses)

Determining completion is indeed tricky, especially where remote hosts
are involved. But there are already two good choices for the interface:
write() blocks until the buffer may be reused by the application,
aio_write() instead posts a notification. What does vmsplice() add?

(Changing to a vector operation changes the names of the functions, but
not the nature of the question).

What does vmsplice() add?

Posted Apr 28, 2006 4:21 UTC (Fri) by axboe (subscriber, #904) [Link]

Blocking on the next vmsplice comes automatically, since you can't replace buffers that haven't been ->release'd yet. So that's how it already works. As I said, the missing bit is getting the release right.

proliferation

Posted Apr 27, 2006 20:10 UTC (Thu) by ncm (guest, #165) [Link]

It's disappointing to see this proliferation of system calls, which will now need to be supported forever. Couldn't we have the VM system integrated with I/O and network buffering, similar to NetBSD's UVM, and just use read() and write() with suitably page-aligned and mmapped buffers? Or am I missing something essential?