Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 9 Apr 2009 10:16:13 -0700
From:      Zachary Loafman <zachary.loafman@isilon.com>
To:        freebsd-arch@freebsd.org
Subject:   splice() in FreeBSD
Message-ID:  <20090409171613.GC9442@isilon.com>

next in thread | raw e-mail | index | archive | help
Arch -

Isilon has internally been using the FreeBSD sendfile() (with
modifications) and our own recvfile() in order to accomplish zero-copy
read/write for the userland portions of our stack (CIFS,
NDMP). However, these interfaces are limited. In particular,
sendfile/recvfile prevent any other thread from dealing with the same
socket until the call is complete. That's somewhat silly - it would be
nicer to split the read-from-file/write-to-file portion from the
read-from-socket/write-to-socket portion. That also eases some of the
decisions that only the layer above can really make - for example, in
the sendfile() case, you don't really know if it's appropriate to send
a partial read or whether the caller really needs all the data.

What we'd like is something like splice(). The Linux splice interface
is documented here: http://linux.die.net/man/2/splice and the
internals are discussed here: http://kerneltrap.org/node/6505 . We
don't need the sillier portions of it - Isilon could care less about
vmsplice()/tee(). We need the ability to shuffle data from one source
to one sink, and then to turn around later and use that sink as a
source. At first, I found the splice() interface a bit of an
abomination, but a pipe is a somewhat natural place to act as a data
staging area. If we just implemented splice alone, this wouldn't
require any real VM hackery - you can imagine just shuffling mbufs
through the pipe to accomplish a limited form of this (or, say, a unix
domain socket).

As part of this, and in order to get something upstreamable, it seems
like we would need a few things:

*) Agreement on syscall APIs - My initial proposal is to adopt splice
verbatim. Initially the interface may not be truly zero-copy for many
cases, but it's a start. It also increases portability for any Linux
apps that are trying to make use of it.

*) Unification of uio and mbufs somehow? Isilon currently has private
patches that add *_MBUF variants for I/O VOPs (e.g. we have a
VOP_READ_MBUF in addition to the standard VOP_READ). Isilon is in a
somewhat unique place here - I'm not sure a general file system can
handle this as easily. At the top-half, our system in many ways acts a
lot like a router, so we can handle things like VOP_READ_MBUF by
taking file data off our back-end (which comes in as mbufs off IB),
header splitting, then just slinging the mbufs out the
front-end. However, I think our *_MBUF VOP variants are actually a
little gross. I would rather figure out a way to unify the uio and
mbuf APIs - they're both scatter/gather lists in their own special
way, then call into a single VOP.

Isilon can get a limited, non-upstreamable thing working fairly
quickly - we can use a unix domain socket as the intermediate buffer
and use our existing *_MBUF VOPs. But it would be nice if we had some
consensus going forward, then we can internally march towards
something we can upstream.

...Zach

-- 
Zach Loafman | Staff Engineer | Isilon Systems



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090409171613.GC9442>