Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 13 Apr 2009 15:00:40 -0400
From:      gnn@freebsd.org
To:        Zachary Loafman <zachary.loafman@isilon.com>
Cc:        freebsd-arch@freebsd.org
Subject:   Re: splice() in FreeBSD
Message-ID:  <7iskkcgyzr.wl%gnn@neville-neil.com>
In-Reply-To: <20090409171613.GC9442@isilon.com>
References:  <20090409171613.GC9442@isilon.com>

next in thread | previous in thread | raw e-mail | index | archive | help
At Thu, 9 Apr 2009 10:16:13 -0700,
Zachary Loafman wrote:
> 
> Arch -
> 
> Isilon has internally been using the FreeBSD sendfile() (with
> modifications) and our own recvfile() in order to accomplish zero-copy
> read/write for the userland portions of our stack (CIFS,
> NDMP). However, these interfaces are limited. In particular,
> sendfile/recvfile prevent any other thread from dealing with the same
> socket until the call is complete. That's somewhat silly - it would be
> nicer to split the read-from-file/write-to-file portion from the
> read-from-socket/write-to-socket portion. That also eases some of the
> decisions that only the layer above can really make - for example, in
> the sendfile() case, you don't really know if it's appropriate to send
> a partial read or whether the caller really needs all the data.
> 
> What we'd like is something like splice(). The Linux splice interface
> is documented here: http://linux.die.net/man/2/splice and the
> internals are discussed here: http://kerneltrap.org/node/6505 . We
> don't need the sillier portions of it - Isilon could care less about
> vmsplice()/tee(). We need the ability to shuffle data from one source
> to one sink, and then to turn around later and use that sink as a
> source. At first, I found the splice() interface a bit of an
> abomination, but a pipe is a somewhat natural place to act as a data
> staging area. If we just implemented splice alone, this wouldn't
> require any real VM hackery - you can imagine just shuffling mbufs
> through the pipe to accomplish a limited form of this (or, say, a unix
> domain socket).
> 
> As part of this, and in order to get something upstreamable, it seems
> like we would need a few things:
> 
> *) Agreement on syscall APIs - My initial proposal is to adopt splice
> verbatim. Initially the interface may not be truly zero-copy for many
> cases, but it's a start. It also increases portability for any Linux
> apps that are trying to make use of it.
> 
> *) Unification of uio and mbufs somehow? Isilon currently has private
> patches that add *_MBUF variants for I/O VOPs (e.g. we have a
> VOP_READ_MBUF in addition to the standard VOP_READ). Isilon is in a
> somewhat unique place here - I'm not sure a general file system can
> handle this as easily. At the top-half, our system in many ways acts a
> lot like a router, so we can handle things like VOP_READ_MBUF by
> taking file data off our back-end (which comes in as mbufs off IB),
> header splitting, then just slinging the mbufs out the
> front-end. However, I think our *_MBUF VOP variants are actually a
> little gross. I would rather figure out a way to unify the uio and
> mbuf APIs - they're both scatter/gather lists in their own special
> way, then call into a single VOP.
> 
> Isilon can get a limited, non-upstreamable thing working fairly
> quickly - we can use a unix domain socket as the intermediate buffer
> and use our existing *_MBUF VOPs. But it would be nice if we had some
> consensus going forward, then we can internally march towards
> something we can upstream.
> 

I like the idea, though I don't know if I like the name "splice"
because to me it's a bit confusing, but we're probably stuck with the
name since it's already in use.  If/when you have patches send them
along next.

Best,
George



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7iskkcgyzr.wl%gnn>