From owner-freebsd-arch@FreeBSD.ORG Mon Apr 13 19:37:57 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 070E31065895 for ; Mon, 13 Apr 2009 19:37:56 +0000 (UTC) (envelope-from gnn@neville-neil.com) Received: from proxy.meer.net (proxy.meer.net [64.13.141.13]) by mx1.freebsd.org (Postfix) with ESMTP id C1BA98FC26 for ; Mon, 13 Apr 2009 19:37:55 +0000 (UTC) (envelope-from gnn@neville-neil.com) Received: from mail.meer.net (mail.meer.net [64.13.141.3]) by proxy.meer.net (8.14.3/8.14.3) with ESMTP id n3DJ10sj018796; Mon, 13 Apr 2009 12:01:00 -0700 (PDT) (envelope-from gnn@neville-neil.com) Received: from mail2.meer.net (mail2.meer.net [64.13.141.16]) by mail.meer.net (8.13.3/8.13.3/meer) with ESMTP id n3DJ0fQs093655; Mon, 13 Apr 2009 12:00:41 -0700 (PDT) (envelope-from gnn@neville-neil.com) Received: from gnnbsd.hudson-trading.com.neville-neil.com (209.249.190.8.available.above.net [209.249.190.8] (may be forged)) (authenticated bits=0) by mail2.meer.net (8.14.1/8.14.3) with ESMTP id n3DJ0f6N059596; Mon, 13 Apr 2009 12:00:41 -0700 (PDT) (envelope-from gnn@neville-neil.com) Date: Mon, 13 Apr 2009 15:00:40 -0400 Message-ID: <7iskkcgyzr.wl%gnn@neville-neil.com> From: gnn@freebsd.org To: Zachary Loafman In-Reply-To: <20090409171613.GC9442@isilon.com> References: <20090409171613.GC9442@isilon.com> User-Agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (=?ISO-8859-4?Q?Shij=F2?=) APEL/10.7 Emacs/22.3 (amd64-portbld-freebsd7.1) MULE/5.0 (SAKAKI) MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka") Content-Type: text/plain; charset=US-ASCII X-Spam-Score: undef - spam scanning disabled X-CanIt-Geo: No geolocation information available for 64.13.141.3 X-CanItPRO-Stream: default X-Canit-Stats-ID: Bayes signature not available X-Scanned-By: CanIt (www . roaringpenguin . com) on 64.13.141.13 Cc: freebsd-arch@freebsd.org Subject: Re: splice() in FreeBSD X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Apr 2009 19:38:02 -0000 At Thu, 9 Apr 2009 10:16:13 -0700, Zachary Loafman wrote: > > Arch - > > Isilon has internally been using the FreeBSD sendfile() (with > modifications) and our own recvfile() in order to accomplish zero-copy > read/write for the userland portions of our stack (CIFS, > NDMP). However, these interfaces are limited. In particular, > sendfile/recvfile prevent any other thread from dealing with the same > socket until the call is complete. That's somewhat silly - it would be > nicer to split the read-from-file/write-to-file portion from the > read-from-socket/write-to-socket portion. That also eases some of the > decisions that only the layer above can really make - for example, in > the sendfile() case, you don't really know if it's appropriate to send > a partial read or whether the caller really needs all the data. > > What we'd like is something like splice(). The Linux splice interface > is documented here: http://linux.die.net/man/2/splice and the > internals are discussed here: http://kerneltrap.org/node/6505 . We > don't need the sillier portions of it - Isilon could care less about > vmsplice()/tee(). We need the ability to shuffle data from one source > to one sink, and then to turn around later and use that sink as a > source. At first, I found the splice() interface a bit of an > abomination, but a pipe is a somewhat natural place to act as a data > staging area. If we just implemented splice alone, this wouldn't > require any real VM hackery - you can imagine just shuffling mbufs > through the pipe to accomplish a limited form of this (or, say, a unix > domain socket). > > As part of this, and in order to get something upstreamable, it seems > like we would need a few things: > > *) Agreement on syscall APIs - My initial proposal is to adopt splice > verbatim. Initially the interface may not be truly zero-copy for many > cases, but it's a start. It also increases portability for any Linux > apps that are trying to make use of it. > > *) Unification of uio and mbufs somehow? Isilon currently has private > patches that add *_MBUF variants for I/O VOPs (e.g. we have a > VOP_READ_MBUF in addition to the standard VOP_READ). Isilon is in a > somewhat unique place here - I'm not sure a general file system can > handle this as easily. At the top-half, our system in many ways acts a > lot like a router, so we can handle things like VOP_READ_MBUF by > taking file data off our back-end (which comes in as mbufs off IB), > header splitting, then just slinging the mbufs out the > front-end. However, I think our *_MBUF VOP variants are actually a > little gross. I would rather figure out a way to unify the uio and > mbuf APIs - they're both scatter/gather lists in their own special > way, then call into a single VOP. > > Isilon can get a limited, non-upstreamable thing working fairly > quickly - we can use a unix domain socket as the intermediate buffer > and use our existing *_MBUF VOPs. But it would be nice if we had some > consensus going forward, then we can internally march towards > something we can upstream. > I like the idea, though I don't know if I like the name "splice" because to me it's a bit confusing, but we're probably stuck with the name since it's already in use. If/when you have patches send them along next. Best, George