From owner-freebsd-arch@FreeBSD.ORG Thu Apr 9 17:21:01 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 362F410656F6 for ; Thu, 9 Apr 2009 17:21:01 +0000 (UTC) (envelope-from zachary.loafman@isilon.com) Received: from seaxch10.isilon.com (seaxch10.isilon.com [74.85.160.26]) by mx1.freebsd.org (Postfix) with ESMTP id 15F338FC0A for ; Thu, 9 Apr 2009 17:21:01 +0000 (UTC) (envelope-from zachary.loafman@isilon.com) Received: from famine.isilon.com ([10.54.190.95]) by seaxch10.isilon.com with Microsoft SMTPSVC(6.0.3790.1830); Thu, 9 Apr 2009 10:21:01 -0700 Received: from zloafman by famine.isilon.com with local (Exim 4.69) (envelope-from ) id 1LrxrJ-0003cJ-RX for freebsd-arch@freebsd.org; Thu, 09 Apr 2009 10:16:13 -0700 Date: Thu, 9 Apr 2009 10:16:13 -0700 From: Zachary Loafman To: freebsd-arch@freebsd.org Message-ID: <20090409171613.GC9442@isilon.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.18 (2008-05-17) X-OriginalArrivalTime: 09 Apr 2009 17:21:01.0255 (UTC) FILETIME=[8DF1E570:01C9B937] Subject: splice() in FreeBSD X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Apr 2009 17:21:01 -0000 Arch - Isilon has internally been using the FreeBSD sendfile() (with modifications) and our own recvfile() in order to accomplish zero-copy read/write for the userland portions of our stack (CIFS, NDMP). However, these interfaces are limited. In particular, sendfile/recvfile prevent any other thread from dealing with the same socket until the call is complete. That's somewhat silly - it would be nicer to split the read-from-file/write-to-file portion from the read-from-socket/write-to-socket portion. That also eases some of the decisions that only the layer above can really make - for example, in the sendfile() case, you don't really know if it's appropriate to send a partial read or whether the caller really needs all the data. What we'd like is something like splice(). The Linux splice interface is documented here: http://linux.die.net/man/2/splice and the internals are discussed here: http://kerneltrap.org/node/6505 . We don't need the sillier portions of it - Isilon could care less about vmsplice()/tee(). We need the ability to shuffle data from one source to one sink, and then to turn around later and use that sink as a source. At first, I found the splice() interface a bit of an abomination, but a pipe is a somewhat natural place to act as a data staging area. If we just implemented splice alone, this wouldn't require any real VM hackery - you can imagine just shuffling mbufs through the pipe to accomplish a limited form of this (or, say, a unix domain socket). As part of this, and in order to get something upstreamable, it seems like we would need a few things: *) Agreement on syscall APIs - My initial proposal is to adopt splice verbatim. Initially the interface may not be truly zero-copy for many cases, but it's a start. It also increases portability for any Linux apps that are trying to make use of it. *) Unification of uio and mbufs somehow? Isilon currently has private patches that add *_MBUF variants for I/O VOPs (e.g. we have a VOP_READ_MBUF in addition to the standard VOP_READ). Isilon is in a somewhat unique place here - I'm not sure a general file system can handle this as easily. At the top-half, our system in many ways acts a lot like a router, so we can handle things like VOP_READ_MBUF by taking file data off our back-end (which comes in as mbufs off IB), header splitting, then just slinging the mbufs out the front-end. However, I think our *_MBUF VOP variants are actually a little gross. I would rather figure out a way to unify the uio and mbuf APIs - they're both scatter/gather lists in their own special way, then call into a single VOP. Isilon can get a limited, non-upstreamable thing working fairly quickly - we can use a unix domain socket as the intermediate buffer and use our existing *_MBUF VOPs. But it would be nice if we had some consensus going forward, then we can internally march towards something we can upstream. ...Zach -- Zach Loafman | Staff Engineer | Isilon Systems