From owner-freebsd-fs@FreeBSD.ORG Tue Apr 29 00:38:24 2014 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 133955C9; Tue, 29 Apr 2014 00:38:24 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id B91C6165E; Tue, 29 Apr 2014 00:38:23 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqIEAMfzXlODaFve/2dsb2JhbABZhCyCZb8Bgw+BMXSCJQEBBSNIDhsOCgICDRkCWQaIVKU8o0UXgSmMWA8VNAeCb4FKBKtqg00hgSwBHyI X-IronPort-AV: E=Sophos;i="4.97,948,1389762000"; d="scan'208";a="118155325" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 28 Apr 2014 20:37:19 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 985B1B40FE; Mon, 28 Apr 2014 20:37:19 -0400 (EDT) Date: Mon, 28 Apr 2014 20:37:19 -0400 (EDT) From: Rick Macklem To: Ivan Voras Message-ID: <1459248112.3139531.1398731839613.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: RFC: using ceph as a backend for an NFSv4.1 pNFS server MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.209] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Apr 2014 00:38:24 -0000 Ivan Voras wrote: > On 26/04/2014 21:47, Rick Macklem wrote: > > > Any other comments w.r.t. this would be appreciated, including > > generic stuff like "we couldn't care less about pNFS" or technical > > details/opinions. > > > > Thanks in advance for any feedback, rick > > ps: I'm no where near committing to do this at this point and > > I do realize that even completing the ceph port to FreeBSD > > might be beyond my limited resources. > > What functionality from ceph would pNFS really need? Would pNFS need > to > be implemented with a single back-end storage like ceph or could it > be > modular? (I don't have much experience here but it looks like HDFS is > becoming popular for some big-data applications). > > Well, I doubt I can answer this, but here is a simple summary of what a pNFS server does: - The NFSv4.1/pNFS server (sometimes called a metadata server of MDS) handles all the normal NFS stuff including read/writes of the files. However, it can also hand out layouts, which tell the client where to read/write the file on another data server (DS). - There are RFCs to describe 3 ways the client can read/write data on a DS. 1 - File Layout, where the client uses a subset of NFSv4.1 (read/write + enough others to use them). 2 - Block/volume, where the client uses iSCSI to read/write blocks for the file's data. 3 - Object, where the object storage commands are used over iSCSI. I think you can see that any of these require a lot of work to be done "behind the curtains" so that the MDS server can know where the file's data lives (and it can be striped across multiple DSs, etc). To implement this "from the ground up" is way beyond my limited time/resources (and expertise). I hope that I can find an open source cluster file system that handles most of the "behind the curtains" stuff so that all the NFSv4.1 server needs to do is "ask the cluster file system where the file/object's data lives" and generate a layout from that. (I'm basically looking for a path of least work.;-) Exactly what is needed from the cluster fs isn`t obvious to me at this time (and depends on layout type) but here are some thoughts: - where the file`s data lives and the info needed for the layout so the client can read and write the file`s data at the DS. - when the file`s data location changes, so it can recall the stale layout - allowing the file to grow without the MDS having to do anything, when the client writes to the DS (the MDS needs to have a way to find out the current size of the file) - allow the DSs to be built easily, using FreeBSD and the cluster file system tools (ideally using underlying FreeBSD file systems like ZFS to avoid `yet another` file system) There are probably a lot more of these. My hunch is that doing this for even one cluster file system will be at/beyond my time/resource limits. I also suspect these cluster file systems are different enough that each would be a lot of effort, even ignoring the fact that none of them are ported to FreeBSD. I'd also like to avoid porting a file system into FreeBSD. What I do like about ceph (and glustre is similar, I think?) is that they are layered on top of a regular file system, so they can use ZFS for the actual storage handling. rick