From owner-freebsd-fs@FreeBSD.ORG Sat Jul 2 00:38:26 2005 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E138A16A41C for ; Sat, 2 Jul 2005 00:38:26 +0000 (GMT) (envelope-from bakul@bitblocks.com) Received: from gate.bitblocks.com (bitblocks.com [209.204.185.216]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8FE8843D1F for ; Sat, 2 Jul 2005 00:38:26 +0000 (GMT) (envelope-from bakul@bitblocks.com) Received: from bitblocks.com (localhost [127.0.0.1]) by gate.bitblocks.com (8.13.3/8.13.1) with ESMTP id j620cO7F071025; Fri, 1 Jul 2005 17:38:25 -0700 (PDT) (envelope-from bakul@bitblocks.com) Message-Id: <200507020038.j620cO7F071025@gate.bitblocks.com> To: Eric Anderson In-reply-to: Your message of "Fri, 01 Jul 2005 07:32:24 CDT." <42C537D8.2000403@centtech.com> Date: Fri, 01 Jul 2005 17:38:24 -0700 From: Bakul Shah Cc: freebsd-fs@freebsd.org Subject: Re: Cluster Filesystem for FreeBSD - any interest? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Jul 2005 00:38:27 -0000 > > A couple FS specific suggestions: > > - perhaps clustering can be built on top of existing > > filesystems. Each machine's local filesystem is considered > > a cache and you use some sort of cache coherency protocol. > > That way you don't have to deal with filesystem allocation > > and layout issues. > > I see - that's an interesting idea. Almost like each machine could > mount the shared version read-only, then slap a layer on top that is > connected to a cache coherency manager (maybe there is a daemon on each > node, and the nodes sync their caches via the network) to keep the > filesystems 'in sync'. Then maybe only one elected node actually writes > the data to the disk. If that node dies, then another node is elected. \begin{handwaving} What I was thinking of: - The cluster system assures that there are atleast N copies of every file at N+ separate locations. - More than N copies may be cached dependign on usage pattern. - any node can write. The system takes care of replication and placement. - meta data, directories are implemented *above* this level. - more likely you'd want to map file *fragments* to local files so that a file can grow beyond one disk and smaller fragements mean you don't have to cache an entire file. - you still need to mediate access at file level but this is no different from two+ processes accessing a local file. Of course, the devil is in the details! > > - a network wide stable storage `disk' may be easier to do > > given GEOM. There are atleast N copies of each data block. > > Data may be cached locally at any site but writing data is > > done as a distributed transaction. So again cache > > coherency is needed. A network RAID if you will! > > I'm not sure how this would work. A network RAID with geom+ggate is > simple (I've done this a couple times - cool!), but how does that get me > shared read-write access to the same data? What I had in mind something like this: Each logical block is backed by N physical blocks at N sites. Individual filesystems live in partitions of this space. So in effect you have a single NFS server per filesystem that deals with all metadata+dir lookup but due to caching read access should be faster. When a server goes down, another server can be elected. > :) I understand. Any nudging in the right direction here would be > appreciated. I'd probably start with modelling a single filesystem and how it maps to a sequence of disk blocks (*without* using any code or worrying about details of formats but capturing the essential elements). I'd describe various operations in terms of preconditions and postconditions. Then, I'd extend the model to deal with redundancy and so on. Then I'd model various failure modes. etc. If you are interested _enough_ we can take this offline and try to work something out. You may even be able to use perl to create an `executable' specification:-) \end{handwaving}