From owner-freebsd-fs@FreeBSD.ORG  Sat Jul  2 00:38:26 2005
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id E138A16A41C
	for <freebsd-fs@freebsd.org>; Sat,  2 Jul 2005 00:38:26 +0000 (GMT)
	(envelope-from bakul@bitblocks.com)
Received: from gate.bitblocks.com (bitblocks.com [209.204.185.216])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 8FE8843D1F
	for <freebsd-fs@freebsd.org>; Sat,  2 Jul 2005 00:38:26 +0000 (GMT)
	(envelope-from bakul@bitblocks.com)
Received: from bitblocks.com (localhost [127.0.0.1])
	by gate.bitblocks.com (8.13.3/8.13.1) with ESMTP id j620cO7F071025;
	Fri, 1 Jul 2005 17:38:25 -0700 (PDT)
	(envelope-from bakul@bitblocks.com)
Message-Id: <200507020038.j620cO7F071025@gate.bitblocks.com>
To: Eric Anderson <anderson@centtech.com>
In-reply-to: Your message of "Fri, 01 Jul 2005 07:32:24 CDT."
	<42C537D8.2000403@centtech.com> 
Date: Fri, 01 Jul 2005 17:38:24 -0700
From: Bakul Shah <bakul@BitBlocks.com>
Cc: freebsd-fs@freebsd.org
Subject: Re: Cluster Filesystem for FreeBSD - any interest? 
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Jul 2005 00:38:27 -0000

> > A couple FS specific suggestions:
> > - perhaps clustering can be built on top of existing
> >   filesystems.  Each machine's local filesystem is considered
> >   a cache and you use some sort of cache coherency protocol.
> >   That way you don't have to deal with filesystem allocation
> >   and layout issues.
> 
> I see - that's an interesting idea.  Almost like each machine could 
> mount the shared version read-only, then slap a layer on top that is 
> connected to a cache coherency manager (maybe there is a daemon on each 
> node, and the nodes sync their caches via the network) to keep the 
> filesystems 'in sync'.  Then maybe only one elected node actually writes 
> the data to the disk.  If that node dies, then another node is elected.

\begin{handwaving}
What I was thinking of:
- The cluster system assures that there are atleast N copies
  of every file at N+ separate locations.
- More than N copies may be cached dependign on usage pattern.
- any node can write.  The system takes care of replication
  and placement.
- meta data, directories are implemented *above* this level.
- more likely you'd want to map file *fragments* to local
  files so that a file can grow beyond one disk and smaller
  fragements mean you don't have to cache an entire file.
- you still need to mediate access at file level but this
  is no different from two+ processes accessing a local file.
Of course, the devil is in the details!

> > - a network wide stable storage `disk' may be easier to do
> >   given GEOM.  There are atleast N copies of each data block.
> >   Data may be cached locally at any site but writing data is
> >   done as a distributed transaction.  So again cache
> >   coherency is needed.  A network RAID if you will!
> 
> I'm not sure how this would work.  A network RAID with geom+ggate is 
> simple (I've done this a couple times - cool!), but how does that get me 
> shared read-write access to the same data?

What I had in mind something like this: Each logical block is
backed by N physical blocks at N sites.  Individual
filesystems live in partitions of this space.  So in effect
you have a single NFS server per filesystem that deals with
all metadata+dir lookup but due to caching read access should
be faster.  When a server goes down, another server can be
elected.

> :) I understand.  Any nudging in the right direction here would be
> appreciated.

I'd probably start with modelling a single filesystem and how
it maps to a sequence of disk blocks (*without* using any
code or worrying about details of formats but capturing the
essential elements).  I'd describe various operations in
terms of preconditions and postconditions.  Then, I'd extend
the model to deal with redundancy and so on.  Then I'd model
various failure modes. etc.  If you are interested _enough_
we can take this offline and try to work something out.  You
may even be able to use perl to create an `executable'
specification:-)
\end{handwaving}