Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 22 Oct 2003 01:43:10 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        "Robert J. Adams (jason)" <radams@siscom.net>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: >1 systems 1 FS
Message-ID:  <3F96431E.A30656E3@mindspring.com>
References:  <3F95B946.8010309@newshosting.com> <20031021233414.GJ99943@elvis.mu.org> <3F95C6F3.8030005@siscom.net>

next in thread | previous in thread | raw e-mail | index | archive | help
"Robert J. Adams (jason)" wrote:
> Alfred Perlstein wrote:
> >>Hello,
> >>
> >>I'm working on a new cluster design and had a quick question. If I have
> >>a few boxes mounting the same FS (over a SAN) all read-only will it
> >>work? Will I have any trouble? Has anyone tried this with UFS/UFS2 ..
> >
> > You shouldn't.
> 
> I shouldn't do this or I shouldn't have trouble? :)
> 
> >>Lets take it one step further.. lets say I have 1 box that mounts it
> >>RW.. and it updates the contents .. will the other systems that have it
> >>mounted RO puke?
> >
> >
> > Likely.
> 
> Well shit.. I need this.

Then you need a new FS.

The issue is that you effectively need block-level or range of
blocks locking on the device over the shared interface wire to
be able to do this effectively, since a device that is a target
of multiple master devices has to know who to permit onto the
blocks and who not to permit onto the blocks.

Firewire was supposed to fix this, and so was SCSI 3.  The parts
of the SCSI 3 standard that deal with this particular issue have
not been finalized, because each device vendor is jockeying to
get their implementation standardized to get a jump on all the
other vendors, instead of cooperating on establishing an open
standard.  This is one of the main reasons that the SCSI 3
standard is not yet final (the other main reason is that a number
of the participants also sell IDE disks, and whatever's bad for
SCSI is good for IDE, so they are being obstructionist jerks
because they can).

There are a number of FS implementations that can deal with this,
however, and they way they deal with this is by implementing an
out-of-device-control-band block-level or range of blocks locking
protocol, usually over ethernet, to ensure that they can get
exclusive access to the blocks.  Usually, this is implemented as
multiple reader, single writer locking, with the ability to go
exclusive ("SIX locking" -- "Shared Intention eXclusive"; look
for it in your favorite search engine).

Obviously, doing this in-band with explicit enforcement, and no
issue of inter-node failure recovery being necessary because the
locks are stored in the physical device (i.e. the SCSI 3 approach)
would have significant performance benefits over the external lock
manager that relies on the machines voluntarily participating and
not going down.

One example of an FS that can do this is GFS, from Sistina; they
used to have an open-source version (under the GPL), but appear
to have since come to their senses.  I ported all the user space
tools for GFS to FreeBSD in about 4 hours of work one night, when
it was still available under the GPL.  See their propaganda at:

	http://www.sistina.com/products_gfs.htm

IBM also has two FS's that can do this, but they don't even run
on Linux, let alone FreeBSD.

In theory, SGI CXFS will also do this (I haven't gotten enough
information from non-proprietary channels to be able to disclose
much here and be on sound legal footing).


Another company that had a product in this space was Zambeel; they
were a Fremont startup, and, among other people, they had hired
Mohit Aron from Rice University (he did the ResCon LRP implementation
and was associated with the SCALA Server project and Peter Druschel's
group).  The company showed a lot of promise, but apparently burnt
all it's first round money to the tune $65M at the rate of $1M/month,
with only 90 people in headcount a little more than a year ago.

Unfortunately, they croaked last April:
http://www.byteandswitch.com/document.asp?doc_id=31886&site=byteandswitch
and it's not likely that anyone will be jumping into the space very
soon, since it hasn't been very profitable for the companies trying
to stake out territory there.


Anyway, the normal way this is handled for SAN/NAS devices is
to carve out a logical volume region on a per-machine basis, and
forget the locking altogether (giving a management node "ownership"
of the "as yet unallocated regions"), which avoid contention by
separation of the contention domain entirely.  Not a very
satisfying way of doing it, if you ask me.

-- Terry



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3F96431E.A30656E3>