From owner-freebsd-cluster@FreeBSD.ORG  Mon Sep 26 11:45:34 2005
Return-Path: <owner-freebsd-cluster@FreeBSD.ORG>
X-Original-To: freebsd-cluster@freebsd.org
Delivered-To: freebsd-cluster@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 09AB016A41F;
	Mon, 26 Sep 2005 11:45:34 +0000 (GMT)
	(envelope-from anderson@centtech.com)
Received: from mh2.centtech.com (moat3.centtech.com [207.200.51.50])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 38DB143D48;
	Mon, 26 Sep 2005 11:45:32 +0000 (GMT)
	(envelope-from anderson@centtech.com)
Received: from [10.177.171.220] (neutrino.centtech.com [10.177.171.220])
	by mh2.centtech.com (8.13.1/8.13.1) with ESMTP id j8QBjU66046492;
	Mon, 26 Sep 2005 06:45:31 -0500 (CDT)
	(envelope-from anderson@centtech.com)
Message-ID: <4337DF56.6030407@centtech.com>
Date: Mon, 26 Sep 2005 06:45:26 -0500
From: Eric Anderson <anderson@centtech.com>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.11) Gecko/20050914
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Brian Candler <B.Candler@pobox.com>
References: <20050924141025.GA1236@uk.tiscali.com>
In-Reply-To: <20050924141025.GA1236@uk.tiscali.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: ClamAV 0.82/1102/Sun Sep 25 09:04:56 2005 on mh2.centtech.com
X-Virus-Status: Clean
Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org
Subject: Re: Options for synchronising filesystems
X-BeenThere: freebsd-cluster@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Clustering FreeBSD <freebsd-cluster.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>, 
	<mailto:freebsd-cluster-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-cluster>
List-Post: <mailto:freebsd-cluster@freebsd.org>
List-Help: <mailto:freebsd-cluster-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>,
	<mailto:freebsd-cluster-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Sep 2005 11:45:34 -0000

Brian Candler wrote:
> Hello,
> 
> I was wondering if anyone would care to share their experiences in
> synchronising filesystems across a number of nodes in a cluster. I can think
> of a number of options, but before changing what I'm doing at the moment I'd
> like to see if anyone has good experiences with any of the others.
> 
> The application: a clustered webserver. The users' CGIs run in a chroot
> environment, and these clearly need to be identical (otherwise a CGI running
> on one box would behave differently when running on a different box).
> Ultimately I'd like to synchronise the host OS on each server too.
> 
> Note that this is a single-master, multiple-slave type of filesystem
> synchronisation I'm interested in.
> 
> 
> 1. Keep a master image on an admin box, and rsync it out to the frontends
> -------------------------------------------------------------------------
> 
> This is what I'm doing at the moment. Install a master image in
> /webroot/cgi, add packages there (chroot /webroot/cgi pkg_add ...), and
> rsync it. [Actually I'm exporting it using NFS, and the frontends run rsync
> locally when required to update their local copies against the NFS master]
> 
> Disadvantages:
> 
> - rsyncing a couple of gigs of data is not particularly fast, even when only
> a few files have changed
> 
> - if a sysadmin (wrongly) changes a file on a front-end instead of on the
> master copy in the admin box, then the change will be lost when the next
> rsync occurs. They might think they've fixed a problem, and then (say) 24
> hours later their change is wiped. However if this is a config file, the
> fact that the old file has been reinstated might not be noticed until the
> daemon is restarted or the box rebooted - maybe months later. This I think
> is the biggest fundamental problem.
> 
> - files can be added locally and they will remain indefinitely (unless we
> use rsync --delete which is a bit scary). If this is done then adding a new
> machine into the cluster by rsyncing from the master will not pick up these
> extra files.
> 
> So, here are the alternatives I'm considering, and I'd welcome any
> additional suggestions too.

Here's a few ideas on this: do multiple rsyncs, one for each top level 
directory.  That might speed up your total rsync process.  Another 
similar method is using a content revisioning system.  This is only good 
for some cases, but something like subversion might work ok here.


> 2. Run the images directly off NFS
> ----------------------------------
> 
> I've had this running before, even the entire O/S, and it works just fine.
> However the NFS server itself then becomes a critical
> single-point-of-failure: if it has to be rebooted and is out of service for
> 2 minutes, then the whole cluster is out of service for that time.
> 
> I think this is only feasible if I can build a highly-available NFS server,
> which really means a pair of boxes serving the same data. Since the system
> image is read-only from the point of view of the frontends, this should be
> easy enough:
> 
>       frontends            frontends
>         | | |                | | |
>          NFS   ----------->   NFS
>        server 1    sync     server 2
> 
> As far as I know, NFS clients don't support the idea of failing over from
> one server to another, so I'd have to make a server pair which transparently
> fails over.
> 
> I could make one NFS server take over the other server's IP address using
> carp or vrrp. However, I suspect that the clients might notice. I know that
> NFS is 'stateless' in the sense that a server can be rebooted, but for a
> client to be redirected from one server to the other, I expect that these
> filesytems would have to be *identical*, down to the level of the inode
> numbers being the same.
> 
> If that's true, then rsync between the two NFS servers won't cut it. I was
> thinking of perhaps using geom_mirror plus ggated/ggatec to make a
> block-identical read-only mirror image on NFS server 2 - this also has the
> advantage that any updates are close to instantaneous.
> 
> What worries me here is how NFS server 2, which has the mirrored filesystem
> mounted read-only, will take to having the data changed under its nose. Does
> it for example keep caches of inodes in memory, and what would happen if
> those inodes on disk were to change? I guess I can always just unmount and
> remount the filesystem on NFS server 2 after each change.

I've tried doing something similar.  I used fiber attached storage, and 
had multiple hosts mounting the same partition.  It seemed as though 
when host A mounted the filesystem read-write, and then host B mounted 
it read-only, any changes made by host A were not seen by B, and even 
remounting did not always bring it up to current state.  I believe it 
has to do with the buffer cache and host A's desire to keep things (like 
inode changes, block maps, etc) in cache and not write them to disk. 
FreeBSD does not currently have a multi-system cache coherency protocol 
to distribute that information to other hosts.  This is something I 
think would be very useful for many people.  I suppose you could just 
mount the filesystem when you know a change has happened, but you still 
may not see the change.  Maybe mounting the filesystem on host A with 
the sync option would help.

> My other concern is about susceptibility to DoS-type attacks: if one
> frontend were to go haywire and start hammering the NFS servers really hard,
> it could impact on all the other machines in the cluster.
> 
> However, the problems of data synchronisation are solved: any change made on
> the NFS server is visible identically to all front-ends, and sysadmins can't
> make changes on the front-ends because the NFS export is read-only.

This was my first thought too, and a highly available NFS server is 
something any NFS heavy installation wants (needs).  There are a few 
implementations of clustered filesystems out there, but non for FreeBSD 
(yet).   What that allows is multiple machines talking to a shared 
storage with read/write access.  Very handy, but since you only need 
read-only access, I think your problem is much simpler, and you can get 
away with a lot less.


> 3. Use a network distributed filesystem - CODA? AFS?
> ----------------------------------------------------
> 
> If each frontend were to access the filesystem as a read-only network mount,
> but have a local copy to work with in the case of disconnected operation,
> then the SPOF of an NFS server would be eliminated.
> 
> However, I have no experience with CODA, and although it's been in the tree
> since 2002, the README's don't inspire confidence:
> 
>    "It is mostly working, but hasn't been run long enough to be sure all the
>    bugs are sorted out. ... This code is not SMP ready"
> 
> Also, a local cache is no good if the data you want during disconnected
> operation is not in the cache at that time, which I think means this idea is
> not actually a very good one.

There is also a port for coda.  I've been reading about this,  and it's 
an interesting filesystem, but I'm just not sure of it's usefulness yet.


> 4. Mount filesystems read-only
> ------------------------------
> 
> On each front-end I could store /webroot/cgi on a filesystem mounted
> read-only to prevent tampering (as long as the sysadmin doesn't remount it
> read-write of course). That would work reasonably well, except that being
> mounted read-only I couldn't use rsync to update it!
> 
> It might also work with geom_mirror and ggated/ggatec, except for the issue
> I raised before about changing blocks on a filesystem under the nose of a
> client who is actively reading from it.

I suppose you could mount r/w only when doing the rsync, then switch 
back to ro once complete.  You should be able to do this online, without 
any issues or taking the filesystem offline.


> 5. Using a filesystem which really is read-only
> -----------------------------------------------
> 
> Better tamper-protection could be had by keeping data in a filesystem
> structure which doesn't support any updates at all - such as cd9660 or
> geom_uzip.
> 
> The issue here is how to roll out a new version of the data. I could push
> out a new filesystem image into a second partition, but it would then be
> necessary to unmount the old filesystem and remount the new on the same
> place, and you can't really unmount a filesystem which is in use. So this
> would require a reboot.
> 
> I was thinking that some symlink trickery might help:
> 
>     /webroot/cgi -> /webroot/cgi1
>     /webroot/cgi1     # filesystem A mounted here
>     /webroot/cgi2     # filesystem B mounted here
> 
> It should be possible to unmount /webroot/cgi2, dd in a new image, remount
> it, and change the symlink to point to /webroot/cgi2. After a little while,
> hopefully all the applications will stop using files in /webroot/cgi1, so
> this one can be unmounted and a new one put in its place on the next update.
> However this is not guaranteed, especially if there are long-lived processes
> using binary images in this partition. You'd still have to stop and restart
> all those processes.
> 
> If reboots were acceptable, then the filesystem image could also be stored
> in ramdisk pulled in via pxeboot. This makes sense especially for geom_uzip
> where the data is pre-compressed. However I would still prefer to avoid
> frequent reboots if at all possible. Also, whilst a ramdisk might be OK for
> the root filesystem, a typical CGI environment (with perl, php, ruby,
> python, and loads of libraries) would probably be too large anyway.
> 
> 
> 6. Journaling filesystem replication
> ------------------------------------
> 
> If the data were stored on a journaling filesystem on the master box, and
> the journal logs were distributed out to the slaves, then they would all
> have identical filesystem copies and only a minimal amount of data would
> need to be pushed out to each machine on each change. (This would be rather
> like NetApps and their snap-mirroring system). However I'm not aware of any
> journaling filesystem for FreeBSD, let alone whether it would support
> filesystem replication in this way.

There is a project underway for UFSJ (UFS journaling).   Maybe once it 
is complete, and bugs are ironed out, one could implement a journal 
distribution piece to send the journal updates to multiple hosts and 
achieve what you are thinking, however, that only distributes the 
meta-data, and not the actual data.


Good luck finding your ultimate solution!

Eric


-- 
------------------------------------------------------------------------
Eric Anderson        Sr. Systems Administrator        Centaur Technology
Anything that works is better than anything that doesn't.
------------------------------------------------------------------------

From owner-freebsd-cluster@FreeBSD.ORG  Mon Sep 26 12:25:37 2005
Return-Path: <owner-freebsd-cluster@FreeBSD.ORG>
X-Original-To: freebsd-cluster@freebsd.org
Delivered-To: freebsd-cluster@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id B5DB616A41F
	for <freebsd-cluster@freebsd.org>; Mon, 26 Sep 2005 12:25:37 +0000 (GMT)
	(envelope-from filip@wuytack.net)
Received: from london.wuytack.net (host-84-9-106-97.bulldogdsl.com
	[84.9.106.97]) by mx1.FreeBSD.org (Postfix) with ESMTP id B1DB943D49
	for <freebsd-cluster@freebsd.org>; Mon, 26 Sep 2005 12:25:36 +0000 (GMT)
	(envelope-from filip@wuytack.net)
Received: (qmail 99551 invoked by uid 1003); 26 Sep 2005 12:42:58 -0000
Received: from filip@wuytack.net by london.wuytack.net by uid 89 with
	qmail-scanner-1.22 
	(clamscan: 0.71. spamassassin: 2.63.  Clear:RC:1(82.110.72.114):. 
	Processed in 5.41758 secs); 26 Sep 2005 12:42:58 -0000
Received: from unknown (HELO ?127.0.0.1?) (filip@wuytack.net@82.110.72.114)
	by 10.11.12.4 with SMTP; 26 Sep 2005 12:42:52 -0000
Message-ID: <4337E8A7.6070107@wuytack.net>
Date: Mon, 26 Sep 2005 13:25:11 +0100
From: filip wuytack <filip@wuytack.net>
User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Eric Anderson <anderson@centtech.com>
References: <20050924141025.GA1236@uk.tiscali.com>
	<4337DF56.6030407@centtech.com>
In-Reply-To: <4337DF56.6030407@centtech.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org,
	Brian Candler <B.Candler@pobox.com>
Subject: Re: Options for synchronising filesystems
X-BeenThere: freebsd-cluster@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Clustering FreeBSD <freebsd-cluster.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>, 
	<mailto:freebsd-cluster-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-cluster>
List-Post: <mailto:freebsd-cluster@freebsd.org>
List-Help: <mailto:freebsd-cluster-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>,
	<mailto:freebsd-cluster-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Sep 2005 12:25:37 -0000


Eric Anderson wrote:
> Brian Candler wrote:
> 
>> Hello,
>>
>> I was wondering if anyone would care to share their experiences in
>> synchronising filesystems across a number of nodes in a cluster. I can 
>> think
>> of a number of options, but before changing what I'm doing at the 
>> moment I'd
>> like to see if anyone has good experiences with any of the others.
>>
>> The application: a clustered webserver. The users' CGIs run in a chroot
>> environment, and these clearly need to be identical (otherwise a CGI 
>> running
>> on one box would behave differently when running on a different box).
>> Ultimately I'd like to synchronise the host OS on each server too.
>>
>> Note that this is a single-master, multiple-slave type of filesystem
>> synchronisation I'm interested in.
>>
>>
>> 1. Keep a master image on an admin box, and rsync it out to the frontends
>> -------------------------------------------------------------------------
>>
>> This is what I'm doing at the moment. Install a master image in
>> /webroot/cgi, add packages there (chroot /webroot/cgi pkg_add ...), and
>> rsync it. [Actually I'm exporting it using NFS, and the frontends run 
>> rsync
>> locally when required to update their local copies against the NFS 
>> master]
>>
>> Disadvantages:
>>
>> - rsyncing a couple of gigs of data is not particularly fast, even 
>> when only
>> a few files have changed
>>
>> - if a sysadmin (wrongly) changes a file on a front-end instead of on the
>> master copy in the admin box, then the change will be lost when the next
>> rsync occurs. They might think they've fixed a problem, and then (say) 24
>> hours later their change is wiped. However if this is a config file, the
>> fact that the old file has been reinstated might not be noticed until the
>> daemon is restarted or the box rebooted - maybe months later. This I 
>> think
>> is the biggest fundamental problem.
>>
>> - files can be added locally and they will remain indefinitely (unless we
>> use rsync --delete which is a bit scary). If this is done then adding 
>> a new
>> machine into the cluster by rsyncing from the master will not pick up 
>> these
>> extra files.
>>
>> So, here are the alternatives I'm considering, and I'd welcome any
>> additional suggestions too.
> 
> 
> Here's a few ideas on this: do multiple rsyncs, one for each top level 
> directory.  That might speed up your total rsync process.  Another 
> similar method is using a content revisioning system.  This is only good 
> for some cases, but something like subversion might work ok here.
> 
> 
> 
>> 2. Run the images directly off NFS
>> ----------------------------------
>>
>> I've had this running before, even the entire O/S, and it works just 
>> fine.
>> However the NFS server itself then becomes a critical
>> single-point-of-failure: if it has to be rebooted and is out of 
>> service for
>> 2 minutes, then the whole cluster is out of service for that time.
>>
>> I think this is only feasible if I can build a highly-available NFS 
>> server,
>> which really means a pair of boxes serving the same data. Since the 
>> system
>> image is read-only from the point of view of the frontends, this 
>> should be
>> easy enough:
>>
>>       frontends            frontends
>>         | | |                | | |
>>          NFS   ----------->   NFS
>>        server 1    sync     server 2
>>
>> As far as I know, NFS clients don't support the idea of failing over from
>> one server to another, so I'd have to make a server pair which 
>> transparently
>> fails over.
>>
>> I could make one NFS server take over the other server's IP address using
>> carp or vrrp. However, I suspect that the clients might notice. I know 
>> that
>> NFS is 'stateless' in the sense that a server can be rebooted, but for a
>> client to be redirected from one server to the other, I expect that these
>> filesytems would have to be *identical*, down to the level of the inode
>> numbers being the same.
>>
>> If that's true, then rsync between the two NFS servers won't cut it. I 
>> was
>> thinking of perhaps using geom_mirror plus ggated/ggatec to make a
>> block-identical read-only mirror image on NFS server 2 - this also has 
>> the
>> advantage that any updates are close to instantaneous.
>>
>> What worries me here is how NFS server 2, which has the mirrored 
>> filesystem
>> mounted read-only, will take to having the data changed under its 
>> nose. Does
>> it for example keep caches of inodes in memory, and what would happen if
>> those inodes on disk were to change? I guess I can always just unmount 
>> and
>> remount the filesystem on NFS server 2 after each change.
> 
> 
> I've tried doing something similar.  I used fiber attached storage, and 
> had multiple hosts mounting the same partition.  It seemed as though 
> when host A mounted the filesystem read-write, and then host B mounted 
> it read-only, any changes made by host A were not seen by B, and even 
> remounting did not always bring it up to current state.  I believe it 
> has to do with the buffer cache and host A's desire to keep things (like 
> inode changes, block maps, etc) in cache and not write them to disk. 
> FreeBSD does not currently have a multi-system cache coherency protocol 
> to distribute that information to other hosts.  This is something I 
> think would be very useful for many people.  I suppose you could just 
> mount the filesystem when you know a change has happened, but you still 
> may not see the change.  Maybe mounting the filesystem on host A with 
> the sync option would help.
> 
>> My other concern is about susceptibility to DoS-type attacks: if one
>> frontend were to go haywire and start hammering the NFS servers really 
>> hard,
>> it could impact on all the other machines in the cluster.
>>
>> However, the problems of data synchronisation are solved: any change 
>> made on
>> the NFS server is visible identically to all front-ends, and sysadmins 
>> can't
>> make changes on the front-ends because the NFS export is read-only.
> 
> 
> This was my first thought too, and a highly available NFS server is 
> something any NFS heavy installation wants (needs).  There are a few 
> implementations of clustered filesystems out there, but non for FreeBSD 
> (yet).   What that allows is multiple machines talking to a shared 
> storage with read/write access.  Very handy, but since you only need 
> read-only access, I think your problem is much simpler, and you can get 
> away with a lot less.
> 
> 
>> 3. Use a network distributed filesystem - CODA? AFS?
>> ----------------------------------------------------
>>
>> If each frontend were to access the filesystem as a read-only network 
>> mount,
>> but have a local copy to work with in the case of disconnected operation,
>> then the SPOF of an NFS server would be eliminated.
>>
>> However, I have no experience with CODA, and although it's been in the 
>> tree
>> since 2002, the README's don't inspire confidence:
>>
>>    "It is mostly working, but hasn't been run long enough to be sure 
>> all the
>>    bugs are sorted out. ... This code is not SMP ready"
>>
>> Also, a local cache is no good if the data you want during disconnected
>> operation is not in the cache at that time, which I think means this 
>> idea is
>> not actually a very good one.
> 
> 
> There is also a port for coda.  I've been reading about this,  and it's 
> an interesting filesystem, but I'm just not sure of it's usefulness yet.
> 
> 
>> 4. Mount filesystems read-only
>> ------------------------------
>>
>> On each front-end I could store /webroot/cgi on a filesystem mounted
>> read-only to prevent tampering (as long as the sysadmin doesn't 
>> remount it
>> read-write of course). That would work reasonably well, except that being
>> mounted read-only I couldn't use rsync to update it!
>>
>> It might also work with geom_mirror and ggated/ggatec, except for the 
>> issue
>> I raised before about changing blocks on a filesystem under the nose of a
>> client who is actively reading from it.
> 
> 
> I suppose you could mount r/w only when doing the rsync, then switch 
> back to ro once complete.  You should be able to do this online, without 
> any issues or taking the filesystem offline.
> 
> 
>> 5. Using a filesystem which really is read-only
>> -----------------------------------------------
>>
>> Better tamper-protection could be had by keeping data in a filesystem
>> structure which doesn't support any updates at all - such as cd9660 or
>> geom_uzip.
>>
>> The issue here is how to roll out a new version of the data. I could push
>> out a new filesystem image into a second partition, but it would then be
>> necessary to unmount the old filesystem and remount the new on the same
>> place, and you can't really unmount a filesystem which is in use. So this
>> would require a reboot.
>>
>> I was thinking that some symlink trickery might help:
>>
>>     /webroot/cgi -> /webroot/cgi1
>>     /webroot/cgi1     # filesystem A mounted here
>>     /webroot/cgi2     # filesystem B mounted here
>>
>> It should be possible to unmount /webroot/cgi2, dd in a new image, 
>> remount
>> it, and change the symlink to point to /webroot/cgi2. After a little 
>> while,
>> hopefully all the applications will stop using files in /webroot/cgi1, so
>> this one can be unmounted and a new one put in its place on the next 
>> update.
>> However this is not guaranteed, especially if there are long-lived 
>> processes
>> using binary images in this partition. You'd still have to stop and 
>> restart
>> all those processes.
>>
>> If reboots were acceptable, then the filesystem image could also be 
>> stored
>> in ramdisk pulled in via pxeboot. This makes sense especially for 
>> geom_uzip
>> where the data is pre-compressed. However I would still prefer to avoid
>> frequent reboots if at all possible. Also, whilst a ramdisk might be 
>> OK for
>> the root filesystem, a typical CGI environment (with perl, php, ruby,
>> python, and loads of libraries) would probably be too large anyway.
>>
>>
>> 6. Journaling filesystem replication
>> ------------------------------------
>>
>> If the data were stored on a journaling filesystem on the master box, and
>> the journal logs were distributed out to the slaves, then they would all
>> have identical filesystem copies and only a minimal amount of data would
>> need to be pushed out to each machine on each change. (This would be 
>> rather
>> like NetApps and their snap-mirroring system). However I'm not aware 
>> of any
>> journaling filesystem for FreeBSD, let alone whether it would support
>> filesystem replication in this way.
> 
> 
> There is a project underway for UFSJ (UFS journaling).   Maybe once it 
> is complete, and bugs are ironed out, one could implement a journal 
> distribution piece to send the journal updates to multiple hosts and 
> achieve what you are thinking, however, that only distributes the 
> meta-data, and not the actual data.
> 
> 
Have a look at dragonfly BSD for this. They are working on a journaling 
filesystem that will do just that.

~ Fil


> Good luck finding your ultimate solution!
> 
> Eric
> 
> 


From owner-freebsd-cluster@FreeBSD.ORG  Mon Sep 26 12:46:16 2005
Return-Path: <owner-freebsd-cluster@FreeBSD.ORG>
X-Original-To: freebsd-cluster@freebsd.org
Delivered-To: freebsd-cluster@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 9724C16A41F;
	Mon, 26 Sep 2005 12:46:16 +0000 (GMT)
	(envelope-from anderson@centtech.com)
Received: from mh2.centtech.com (moat3.centtech.com [207.200.51.50])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 2252743D48;
	Mon, 26 Sep 2005 12:46:15 +0000 (GMT)
	(envelope-from anderson@centtech.com)
Received: from [10.177.171.220] (neutrino.centtech.com [10.177.171.220])
	by mh2.centtech.com (8.13.1/8.13.1) with ESMTP id j8QCkENe047545;
	Mon, 26 Sep 2005 07:46:14 -0500 (CDT)
	(envelope-from anderson@centtech.com)
Message-ID: <4337ED91.8080200@centtech.com>
Date: Mon, 26 Sep 2005 07:46:09 -0500
From: Eric Anderson <anderson@centtech.com>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.11) Gecko/20050914
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: filip wuytack <filip@wuytack.net>
References: <20050924141025.GA1236@uk.tiscali.com>
	<4337DF56.6030407@centtech.com> <4337E8A7.6070107@wuytack.net>
In-Reply-To: <4337E8A7.6070107@wuytack.net>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: ClamAV 0.82/1102/Sun Sep 25 09:04:56 2005 on mh2.centtech.com
X-Virus-Status: Clean
Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org
Subject: Re: Options for synchronising filesystems
X-BeenThere: freebsd-cluster@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Clustering FreeBSD <freebsd-cluster.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>, 
	<mailto:freebsd-cluster-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-cluster>
List-Post: <mailto:freebsd-cluster@freebsd.org>
List-Help: <mailto:freebsd-cluster-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>,
	<mailto:freebsd-cluster-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Sep 2005 12:46:16 -0000

filip wuytack wrote:
> 
> 
> Eric Anderson wrote:
> 
>> Brian Candler wrote:
>>
>>> Hello,
>>>
>>> I was wondering if anyone would care to share their experiences in
>>> synchronising filesystems across a number of nodes in a cluster. I 
>>> can think
>>> of a number of options, but before changing what I'm doing at the 
>>> moment I'd
>>> like to see if anyone has good experiences with any of the others.
>>>
>>> The application: a clustered webserver. The users' CGIs run in a chroot
>>> environment, and these clearly need to be identical (otherwise a CGI 
>>> running
>>> on one box would behave differently when running on a different box).
>>> Ultimately I'd like to synchronise the host OS on each server too.
>>>
>>> Note that this is a single-master, multiple-slave type of filesystem
>>> synchronisation I'm interested in.
>>>
>>>
>>> 1. Keep a master image on an admin box, and rsync it out to the 
>>> frontends
>>> ------------------------------------------------------------------------- 
>>>
>>>
>>> This is what I'm doing at the moment. Install a master image in
>>> /webroot/cgi, add packages there (chroot /webroot/cgi pkg_add ...), and
>>> rsync it. [Actually I'm exporting it using NFS, and the frontends run 
>>> rsync
>>> locally when required to update their local copies against the NFS 
>>> master]
>>>
>>> Disadvantages:
>>>
>>> - rsyncing a couple of gigs of data is not particularly fast, even 
>>> when only
>>> a few files have changed
>>>
>>> - if a sysadmin (wrongly) changes a file on a front-end instead of on 
>>> the
>>> master copy in the admin box, then the change will be lost when the next
>>> rsync occurs. They might think they've fixed a problem, and then 
>>> (say) 24
>>> hours later their change is wiped. However if this is a config file, the
>>> fact that the old file has been reinstated might not be noticed until 
>>> the
>>> daemon is restarted or the box rebooted - maybe months later. This I 
>>> think
>>> is the biggest fundamental problem.
>>>
>>> - files can be added locally and they will remain indefinitely 
>>> (unless we
>>> use rsync --delete which is a bit scary). If this is done then adding 
>>> a new
>>> machine into the cluster by rsyncing from the master will not pick up 
>>> these
>>> extra files.
>>>
>>> So, here are the alternatives I'm considering, and I'd welcome any
>>> additional suggestions too.
>>
>>
>>
>> Here's a few ideas on this: do multiple rsyncs, one for each top level 
>> directory.  That might speed up your total rsync process.  Another 
>> similar method is using a content revisioning system.  This is only 
>> good for some cases, but something like subversion might work ok here.
>>
>>
>>
>>> 2. Run the images directly off NFS
>>> ----------------------------------
>>>
>>> I've had this running before, even the entire O/S, and it works just 
>>> fine.
>>> However the NFS server itself then becomes a critical
>>> single-point-of-failure: if it has to be rebooted and is out of 
>>> service for
>>> 2 minutes, then the whole cluster is out of service for that time.
>>>
>>> I think this is only feasible if I can build a highly-available NFS 
>>> server,
>>> which really means a pair of boxes serving the same data. Since the 
>>> system
>>> image is read-only from the point of view of the frontends, this 
>>> should be
>>> easy enough:
>>>
>>>       frontends            frontends
>>>         | | |                | | |
>>>          NFS   ----------->   NFS
>>>        server 1    sync     server 2
>>>
>>> As far as I know, NFS clients don't support the idea of failing over 
>>> from
>>> one server to another, so I'd have to make a server pair which 
>>> transparently
>>> fails over.
>>>
>>> I could make one NFS server take over the other server's IP address 
>>> using
>>> carp or vrrp. However, I suspect that the clients might notice. I 
>>> know that
>>> NFS is 'stateless' in the sense that a server can be rebooted, but for a
>>> client to be redirected from one server to the other, I expect that 
>>> these
>>> filesytems would have to be *identical*, down to the level of the inode
>>> numbers being the same.
>>>
>>> If that's true, then rsync between the two NFS servers won't cut it. 
>>> I was
>>> thinking of perhaps using geom_mirror plus ggated/ggatec to make a
>>> block-identical read-only mirror image on NFS server 2 - this also 
>>> has the
>>> advantage that any updates are close to instantaneous.
>>>
>>> What worries me here is how NFS server 2, which has the mirrored 
>>> filesystem
>>> mounted read-only, will take to having the data changed under its 
>>> nose. Does
>>> it for example keep caches of inodes in memory, and what would happen if
>>> those inodes on disk were to change? I guess I can always just 
>>> unmount and
>>> remount the filesystem on NFS server 2 after each change.
>>
>>
>>
>> I've tried doing something similar.  I used fiber attached storage, 
>> and had multiple hosts mounting the same partition.  It seemed as 
>> though when host A mounted the filesystem read-write, and then host B 
>> mounted it read-only, any changes made by host A were not seen by B, 
>> and even remounting did not always bring it up to current state.  I 
>> believe it has to do with the buffer cache and host A's desire to keep 
>> things (like inode changes, block maps, etc) in cache and not write 
>> them to disk. FreeBSD does not currently have a multi-system cache 
>> coherency protocol to distribute that information to other hosts.  
>> This is something I think would be very useful for many people.  I 
>> suppose you could just mount the filesystem when you know a change has 
>> happened, but you still may not see the change.  Maybe mounting the 
>> filesystem on host A with the sync option would help.
>>
>>> My other concern is about susceptibility to DoS-type attacks: if one
>>> frontend were to go haywire and start hammering the NFS servers 
>>> really hard,
>>> it could impact on all the other machines in the cluster.
>>>
>>> However, the problems of data synchronisation are solved: any change 
>>> made on
>>> the NFS server is visible identically to all front-ends, and 
>>> sysadmins can't
>>> make changes on the front-ends because the NFS export is read-only.
>>
>>
>>
>> This was my first thought too, and a highly available NFS server is 
>> something any NFS heavy installation wants (needs).  There are a few 
>> implementations of clustered filesystems out there, but non for 
>> FreeBSD (yet).   What that allows is multiple machines talking to a 
>> shared storage with read/write access.  Very handy, but since you only 
>> need read-only access, I think your problem is much simpler, and you 
>> can get away with a lot less.
>>
>>
>>> 3. Use a network distributed filesystem - CODA? AFS?
>>> ----------------------------------------------------
>>>
>>> If each frontend were to access the filesystem as a read-only network 
>>> mount,
>>> but have a local copy to work with in the case of disconnected 
>>> operation,
>>> then the SPOF of an NFS server would be eliminated.
>>>
>>> However, I have no experience with CODA, and although it's been in 
>>> the tree
>>> since 2002, the README's don't inspire confidence:
>>>
>>>    "It is mostly working, but hasn't been run long enough to be sure 
>>> all the
>>>    bugs are sorted out. ... This code is not SMP ready"
>>>
>>> Also, a local cache is no good if the data you want during disconnected
>>> operation is not in the cache at that time, which I think means this 
>>> idea is
>>> not actually a very good one.
>>
>>
>>
>> There is also a port for coda.  I've been reading about this,  and 
>> it's an interesting filesystem, but I'm just not sure of it's 
>> usefulness yet.
>>
>>
>>> 4. Mount filesystems read-only
>>> ------------------------------
>>>
>>> On each front-end I could store /webroot/cgi on a filesystem mounted
>>> read-only to prevent tampering (as long as the sysadmin doesn't 
>>> remount it
>>> read-write of course). That would work reasonably well, except that 
>>> being
>>> mounted read-only I couldn't use rsync to update it!
>>>
>>> It might also work with geom_mirror and ggated/ggatec, except for the 
>>> issue
>>> I raised before about changing blocks on a filesystem under the nose 
>>> of a
>>> client who is actively reading from it.
>>
>>
>>
>> I suppose you could mount r/w only when doing the rsync, then switch 
>> back to ro once complete.  You should be able to do this online, 
>> without any issues or taking the filesystem offline.
>>
>>
>>> 5. Using a filesystem which really is read-only
>>> -----------------------------------------------
>>>
>>> Better tamper-protection could be had by keeping data in a filesystem
>>> structure which doesn't support any updates at all - such as cd9660 or
>>> geom_uzip.
>>>
>>> The issue here is how to roll out a new version of the data. I could 
>>> push
>>> out a new filesystem image into a second partition, but it would then be
>>> necessary to unmount the old filesystem and remount the new on the same
>>> place, and you can't really unmount a filesystem which is in use. So 
>>> this
>>> would require a reboot.
>>>
>>> I was thinking that some symlink trickery might help:
>>>
>>>     /webroot/cgi -> /webroot/cgi1
>>>     /webroot/cgi1     # filesystem A mounted here
>>>     /webroot/cgi2     # filesystem B mounted here
>>>
>>> It should be possible to unmount /webroot/cgi2, dd in a new image, 
>>> remount
>>> it, and change the symlink to point to /webroot/cgi2. After a little 
>>> while,
>>> hopefully all the applications will stop using files in 
>>> /webroot/cgi1, so
>>> this one can be unmounted and a new one put in its place on the next 
>>> update.
>>> However this is not guaranteed, especially if there are long-lived 
>>> processes
>>> using binary images in this partition. You'd still have to stop and 
>>> restart
>>> all those processes.
>>>
>>> If reboots were acceptable, then the filesystem image could also be 
>>> stored
>>> in ramdisk pulled in via pxeboot. This makes sense especially for 
>>> geom_uzip
>>> where the data is pre-compressed. However I would still prefer to avoid
>>> frequent reboots if at all possible. Also, whilst a ramdisk might be 
>>> OK for
>>> the root filesystem, a typical CGI environment (with perl, php, ruby,
>>> python, and loads of libraries) would probably be too large anyway.
>>>
>>>
>>> 6. Journaling filesystem replication
>>> ------------------------------------
>>>
>>> If the data were stored on a journaling filesystem on the master box, 
>>> and
>>> the journal logs were distributed out to the slaves, then they would all
>>> have identical filesystem copies and only a minimal amount of data would
>>> need to be pushed out to each machine on each change. (This would be 
>>> rather
>>> like NetApps and their snap-mirroring system). However I'm not aware 
>>> of any
>>> journaling filesystem for FreeBSD, let alone whether it would support
>>> filesystem replication in this way.
>>
>>
>>
>> There is a project underway for UFSJ (UFS journaling).   Maybe once it 
>> is complete, and bugs are ironed out, one could implement a journal 
>> distribution piece to send the journal updates to multiple hosts and 
>> achieve what you are thinking, however, that only distributes the 
>> meta-data, and not the actual data.
>>
>>
> Have a look at dragonfly BSD for this. They are working on a journaling 
> filesystem that will do just that.

Do you have a link to some information on this?  I've been looking at 
Dragonfly, but I'm having trouble finding good information on what is 
already working, in planning, etc.

Eric


-- 
------------------------------------------------------------------------
Eric Anderson        Sr. Systems Administrator        Centaur Technology
Anything that works is better than anything that doesn't.
------------------------------------------------------------------------

From owner-freebsd-cluster@FreeBSD.ORG  Mon Sep 26 15:33:27 2005
Return-Path: <owner-freebsd-cluster@FreeBSD.ORG>
X-Original-To: freebsd-cluster@freebsd.org
Delivered-To: freebsd-cluster@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 3336B16A41F;
	Mon, 26 Sep 2005 15:33:27 +0000 (GMT)
	(envelope-from b.candler@pobox.com)
Received: from leto.uk.clara.net (leto.uk.clara.net [80.168.69.50])
	by mx1.FreeBSD.org (Postfix) with ESMTP id B2F9F43D62;
	Mon, 26 Sep 2005 15:33:20 +0000 (GMT)
	(envelope-from b.candler@pobox.com)
Received: from bloodhound.noc.clara.net ([195.8.70.207])
	by leto.uk.clara.net with esmtp (Exim 4.43)
	id 1EJuyk-000HWz-Rg; Mon, 26 Sep 2005 16:33:18 +0100
Received: from personal by bloodhound.noc.clara.net with local (Exim 4.52
	(FreeBSD)) id 1EJuyy-0006vz-Ff; Mon, 26 Sep 2005 16:33:32 +0100
Date: Mon, 26 Sep 2005 16:33:32 +0100
From: Brian Candler <B.Candler@pobox.com>
To: filip wuytack <filip@wuytack.net>
Message-ID: <20050926153332.GA26373@uk.tiscali.com>
References: <20050924141025.GA1236@uk.tiscali.com>
	<4337DF56.6030407@centtech.com> <4337E8A7.6070107@wuytack.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4337E8A7.6070107@wuytack.net>
User-Agent: Mutt/1.4.2.1i
Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org
Subject: Re: Options for synchronising filesystems
X-BeenThere: freebsd-cluster@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Clustering FreeBSD <freebsd-cluster.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>, 
	<mailto:freebsd-cluster-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-cluster>
List-Post: <mailto:freebsd-cluster@freebsd.org>
List-Help: <mailto:freebsd-cluster-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>,
	<mailto:freebsd-cluster-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Sep 2005 15:33:27 -0000

On Mon, Sep 26, 2005 at 01:25:11PM +0100, filip wuytack wrote:
> Have a look at dragonfly BSD for this. They are working on a journaling 
> filesystem that will do just that.

Someone else mentioned that. However DragonFly BSD seems very short of
documentation on the web; I did finally find some on-line manpages courtesy
of google (couldn't find them linked from www.dragonflybsd.org)

>From what I read, I also get the impression that the journalling feature is
rather a work-in-progress right now.

Regards,

Brian.

From owner-freebsd-cluster@FreeBSD.ORG  Mon Sep 26 18:16:38 2005
Return-Path: <owner-freebsd-cluster@FreeBSD.ORG>
X-Original-To: freebsd-cluster@freebsd.org
Delivered-To: freebsd-cluster@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 77E7516A41F;
	Mon, 26 Sep 2005 18:16:38 +0000 (GMT) (envelope-from ike@lesmuug.org)
Received: from beth.easthouston.org (dsl254-117-002.nyc1.dsl.speakeasy.net
	[216.254.117.2])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 0CE6043D66;
	Mon, 26 Sep 2005 18:16:35 +0000 (GMT) (envelope-from ike@lesmuug.org)
Received: from [192.168.1.22] (249-218.customer.cloud9.net [168.100.249.218])
	(using TLSv1 with cipher RC4-SHA (128/128 bits))
	(No client certificate requested)
	by beth.easthouston.org (Postfix) with ESMTP
	id 932B2D9AC20; Mon, 26 Sep 2005 14:16:34 -0400 (EDT)
In-Reply-To: <20050924141025.GA1236@uk.tiscali.com>
References: <20050924141025.GA1236@uk.tiscali.com>
Mime-Version: 1.0 (Apple Message framework v734)
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <E7A2AE04-87DC-4F3A-87DE-97CD5B51E60F@lesmuug.org>
Content-Transfer-Encoding: 7bit
From: Isaac Levy <ike@lesmuug.org>
Date: Mon, 26 Sep 2005 14:16:31 -0400
To: Brian Candler <B.Candler@pobox.com>
X-Mailer: Apple Mail (2.734)
Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org
Subject: Re: Options for synchronising filesystems
X-BeenThere: freebsd-cluster@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Clustering FreeBSD <freebsd-cluster.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>, 
	<mailto:freebsd-cluster-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-cluster>
List-Post: <mailto:freebsd-cluster@freebsd.org>
List-Help: <mailto:freebsd-cluster-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>,
	<mailto:freebsd-cluster-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Sep 2005 18:16:38 -0000

Hi Brian, All,

This email has one theme: GEOM! :)

On Sep 24, 2005, at 10:10 AM, Brian Candler wrote:

> Hello,
>
> I was wondering if anyone would care to share their experiences in
> synchronising filesystems across a number of nodes in a cluster. I  
> can think
> of a number of options, but before changing what I'm doing at the  
> moment I'd
> like to see if anyone has good experiences with any of the others.
>
> The application: a clustered webserver. The users' CGIs run in a  
> chroot
> environment, and these clearly need to be identical (otherwise a  
> CGI running
> on one box would behave differently when running on a different box).
> Ultimately I'd like to synchronise the host OS on each server too.
>
> Note that this is a single-master, multiple-slave type of filesystem
> synchronisation I'm interested in.

I just wanted to throw out some quick thoughts on a totally different  
approach which nobody has really explored in this thread, solutions  
which are production level software. (Sorry if I'm repeating things  
or giving out info yall' already know:)

--
Geom:
http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/geom- 
intro.html

The core Disk IO framework for FreeBSD, as of 5.x, led by PHK:
http://www.bsdcan.org/2004/papers/geom.pdf

This framework itself is not as useful to you as the utilities which  
make use of it,

--
Geom Gate:
http://kerneltrap.org/news/freebsd?from=20

Network device-level client/server disk mapping tool.
(VERY IMPORTANT COMPONENT, it's reportedly faster, and more stable  
than NFS has ever been- so people have immediately and happily  
deployed it in production systems!)

--
Gvinum and Gmirror:

Gmirror
http://people.freebsd.org/~rse/mirror/
http://www.ie.freebsd.org/doc/en_US.ISO8859-1/books/handbook/geom.html

(Sidenote: even Greg Lehey (original author of Vinum), has stated  
that it's better to use Geom-based tools than Vinum for the  
forseeable future.)

--
In a nutshell, to address your needs, let me toss out the following  
example setup:

I know of one web-shop in Canada, which is running 2 machines for  
every virtual cluster, in the following configuration:

2 servers,
4 SATA drives per box,
quad copper/ethernet gigabit nic on each box

each drive is mirrored using gmirror, over each of the gigabit  
ethernet nics
each box is running Vinum Raid5 across the 4  mirrored drives

The drives are then sliced appropriately, and server resources are  
distributed across the boxes- with various slices mounted on each box.
The folks I speak of simply have a suite of failover shell scripts  
prepared, in the event of a machine experiencing total hardware failure.

Pretty tough stuff, very high-performance, and CHEAP.

--
With that, I'm working towards similar setups, oriented around  
redundant jailed systems, with an eventual end to tie CARP (from pf)  
into the mix to make for nearly-instantaneous jailed failover  
redundancy- (but it's going to be some time before I have what I want  
worked out for production on my own).

Regardless, it's worth tapping into the GEOM dialogues, as there are  
many new ways of working with disks coming into existence- and the  
GEOM framework itself provides an EXTREMELY solid base to bring  
'exotic' disk configurations up to production level quickly.
(Also noteworthy, there's a couple of encrypted disk systems based on  
GEOM emerging now too...)

--
Hope all that helps,

Best,
.ike


From owner-freebsd-cluster@FreeBSD.ORG  Mon Sep 26 20:38:09 2005
Return-Path: <owner-freebsd-cluster@FreeBSD.ORG>
X-Original-To: freebsd-cluster@freebsd.org
Delivered-To: freebsd-cluster@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 2F96F16A496;
	Mon, 26 Sep 2005 20:38:09 +0000 (GMT)
	(envelope-from anderson@centtech.com)
Received: from mh1.centtech.com (moat3.centtech.com [207.200.51.50])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 6D82C43D58;
	Mon, 26 Sep 2005 20:38:07 +0000 (GMT)
	(envelope-from anderson@centtech.com)
Received: from [10.177.171.220] (neutrino.centtech.com [10.177.171.220])
	by mh1.centtech.com (8.13.1/8.13.1) with ESMTP id j8QKc6nK077563;
	Mon, 26 Sep 2005 15:38:06 -0500 (CDT)
	(envelope-from anderson@centtech.com)
Message-ID: <43385C29.5060406@centtech.com>
Date: Mon, 26 Sep 2005 15:38:01 -0500
From: Eric Anderson <anderson@centtech.com>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.11) Gecko/20050914
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Isaac Levy <ike@lesmuug.org>
References: <20050924141025.GA1236@uk.tiscali.com>
	<E7A2AE04-87DC-4F3A-87DE-97CD5B51E60F@lesmuug.org>
In-Reply-To: <E7A2AE04-87DC-4F3A-87DE-97CD5B51E60F@lesmuug.org>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: ClamAV 0.82/1102/Sun Sep 25 09:04:56 2005 on mh1.centtech.com
X-Virus-Status: Clean
Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org
Subject: Re: Options for synchronising filesystems
X-BeenThere: freebsd-cluster@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Clustering FreeBSD <freebsd-cluster.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>, 
	<mailto:freebsd-cluster-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-cluster>
List-Post: <mailto:freebsd-cluster@freebsd.org>
List-Help: <mailto:freebsd-cluster-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>,
	<mailto:freebsd-cluster-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Sep 2005 20:38:09 -0000

Isaac Levy wrote:
> Hi Brian, All,
> 
> This email has one theme: GEOM! :)
> 
> On Sep 24, 2005, at 10:10 AM, Brian Candler wrote:
> 
>> Hello,
>>
>> I was wondering if anyone would care to share their experiences in
>> synchronising filesystems across a number of nodes in a cluster. I  
>> can think
>> of a number of options, but before changing what I'm doing at the  
>> moment I'd
>> like to see if anyone has good experiences with any of the others.
>>
>> The application: a clustered webserver. The users' CGIs run in a  chroot
>> environment, and these clearly need to be identical (otherwise a  CGI 
>> running
>> on one box would behave differently when running on a different box).
>> Ultimately I'd like to synchronise the host OS on each server too.
>>
>> Note that this is a single-master, multiple-slave type of filesystem
>> synchronisation I'm interested in.
> 
> 
> I just wanted to throw out some quick thoughts on a totally different  
> approach which nobody has really explored in this thread, solutions  
> which are production level software. (Sorry if I'm repeating things  or 
> giving out info yall' already know:)
> 
> -- 
> Geom:
> http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/geom- intro.html
> 
> The core Disk IO framework for FreeBSD, as of 5.x, led by PHK:
> http://www.bsdcan.org/2004/papers/geom.pdf
> 
> This framework itself is not as useful to you as the utilities which  
> make use of it,
> 
> -- 
> Geom Gate:
> http://kerneltrap.org/news/freebsd?from=20
> 
> Network device-level client/server disk mapping tool.
> (VERY IMPORTANT COMPONENT, it's reportedly faster, and more stable  than 
> NFS has ever been- so people have immediately and happily  deployed it 
> in production systems!)
> 
> -- 
> Gvinum and Gmirror:
> 
> Gmirror
> http://people.freebsd.org/~rse/mirror/
> http://www.ie.freebsd.org/doc/en_US.ISO8859-1/books/handbook/geom.html
> 
> (Sidenote: even Greg Lehey (original author of Vinum), has stated  that 
> it's better to use Geom-based tools than Vinum for the  forseeable future.)
> 
> -- 
> In a nutshell, to address your needs, let me toss out the following  
> example setup:
> 
> I know of one web-shop in Canada, which is running 2 machines for  every 
> virtual cluster, in the following configuration:
> 
> 2 servers,
> 4 SATA drives per box,
> quad copper/ethernet gigabit nic on each box
> 
> each drive is mirrored using gmirror, over each of the gigabit  ethernet 
> nics
> each box is running Vinum Raid5 across the 4  mirrored drives
> 
> The drives are then sliced appropriately, and server resources are  
> distributed across the boxes- with various slices mounted on each box.
> The folks I speak of simply have a suite of failover shell scripts  
> prepared, in the event of a machine experiencing total hardware failure.
> 
> Pretty tough stuff, very high-performance, and CHEAP.
> 
> -- 
> With that, I'm working towards similar setups, oriented around  
> redundant jailed systems, with an eventual end to tie CARP (from pf)  
> into the mix to make for nearly-instantaneous jailed failover  
> redundancy- (but it's going to be some time before I have what I want  
> worked out for production on my own).
> 
> Regardless, it's worth tapping into the GEOM dialogues, as there are  
> many new ways of working with disks coming into existence- and the  GEOM 
> framework itself provides an EXTREMELY solid base to bring  'exotic' 
> disk configurations up to production level quickly.
> (Also noteworthy, there's a couple of encrypted disk systems based on  
> GEOM emerging now too...)

I think the original poster (and I at least) knew about this already, 
but what I still fail to see is how you can get several machines using 
the same data at the same time, and still do updates to that data?  The 
only way I know of is to use a syncing tool (like rsync) or a shared 
filesystem (like NFS, or CXFS, or Polyserve FS, opengfs, etc), none of 
which run on FreeBSD.

What I read from above, is a redundant server setup, not a 
high-performance setup (meaning multiple machines serving the same data 
to many clients).  If I'm missing something, please fill me in..

Eric


-- 
------------------------------------------------------------------------
Eric Anderson        Sr. Systems Administrator        Centaur Technology
Anything that works is better than anything that doesn't.
------------------------------------------------------------------------

From owner-freebsd-cluster@FreeBSD.ORG  Mon Sep 26 21:27:34 2005
Return-Path: <owner-freebsd-cluster@FreeBSD.ORG>
X-Original-To: freebsd-cluster@freebsd.org
Delivered-To: freebsd-cluster@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 8757816A41F;
	Mon, 26 Sep 2005 21:27:34 +0000 (GMT) (envelope-from ike@lesmuug.org)
Received: from beth.easthouston.org (dsl254-117-002.nyc1.dsl.speakeasy.net
	[216.254.117.2])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 2C02F43D48;
	Mon, 26 Sep 2005 21:27:34 +0000 (GMT) (envelope-from ike@lesmuug.org)
Received: from [192.168.1.22] (249-218.customer.cloud9.net [168.100.249.218])
	(using TLSv1 with cipher RC4-SHA (128/128 bits))
	(No client certificate requested)
	by beth.easthouston.org (Postfix) with ESMTP
	id 5115BD9B887; Mon, 26 Sep 2005 17:27:33 -0400 (EDT)
In-Reply-To: <43385C29.5060406@centtech.com>
References: <20050924141025.GA1236@uk.tiscali.com>
	<E7A2AE04-87DC-4F3A-87DE-97CD5B51E60F@lesmuug.org>
	<43385C29.5060406@centtech.com>
Mime-Version: 1.0 (Apple Message framework v734)
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <BF443E79-045A-4ADC-956D-01DDEE64413D@lesmuug.org>
Content-Transfer-Encoding: 7bit
From: Isaac Levy <ike@lesmuug.org>
Date: Mon, 26 Sep 2005 17:27:30 -0400
To: Eric Anderson <anderson@centtech.com>
X-Mailer: Apple Mail (2.734)
Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org
Subject: Re: Options for synchronising filesystems
X-BeenThere: freebsd-cluster@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Clustering FreeBSD <freebsd-cluster.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>, 
	<mailto:freebsd-cluster-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-cluster>
List-Post: <mailto:freebsd-cluster@freebsd.org>
List-Help: <mailto:freebsd-cluster-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>,
	<mailto:freebsd-cluster-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Sep 2005 21:27:34 -0000

Hi Eric, All,

On Sep 26, 2005, at 4:38 PM, Eric Anderson wrote:

> I think the original poster (and I at least) knew about this  
> already, but what I still fail to see is how you can get several  
> machines using the same data at the same time, and still do updates  
> to that data?  The only way I know of is to use a syncing tool  
> (like rsync) or a shared filesystem (like NFS, or CXFS, or  
> Polyserve FS, opengfs, etc), none of which run on FreeBSD.

Gotcha, I did skip somewhat to the side of the original requirements,

>
> What I read from above, is a redundant server setup, not a high- 
> performance setup (meaning multiple machines serving the same data  
> to many clients).  If I'm missing something, please fill me in..

I'm not certain that my intention was to provide the best answer, but  
to provide yet another set of tools to get the job done.

In effect, a terse example of how someone could use the Geom tools I  
mentioned, to meet this requirement:

+ Setup mirrored disks across machines as discussed before

+ Mount a slice of that disk Read/Write on one machine (acting as  
master)
+ Mount that same slice Readonly on both machines, using Geom Gate,  
and serve data from there.

- If the master machine dies, mount the volume Read/Write on the  
other machine


I'm not certain if this meets the requirements precisely, but I  
believe there may be a combination of these Geom-based utilities  
which would- and they are all actively under continued development.

--
Eric, you are definately correct, that there's not really a disk- 
level mechanism to maintain concurrent writes between volumes mounted  
across servers using FreeBSD (excepting NFS, which in this context,  
makes me say *yuck*).
Anyone with some spare time want to take up this problem as a new  
Geom project? ;)

However, based on my experiences with distributed database clusters,  
I believe it's fair to say that any persistent data (writes) are a  
very difficult task to get done right across a cluster- and maintain  
contextually sane levels of performance, (due to resource locking  
issues, mixed with network latency, etc...)
I guess I'm saying this is a big-picture computing problem IMHO, and  
I don't know of a good solution here (though I'm curious about what  
kind of work has been done in Dragonfly which is relevant?)

>
> Eric
>

--
Got a spare NetApp anyone?  My head hurts. :)

Best,
.ike


From owner-freebsd-cluster@FreeBSD.ORG  Mon Sep 26 21:39:36 2005
Return-Path: <owner-freebsd-cluster@FreeBSD.ORG>
X-Original-To: freebsd-cluster@freebsd.org
Delivered-To: freebsd-cluster@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 3541716A41F;
	Mon, 26 Sep 2005 21:39:36 +0000 (GMT)
	(envelope-from b.candler@pobox.com)
Received: from orb.pobox.com (orb.pobox.com [207.8.226.5])
	by mx1.FreeBSD.org (Postfix) with ESMTP id B6C4E43D48;
	Mon, 26 Sep 2005 21:39:35 +0000 (GMT)
	(envelope-from b.candler@pobox.com)
Received: from orb (localhost [127.0.0.1])
	by orb.pobox.com (Postfix) with ESMTP
	id 8C4871DDA; Mon, 26 Sep 2005 17:39:56 -0400 (EDT)
Received: from billdog.local.linnet.org
	(dsl-212-74-113-66.access.uk.tiscali.com [212.74.113.66])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by orb.sasl.smtp.pobox.com (Postfix) with ESMTP id 30976A2;
	Mon, 26 Sep 2005 17:39:54 -0400 (EDT)
Received: from brian by billdog.local.linnet.org with local (Exim 4.50
	(FreeBSD)) id 1EK0kf-0000Cx-U9; Mon, 26 Sep 2005 22:43:09 +0100
Date: Mon, 26 Sep 2005 22:43:09 +0100
From: Brian Candler <B.Candler@pobox.com>
To: Isaac Levy <ike@lesmuug.org>
Message-ID: <20050926214309.GA766@uk.tiscali.com>
References: <20050924141025.GA1236@uk.tiscali.com>
	<E7A2AE04-87DC-4F3A-87DE-97CD5B51E60F@lesmuug.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <E7A2AE04-87DC-4F3A-87DE-97CD5B51E60F@lesmuug.org>
User-Agent: Mutt/1.4.2.1i
Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org
Subject: Re: Options for synchronising filesystems
X-BeenThere: freebsd-cluster@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Clustering FreeBSD <freebsd-cluster.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>, 
	<mailto:freebsd-cluster-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-cluster>
List-Post: <mailto:freebsd-cluster@freebsd.org>
List-Help: <mailto:freebsd-cluster-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>,
	<mailto:freebsd-cluster-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Sep 2005 21:39:36 -0000

On Mon, Sep 26, 2005 at 02:16:31PM -0400, Isaac Levy wrote:
> I just wanted to throw out some quick thoughts on a totally different  
> approach which nobody has really explored in this thread

Geom (gmirror plus ggated/ggatec) was what I suggested for syncing two NFS
servers (my option 2) or for direct synchronisation of the clients'
filesystems to the servers (my option 4). The problem occurs when a client
actually *mounts* and uses the mirrored copy, rather than just keeping a
mirrored copy for resilience.

> Geom Gate:
> http://kerneltrap.org/news/freebsd?from=20
> 
> Network device-level client/server disk mapping tool.
> (VERY IMPORTANT COMPONENT, it's reportedly faster, and more stable  
> than NFS has ever been- so people have immediately and happily  
> deployed it in production systems!)

NFS and geom gate are two different things, so you can't really compare them
directly. NFS shares files; geom gate shares a block level device. With NFS
you can have one server and multiple clients, and the clients can access
this filesystem read-write. With geom gate, you just have remote access to a
disk partition, and essentially can only do what you could do with a local
block device.

Incidentally, NFS has been *hugely* dependable for me in production
environments. However I've always used expensive and beefy NFS servers
(Netapp) whilst FreeBSD is just the client.

> I know of one web-shop in Canada, which is running 2 machines for  
> every virtual cluster, in the following configuration:
> 
> 2 servers,
> 4 SATA drives per box,
> quad copper/ethernet gigabit nic on each box
> 
> each drive is mirrored using gmirror, over each of the gigabit  
> ethernet nics
> each box is running Vinum Raid5 across the 4  mirrored drives
> 
> The drives are then sliced appropriately, and server resources are  
> distributed across the boxes- with various slices mounted on each box.
> The folks I speak of simply have a suite of failover shell scripts  
> prepared, in the event of a machine experiencing total hardware failure.

Right. But unless I'm mistaken, the remote mirrors are just backup copies of
the data. Those remote mirrors are not actually *mounted* as filesystems.

I think you're talking about a master/slave failover scenario. With careful
arrangement, machine 1 can be master for dataset A and slave for dataset B,
while machine 2 is slave for A and master for B, so you're not wasting your
second machine. If machine 1 fails, machine 2 can take over both datasets.
That's fine.

However, what I need is for dataset A to be generated on machine 1 and
identical copies available on machines 2, 3, 4, 5...9. Not just *stored*
there, but actually *used* there, as live read-only copies. So if machine 1
makes a change to the dataset, all the other machines notice the change
properly and start using it immediately.

>From what I've heard, I can't use gmirror from machine 1 to machines 2-9,
because you can't mount a filesystem readonly while some other machine
magically updates the blocks from under its nose. The filesystem gets
confused because its local caches of blocks and inodes become out of date
when the data in the block device changes.

> Regardless, it's worth tapping into the GEOM dialogues

GEOM is definitely cool, and a strong selling point for moving from 4.x to
5.x

Regards,

Brian.

From owner-freebsd-cluster@FreeBSD.ORG  Tue Sep 27 11:25:29 2005
Return-Path: <owner-freebsd-cluster@FreeBSD.ORG>
X-Original-To: freebsd-cluster@FreeBSD.ORG
Delivered-To: freebsd-cluster@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 7959816A41F
	for <freebsd-cluster@FreeBSD.ORG>; Tue, 27 Sep 2005 11:25:29 +0000 (GMT)
	(envelope-from olli@lurza.secnetix.de)
Received: from lurza.secnetix.de (lurza.secnetix.de [83.120.8.8])
	by mx1.FreeBSD.org (Postfix) with ESMTP id E939943D49
	for <freebsd-cluster@FreeBSD.ORG>; Tue, 27 Sep 2005 11:25:28 +0000 (GMT)
	(envelope-from olli@lurza.secnetix.de)
Received: from lurza.secnetix.de (gbshkj@localhost [127.0.0.1])
	by lurza.secnetix.de (8.13.1/8.13.1) with ESMTP id j8RBPQQU094108
	for <freebsd-cluster@FreeBSD.ORG>;
	Tue, 27 Sep 2005 13:25:26 +0200 (CEST)
	(envelope-from oliver.fromme@secnetix.de)
Received: (from olli@localhost)
	by lurza.secnetix.de (8.13.1/8.13.1/Submit) id j8RBPQxk094107;
	Tue, 27 Sep 2005 13:25:26 +0200 (CEST) (envelope-from olli)
Date: Tue, 27 Sep 2005 13:25:26 +0200 (CEST)
Message-Id: <200509271125.j8RBPQxk094107@lurza.secnetix.de>
From: Oliver Fromme <olli@lurza.secnetix.de>
To: freebsd-cluster@FreeBSD.ORG
In-Reply-To: <BF443E79-045A-4ADC-956D-01DDEE64413D@lesmuug.org>
X-Newsgroups: list.freebsd-cluster
User-Agent: tin/1.5.4-20000523 ("1959") (UNIX) (FreeBSD/4.11-RELEASE (i386))
Cc: 
Subject: Re: Options for synchronising filesystems
X-BeenThere: freebsd-cluster@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: freebsd-cluster@FreeBSD.ORG
List-Id: Clustering FreeBSD <freebsd-cluster.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>, 
	<mailto:freebsd-cluster-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-cluster>
List-Post: <mailto:freebsd-cluster@freebsd.org>
List-Help: <mailto:freebsd-cluster-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>,
	<mailto:freebsd-cluster-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Sep 2005 11:25:29 -0000

Isaac Levy <ike@lesmuug.org> wrote:
 > In effect, a terse example of how someone could use the Geom tools I  
 > mentioned, to meet this requirement:
 > 
 > + Setup mirrored disks across machines as discussed before
 > 
 > + Mount a slice of that disk Read/Write on one machine (acting as  
 > master)
 > + Mount that same slice Readonly on both machines, using Geom Gate,  
 > and serve data from there.

That doesn't work.  You would need a cache coherency proto-
coll for that setup to work correctly.  Or mount it read-
only _everywhere_ (including the master).  If you have to
perform updates, you would have to perform this sequence:

   1. remount the master read-write,
   2. do the updates,
   3. remount the master read-only,
   4. flush the caches on the slaves (umount; mount).

But that means you'll have a short downtime each time you
update things -- probably not what you want, especially if
a redundant setup is the goal.

Currently, the _only_ way to mount the same filesystem on
multiple FreeBSD systems is NFS (or third-party software
like CODA).  Another alternative, as others have mentioned,
is to duplicate or synchronize the filesystems on each
server regularly (rsync, unison, whatever).

 > Eric, you are definately correct, that there's not really a disk- 
 > level mechanism to maintain concurrent writes between volumes mounted  
 > across servers using FreeBSD

Not even concurrent reads and write (i.e. one host writes
and the others read).  See above.

 > (excepting NFS, which in this context, makes me say *yuck*).

NFS certainly has its disadvantages, but works pretty well
when set up in a reasonable way.

 > I guess I'm saying this is a big-picture computing problem IMHO, and  
 > I don't know of a good solution here (though I'm curious about what  
 > kind of work has been done in Dragonfly which is relevant?)

So far, DragonFly BSD hasn't done anything regarding cache-
coherency for clusters, though it is planned for the future,
I think.  So, as of today, DF doesn't provide a solution for
the above problem either.

However, Matt Dillon has worked on a journalling feature
for UFS which might be helpful for the situation given by
the OP.  Not everything is implemented yet, but it _is_
already usable to maintain remote mirrors.  On July 5th,
Matt wrote:  "I can now run a buildworld loop, and the
mirror generated by the journal stays in synch (diff -r
reports no differences if I idle the buildworld, wait a
second or two for the journal to catch up, and run it)."

Best regards
   Oliver

-- 
Oliver Fromme,  secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

It's trivial to make fun of Microsoft products,
but it takes a real man to make them work,
and a God to make them do anything useful.