From owner-freebsd-cluster@FreeBSD.ORG Mon Sep 26 11:45:34 2005 Return-Path: X-Original-To: freebsd-cluster@freebsd.org Delivered-To: freebsd-cluster@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 09AB016A41F; Mon, 26 Sep 2005 11:45:34 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from mh2.centtech.com (moat3.centtech.com [207.200.51.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 38DB143D48; Mon, 26 Sep 2005 11:45:32 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from [10.177.171.220] (neutrino.centtech.com [10.177.171.220]) by mh2.centtech.com (8.13.1/8.13.1) with ESMTP id j8QBjU66046492; Mon, 26 Sep 2005 06:45:31 -0500 (CDT) (envelope-from anderson@centtech.com) Message-ID: <4337DF56.6030407@centtech.com> Date: Mon, 26 Sep 2005 06:45:26 -0500 From: Eric Anderson User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.11) Gecko/20050914 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Brian Candler References: <20050924141025.GA1236@uk.tiscali.com> In-Reply-To: <20050924141025.GA1236@uk.tiscali.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.82/1102/Sun Sep 25 09:04:56 2005 on mh2.centtech.com X-Virus-Status: Clean Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org Subject: Re: Options for synchronising filesystems X-BeenThere: freebsd-cluster@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Clustering FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Sep 2005 11:45:34 -0000 Brian Candler wrote: > Hello, > > I was wondering if anyone would care to share their experiences in > synchronising filesystems across a number of nodes in a cluster. I can think > of a number of options, but before changing what I'm doing at the moment I'd > like to see if anyone has good experiences with any of the others. > > The application: a clustered webserver. The users' CGIs run in a chroot > environment, and these clearly need to be identical (otherwise a CGI running > on one box would behave differently when running on a different box). > Ultimately I'd like to synchronise the host OS on each server too. > > Note that this is a single-master, multiple-slave type of filesystem > synchronisation I'm interested in. > > > 1. Keep a master image on an admin box, and rsync it out to the frontends > ------------------------------------------------------------------------- > > This is what I'm doing at the moment. Install a master image in > /webroot/cgi, add packages there (chroot /webroot/cgi pkg_add ...), and > rsync it. [Actually I'm exporting it using NFS, and the frontends run rsync > locally when required to update their local copies against the NFS master] > > Disadvantages: > > - rsyncing a couple of gigs of data is not particularly fast, even when only > a few files have changed > > - if a sysadmin (wrongly) changes a file on a front-end instead of on the > master copy in the admin box, then the change will be lost when the next > rsync occurs. They might think they've fixed a problem, and then (say) 24 > hours later their change is wiped. However if this is a config file, the > fact that the old file has been reinstated might not be noticed until the > daemon is restarted or the box rebooted - maybe months later. This I think > is the biggest fundamental problem. > > - files can be added locally and they will remain indefinitely (unless we > use rsync --delete which is a bit scary). If this is done then adding a new > machine into the cluster by rsyncing from the master will not pick up these > extra files. > > So, here are the alternatives I'm considering, and I'd welcome any > additional suggestions too. Here's a few ideas on this: do multiple rsyncs, one for each top level directory. That might speed up your total rsync process. Another similar method is using a content revisioning system. This is only good for some cases, but something like subversion might work ok here. > 2. Run the images directly off NFS > ---------------------------------- > > I've had this running before, even the entire O/S, and it works just fine. > However the NFS server itself then becomes a critical > single-point-of-failure: if it has to be rebooted and is out of service for > 2 minutes, then the whole cluster is out of service for that time. > > I think this is only feasible if I can build a highly-available NFS server, > which really means a pair of boxes serving the same data. Since the system > image is read-only from the point of view of the frontends, this should be > easy enough: > > frontends frontends > | | | | | | > NFS -----------> NFS > server 1 sync server 2 > > As far as I know, NFS clients don't support the idea of failing over from > one server to another, so I'd have to make a server pair which transparently > fails over. > > I could make one NFS server take over the other server's IP address using > carp or vrrp. However, I suspect that the clients might notice. I know that > NFS is 'stateless' in the sense that a server can be rebooted, but for a > client to be redirected from one server to the other, I expect that these > filesytems would have to be *identical*, down to the level of the inode > numbers being the same. > > If that's true, then rsync between the two NFS servers won't cut it. I was > thinking of perhaps using geom_mirror plus ggated/ggatec to make a > block-identical read-only mirror image on NFS server 2 - this also has the > advantage that any updates are close to instantaneous. > > What worries me here is how NFS server 2, which has the mirrored filesystem > mounted read-only, will take to having the data changed under its nose. Does > it for example keep caches of inodes in memory, and what would happen if > those inodes on disk were to change? I guess I can always just unmount and > remount the filesystem on NFS server 2 after each change. I've tried doing something similar. I used fiber attached storage, and had multiple hosts mounting the same partition. It seemed as though when host A mounted the filesystem read-write, and then host B mounted it read-only, any changes made by host A were not seen by B, and even remounting did not always bring it up to current state. I believe it has to do with the buffer cache and host A's desire to keep things (like inode changes, block maps, etc) in cache and not write them to disk. FreeBSD does not currently have a multi-system cache coherency protocol to distribute that information to other hosts. This is something I think would be very useful for many people. I suppose you could just mount the filesystem when you know a change has happened, but you still may not see the change. Maybe mounting the filesystem on host A with the sync option would help. > My other concern is about susceptibility to DoS-type attacks: if one > frontend were to go haywire and start hammering the NFS servers really hard, > it could impact on all the other machines in the cluster. > > However, the problems of data synchronisation are solved: any change made on > the NFS server is visible identically to all front-ends, and sysadmins can't > make changes on the front-ends because the NFS export is read-only. This was my first thought too, and a highly available NFS server is something any NFS heavy installation wants (needs). There are a few implementations of clustered filesystems out there, but non for FreeBSD (yet). What that allows is multiple machines talking to a shared storage with read/write access. Very handy, but since you only need read-only access, I think your problem is much simpler, and you can get away with a lot less. > 3. Use a network distributed filesystem - CODA? AFS? > ---------------------------------------------------- > > If each frontend were to access the filesystem as a read-only network mount, > but have a local copy to work with in the case of disconnected operation, > then the SPOF of an NFS server would be eliminated. > > However, I have no experience with CODA, and although it's been in the tree > since 2002, the README's don't inspire confidence: > > "It is mostly working, but hasn't been run long enough to be sure all the > bugs are sorted out. ... This code is not SMP ready" > > Also, a local cache is no good if the data you want during disconnected > operation is not in the cache at that time, which I think means this idea is > not actually a very good one. There is also a port for coda. I've been reading about this, and it's an interesting filesystem, but I'm just not sure of it's usefulness yet. > 4. Mount filesystems read-only > ------------------------------ > > On each front-end I could store /webroot/cgi on a filesystem mounted > read-only to prevent tampering (as long as the sysadmin doesn't remount it > read-write of course). That would work reasonably well, except that being > mounted read-only I couldn't use rsync to update it! > > It might also work with geom_mirror and ggated/ggatec, except for the issue > I raised before about changing blocks on a filesystem under the nose of a > client who is actively reading from it. I suppose you could mount r/w only when doing the rsync, then switch back to ro once complete. You should be able to do this online, without any issues or taking the filesystem offline. > 5. Using a filesystem which really is read-only > ----------------------------------------------- > > Better tamper-protection could be had by keeping data in a filesystem > structure which doesn't support any updates at all - such as cd9660 or > geom_uzip. > > The issue here is how to roll out a new version of the data. I could push > out a new filesystem image into a second partition, but it would then be > necessary to unmount the old filesystem and remount the new on the same > place, and you can't really unmount a filesystem which is in use. So this > would require a reboot. > > I was thinking that some symlink trickery might help: > > /webroot/cgi -> /webroot/cgi1 > /webroot/cgi1 # filesystem A mounted here > /webroot/cgi2 # filesystem B mounted here > > It should be possible to unmount /webroot/cgi2, dd in a new image, remount > it, and change the symlink to point to /webroot/cgi2. After a little while, > hopefully all the applications will stop using files in /webroot/cgi1, so > this one can be unmounted and a new one put in its place on the next update. > However this is not guaranteed, especially if there are long-lived processes > using binary images in this partition. You'd still have to stop and restart > all those processes. > > If reboots were acceptable, then the filesystem image could also be stored > in ramdisk pulled in via pxeboot. This makes sense especially for geom_uzip > where the data is pre-compressed. However I would still prefer to avoid > frequent reboots if at all possible. Also, whilst a ramdisk might be OK for > the root filesystem, a typical CGI environment (with perl, php, ruby, > python, and loads of libraries) would probably be too large anyway. > > > 6. Journaling filesystem replication > ------------------------------------ > > If the data were stored on a journaling filesystem on the master box, and > the journal logs were distributed out to the slaves, then they would all > have identical filesystem copies and only a minimal amount of data would > need to be pushed out to each machine on each change. (This would be rather > like NetApps and their snap-mirroring system). However I'm not aware of any > journaling filesystem for FreeBSD, let alone whether it would support > filesystem replication in this way. There is a project underway for UFSJ (UFS journaling). Maybe once it is complete, and bugs are ironed out, one could implement a journal distribution piece to send the journal updates to multiple hosts and achieve what you are thinking, however, that only distributes the meta-data, and not the actual data. Good luck finding your ultimate solution! Eric -- ------------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology Anything that works is better than anything that doesn't. ------------------------------------------------------------------------ From owner-freebsd-cluster@FreeBSD.ORG Mon Sep 26 12:25:37 2005 Return-Path: X-Original-To: freebsd-cluster@freebsd.org Delivered-To: freebsd-cluster@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B5DB616A41F for ; Mon, 26 Sep 2005 12:25:37 +0000 (GMT) (envelope-from filip@wuytack.net) Received: from london.wuytack.net (host-84-9-106-97.bulldogdsl.com [84.9.106.97]) by mx1.FreeBSD.org (Postfix) with ESMTP id B1DB943D49 for ; Mon, 26 Sep 2005 12:25:36 +0000 (GMT) (envelope-from filip@wuytack.net) Received: (qmail 99551 invoked by uid 1003); 26 Sep 2005 12:42:58 -0000 Received: from filip@wuytack.net by london.wuytack.net by uid 89 with qmail-scanner-1.22 (clamscan: 0.71. spamassassin: 2.63. Clear:RC:1(82.110.72.114):. Processed in 5.41758 secs); 26 Sep 2005 12:42:58 -0000 Received: from unknown (HELO ?127.0.0.1?) (filip@wuytack.net@82.110.72.114) by 10.11.12.4 with SMTP; 26 Sep 2005 12:42:52 -0000 Message-ID: <4337E8A7.6070107@wuytack.net> Date: Mon, 26 Sep 2005 13:25:11 +0100 From: filip wuytack User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Eric Anderson References: <20050924141025.GA1236@uk.tiscali.com> <4337DF56.6030407@centtech.com> In-Reply-To: <4337DF56.6030407@centtech.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org, Brian Candler Subject: Re: Options for synchronising filesystems X-BeenThere: freebsd-cluster@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Clustering FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Sep 2005 12:25:37 -0000 Eric Anderson wrote: > Brian Candler wrote: > >> Hello, >> >> I was wondering if anyone would care to share their experiences in >> synchronising filesystems across a number of nodes in a cluster. I can >> think >> of a number of options, but before changing what I'm doing at the >> moment I'd >> like to see if anyone has good experiences with any of the others. >> >> The application: a clustered webserver. The users' CGIs run in a chroot >> environment, and these clearly need to be identical (otherwise a CGI >> running >> on one box would behave differently when running on a different box). >> Ultimately I'd like to synchronise the host OS on each server too. >> >> Note that this is a single-master, multiple-slave type of filesystem >> synchronisation I'm interested in. >> >> >> 1. Keep a master image on an admin box, and rsync it out to the frontends >> ------------------------------------------------------------------------- >> >> This is what I'm doing at the moment. Install a master image in >> /webroot/cgi, add packages there (chroot /webroot/cgi pkg_add ...), and >> rsync it. [Actually I'm exporting it using NFS, and the frontends run >> rsync >> locally when required to update their local copies against the NFS >> master] >> >> Disadvantages: >> >> - rsyncing a couple of gigs of data is not particularly fast, even >> when only >> a few files have changed >> >> - if a sysadmin (wrongly) changes a file on a front-end instead of on the >> master copy in the admin box, then the change will be lost when the next >> rsync occurs. They might think they've fixed a problem, and then (say) 24 >> hours later their change is wiped. However if this is a config file, the >> fact that the old file has been reinstated might not be noticed until the >> daemon is restarted or the box rebooted - maybe months later. This I >> think >> is the biggest fundamental problem. >> >> - files can be added locally and they will remain indefinitely (unless we >> use rsync --delete which is a bit scary). If this is done then adding >> a new >> machine into the cluster by rsyncing from the master will not pick up >> these >> extra files. >> >> So, here are the alternatives I'm considering, and I'd welcome any >> additional suggestions too. > > > Here's a few ideas on this: do multiple rsyncs, one for each top level > directory. That might speed up your total rsync process. Another > similar method is using a content revisioning system. This is only good > for some cases, but something like subversion might work ok here. > > > >> 2. Run the images directly off NFS >> ---------------------------------- >> >> I've had this running before, even the entire O/S, and it works just >> fine. >> However the NFS server itself then becomes a critical >> single-point-of-failure: if it has to be rebooted and is out of >> service for >> 2 minutes, then the whole cluster is out of service for that time. >> >> I think this is only feasible if I can build a highly-available NFS >> server, >> which really means a pair of boxes serving the same data. Since the >> system >> image is read-only from the point of view of the frontends, this >> should be >> easy enough: >> >> frontends frontends >> | | | | | | >> NFS -----------> NFS >> server 1 sync server 2 >> >> As far as I know, NFS clients don't support the idea of failing over from >> one server to another, so I'd have to make a server pair which >> transparently >> fails over. >> >> I could make one NFS server take over the other server's IP address using >> carp or vrrp. However, I suspect that the clients might notice. I know >> that >> NFS is 'stateless' in the sense that a server can be rebooted, but for a >> client to be redirected from one server to the other, I expect that these >> filesytems would have to be *identical*, down to the level of the inode >> numbers being the same. >> >> If that's true, then rsync between the two NFS servers won't cut it. I >> was >> thinking of perhaps using geom_mirror plus ggated/ggatec to make a >> block-identical read-only mirror image on NFS server 2 - this also has >> the >> advantage that any updates are close to instantaneous. >> >> What worries me here is how NFS server 2, which has the mirrored >> filesystem >> mounted read-only, will take to having the data changed under its >> nose. Does >> it for example keep caches of inodes in memory, and what would happen if >> those inodes on disk were to change? I guess I can always just unmount >> and >> remount the filesystem on NFS server 2 after each change. > > > I've tried doing something similar. I used fiber attached storage, and > had multiple hosts mounting the same partition. It seemed as though > when host A mounted the filesystem read-write, and then host B mounted > it read-only, any changes made by host A were not seen by B, and even > remounting did not always bring it up to current state. I believe it > has to do with the buffer cache and host A's desire to keep things (like > inode changes, block maps, etc) in cache and not write them to disk. > FreeBSD does not currently have a multi-system cache coherency protocol > to distribute that information to other hosts. This is something I > think would be very useful for many people. I suppose you could just > mount the filesystem when you know a change has happened, but you still > may not see the change. Maybe mounting the filesystem on host A with > the sync option would help. > >> My other concern is about susceptibility to DoS-type attacks: if one >> frontend were to go haywire and start hammering the NFS servers really >> hard, >> it could impact on all the other machines in the cluster. >> >> However, the problems of data synchronisation are solved: any change >> made on >> the NFS server is visible identically to all front-ends, and sysadmins >> can't >> make changes on the front-ends because the NFS export is read-only. > > > This was my first thought too, and a highly available NFS server is > something any NFS heavy installation wants (needs). There are a few > implementations of clustered filesystems out there, but non for FreeBSD > (yet). What that allows is multiple machines talking to a shared > storage with read/write access. Very handy, but since you only need > read-only access, I think your problem is much simpler, and you can get > away with a lot less. > > >> 3. Use a network distributed filesystem - CODA? AFS? >> ---------------------------------------------------- >> >> If each frontend were to access the filesystem as a read-only network >> mount, >> but have a local copy to work with in the case of disconnected operation, >> then the SPOF of an NFS server would be eliminated. >> >> However, I have no experience with CODA, and although it's been in the >> tree >> since 2002, the README's don't inspire confidence: >> >> "It is mostly working, but hasn't been run long enough to be sure >> all the >> bugs are sorted out. ... This code is not SMP ready" >> >> Also, a local cache is no good if the data you want during disconnected >> operation is not in the cache at that time, which I think means this >> idea is >> not actually a very good one. > > > There is also a port for coda. I've been reading about this, and it's > an interesting filesystem, but I'm just not sure of it's usefulness yet. > > >> 4. Mount filesystems read-only >> ------------------------------ >> >> On each front-end I could store /webroot/cgi on a filesystem mounted >> read-only to prevent tampering (as long as the sysadmin doesn't >> remount it >> read-write of course). That would work reasonably well, except that being >> mounted read-only I couldn't use rsync to update it! >> >> It might also work with geom_mirror and ggated/ggatec, except for the >> issue >> I raised before about changing blocks on a filesystem under the nose of a >> client who is actively reading from it. > > > I suppose you could mount r/w only when doing the rsync, then switch > back to ro once complete. You should be able to do this online, without > any issues or taking the filesystem offline. > > >> 5. Using a filesystem which really is read-only >> ----------------------------------------------- >> >> Better tamper-protection could be had by keeping data in a filesystem >> structure which doesn't support any updates at all - such as cd9660 or >> geom_uzip. >> >> The issue here is how to roll out a new version of the data. I could push >> out a new filesystem image into a second partition, but it would then be >> necessary to unmount the old filesystem and remount the new on the same >> place, and you can't really unmount a filesystem which is in use. So this >> would require a reboot. >> >> I was thinking that some symlink trickery might help: >> >> /webroot/cgi -> /webroot/cgi1 >> /webroot/cgi1 # filesystem A mounted here >> /webroot/cgi2 # filesystem B mounted here >> >> It should be possible to unmount /webroot/cgi2, dd in a new image, >> remount >> it, and change the symlink to point to /webroot/cgi2. After a little >> while, >> hopefully all the applications will stop using files in /webroot/cgi1, so >> this one can be unmounted and a new one put in its place on the next >> update. >> However this is not guaranteed, especially if there are long-lived >> processes >> using binary images in this partition. You'd still have to stop and >> restart >> all those processes. >> >> If reboots were acceptable, then the filesystem image could also be >> stored >> in ramdisk pulled in via pxeboot. This makes sense especially for >> geom_uzip >> where the data is pre-compressed. However I would still prefer to avoid >> frequent reboots if at all possible. Also, whilst a ramdisk might be >> OK for >> the root filesystem, a typical CGI environment (with perl, php, ruby, >> python, and loads of libraries) would probably be too large anyway. >> >> >> 6. Journaling filesystem replication >> ------------------------------------ >> >> If the data were stored on a journaling filesystem on the master box, and >> the journal logs were distributed out to the slaves, then they would all >> have identical filesystem copies and only a minimal amount of data would >> need to be pushed out to each machine on each change. (This would be >> rather >> like NetApps and their snap-mirroring system). However I'm not aware >> of any >> journaling filesystem for FreeBSD, let alone whether it would support >> filesystem replication in this way. > > > There is a project underway for UFSJ (UFS journaling). Maybe once it > is complete, and bugs are ironed out, one could implement a journal > distribution piece to send the journal updates to multiple hosts and > achieve what you are thinking, however, that only distributes the > meta-data, and not the actual data. > > Have a look at dragonfly BSD for this. They are working on a journaling filesystem that will do just that. ~ Fil > Good luck finding your ultimate solution! > > Eric > > From owner-freebsd-cluster@FreeBSD.ORG Mon Sep 26 12:46:16 2005 Return-Path: X-Original-To: freebsd-cluster@freebsd.org Delivered-To: freebsd-cluster@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9724C16A41F; Mon, 26 Sep 2005 12:46:16 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from mh2.centtech.com (moat3.centtech.com [207.200.51.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2252743D48; Mon, 26 Sep 2005 12:46:15 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from [10.177.171.220] (neutrino.centtech.com [10.177.171.220]) by mh2.centtech.com (8.13.1/8.13.1) with ESMTP id j8QCkENe047545; Mon, 26 Sep 2005 07:46:14 -0500 (CDT) (envelope-from anderson@centtech.com) Message-ID: <4337ED91.8080200@centtech.com> Date: Mon, 26 Sep 2005 07:46:09 -0500 From: Eric Anderson User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.11) Gecko/20050914 X-Accept-Language: en-us, en MIME-Version: 1.0 To: filip wuytack References: <20050924141025.GA1236@uk.tiscali.com> <4337DF56.6030407@centtech.com> <4337E8A7.6070107@wuytack.net> In-Reply-To: <4337E8A7.6070107@wuytack.net> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.82/1102/Sun Sep 25 09:04:56 2005 on mh2.centtech.com X-Virus-Status: Clean Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org Subject: Re: Options for synchronising filesystems X-BeenThere: freebsd-cluster@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Clustering FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Sep 2005 12:46:16 -0000 filip wuytack wrote: > > > Eric Anderson wrote: > >> Brian Candler wrote: >> >>> Hello, >>> >>> I was wondering if anyone would care to share their experiences in >>> synchronising filesystems across a number of nodes in a cluster. I >>> can think >>> of a number of options, but before changing what I'm doing at the >>> moment I'd >>> like to see if anyone has good experiences with any of the others. >>> >>> The application: a clustered webserver. The users' CGIs run in a chroot >>> environment, and these clearly need to be identical (otherwise a CGI >>> running >>> on one box would behave differently when running on a different box). >>> Ultimately I'd like to synchronise the host OS on each server too. >>> >>> Note that this is a single-master, multiple-slave type of filesystem >>> synchronisation I'm interested in. >>> >>> >>> 1. Keep a master image on an admin box, and rsync it out to the >>> frontends >>> ------------------------------------------------------------------------- >>> >>> >>> This is what I'm doing at the moment. Install a master image in >>> /webroot/cgi, add packages there (chroot /webroot/cgi pkg_add ...), and >>> rsync it. [Actually I'm exporting it using NFS, and the frontends run >>> rsync >>> locally when required to update their local copies against the NFS >>> master] >>> >>> Disadvantages: >>> >>> - rsyncing a couple of gigs of data is not particularly fast, even >>> when only >>> a few files have changed >>> >>> - if a sysadmin (wrongly) changes a file on a front-end instead of on >>> the >>> master copy in the admin box, then the change will be lost when the next >>> rsync occurs. They might think they've fixed a problem, and then >>> (say) 24 >>> hours later their change is wiped. However if this is a config file, the >>> fact that the old file has been reinstated might not be noticed until >>> the >>> daemon is restarted or the box rebooted - maybe months later. This I >>> think >>> is the biggest fundamental problem. >>> >>> - files can be added locally and they will remain indefinitely >>> (unless we >>> use rsync --delete which is a bit scary). If this is done then adding >>> a new >>> machine into the cluster by rsyncing from the master will not pick up >>> these >>> extra files. >>> >>> So, here are the alternatives I'm considering, and I'd welcome any >>> additional suggestions too. >> >> >> >> Here's a few ideas on this: do multiple rsyncs, one for each top level >> directory. That might speed up your total rsync process. Another >> similar method is using a content revisioning system. This is only >> good for some cases, but something like subversion might work ok here. >> >> >> >>> 2. Run the images directly off NFS >>> ---------------------------------- >>> >>> I've had this running before, even the entire O/S, and it works just >>> fine. >>> However the NFS server itself then becomes a critical >>> single-point-of-failure: if it has to be rebooted and is out of >>> service for >>> 2 minutes, then the whole cluster is out of service for that time. >>> >>> I think this is only feasible if I can build a highly-available NFS >>> server, >>> which really means a pair of boxes serving the same data. Since the >>> system >>> image is read-only from the point of view of the frontends, this >>> should be >>> easy enough: >>> >>> frontends frontends >>> | | | | | | >>> NFS -----------> NFS >>> server 1 sync server 2 >>> >>> As far as I know, NFS clients don't support the idea of failing over >>> from >>> one server to another, so I'd have to make a server pair which >>> transparently >>> fails over. >>> >>> I could make one NFS server take over the other server's IP address >>> using >>> carp or vrrp. However, I suspect that the clients might notice. I >>> know that >>> NFS is 'stateless' in the sense that a server can be rebooted, but for a >>> client to be redirected from one server to the other, I expect that >>> these >>> filesytems would have to be *identical*, down to the level of the inode >>> numbers being the same. >>> >>> If that's true, then rsync between the two NFS servers won't cut it. >>> I was >>> thinking of perhaps using geom_mirror plus ggated/ggatec to make a >>> block-identical read-only mirror image on NFS server 2 - this also >>> has the >>> advantage that any updates are close to instantaneous. >>> >>> What worries me here is how NFS server 2, which has the mirrored >>> filesystem >>> mounted read-only, will take to having the data changed under its >>> nose. Does >>> it for example keep caches of inodes in memory, and what would happen if >>> those inodes on disk were to change? I guess I can always just >>> unmount and >>> remount the filesystem on NFS server 2 after each change. >> >> >> >> I've tried doing something similar. I used fiber attached storage, >> and had multiple hosts mounting the same partition. It seemed as >> though when host A mounted the filesystem read-write, and then host B >> mounted it read-only, any changes made by host A were not seen by B, >> and even remounting did not always bring it up to current state. I >> believe it has to do with the buffer cache and host A's desire to keep >> things (like inode changes, block maps, etc) in cache and not write >> them to disk. FreeBSD does not currently have a multi-system cache >> coherency protocol to distribute that information to other hosts. >> This is something I think would be very useful for many people. I >> suppose you could just mount the filesystem when you know a change has >> happened, but you still may not see the change. Maybe mounting the >> filesystem on host A with the sync option would help. >> >>> My other concern is about susceptibility to DoS-type attacks: if one >>> frontend were to go haywire and start hammering the NFS servers >>> really hard, >>> it could impact on all the other machines in the cluster. >>> >>> However, the problems of data synchronisation are solved: any change >>> made on >>> the NFS server is visible identically to all front-ends, and >>> sysadmins can't >>> make changes on the front-ends because the NFS export is read-only. >> >> >> >> This was my first thought too, and a highly available NFS server is >> something any NFS heavy installation wants (needs). There are a few >> implementations of clustered filesystems out there, but non for >> FreeBSD (yet). What that allows is multiple machines talking to a >> shared storage with read/write access. Very handy, but since you only >> need read-only access, I think your problem is much simpler, and you >> can get away with a lot less. >> >> >>> 3. Use a network distributed filesystem - CODA? AFS? >>> ---------------------------------------------------- >>> >>> If each frontend were to access the filesystem as a read-only network >>> mount, >>> but have a local copy to work with in the case of disconnected >>> operation, >>> then the SPOF of an NFS server would be eliminated. >>> >>> However, I have no experience with CODA, and although it's been in >>> the tree >>> since 2002, the README's don't inspire confidence: >>> >>> "It is mostly working, but hasn't been run long enough to be sure >>> all the >>> bugs are sorted out. ... This code is not SMP ready" >>> >>> Also, a local cache is no good if the data you want during disconnected >>> operation is not in the cache at that time, which I think means this >>> idea is >>> not actually a very good one. >> >> >> >> There is also a port for coda. I've been reading about this, and >> it's an interesting filesystem, but I'm just not sure of it's >> usefulness yet. >> >> >>> 4. Mount filesystems read-only >>> ------------------------------ >>> >>> On each front-end I could store /webroot/cgi on a filesystem mounted >>> read-only to prevent tampering (as long as the sysadmin doesn't >>> remount it >>> read-write of course). That would work reasonably well, except that >>> being >>> mounted read-only I couldn't use rsync to update it! >>> >>> It might also work with geom_mirror and ggated/ggatec, except for the >>> issue >>> I raised before about changing blocks on a filesystem under the nose >>> of a >>> client who is actively reading from it. >> >> >> >> I suppose you could mount r/w only when doing the rsync, then switch >> back to ro once complete. You should be able to do this online, >> without any issues or taking the filesystem offline. >> >> >>> 5. Using a filesystem which really is read-only >>> ----------------------------------------------- >>> >>> Better tamper-protection could be had by keeping data in a filesystem >>> structure which doesn't support any updates at all - such as cd9660 or >>> geom_uzip. >>> >>> The issue here is how to roll out a new version of the data. I could >>> push >>> out a new filesystem image into a second partition, but it would then be >>> necessary to unmount the old filesystem and remount the new on the same >>> place, and you can't really unmount a filesystem which is in use. So >>> this >>> would require a reboot. >>> >>> I was thinking that some symlink trickery might help: >>> >>> /webroot/cgi -> /webroot/cgi1 >>> /webroot/cgi1 # filesystem A mounted here >>> /webroot/cgi2 # filesystem B mounted here >>> >>> It should be possible to unmount /webroot/cgi2, dd in a new image, >>> remount >>> it, and change the symlink to point to /webroot/cgi2. After a little >>> while, >>> hopefully all the applications will stop using files in >>> /webroot/cgi1, so >>> this one can be unmounted and a new one put in its place on the next >>> update. >>> However this is not guaranteed, especially if there are long-lived >>> processes >>> using binary images in this partition. You'd still have to stop and >>> restart >>> all those processes. >>> >>> If reboots were acceptable, then the filesystem image could also be >>> stored >>> in ramdisk pulled in via pxeboot. This makes sense especially for >>> geom_uzip >>> where the data is pre-compressed. However I would still prefer to avoid >>> frequent reboots if at all possible. Also, whilst a ramdisk might be >>> OK for >>> the root filesystem, a typical CGI environment (with perl, php, ruby, >>> python, and loads of libraries) would probably be too large anyway. >>> >>> >>> 6. Journaling filesystem replication >>> ------------------------------------ >>> >>> If the data were stored on a journaling filesystem on the master box, >>> and >>> the journal logs were distributed out to the slaves, then they would all >>> have identical filesystem copies and only a minimal amount of data would >>> need to be pushed out to each machine on each change. (This would be >>> rather >>> like NetApps and their snap-mirroring system). However I'm not aware >>> of any >>> journaling filesystem for FreeBSD, let alone whether it would support >>> filesystem replication in this way. >> >> >> >> There is a project underway for UFSJ (UFS journaling). Maybe once it >> is complete, and bugs are ironed out, one could implement a journal >> distribution piece to send the journal updates to multiple hosts and >> achieve what you are thinking, however, that only distributes the >> meta-data, and not the actual data. >> >> > Have a look at dragonfly BSD for this. They are working on a journaling > filesystem that will do just that. Do you have a link to some information on this? I've been looking at Dragonfly, but I'm having trouble finding good information on what is already working, in planning, etc. Eric -- ------------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology Anything that works is better than anything that doesn't. ------------------------------------------------------------------------ From owner-freebsd-cluster@FreeBSD.ORG Mon Sep 26 15:33:27 2005 Return-Path: X-Original-To: freebsd-cluster@freebsd.org Delivered-To: freebsd-cluster@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3336B16A41F; Mon, 26 Sep 2005 15:33:27 +0000 (GMT) (envelope-from b.candler@pobox.com) Received: from leto.uk.clara.net (leto.uk.clara.net [80.168.69.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id B2F9F43D62; Mon, 26 Sep 2005 15:33:20 +0000 (GMT) (envelope-from b.candler@pobox.com) Received: from bloodhound.noc.clara.net ([195.8.70.207]) by leto.uk.clara.net with esmtp (Exim 4.43) id 1EJuyk-000HWz-Rg; Mon, 26 Sep 2005 16:33:18 +0100 Received: from personal by bloodhound.noc.clara.net with local (Exim 4.52 (FreeBSD)) id 1EJuyy-0006vz-Ff; Mon, 26 Sep 2005 16:33:32 +0100 Date: Mon, 26 Sep 2005 16:33:32 +0100 From: Brian Candler To: filip wuytack Message-ID: <20050926153332.GA26373@uk.tiscali.com> References: <20050924141025.GA1236@uk.tiscali.com> <4337DF56.6030407@centtech.com> <4337E8A7.6070107@wuytack.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4337E8A7.6070107@wuytack.net> User-Agent: Mutt/1.4.2.1i Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org Subject: Re: Options for synchronising filesystems X-BeenThere: freebsd-cluster@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Clustering FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Sep 2005 15:33:27 -0000 On Mon, Sep 26, 2005 at 01:25:11PM +0100, filip wuytack wrote: > Have a look at dragonfly BSD for this. They are working on a journaling > filesystem that will do just that. Someone else mentioned that. However DragonFly BSD seems very short of documentation on the web; I did finally find some on-line manpages courtesy of google (couldn't find them linked from www.dragonflybsd.org) >From what I read, I also get the impression that the journalling feature is rather a work-in-progress right now. Regards, Brian. From owner-freebsd-cluster@FreeBSD.ORG Mon Sep 26 18:16:38 2005 Return-Path: X-Original-To: freebsd-cluster@freebsd.org Delivered-To: freebsd-cluster@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 77E7516A41F; Mon, 26 Sep 2005 18:16:38 +0000 (GMT) (envelope-from ike@lesmuug.org) Received: from beth.easthouston.org (dsl254-117-002.nyc1.dsl.speakeasy.net [216.254.117.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0CE6043D66; Mon, 26 Sep 2005 18:16:35 +0000 (GMT) (envelope-from ike@lesmuug.org) Received: from [192.168.1.22] (249-218.customer.cloud9.net [168.100.249.218]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by beth.easthouston.org (Postfix) with ESMTP id 932B2D9AC20; Mon, 26 Sep 2005 14:16:34 -0400 (EDT) In-Reply-To: <20050924141025.GA1236@uk.tiscali.com> References: <20050924141025.GA1236@uk.tiscali.com> Mime-Version: 1.0 (Apple Message framework v734) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Isaac Levy Date: Mon, 26 Sep 2005 14:16:31 -0400 To: Brian Candler X-Mailer: Apple Mail (2.734) Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org Subject: Re: Options for synchronising filesystems X-BeenThere: freebsd-cluster@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Clustering FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Sep 2005 18:16:38 -0000 Hi Brian, All, This email has one theme: GEOM! :) On Sep 24, 2005, at 10:10 AM, Brian Candler wrote: > Hello, > > I was wondering if anyone would care to share their experiences in > synchronising filesystems across a number of nodes in a cluster. I > can think > of a number of options, but before changing what I'm doing at the > moment I'd > like to see if anyone has good experiences with any of the others. > > The application: a clustered webserver. The users' CGIs run in a > chroot > environment, and these clearly need to be identical (otherwise a > CGI running > on one box would behave differently when running on a different box). > Ultimately I'd like to synchronise the host OS on each server too. > > Note that this is a single-master, multiple-slave type of filesystem > synchronisation I'm interested in. I just wanted to throw out some quick thoughts on a totally different approach which nobody has really explored in this thread, solutions which are production level software. (Sorry if I'm repeating things or giving out info yall' already know:) -- Geom: http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/geom- intro.html The core Disk IO framework for FreeBSD, as of 5.x, led by PHK: http://www.bsdcan.org/2004/papers/geom.pdf This framework itself is not as useful to you as the utilities which make use of it, -- Geom Gate: http://kerneltrap.org/news/freebsd?from=20 Network device-level client/server disk mapping tool. (VERY IMPORTANT COMPONENT, it's reportedly faster, and more stable than NFS has ever been- so people have immediately and happily deployed it in production systems!) -- Gvinum and Gmirror: Gmirror http://people.freebsd.org/~rse/mirror/ http://www.ie.freebsd.org/doc/en_US.ISO8859-1/books/handbook/geom.html (Sidenote: even Greg Lehey (original author of Vinum), has stated that it's better to use Geom-based tools than Vinum for the forseeable future.) -- In a nutshell, to address your needs, let me toss out the following example setup: I know of one web-shop in Canada, which is running 2 machines for every virtual cluster, in the following configuration: 2 servers, 4 SATA drives per box, quad copper/ethernet gigabit nic on each box each drive is mirrored using gmirror, over each of the gigabit ethernet nics each box is running Vinum Raid5 across the 4 mirrored drives The drives are then sliced appropriately, and server resources are distributed across the boxes- with various slices mounted on each box. The folks I speak of simply have a suite of failover shell scripts prepared, in the event of a machine experiencing total hardware failure. Pretty tough stuff, very high-performance, and CHEAP. -- With that, I'm working towards similar setups, oriented around redundant jailed systems, with an eventual end to tie CARP (from pf) into the mix to make for nearly-instantaneous jailed failover redundancy- (but it's going to be some time before I have what I want worked out for production on my own). Regardless, it's worth tapping into the GEOM dialogues, as there are many new ways of working with disks coming into existence- and the GEOM framework itself provides an EXTREMELY solid base to bring 'exotic' disk configurations up to production level quickly. (Also noteworthy, there's a couple of encrypted disk systems based on GEOM emerging now too...) -- Hope all that helps, Best, .ike From owner-freebsd-cluster@FreeBSD.ORG Mon Sep 26 20:38:09 2005 Return-Path: X-Original-To: freebsd-cluster@freebsd.org Delivered-To: freebsd-cluster@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2F96F16A496; Mon, 26 Sep 2005 20:38:09 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from mh1.centtech.com (moat3.centtech.com [207.200.51.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6D82C43D58; Mon, 26 Sep 2005 20:38:07 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from [10.177.171.220] (neutrino.centtech.com [10.177.171.220]) by mh1.centtech.com (8.13.1/8.13.1) with ESMTP id j8QKc6nK077563; Mon, 26 Sep 2005 15:38:06 -0500 (CDT) (envelope-from anderson@centtech.com) Message-ID: <43385C29.5060406@centtech.com> Date: Mon, 26 Sep 2005 15:38:01 -0500 From: Eric Anderson User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.11) Gecko/20050914 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Isaac Levy References: <20050924141025.GA1236@uk.tiscali.com> In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.82/1102/Sun Sep 25 09:04:56 2005 on mh1.centtech.com X-Virus-Status: Clean Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org Subject: Re: Options for synchronising filesystems X-BeenThere: freebsd-cluster@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Clustering FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Sep 2005 20:38:09 -0000 Isaac Levy wrote: > Hi Brian, All, > > This email has one theme: GEOM! :) > > On Sep 24, 2005, at 10:10 AM, Brian Candler wrote: > >> Hello, >> >> I was wondering if anyone would care to share their experiences in >> synchronising filesystems across a number of nodes in a cluster. I >> can think >> of a number of options, but before changing what I'm doing at the >> moment I'd >> like to see if anyone has good experiences with any of the others. >> >> The application: a clustered webserver. The users' CGIs run in a chroot >> environment, and these clearly need to be identical (otherwise a CGI >> running >> on one box would behave differently when running on a different box). >> Ultimately I'd like to synchronise the host OS on each server too. >> >> Note that this is a single-master, multiple-slave type of filesystem >> synchronisation I'm interested in. > > > I just wanted to throw out some quick thoughts on a totally different > approach which nobody has really explored in this thread, solutions > which are production level software. (Sorry if I'm repeating things or > giving out info yall' already know:) > > -- > Geom: > http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/geom- intro.html > > The core Disk IO framework for FreeBSD, as of 5.x, led by PHK: > http://www.bsdcan.org/2004/papers/geom.pdf > > This framework itself is not as useful to you as the utilities which > make use of it, > > -- > Geom Gate: > http://kerneltrap.org/news/freebsd?from=20 > > Network device-level client/server disk mapping tool. > (VERY IMPORTANT COMPONENT, it's reportedly faster, and more stable than > NFS has ever been- so people have immediately and happily deployed it > in production systems!) > > -- > Gvinum and Gmirror: > > Gmirror > http://people.freebsd.org/~rse/mirror/ > http://www.ie.freebsd.org/doc/en_US.ISO8859-1/books/handbook/geom.html > > (Sidenote: even Greg Lehey (original author of Vinum), has stated that > it's better to use Geom-based tools than Vinum for the forseeable future.) > > -- > In a nutshell, to address your needs, let me toss out the following > example setup: > > I know of one web-shop in Canada, which is running 2 machines for every > virtual cluster, in the following configuration: > > 2 servers, > 4 SATA drives per box, > quad copper/ethernet gigabit nic on each box > > each drive is mirrored using gmirror, over each of the gigabit ethernet > nics > each box is running Vinum Raid5 across the 4 mirrored drives > > The drives are then sliced appropriately, and server resources are > distributed across the boxes- with various slices mounted on each box. > The folks I speak of simply have a suite of failover shell scripts > prepared, in the event of a machine experiencing total hardware failure. > > Pretty tough stuff, very high-performance, and CHEAP. > > -- > With that, I'm working towards similar setups, oriented around > redundant jailed systems, with an eventual end to tie CARP (from pf) > into the mix to make for nearly-instantaneous jailed failover > redundancy- (but it's going to be some time before I have what I want > worked out for production on my own). > > Regardless, it's worth tapping into the GEOM dialogues, as there are > many new ways of working with disks coming into existence- and the GEOM > framework itself provides an EXTREMELY solid base to bring 'exotic' > disk configurations up to production level quickly. > (Also noteworthy, there's a couple of encrypted disk systems based on > GEOM emerging now too...) I think the original poster (and I at least) knew about this already, but what I still fail to see is how you can get several machines using the same data at the same time, and still do updates to that data? The only way I know of is to use a syncing tool (like rsync) or a shared filesystem (like NFS, or CXFS, or Polyserve FS, opengfs, etc), none of which run on FreeBSD. What I read from above, is a redundant server setup, not a high-performance setup (meaning multiple machines serving the same data to many clients). If I'm missing something, please fill me in.. Eric -- ------------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology Anything that works is better than anything that doesn't. ------------------------------------------------------------------------ From owner-freebsd-cluster@FreeBSD.ORG Mon Sep 26 21:27:34 2005 Return-Path: X-Original-To: freebsd-cluster@freebsd.org Delivered-To: freebsd-cluster@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8757816A41F; Mon, 26 Sep 2005 21:27:34 +0000 (GMT) (envelope-from ike@lesmuug.org) Received: from beth.easthouston.org (dsl254-117-002.nyc1.dsl.speakeasy.net [216.254.117.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2C02F43D48; Mon, 26 Sep 2005 21:27:34 +0000 (GMT) (envelope-from ike@lesmuug.org) Received: from [192.168.1.22] (249-218.customer.cloud9.net [168.100.249.218]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by beth.easthouston.org (Postfix) with ESMTP id 5115BD9B887; Mon, 26 Sep 2005 17:27:33 -0400 (EDT) In-Reply-To: <43385C29.5060406@centtech.com> References: <20050924141025.GA1236@uk.tiscali.com> <43385C29.5060406@centtech.com> Mime-Version: 1.0 (Apple Message framework v734) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Isaac Levy Date: Mon, 26 Sep 2005 17:27:30 -0400 To: Eric Anderson X-Mailer: Apple Mail (2.734) Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org Subject: Re: Options for synchronising filesystems X-BeenThere: freebsd-cluster@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Clustering FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Sep 2005 21:27:34 -0000 Hi Eric, All, On Sep 26, 2005, at 4:38 PM, Eric Anderson wrote: > I think the original poster (and I at least) knew about this > already, but what I still fail to see is how you can get several > machines using the same data at the same time, and still do updates > to that data? The only way I know of is to use a syncing tool > (like rsync) or a shared filesystem (like NFS, or CXFS, or > Polyserve FS, opengfs, etc), none of which run on FreeBSD. Gotcha, I did skip somewhat to the side of the original requirements, > > What I read from above, is a redundant server setup, not a high- > performance setup (meaning multiple machines serving the same data > to many clients). If I'm missing something, please fill me in.. I'm not certain that my intention was to provide the best answer, but to provide yet another set of tools to get the job done. In effect, a terse example of how someone could use the Geom tools I mentioned, to meet this requirement: + Setup mirrored disks across machines as discussed before + Mount a slice of that disk Read/Write on one machine (acting as master) + Mount that same slice Readonly on both machines, using Geom Gate, and serve data from there. - If the master machine dies, mount the volume Read/Write on the other machine I'm not certain if this meets the requirements precisely, but I believe there may be a combination of these Geom-based utilities which would- and they are all actively under continued development. -- Eric, you are definately correct, that there's not really a disk- level mechanism to maintain concurrent writes between volumes mounted across servers using FreeBSD (excepting NFS, which in this context, makes me say *yuck*). Anyone with some spare time want to take up this problem as a new Geom project? ;) However, based on my experiences with distributed database clusters, I believe it's fair to say that any persistent data (writes) are a very difficult task to get done right across a cluster- and maintain contextually sane levels of performance, (due to resource locking issues, mixed with network latency, etc...) I guess I'm saying this is a big-picture computing problem IMHO, and I don't know of a good solution here (though I'm curious about what kind of work has been done in Dragonfly which is relevant?) > > Eric > -- Got a spare NetApp anyone? My head hurts. :) Best, .ike From owner-freebsd-cluster@FreeBSD.ORG Mon Sep 26 21:39:36 2005 Return-Path: X-Original-To: freebsd-cluster@freebsd.org Delivered-To: freebsd-cluster@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3541716A41F; Mon, 26 Sep 2005 21:39:36 +0000 (GMT) (envelope-from b.candler@pobox.com) Received: from orb.pobox.com (orb.pobox.com [207.8.226.5]) by mx1.FreeBSD.org (Postfix) with ESMTP id B6C4E43D48; Mon, 26 Sep 2005 21:39:35 +0000 (GMT) (envelope-from b.candler@pobox.com) Received: from orb (localhost [127.0.0.1]) by orb.pobox.com (Postfix) with ESMTP id 8C4871DDA; Mon, 26 Sep 2005 17:39:56 -0400 (EDT) Received: from billdog.local.linnet.org (dsl-212-74-113-66.access.uk.tiscali.com [212.74.113.66]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by orb.sasl.smtp.pobox.com (Postfix) with ESMTP id 30976A2; Mon, 26 Sep 2005 17:39:54 -0400 (EDT) Received: from brian by billdog.local.linnet.org with local (Exim 4.50 (FreeBSD)) id 1EK0kf-0000Cx-U9; Mon, 26 Sep 2005 22:43:09 +0100 Date: Mon, 26 Sep 2005 22:43:09 +0100 From: Brian Candler To: Isaac Levy Message-ID: <20050926214309.GA766@uk.tiscali.com> References: <20050924141025.GA1236@uk.tiscali.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.1i Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org Subject: Re: Options for synchronising filesystems X-BeenThere: freebsd-cluster@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Clustering FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Sep 2005 21:39:36 -0000 On Mon, Sep 26, 2005 at 02:16:31PM -0400, Isaac Levy wrote: > I just wanted to throw out some quick thoughts on a totally different > approach which nobody has really explored in this thread Geom (gmirror plus ggated/ggatec) was what I suggested for syncing two NFS servers (my option 2) or for direct synchronisation of the clients' filesystems to the servers (my option 4). The problem occurs when a client actually *mounts* and uses the mirrored copy, rather than just keeping a mirrored copy for resilience. > Geom Gate: > http://kerneltrap.org/news/freebsd?from=20 > > Network device-level client/server disk mapping tool. > (VERY IMPORTANT COMPONENT, it's reportedly faster, and more stable > than NFS has ever been- so people have immediately and happily > deployed it in production systems!) NFS and geom gate are two different things, so you can't really compare them directly. NFS shares files; geom gate shares a block level device. With NFS you can have one server and multiple clients, and the clients can access this filesystem read-write. With geom gate, you just have remote access to a disk partition, and essentially can only do what you could do with a local block device. Incidentally, NFS has been *hugely* dependable for me in production environments. However I've always used expensive and beefy NFS servers (Netapp) whilst FreeBSD is just the client. > I know of one web-shop in Canada, which is running 2 machines for > every virtual cluster, in the following configuration: > > 2 servers, > 4 SATA drives per box, > quad copper/ethernet gigabit nic on each box > > each drive is mirrored using gmirror, over each of the gigabit > ethernet nics > each box is running Vinum Raid5 across the 4 mirrored drives > > The drives are then sliced appropriately, and server resources are > distributed across the boxes- with various slices mounted on each box. > The folks I speak of simply have a suite of failover shell scripts > prepared, in the event of a machine experiencing total hardware failure. Right. But unless I'm mistaken, the remote mirrors are just backup copies of the data. Those remote mirrors are not actually *mounted* as filesystems. I think you're talking about a master/slave failover scenario. With careful arrangement, machine 1 can be master for dataset A and slave for dataset B, while machine 2 is slave for A and master for B, so you're not wasting your second machine. If machine 1 fails, machine 2 can take over both datasets. That's fine. However, what I need is for dataset A to be generated on machine 1 and identical copies available on machines 2, 3, 4, 5...9. Not just *stored* there, but actually *used* there, as live read-only copies. So if machine 1 makes a change to the dataset, all the other machines notice the change properly and start using it immediately. >From what I've heard, I can't use gmirror from machine 1 to machines 2-9, because you can't mount a filesystem readonly while some other machine magically updates the blocks from under its nose. The filesystem gets confused because its local caches of blocks and inodes become out of date when the data in the block device changes. > Regardless, it's worth tapping into the GEOM dialogues GEOM is definitely cool, and a strong selling point for moving from 4.x to 5.x Regards, Brian. From owner-freebsd-cluster@FreeBSD.ORG Tue Sep 27 11:25:29 2005 Return-Path: X-Original-To: freebsd-cluster@FreeBSD.ORG Delivered-To: freebsd-cluster@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7959816A41F for ; Tue, 27 Sep 2005 11:25:29 +0000 (GMT) (envelope-from olli@lurza.secnetix.de) Received: from lurza.secnetix.de (lurza.secnetix.de [83.120.8.8]) by mx1.FreeBSD.org (Postfix) with ESMTP id E939943D49 for ; Tue, 27 Sep 2005 11:25:28 +0000 (GMT) (envelope-from olli@lurza.secnetix.de) Received: from lurza.secnetix.de (gbshkj@localhost [127.0.0.1]) by lurza.secnetix.de (8.13.1/8.13.1) with ESMTP id j8RBPQQU094108 for ; Tue, 27 Sep 2005 13:25:26 +0200 (CEST) (envelope-from oliver.fromme@secnetix.de) Received: (from olli@localhost) by lurza.secnetix.de (8.13.1/8.13.1/Submit) id j8RBPQxk094107; Tue, 27 Sep 2005 13:25:26 +0200 (CEST) (envelope-from olli) Date: Tue, 27 Sep 2005 13:25:26 +0200 (CEST) Message-Id: <200509271125.j8RBPQxk094107@lurza.secnetix.de> From: Oliver Fromme To: freebsd-cluster@FreeBSD.ORG In-Reply-To: X-Newsgroups: list.freebsd-cluster User-Agent: tin/1.5.4-20000523 ("1959") (UNIX) (FreeBSD/4.11-RELEASE (i386)) Cc: Subject: Re: Options for synchronising filesystems X-BeenThere: freebsd-cluster@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: freebsd-cluster@FreeBSD.ORG List-Id: Clustering FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Sep 2005 11:25:29 -0000 Isaac Levy wrote: > In effect, a terse example of how someone could use the Geom tools I > mentioned, to meet this requirement: > > + Setup mirrored disks across machines as discussed before > > + Mount a slice of that disk Read/Write on one machine (acting as > master) > + Mount that same slice Readonly on both machines, using Geom Gate, > and serve data from there. That doesn't work. You would need a cache coherency proto- coll for that setup to work correctly. Or mount it read- only _everywhere_ (including the master). If you have to perform updates, you would have to perform this sequence: 1. remount the master read-write, 2. do the updates, 3. remount the master read-only, 4. flush the caches on the slaves (umount; mount). But that means you'll have a short downtime each time you update things -- probably not what you want, especially if a redundant setup is the goal. Currently, the _only_ way to mount the same filesystem on multiple FreeBSD systems is NFS (or third-party software like CODA). Another alternative, as others have mentioned, is to duplicate or synchronize the filesystems on each server regularly (rsync, unison, whatever). > Eric, you are definately correct, that there's not really a disk- > level mechanism to maintain concurrent writes between volumes mounted > across servers using FreeBSD Not even concurrent reads and write (i.e. one host writes and the others read). See above. > (excepting NFS, which in this context, makes me say *yuck*). NFS certainly has its disadvantages, but works pretty well when set up in a reasonable way. > I guess I'm saying this is a big-picture computing problem IMHO, and > I don't know of a good solution here (though I'm curious about what > kind of work has been done in Dragonfly which is relevant?) So far, DragonFly BSD hasn't done anything regarding cache- coherency for clusters, though it is planned for the future, I think. So, as of today, DF doesn't provide a solution for the above problem either. However, Matt Dillon has worked on a journalling feature for UFS which might be helpful for the situation given by the OP. Not everything is implemented yet, but it _is_ already usable to maintain remote mirrors. On July 5th, Matt wrote: "I can now run a buildworld loop, and the mirror generated by the journal stays in synch (diff -r reports no differences if I idle the buildworld, wait a second or two for the journal to catch up, and run it)." Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd Any opinions expressed in this message may be personal to the author and may not necessarily reflect the opinions of secnetix in any way. It's trivial to make fun of Microsoft products, but it takes a real man to make them work, and a God to make them do anything useful.