From owner-freebsd-cluster@FreeBSD.ORG Sat Sep 24 14:06:54 2005 Return-Path: X-Original-To: freebsd-cluster@freebsd.org Delivered-To: freebsd-cluster@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9F15816A41F; Sat, 24 Sep 2005 14:06:54 +0000 (GMT) (envelope-from b.candler@pobox.com) Received: from orb.pobox.com (orb.pobox.com [207.8.226.5]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2B01343D48; Sat, 24 Sep 2005 14:06:54 +0000 (GMT) (envelope-from b.candler@pobox.com) Received: from orb (localhost [127.0.0.1]) by orb.pobox.com (Postfix) with ESMTP id F10CF1F15; Sat, 24 Sep 2005 10:07:14 -0400 (EDT) Received: from billdog.local.linnet.org (dsl-212-74-113-66.access.uk.tiscali.com [212.74.113.66]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by orb.sasl.smtp.pobox.com (Postfix) with ESMTP id 822D5A0; Sat, 24 Sep 2005 10:07:13 -0400 (EDT) Received: from brian by billdog.local.linnet.org with local (Exim 4.50 (FreeBSD)) id 1EJAjR-0000MJ-DY; Sat, 24 Sep 2005 15:10:25 +0100 Date: Sat, 24 Sep 2005 15:10:25 +0100 From: Brian Candler To: freebsd-cluster@freebsd.org, freebsd-isp@freebsd.org Message-ID: <20050924141025.GA1236@uk.tiscali.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.1i Cc: Subject: Options for synchronising filesystems X-BeenThere: freebsd-cluster@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Clustering FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 24 Sep 2005 14:06:54 -0000 Hello, I was wondering if anyone would care to share their experiences in synchronising filesystems across a number of nodes in a cluster. I can think of a number of options, but before changing what I'm doing at the moment I'd like to see if anyone has good experiences with any of the others. The application: a clustered webserver. The users' CGIs run in a chroot environment, and these clearly need to be identical (otherwise a CGI running on one box would behave differently when running on a different box). Ultimately I'd like to synchronise the host OS on each server too. Note that this is a single-master, multiple-slave type of filesystem synchronisation I'm interested in. 1. Keep a master image on an admin box, and rsync it out to the frontends ------------------------------------------------------------------------- This is what I'm doing at the moment. Install a master image in /webroot/cgi, add packages there (chroot /webroot/cgi pkg_add ...), and rsync it. [Actually I'm exporting it using NFS, and the frontends run rsync locally when required to update their local copies against the NFS master] Disadvantages: - rsyncing a couple of gigs of data is not particularly fast, even when only a few files have changed - if a sysadmin (wrongly) changes a file on a front-end instead of on the master copy in the admin box, then the change will be lost when the next rsync occurs. They might think they've fixed a problem, and then (say) 24 hours later their change is wiped. However if this is a config file, the fact that the old file has been reinstated might not be noticed until the daemon is restarted or the box rebooted - maybe months later. This I think is the biggest fundamental problem. - files can be added locally and they will remain indefinitely (unless we use rsync --delete which is a bit scary). If this is done then adding a new machine into the cluster by rsyncing from the master will not pick up these extra files. So, here are the alternatives I'm considering, and I'd welcome any additional suggestions too. 2. Run the images directly off NFS ---------------------------------- I've had this running before, even the entire O/S, and it works just fine. However the NFS server itself then becomes a critical single-point-of-failure: if it has to be rebooted and is out of service for 2 minutes, then the whole cluster is out of service for that time. I think this is only feasible if I can build a highly-available NFS server, which really means a pair of boxes serving the same data. Since the system image is read-only from the point of view of the frontends, this should be easy enough: frontends frontends | | | | | | NFS -----------> NFS server 1 sync server 2 As far as I know, NFS clients don't support the idea of failing over from one server to another, so I'd have to make a server pair which transparently fails over. I could make one NFS server take over the other server's IP address using carp or vrrp. However, I suspect that the clients might notice. I know that NFS is 'stateless' in the sense that a server can be rebooted, but for a client to be redirected from one server to the other, I expect that these filesytems would have to be *identical*, down to the level of the inode numbers being the same. If that's true, then rsync between the two NFS servers won't cut it. I was thinking of perhaps using geom_mirror plus ggated/ggatec to make a block-identical read-only mirror image on NFS server 2 - this also has the advantage that any updates are close to instantaneous. What worries me here is how NFS server 2, which has the mirrored filesystem mounted read-only, will take to having the data changed under its nose. Does it for example keep caches of inodes in memory, and what would happen if those inodes on disk were to change? I guess I can always just unmount and remount the filesystem on NFS server 2 after each change. My other concern is about susceptibility to DoS-type attacks: if one frontend were to go haywire and start hammering the NFS servers really hard, it could impact on all the other machines in the cluster. However, the problems of data synchronisation are solved: any change made on the NFS server is visible identically to all front-ends, and sysadmins can't make changes on the front-ends because the NFS export is read-only. 3. Use a network distributed filesystem - CODA? AFS? ---------------------------------------------------- If each frontend were to access the filesystem as a read-only network mount, but have a local copy to work with in the case of disconnected operation, then the SPOF of an NFS server would be eliminated. However, I have no experience with CODA, and although it's been in the tree since 2002, the README's don't inspire confidence: "It is mostly working, but hasn't been run long enough to be sure all the bugs are sorted out. ... This code is not SMP ready" Also, a local cache is no good if the data you want during disconnected operation is not in the cache at that time, which I think means this idea is not actually a very good one. 4. Mount filesystems read-only ------------------------------ On each front-end I could store /webroot/cgi on a filesystem mounted read-only to prevent tampering (as long as the sysadmin doesn't remount it read-write of course). That would work reasonably well, except that being mounted read-only I couldn't use rsync to update it! It might also work with geom_mirror and ggated/ggatec, except for the issue I raised before about changing blocks on a filesystem under the nose of a client who is actively reading from it. 5. Using a filesystem which really is read-only ----------------------------------------------- Better tamper-protection could be had by keeping data in a filesystem structure which doesn't support any updates at all - such as cd9660 or geom_uzip. The issue here is how to roll out a new version of the data. I could push out a new filesystem image into a second partition, but it would then be necessary to unmount the old filesystem and remount the new on the same place, and you can't really unmount a filesystem which is in use. So this would require a reboot. I was thinking that some symlink trickery might help: /webroot/cgi -> /webroot/cgi1 /webroot/cgi1 # filesystem A mounted here /webroot/cgi2 # filesystem B mounted here It should be possible to unmount /webroot/cgi2, dd in a new image, remount it, and change the symlink to point to /webroot/cgi2. After a little while, hopefully all the applications will stop using files in /webroot/cgi1, so this one can be unmounted and a new one put in its place on the next update. However this is not guaranteed, especially if there are long-lived processes using binary images in this partition. You'd still have to stop and restart all those processes. If reboots were acceptable, then the filesystem image could also be stored in ramdisk pulled in via pxeboot. This makes sense especially for geom_uzip where the data is pre-compressed. However I would still prefer to avoid frequent reboots if at all possible. Also, whilst a ramdisk might be OK for the root filesystem, a typical CGI environment (with perl, php, ruby, python, and loads of libraries) would probably be too large anyway. 6. Journaling filesystem replication ------------------------------------ If the data were stored on a journaling filesystem on the master box, and the journal logs were distributed out to the slaves, then they would all have identical filesystem copies and only a minimal amount of data would need to be pushed out to each machine on each change. (This would be rather like NetApps and their snap-mirroring system). However I'm not aware of any journaling filesystem for FreeBSD, let alone whether it would support filesystem replication in this way. Well, that's what I've come up with so far. I'd be very interested to hear if people have any other strategies or suggestions, particularly with practical experience in a clustered/ISP environment. Regards, Brian Candler. From owner-freebsd-cluster@FreeBSD.ORG Sat Sep 24 16:06:58 2005 Return-Path: X-Original-To: freebsd-cluster@FreeBSD.ORG Delivered-To: freebsd-cluster@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 79F1F16A423 for ; Sat, 24 Sep 2005 16:06:58 +0000 (GMT) (envelope-from olli@lurza.secnetix.de) Received: from lurza.secnetix.de (lurza.secnetix.de [83.120.8.8]) by mx1.FreeBSD.org (Postfix) with ESMTP id D903543D49 for ; Sat, 24 Sep 2005 16:06:57 +0000 (GMT) (envelope-from olli@lurza.secnetix.de) Received: from lurza.secnetix.de (slghcx@localhost [127.0.0.1]) by lurza.secnetix.de (8.13.1/8.13.1) with ESMTP id j8OG6uFJ066589 for ; Sat, 24 Sep 2005 18:06:56 +0200 (CEST) (envelope-from oliver.fromme@secnetix.de) Received: (from olli@localhost) by lurza.secnetix.de (8.13.1/8.13.1/Submit) id j8OG6u9N066588; Sat, 24 Sep 2005 18:06:56 +0200 (CEST) (envelope-from olli) Date: Sat, 24 Sep 2005 18:06:56 +0200 (CEST) Message-Id: <200509241606.j8OG6u9N066588@lurza.secnetix.de> From: Oliver Fromme To: freebsd-cluster@FreeBSD.ORG In-Reply-To: <20050924141025.GA1236@uk.tiscali.com> X-Newsgroups: list.freebsd-cluster User-Agent: tin/1.5.4-20000523 ("1959") (UNIX) (FreeBSD/4.11-RELEASE (i386)) Cc: Subject: Re: Options for synchronising filesystems X-BeenThere: freebsd-cluster@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: freebsd-cluster@FreeBSD.ORG List-Id: Clustering FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 24 Sep 2005 16:06:58 -0000 Just a few things that came to my mind when reading your message ... Brian Candler wrote: > [...] > 2. Run the images directly off NFS > ---------------------------------- > [...] > As far as I know, NFS clients don't support the idea of failing over from > one server to another, so I'd have to make a server pair which transparently > fails over. NetApp filers support that (in cluster configuration). It works very well, I've used such NetApp filer clusters as NFS servers for a server farm running FreeBSD for several years. Disadvantage: Not exactly cheap. > 6. Journaling filesystem replication > ------------------------------------ > > If the data were stored on a journaling filesystem on the master box, and > the journal logs were distributed out to the slaves, then they would all > have identical filesystem copies and only a minimal amount of data would > need to be pushed out to each machine on each change. (This would be rather > like NetApps and their snap-mirroring system). However I'm not aware of any > journaling filesystem for FreeBSD, let alone whether it would support > filesystem replication in this way. DragonFly BSD supports exactly that. Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd Any opinions expressed in this message may be personal to the author and may not necessarily reflect the opinions of secnetix in any way. We're sysadmins. To us, data is a protocol-overhead.