From owner-freebsd-fs@FreeBSD.ORG Thu Jun 6 23:34:33 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 38305659 for ; Thu, 6 Jun 2013 23:34:33 +0000 (UTC) (envelope-from jdc@koitsu.org) Received: from relay3-d.mail.gandi.net (relay3-d.mail.gandi.net [217.70.183.195]) by mx1.freebsd.org (Postfix) with ESMTP id CD6DB1C6D for ; Thu, 6 Jun 2013 23:34:32 +0000 (UTC) Received: from mfilter14-d.gandi.net (mfilter14-d.gandi.net [217.70.178.142]) by relay3-d.mail.gandi.net (Postfix) with ESMTP id D8B86A80BC; Fri, 7 Jun 2013 01:34:21 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at mfilter14-d.gandi.net Received: from relay3-d.mail.gandi.net ([217.70.183.195]) by mfilter14-d.gandi.net (mfilter14-d.gandi.net [10.0.15.180]) (amavisd-new, port 10024) with ESMTP id z7Rn0FfDo8yc; Fri, 7 Jun 2013 01:34:20 +0200 (CEST) X-Originating-IP: 67.180.84.87 Received: from jdc.koitsu.org (c-67-180-84-87.hsd1.ca.comcast.net [67.180.84.87]) (Authenticated sender: jdc@koitsu.org) by relay3-d.mail.gandi.net (Postfix) with ESMTPSA id A240FA80B4; Fri, 7 Jun 2013 01:34:19 +0200 (CEST) Received: by icarus.home.lan (Postfix, from userid 1000) id E41CB73A1C; Thu, 6 Jun 2013 16:34:17 -0700 (PDT) Date: Thu, 6 Jun 2013 16:34:17 -0700 From: Jeremy Chadwick To: mxb Subject: Re: zpool export/import on failover - The pool metadata is corrupted Message-ID: <20130606233417.GA46506@icarus.home.lan> References: <016B635E-4EDC-4CDF-AC58-82AC39CBFF56@alumni.chalmers.se> <20130606223911.GA45807@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: "freebsd-fs@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 06 Jun 2013 23:34:33 -0000 On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote: > > Sure, script is not perfects yet and does not handle many of stuff, but moving highlight from zpool import/export to the script itself not that > clever,as this works most of the time. > > Question is WHY ZFS corrupts metadata then it should not. Sometimes. > I'v seen stale of zpool then manually importing/exporting pool. > > > On 7 jun 2013, at 00:39, Jeremy Chadwick wrote: > > > On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote: > >> > >> Then MASTER goes down, CARP on the second node goes MASTER (devd.conf, and script for lifting): > >> > >> root@nfs2:/root # cat /etc/devd.conf > >> > >> > >> notify 30 { > >> match "system" "IFNET"; > >> match "subsystem" "carp0"; > >> match "type" "LINK_UP"; > >> action "/etc/zfs_switch.sh active"; > >> }; > >> > >> notify 30 { > >> match "system" "IFNET"; > >> match "subsystem" "carp0"; > >> match "type" "LINK_DOWN"; > >> action "/etc/zfs_switch.sh backup"; > >> }; > >> > >> root@nfs2:/root # cat /etc/zfs_switch.sh > >> #!/bin/sh > >> > >> DATE=`date +%Y%m%d` > >> HOSTNAME=`hostname` > >> > >> ZFS_POOL="jbod" > >> > >> > >> case $1 in > >> active) > >> echo "Switching to ACTIVE and importing ZFS" | mail -s ''$DATE': '$HOSTNAME' switching to ACTIVE' root > >> sleep 10 > >> /sbin/zpool import -f jbod > >> /etc/rc.d/mountd restart > >> /etc/rc.d/nfsd restart > >> ;; > >> backup) > >> echo "Switching to BACKUP and exporting ZFS" | mail -s ''$DATE': '$HOSTNAME' switching to BACKUP' root > >> /sbin/zpool export jbod > >> /etc/rc.d/mountd restart > >> /etc/rc.d/nfsd restart > >> ;; > >> *) > >> exit 0 > >> ;; > >> esac > >> > >> This works, most of the time, but sometimes I'm forced to re-create pool. Those machines suppose to go into prod. > >> Loosing pool(and data inside it) stops me from deploy this setup. > > > > This script looks highly error-prone. Hasty hasty... :-) > > > > This script assumes that the "zpool" commands (import and export) always > > work/succeed; there is no exit code ($?) checking being used. > > > > Since this is run from within devd(8): where does stdout/stderr go to > > when running a program/script under devd(8)? Does it effectively go > > to the bit bucket (/dev/null)? If so, you'd never know if the import or > > export actually succeeded or not (the export sounds more likely to be > > the problem point). > > > > I imagine there would be some situations where the export would fail > > (some files on filesystems under pool "jbod" still in use), yet CARP is > > already blindly assuming everything will be fantastic. Surprise. > > > > I also do not know if devd.conf(5) "action" commands spawn a sub-shell > > (/bin/sh) or not. If they don't, you won't be able to use things like" > > 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'. You > > would then need to implement the equivalent of logging within your > > zfs_switch.sh script. > > > > You may want to consider the -f flag to zpool import/export > > (particularly export). However there are risks involved -- userland > > applications which have an fd/fh open on a file which is stored on a > > filesystem that has now completely disappeared can sometimes crash > > (segfault) or behave very oddly (100% CPU usage, etc.) depending on how > > they're designed. > > > > Basically what I'm trying to say is that devd(8) being used as a form of > > HA (high availability) and load balancing is not always possible. > > Real/true HA (especially with SANs) is often done very differently (now > > you know why it's often proprietary. :-) ) Add error checking to your script. That's my first and foremost recommendation. It's not hard to do, really. :-) After you do that and still experience the issue (e.g. you see no actual errors/issues during the export/import phases), I recommend removing the "cache" devices which are "independent" on each system from the pool entirely. Quoting you (for readers, since I snipped it from my previous reply): >>> Note, that ZIL(mirrored) resides on external enclosure. Only L2ARC >>> is both local and external - da1,da2, da13s2, da14s2 I interpret this to mean the primary and backup nodes (physical systems) have actual disks which are not part of the "external enclosure". If that's the case -- those disks are always going to vary in their contents and metadata. Those are never going to be 100% identical all the time (is this not obvious?). I'm surprised your stuff has worked at all using that model, honestly. ZFS is going to bitch/cry if it cannot verify the integrity of certain things, all the way down to the L2ARC. That's my understanding of it at least, meaning there must always be "some" kind of metadata that has to be kept/maintained there. Alternately you could try doing this: zpool remove jbod cache daX daY ... zpool export jbod Then on the other system: zpool import jbod zpool add jbod cache daX daY ... Where daX and daY are the disks which are independent to each system (not on the "external enclosure"). Finally, it would also be useful/worthwhile if you would provide "dmesg" from both systems and for you to explain the physical wiring along with what device (e.g. daX) correlates with what exact thing on each system. (We right now have no knowledge of that, and your terse explanations imply we do -- we need to know more) -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Making life hard for others since 1977. PGP 4BD6C0CB |