From owner-freebsd-fs@FreeBSD.ORG Tue Jun 11 08:46:28 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 0DF31F70 for ; Tue, 11 Jun 2013 08:46:28 +0000 (UTC) (envelope-from mxb@alumni.chalmers.se) Received: from mail-la0-x22b.google.com (mail-la0-x22b.google.com [IPv6:2a00:1450:4010:c03::22b]) by mx1.freebsd.org (Postfix) with ESMTP id 87554160C for ; Tue, 11 Jun 2013 08:46:27 +0000 (UTC) Received: by mail-la0-f43.google.com with SMTP id gw10so6666799lab.30 for ; Tue, 11 Jun 2013 01:46:25 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer :x-gm-message-state; bh=qMS8s+keSx5TMEvVX4qXiamKGDpfRAyy9glszaQGKeA=; b=W2kljDhSycGwDHYoqXimTSXyJwu8agT1s3ygChxFeoQ7yAj4410umMSe92nBxuaLOY E8qvdWHqKSZF+xqDTmkt+YH719tdLFm6tOlE9BOy6v+VJzR8Sf3zWhYMUQjxG2n96iWu IgRhNAYrp2uJtDUae2f+APU796qKNqXmA1+xjwvbuYa4jmpm9Av1xaTWZmtdIueVkBCW FzL65vQ1N+acVKdyl9IunesvkPepnYhHfeuGuU3dkarbxVcszKirdDYcKjOg90uBaMyw 03D+oj46SoVdzg9myy8mJagxNGTSAOhKEb0XSQkGoUDbktVVs6xxlR4+orjUHUaZHjL/ prPw== X-Received: by 10.152.121.106 with SMTP id lj10mr6861724lab.27.1370940385481; Tue, 11 Jun 2013 01:46:25 -0700 (PDT) Received: from grey.office.se.prisjakt.nu ([212.16.170.194]) by mx.google.com with ESMTPSA id w4sm855521law.5.2013.06.11.01.46.23 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 11 Jun 2013 01:46:24 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: zpool export/import on failover - The pool metadata is corrupted From: mxb In-Reply-To: <20130606233417.GA46506@icarus.home.lan> Date: Tue, 11 Jun 2013 10:46:22 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <61E414CF-FCD3-42BB-9533-A40EA934DB99@alumni.chalmers.se> References: <016B635E-4EDC-4CDF-AC58-82AC39CBFF56@alumni.chalmers.se> <20130606223911.GA45807@icarus.home.lan> <20130606233417.GA46506@icarus.home.lan> To: Jeremy Chadwick X-Mailer: Apple Mail (2.1508) X-Gm-Message-State: ALoCoQkc8OJr4ravkNcLpOU7h/rIr026SuVs1m8vZtwDDxWpOfIwmf24Vr9kv+7aC7fnkKy/szrK Cc: "freebsd-fs@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Jun 2013 08:46:28 -0000 Thanks everyone whom replied. Removing local L2ARC cache disks (da1,da2) indeed showed to be a cure to = my problem. Next is to test with add/remove after import/export as Jeremy suggested. //mxb On 7 jun 2013, at 01:34, Jeremy Chadwick wrote: > On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote: >>=20 >> Sure, script is not perfects yet and does not handle many of stuff, = but moving highlight from zpool import/export to the script itself not = that >> clever,as this works most of the time. >>=20 >> Question is WHY ZFS corrupts metadata then it should not. Sometimes. >> I'v seen stale of zpool then manually importing/exporting pool. >>=20 >>=20 >> On 7 jun 2013, at 00:39, Jeremy Chadwick wrote: >>=20 >>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote: >>>>=20 >>>> Then MASTER goes down, CARP on the second node goes MASTER = (devd.conf, and script for lifting): >>>>=20 >>>> root@nfs2:/root # cat /etc/devd.conf >>>>=20 >>>>=20 >>>> notify 30 { >>>> match "system" "IFNET"; >>>> match "subsystem" "carp0"; >>>> match "type" "LINK_UP"; >>>> action "/etc/zfs_switch.sh active"; >>>> }; >>>>=20 >>>> notify 30 { >>>> match "system" "IFNET"; >>>> match "subsystem" "carp0"; >>>> match "type" "LINK_DOWN"; >>>> action "/etc/zfs_switch.sh backup"; >>>> }; >>>>=20 >>>> root@nfs2:/root # cat /etc/zfs_switch.sh >>>> #!/bin/sh >>>>=20 >>>> DATE=3D`date +%Y%m%d` >>>> HOSTNAME=3D`hostname` >>>>=20 >>>> ZFS_POOL=3D"jbod" >>>>=20 >>>>=20 >>>> case $1 in >>>> active) >>>> echo "Switching to ACTIVE and importing ZFS" | mail -s = ''$DATE': '$HOSTNAME' switching to ACTIVE' root >>>> sleep 10 >>>> /sbin/zpool import -f jbod >>>> /etc/rc.d/mountd restart >>>> /etc/rc.d/nfsd restart >>>> ;; >>>> backup) >>>> echo "Switching to BACKUP and exporting ZFS" | mail -s = ''$DATE': '$HOSTNAME' switching to BACKUP' root >>>> /sbin/zpool export jbod >>>> /etc/rc.d/mountd restart >>>> /etc/rc.d/nfsd restart >>>> ;; >>>> *) >>>> exit 0 >>>> ;; >>>> esac >>>>=20 >>>> This works, most of the time, but sometimes I'm forced to re-create = pool. Those machines suppose to go into prod. >>>> Loosing pool(and data inside it) stops me from deploy this setup. >>>=20 >>> This script looks highly error-prone. Hasty hasty... :-) >>>=20 >>> This script assumes that the "zpool" commands (import and export) = always >>> work/succeed; there is no exit code ($?) checking being used. >>>=20 >>> Since this is run from within devd(8): where does stdout/stderr go = to >>> when running a program/script under devd(8)? Does it effectively go >>> to the bit bucket (/dev/null)? If so, you'd never know if the = import or >>> export actually succeeded or not (the export sounds more likely to = be >>> the problem point). >>>=20 >>> I imagine there would be some situations where the export would fail >>> (some files on filesystems under pool "jbod" still in use), yet CARP = is >>> already blindly assuming everything will be fantastic. Surprise. >>>=20 >>> I also do not know if devd.conf(5) "action" commands spawn a = sub-shell >>> (/bin/sh) or not. If they don't, you won't be able to use things = like" >>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'. You >>> would then need to implement the equivalent of logging within your >>> zfs_switch.sh script. >>>=20 >>> You may want to consider the -f flag to zpool import/export >>> (particularly export). However there are risks involved -- userland >>> applications which have an fd/fh open on a file which is stored on a >>> filesystem that has now completely disappeared can sometimes crash >>> (segfault) or behave very oddly (100% CPU usage, etc.) depending on = how >>> they're designed. >>>=20 >>> Basically what I'm trying to say is that devd(8) being used as a = form of >>> HA (high availability) and load balancing is not always possible. >>> Real/true HA (especially with SANs) is often done very differently = (now >>> you know why it's often proprietary. :-) ) >=20 > Add error checking to your script. That's my first and foremost > recommendation. It's not hard to do, really. :-) >=20 > After you do that and still experience the issue (e.g. you see no = actual > errors/issues during the export/import phases), I recommend removing > the "cache" devices which are "independent" on each system from the = pool > entirely. Quoting you (for readers, since I snipped it from my = previous > reply): >=20 >>>> Note, that ZIL(mirrored) resides on external enclosure. Only L2ARC >>>> is both local and external - da1,da2, da13s2, da14s2 >=20 > I interpret this to mean the primary and backup nodes (physical = systems) > have actual disks which are not part of the "external enclosure". If > that's the case -- those disks are always going to vary in their > contents and metadata. Those are never going to be 100% identical all > the time (is this not obvious?). I'm surprised your stuff has worked = at > all using that model, honestly. >=20 > ZFS is going to bitch/cry if it cannot verify the integrity of certain > things, all the way down to the L2ARC. That's my understanding of it = at > least, meaning there must always be "some" kind of metadata that has = to > be kept/maintained there. >=20 > Alternately you could try doing this: >=20 > zpool remove jbod cache daX daY ... > zpool export jbod >=20 > Then on the other system: >=20 > zpool import jbod > zpool add jbod cache daX daY ... >=20 > Where daX and daY are the disks which are independent to each system > (not on the "external enclosure"). >=20 > Finally, it would also be useful/worthwhile if you would provide=20 > "dmesg" from both systems and for you to explain the physical wiring > along with what device (e.g. daX) correlates with what exact thing on > each system. (We right now have no knowledge of that, and your terse > explanations imply we do -- we need to know more) >=20 > --=20 > | Jeremy Chadwick jdc@koitsu.org | > | UNIX Systems Administrator http://jdc.koitsu.org/ | > | Making life hard for others since 1977. PGP 4BD6C0CB | >=20