From owner-freebsd-fs@FreeBSD.ORG  Tue Jun 11 08:46:28 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 0DF31F70
 for <freebsd-fs@freebsd.org>; Tue, 11 Jun 2013 08:46:28 +0000 (UTC)
 (envelope-from mxb@alumni.chalmers.se)
Received: from mail-la0-x22b.google.com (mail-la0-x22b.google.com
 [IPv6:2a00:1450:4010:c03::22b])
 by mx1.freebsd.org (Postfix) with ESMTP id 87554160C
 for <freebsd-fs@freebsd.org>; Tue, 11 Jun 2013 08:46:27 +0000 (UTC)
Received: by mail-la0-f43.google.com with SMTP id gw10so6666799lab.30
 for <freebsd-fs@freebsd.org>; Tue, 11 Jun 2013 01:46:25 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=google.com; s=20120113;
 h=content-type:mime-version:subject:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to:x-mailer
 :x-gm-message-state;
 bh=qMS8s+keSx5TMEvVX4qXiamKGDpfRAyy9glszaQGKeA=;
 b=W2kljDhSycGwDHYoqXimTSXyJwu8agT1s3ygChxFeoQ7yAj4410umMSe92nBxuaLOY
 E8qvdWHqKSZF+xqDTmkt+YH719tdLFm6tOlE9BOy6v+VJzR8Sf3zWhYMUQjxG2n96iWu
 IgRhNAYrp2uJtDUae2f+APU796qKNqXmA1+xjwvbuYa4jmpm9Av1xaTWZmtdIueVkBCW
 FzL65vQ1N+acVKdyl9IunesvkPepnYhHfeuGuU3dkarbxVcszKirdDYcKjOg90uBaMyw
 03D+oj46SoVdzg9myy8mJagxNGTSAOhKEb0XSQkGoUDbktVVs6xxlR4+orjUHUaZHjL/
 prPw==
X-Received: by 10.152.121.106 with SMTP id lj10mr6861724lab.27.1370940385481; 
 Tue, 11 Jun 2013 01:46:25 -0700 (PDT)
Received: from grey.office.se.prisjakt.nu ([212.16.170.194])
 by mx.google.com with ESMTPSA id w4sm855521law.5.2013.06.11.01.46.23
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Tue, 11 Jun 2013 01:46:24 -0700 (PDT)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
Subject: Re: zpool export/import on failover - The pool metadata is corrupted
From: mxb <mxb@alumni.chalmers.se>
In-Reply-To: <20130606233417.GA46506@icarus.home.lan>
Date: Tue, 11 Jun 2013 10:46:22 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <61E414CF-FCD3-42BB-9533-A40EA934DB99@alumni.chalmers.se>
References: <D7F099CB-855F-43F8-ACB5-094B93201B4B@alumni.chalmers.se>
 <CAKYr3zyPLpLau8xsv3fCkYrpJVzS0tXkyMn4E2aLz29EMBF9cA@mail.gmail.com>
 <016B635E-4EDC-4CDF-AC58-82AC39CBFF56@alumni.chalmers.se>
 <20130606223911.GA45807@icarus.home.lan>
 <C3FC39B3-D09F-4E73-9476-3BFC8B817278@alumni.chalmers.se>
 <20130606233417.GA46506@icarus.home.lan>
To: Jeremy Chadwick <jdc@koitsu.org>
X-Mailer: Apple Mail (2.1508)
X-Gm-Message-State: ALoCoQkc8OJr4ravkNcLpOU7h/rIr026SuVs1m8vZtwDDxWpOfIwmf24Vr9kv+7aC7fnkKy/szrK
Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 11 Jun 2013 08:46:28 -0000


Thanks everyone whom replied.
Removing local L2ARC cache disks (da1,da2) indeed showed to be a cure to =
my problem.

Next is to test with add/remove after import/export as Jeremy suggested.

//mxb

On 7 jun 2013, at 01:34, Jeremy Chadwick <jdc@koitsu.org> wrote:

> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote:
>>=20
>> Sure, script is not perfects yet and does not handle many of stuff, =
but moving highlight from zpool import/export to the script itself not =
that
>> clever,as this works most of the time.
>>=20
>> Question is WHY ZFS corrupts metadata then it should not. Sometimes.
>> I'v seen stale of zpool then manually importing/exporting pool.
>>=20
>>=20
>> On 7 jun 2013, at 00:39, Jeremy Chadwick <jdc@koitsu.org> wrote:
>>=20
>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote:
>>>>=20
>>>> Then MASTER goes down, CARP on the second node goes MASTER =
(devd.conf, and script for lifting):
>>>>=20
>>>> root@nfs2:/root # cat /etc/devd.conf
>>>>=20
>>>>=20
>>>> notify 30 {
>>>> match "system"		"IFNET";
>>>> match "subsystem"	"carp0";
>>>> match "type"		"LINK_UP";
>>>> action "/etc/zfs_switch.sh active";
>>>> };
>>>>=20
>>>> notify 30 {
>>>> match "system"          "IFNET";
>>>> match "subsystem"       "carp0";
>>>> match "type"            "LINK_DOWN";
>>>> action "/etc/zfs_switch.sh backup";
>>>> };
>>>>=20
>>>> root@nfs2:/root # cat /etc/zfs_switch.sh
>>>> #!/bin/sh
>>>>=20
>>>> DATE=3D`date +%Y%m%d`
>>>> HOSTNAME=3D`hostname`
>>>>=20
>>>> ZFS_POOL=3D"jbod"
>>>>=20
>>>>=20
>>>> case $1 in
>>>> 	active)
>>>> 		echo "Switching to ACTIVE and importing ZFS" | mail -s =
''$DATE': '$HOSTNAME' switching to ACTIVE' root
>>>> 		sleep 10
>>>> 		/sbin/zpool import -f jbod
>>>> 		/etc/rc.d/mountd restart
>>>> 		/etc/rc.d/nfsd restart
>>>> 		;;
>>>> 	backup)
>>>> 		echo "Switching to BACKUP and exporting ZFS" | mail -s =
''$DATE': '$HOSTNAME' switching to BACKUP' root
>>>> 		/sbin/zpool export jbod
>>>> 		/etc/rc.d/mountd restart
>>>>               /etc/rc.d/nfsd restart
>>>> 		;;
>>>> 	*)
>>>> 		exit 0
>>>> 		;;
>>>> esac
>>>>=20
>>>> This works, most of the time, but sometimes I'm forced to re-create =
pool. Those machines suppose to go into prod.
>>>> Loosing pool(and data inside it) stops me from deploy this setup.
>>>=20
>>> This script looks highly error-prone.  Hasty hasty...  :-)
>>>=20
>>> This script assumes that the "zpool" commands (import and export) =
always
>>> work/succeed; there is no exit code ($?) checking being used.
>>>=20
>>> Since this is run from within devd(8): where does stdout/stderr go =
to
>>> when running a program/script under devd(8)?  Does it effectively go
>>> to the bit bucket (/dev/null)?  If so, you'd never know if the =
import or
>>> export actually succeeded or not (the export sounds more likely to =
be
>>> the problem point).
>>>=20
>>> I imagine there would be some situations where the export would fail
>>> (some files on filesystems under pool "jbod" still in use), yet CARP =
is
>>> already blindly assuming everything will be fantastic.  Surprise.
>>>=20
>>> I also do not know if devd.conf(5) "action" commands spawn a =
sub-shell
>>> (/bin/sh) or not.  If they don't, you won't be able to use things =
like"
>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'.  You
>>> would then need to implement the equivalent of logging within your
>>> zfs_switch.sh script.
>>>=20
>>> You may want to consider the -f flag to zpool import/export
>>> (particularly export).  However there are risks involved -- userland
>>> applications which have an fd/fh open on a file which is stored on a
>>> filesystem that has now completely disappeared can sometimes crash
>>> (segfault) or behave very oddly (100% CPU usage, etc.) depending on =
how
>>> they're designed.
>>>=20
>>> Basically what I'm trying to say is that devd(8) being used as a =
form of
>>> HA (high availability) and load balancing is not always possible.
>>> Real/true HA (especially with SANs) is often done very differently =
(now
>>> you know why it's often proprietary.  :-) )
>=20
> Add error checking to your script.  That's my first and foremost
> recommendation.  It's not hard to do, really.  :-)
>=20
> After you do that and still experience the issue (e.g. you see no =
actual
> errors/issues during the export/import phases), I recommend removing
> the "cache" devices which are "independent" on each system from the =
pool
> entirely.  Quoting you (for readers, since I snipped it from my =
previous
> reply):
>=20
>>>> Note, that ZIL(mirrored) resides on external enclosure. Only L2ARC
>>>> is both local and external - da1,da2, da13s2, da14s2
>=20
> I interpret this to mean the primary and backup nodes (physical =
systems)
> have actual disks which are not part of the "external enclosure".  If
> that's the case -- those disks are always going to vary in their
> contents and metadata.  Those are never going to be 100% identical all
> the time (is this not obvious?).  I'm surprised your stuff has worked =
at
> all using that model, honestly.
>=20
> ZFS is going to bitch/cry if it cannot verify the integrity of certain
> things, all the way down to the L2ARC.  That's my understanding of it =
at
> least, meaning there must always be "some" kind of metadata that has =
to
> be kept/maintained there.
>=20
> Alternately you could try doing this:
>=20
> zpool remove jbod cache daX daY ...
> zpool export jbod
>=20
> Then on the other system:
>=20
> zpool import jbod
> zpool add jbod cache daX daY ...
>=20
> Where daX and daY are the disks which are independent to each system
> (not on the "external enclosure").
>=20
> Finally, it would also be useful/worthwhile if you would provide=20
> "dmesg" from both systems and for you to explain the physical wiring
> along with what device (e.g. daX) correlates with what exact thing on
> each system.  (We right now have no knowledge of that, and your terse
> explanations imply we do -- we need to know more)
>=20
> --=20
> | Jeremy Chadwick                                   jdc@koitsu.org |
> | UNIX Systems Administrator                http://jdc.koitsu.org/ |
> | Making life hard for others since 1977.             PGP 4BD6C0CB |
>=20