From owner-freebsd-fs@FreeBSD.ORG  Thu Jun 27 11:26:17 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 15D1C6B8
 for <freebsd-fs@freebsd.org>; Thu, 27 Jun 2013 11:26:17 +0000 (UTC)
 (envelope-from mxb@alumni.chalmers.se)
Received: from mail-lb0-f171.google.com (mail-lb0-f171.google.com
 [209.85.217.171])
 by mx1.freebsd.org (Postfix) with ESMTP id 77B181C9F
 for <freebsd-fs@freebsd.org>; Thu, 27 Jun 2013 11:26:15 +0000 (UTC)
Received: by mail-lb0-f171.google.com with SMTP id 13so349740lba.30
 for <freebsd-fs@freebsd.org>; Thu, 27 Jun 2013 04:26:14 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=google.com; s=20120113;
 h=content-type:mime-version:subject:from:in-reply-to:date:cc
 :message-id:references:to:x-mailer:x-gm-message-state;
 bh=TGtl12ZxEPibbeUC6DbtdyMu8xGZrrrgSfGwMvx3idY=;
 b=azfpyhiiPWRJax5RAKucXfiUBoj5WIRW9OiYZQI+6E5TIvPjHZl/x5WlTYxCATz4WB
 g2+OBv7MWkS3y3A3N5HGwVOs8ABsmv1OYaay0N0tYwT9m+haxAt6fQ2fvWdPnWB1aDkH
 lRxYEfvr9zI16vXncrJY1cw16xJfq294mcSjyR8Uwf4HAL7mlpW6ec22kRBZi6ISHYE2
 Bg4hiXflOeSCPsLfwZ2fGFxyDO3jsRhVvwSuS4BKeDj5Zy7A4t+IiVwq/KFIrA+iz8AQ
 pwaFaPRTpYaynr8F9DWGY3GH27TYigw7xJSejHFyD6S0CPv8/Kc2Ybw7BsWsXumXB6hA
 yk2w==
X-Received: by 10.152.8.37 with SMTP id o5mr3789742laa.87.1372328556192;
 Thu, 27 Jun 2013 03:22:36 -0700 (PDT)
Received: from grey.office.se.prisjakt.nu ([212.16.170.194])
 by mx.google.com with ESMTPSA id ea14sm890106lbb.11.2013.06.27.03.22.33
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Thu, 27 Jun 2013 03:22:35 -0700 (PDT)
Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
Subject: Re: zpool export/import on failover - The pool metadata is corrupted
From: mxb <mxb@alumni.chalmers.se>
In-Reply-To: <CAOfEmZj=12VOEv6RRQUAmRtm6Mp+xHo47DwT+wmUDqmRyQJU3w@mail.gmail.com>
Date: Thu, 27 Jun 2013 12:22:32 +0200
Message-Id: <DEF48329-E6AC-4D03-BBDA-164D11DC72D3@alumni.chalmers.se>
References: <D7F099CB-855F-43F8-ACB5-094B93201B4B@alumni.chalmers.se>
 <CAKYr3zyPLpLau8xsv3fCkYrpJVzS0tXkyMn4E2aLz29EMBF9cA@mail.gmail.com>
 <016B635E-4EDC-4CDF-AC58-82AC39CBFF56@alumni.chalmers.se>
 <20130606223911.GA45807@icarus.home.lan>
 <C3FC39B3-D09F-4E73-9476-3BFC8B817278@alumni.chalmers.se>
 <20130606233417.GA46506@icarus.home.lan>
 <61E414CF-FCD3-42BB-9533-A40EA934DB99@alumni.chalmers.se>
 <09717048-12BE-474B-9B20-F5E72D00152E@alumni.chalmers.se>
 <5A26ABDE-C7F2-41CC-A3D1-69310AB6BC36@alumni.chalmers.se>
 <47B6A89F-6444-485A-88DD-69A9A93D9B3F@alumni.chalmers.se>
 <CAOfEmZj=12VOEv6RRQUAmRtm6Mp+xHo47DwT+wmUDqmRyQJU3w@mail.gmail.com>
To: araujo@FreeBSD.org
X-Mailer: Apple Mail (2.1508)
X-Gm-Message-State: ALoCoQlfKsMVajYk2kh9ZaDpIjL6xzghViP+tq7zQx72c5iJ7Ht3x67tCsUUCzjcj+1pFJ0L57g6
Content-Type: text/plain;
	charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Jun 2013 11:26:17 -0000


This solution is built on top of CARP.
One of nodes is (as of advskew) a preferred master.

Triggered chain is CARP -> devd -> failover_script.sh (zfs =
import/export)


On 27 jun 2013, at 11:43, Marcelo Araujo <araujobsdport@gmail.com> =
wrote:

> For this failover solution, did you create a heartbeat or something =
such like that? How do you avoid split-brain?
>=20
> Best Regards.
>=20
>=20
> 2013/6/27 mxb <mxb@alumni.chalmers.se>
>=20
> Notation for archives.
>=20
> I have so far not experienced any problems with both local (per head =
unit) and external (on disk enclosure) caches while importing
> and exporting my pool. Disks I use on both nodes are identical - =
manufacturer, size, model.
>=20
> da1,da2 - local
> da32,da33 - external
>=20
> Export/import is done WITHOUT removing/adding local disks.
>=20
> root@nfs1:/root # zpool status
>   pool: jbod
>  state: ONLINE
>   scan: scrub repaired 0 in 0h0m with 0 errors on Wed Jun 26 13:14:55 =
2013
> config:
>=20
>         NAME        STATE     READ WRITE CKSUM
>         jbod        ONLINE       0     0     0
>           raidz3-0  ONLINE       0     0     0
>             da10    ONLINE       0     0     0
>             da11    ONLINE       0     0     0
>             da12    ONLINE       0     0     0
>             da13    ONLINE       0     0     0
>             da14    ONLINE       0     0     0
>             da15    ONLINE       0     0     0
>             da16    ONLINE       0     0     0
>             da17    ONLINE       0     0     0
>             da18    ONLINE       0     0     0
>             da19    ONLINE       0     0     0
>         logs
>           mirror-1  ONLINE       0     0     0
>             da32s1  ONLINE       0     0     0
>             da33s1  ONLINE       0     0     0
>         cache
>           da32s2    ONLINE       0     0     0
>           da33s2    ONLINE       0     0     0
>           da1       ONLINE       0     0     0
>           da2       ONLINE       0     0     0
>=20
> On 25 jun 2013, at 21:22, mxb <mxb@alumni.chalmers.se> wrote:
>=20
> >
> > I think I'v found the root of this issue.
> > Looks like "wiring down" disks the same way on both nodes (as =
suggested) fixes this issue.
> >
> > //mxb
> >
> > On 20 jun 2013, at 12:30, mxb <mxb@alumni.chalmers.se> wrote:
> >
> >>
> >> Well,
> >>
> >> I'm back to square one.
> >>
> >> After some uptime and successful import/export from one node to =
another, I eventually got 'metadata corruption'.
> >> I had no problem with import/export while for ex. rebooting =
master-node (nfs1), but not THIS time.
> >> Metdata got corrupted while rebooting master-node??
> >>
> >> Any ideas?
> >>
> >> [root@nfs1 ~]# zpool import
> >>  pool: jbod
> >>    id: 7663925948774378610
> >> state: FAULTED
> >> status: The pool metadata is corrupted.
> >> action: The pool cannot be imported due to damaged devices or data.
> >>  see: http://illumos.org/msg/ZFS-8000-72
> >> config:
> >>
> >>      jbod        FAULTED  corrupted data
> >>        raidz3-0  ONLINE
> >>          da3     ONLINE
> >>          da4     ONLINE
> >>          da5     ONLINE
> >>          da6     ONLINE
> >>          da7     ONLINE
> >>          da8     ONLINE
> >>          da9     ONLINE
> >>          da10    ONLINE
> >>          da11    ONLINE
> >>          da12    ONLINE
> >>      cache
> >>        da13s2
> >>        da14s2
> >>      logs
> >>        mirror-1  ONLINE
> >>          da13s1  ONLINE
> >>          da14s1  ONLINE
> >> [root@nfs1 ~]# zpool import jbod
> >> cannot import 'jbod': I/O error
> >>      Destroy and re-create the pool from
> >>      a backup source.
> >> [root@nfs1 ~]#
> >>
> >> On 11 jun 2013, at 10:46, mxb <mxb@alumni.chalmers.se> wrote:
> >>
> >>>
> >>> Thanks everyone whom replied.
> >>> Removing local L2ARC cache disks (da1,da2) indeed showed to be a =
cure to my problem.
> >>>
> >>> Next is to test with add/remove after import/export as Jeremy =
suggested.
> >>>
> >>> //mxb
> >>>
> >>> On 7 jun 2013, at 01:34, Jeremy Chadwick <jdc@koitsu.org> wrote:
> >>>
> >>>> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote:
> >>>>>
> >>>>> Sure, script is not perfects yet and does not handle many of =
stuff, but moving highlight from zpool import/export to the script =
itself not that
> >>>>> clever,as this works most of the time.
> >>>>>
> >>>>> Question is WHY ZFS corrupts metadata then it should not. =
Sometimes.
> >>>>> I'v seen stale of zpool then manually importing/exporting pool.
> >>>>>
> >>>>>
> >>>>> On 7 jun 2013, at 00:39, Jeremy Chadwick <jdc@koitsu.org> wrote:
> >>>>>
> >>>>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote:
> >>>>>>>
> >>>>>>> Then MASTER goes down, CARP on the second node goes MASTER =
(devd.conf, and script for lifting):
> >>>>>>>
> >>>>>>> root@nfs2:/root # cat /etc/devd.conf
> >>>>>>>
> >>>>>>>
> >>>>>>> notify 30 {
> >>>>>>> match "system"          "IFNET";
> >>>>>>> match "subsystem"       "carp0";
> >>>>>>> match "type"            "LINK_UP";
> >>>>>>> action "/etc/zfs_switch.sh active";
> >>>>>>> };
> >>>>>>>
> >>>>>>> notify 30 {
> >>>>>>> match "system"          "IFNET";
> >>>>>>> match "subsystem"       "carp0";
> >>>>>>> match "type"            "LINK_DOWN";
> >>>>>>> action "/etc/zfs_switch.sh backup";
> >>>>>>> };
> >>>>>>>
> >>>>>>> root@nfs2:/root # cat /etc/zfs_switch.sh
> >>>>>>> #!/bin/sh
> >>>>>>>
> >>>>>>> DATE=3D`date +%Y%m%d`
> >>>>>>> HOSTNAME=3D`hostname`
> >>>>>>>
> >>>>>>> ZFS_POOL=3D"jbod"
> >>>>>>>
> >>>>>>>
> >>>>>>> case $1 in
> >>>>>>>         active)
> >>>>>>>                 echo "Switching to ACTIVE and importing ZFS" | =
mail -s ''$DATE': '$HOSTNAME' switching to ACTIVE' root
> >>>>>>>                 sleep 10
> >>>>>>>                 /sbin/zpool import -f jbod
> >>>>>>>                 /etc/rc.d/mountd restart
> >>>>>>>                 /etc/rc.d/nfsd restart
> >>>>>>>                 ;;
> >>>>>>>         backup)
> >>>>>>>                 echo "Switching to BACKUP and exporting ZFS" | =
mail -s ''$DATE': '$HOSTNAME' switching to BACKUP' root
> >>>>>>>                 /sbin/zpool export jbod
> >>>>>>>                 /etc/rc.d/mountd restart
> >>>>>>>            /etc/rc.d/nfsd restart
> >>>>>>>                 ;;
> >>>>>>>         *)
> >>>>>>>                 exit 0
> >>>>>>>                 ;;
> >>>>>>> esac
> >>>>>>>
> >>>>>>> This works, most of the time, but sometimes I'm forced to =
re-create pool. Those machines suppose to go into prod.
> >>>>>>> Loosing pool(and data inside it) stops me from deploy this =
setup.
> >>>>>>
> >>>>>> This script looks highly error-prone.  Hasty hasty...  :-)
> >>>>>>
> >>>>>> This script assumes that the "zpool" commands (import and =
export) always
> >>>>>> work/succeed; there is no exit code ($?) checking being used.
> >>>>>>
> >>>>>> Since this is run from within devd(8): where does stdout/stderr =
go to
> >>>>>> when running a program/script under devd(8)?  Does it =
effectively go
> >>>>>> to the bit bucket (/dev/null)?  If so, you'd never know if the =
import or
> >>>>>> export actually succeeded or not (the export sounds more likely =
to be
> >>>>>> the problem point).
> >>>>>>
> >>>>>> I imagine there would be some situations where the export would =
fail
> >>>>>> (some files on filesystems under pool "jbod" still in use), yet =
CARP is
> >>>>>> already blindly assuming everything will be fantastic.  =
Surprise.
> >>>>>>
> >>>>>> I also do not know if devd.conf(5) "action" commands spawn a =
sub-shell
> >>>>>> (/bin/sh) or not.  If they don't, you won't be able to use =
things like"
> >>>>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'. =
 You
> >>>>>> would then need to implement the equivalent of logging within =
your
> >>>>>> zfs_switch.sh script.
> >>>>>>
> >>>>>> You may want to consider the -f flag to zpool import/export
> >>>>>> (particularly export).  However there are risks involved -- =
userland
> >>>>>> applications which have an fd/fh open on a file which is stored =
on a
> >>>>>> filesystem that has now completely disappeared can sometimes =
crash
> >>>>>> (segfault) or behave very oddly (100% CPU usage, etc.) =
depending on how
> >>>>>> they're designed.
> >>>>>>
> >>>>>> Basically what I'm trying to say is that devd(8) being used as =
a form of
> >>>>>> HA (high availability) and load balancing is not always =
possible.
> >>>>>> Real/true HA (especially with SANs) is often done very =
differently (now
> >>>>>> you know why it's often proprietary.  :-) )
> >>>>
> >>>> Add error checking to your script.  That's my first and foremost
> >>>> recommendation.  It's not hard to do, really.  :-)
> >>>>
> >>>> After you do that and still experience the issue (e.g. you see no =
actual
> >>>> errors/issues during the export/import phases), I recommend =
removing
> >>>> the "cache" devices which are "independent" on each system from =
the pool
> >>>> entirely.  Quoting you (for readers, since I snipped it from my =
previous
> >>>> reply):
> >>>>
> >>>>>>> Note, that ZIL(mirrored) resides on external enclosure. Only =
L2ARC
> >>>>>>> is both local and external - da1,da2, da13s2, da14s2
> >>>>
> >>>> I interpret this to mean the primary and backup nodes (physical =
systems)
> >>>> have actual disks which are not part of the "external enclosure". =
 If
> >>>> that's the case -- those disks are always going to vary in their
> >>>> contents and metadata.  Those are never going to be 100% =
identical all
> >>>> the time (is this not obvious?).  I'm surprised your stuff has =
worked at
> >>>> all using that model, honestly.
> >>>>
> >>>> ZFS is going to bitch/cry if it cannot verify the integrity of =
certain
> >>>> things, all the way down to the L2ARC.  That's my understanding =
of it at
> >>>> least, meaning there must always be "some" kind of metadata that =
has to
> >>>> be kept/maintained there.
> >>>>
> >>>> Alternately you could try doing this:
> >>>>
> >>>> zpool remove jbod cache daX daY ...
> >>>> zpool export jbod
> >>>>
> >>>> Then on the other system:
> >>>>
> >>>> zpool import jbod
> >>>> zpool add jbod cache daX daY ...
> >>>>
> >>>> Where daX and daY are the disks which are independent to each =
system
> >>>> (not on the "external enclosure").
> >>>>
> >>>> Finally, it would also be useful/worthwhile if you would provide
> >>>> "dmesg" from both systems and for you to explain the physical =
wiring
> >>>> along with what device (e.g. daX) correlates with what exact =
thing on
> >>>> each system.  (We right now have no knowledge of that, and your =
terse
> >>>> explanations imply we do -- we need to know more)
> >>>>
> >>>> --
> >>>> | Jeremy Chadwick                                   =
jdc@koitsu.org |
> >>>> | UNIX Systems Administrator                =
http://jdc.koitsu.org/ |
> >>>> | Making life hard for others since 1977.             PGP =
4BD6C0CB |
> >>>>
> >>>
> >>
> >
>=20
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>=20
>=20
>=20
> --=20
> Marcelo Araujo
> araujo@FreeBSD.org