From owner-freebsd-fs@FreeBSD.ORG Thu Jun 27 11:26:17 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 15D1C6B8 for ; Thu, 27 Jun 2013 11:26:17 +0000 (UTC) (envelope-from mxb@alumni.chalmers.se) Received: from mail-lb0-f171.google.com (mail-lb0-f171.google.com [209.85.217.171]) by mx1.freebsd.org (Postfix) with ESMTP id 77B181C9F for ; Thu, 27 Jun 2013 11:26:15 +0000 (UTC) Received: by mail-lb0-f171.google.com with SMTP id 13so349740lba.30 for ; Thu, 27 Jun 2013 04:26:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date:cc :message-id:references:to:x-mailer:x-gm-message-state; bh=TGtl12ZxEPibbeUC6DbtdyMu8xGZrrrgSfGwMvx3idY=; b=azfpyhiiPWRJax5RAKucXfiUBoj5WIRW9OiYZQI+6E5TIvPjHZl/x5WlTYxCATz4WB g2+OBv7MWkS3y3A3N5HGwVOs8ABsmv1OYaay0N0tYwT9m+haxAt6fQ2fvWdPnWB1aDkH lRxYEfvr9zI16vXncrJY1cw16xJfq294mcSjyR8Uwf4HAL7mlpW6ec22kRBZi6ISHYE2 Bg4hiXflOeSCPsLfwZ2fGFxyDO3jsRhVvwSuS4BKeDj5Zy7A4t+IiVwq/KFIrA+iz8AQ pwaFaPRTpYaynr8F9DWGY3GH27TYigw7xJSejHFyD6S0CPv8/Kc2Ybw7BsWsXumXB6hA yk2w== X-Received: by 10.152.8.37 with SMTP id o5mr3789742laa.87.1372328556192; Thu, 27 Jun 2013 03:22:36 -0700 (PDT) Received: from grey.office.se.prisjakt.nu ([212.16.170.194]) by mx.google.com with ESMTPSA id ea14sm890106lbb.11.2013.06.27.03.22.33 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 27 Jun 2013 03:22:35 -0700 (PDT) Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: zpool export/import on failover - The pool metadata is corrupted From: mxb In-Reply-To: Date: Thu, 27 Jun 2013 12:22:32 +0200 Message-Id: References: <016B635E-4EDC-4CDF-AC58-82AC39CBFF56@alumni.chalmers.se> <20130606223911.GA45807@icarus.home.lan> <20130606233417.GA46506@icarus.home.lan> <61E414CF-FCD3-42BB-9533-A40EA934DB99@alumni.chalmers.se> <09717048-12BE-474B-9B20-F5E72D00152E@alumni.chalmers.se> <5A26ABDE-C7F2-41CC-A3D1-69310AB6BC36@alumni.chalmers.se> <47B6A89F-6444-485A-88DD-69A9A93D9B3F@alumni.chalmers.se> To: araujo@FreeBSD.org X-Mailer: Apple Mail (2.1508) X-Gm-Message-State: ALoCoQlfKsMVajYk2kh9ZaDpIjL6xzghViP+tq7zQx72c5iJ7Ht3x67tCsUUCzjcj+1pFJ0L57g6 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: "freebsd-fs@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Jun 2013 11:26:17 -0000 This solution is built on top of CARP. One of nodes is (as of advskew) a preferred master. Triggered chain is CARP -> devd -> failover_script.sh (zfs = import/export) On 27 jun 2013, at 11:43, Marcelo Araujo = wrote: > For this failover solution, did you create a heartbeat or something = such like that? How do you avoid split-brain? >=20 > Best Regards. >=20 >=20 > 2013/6/27 mxb >=20 > Notation for archives. >=20 > I have so far not experienced any problems with both local (per head = unit) and external (on disk enclosure) caches while importing > and exporting my pool. Disks I use on both nodes are identical - = manufacturer, size, model. >=20 > da1,da2 - local > da32,da33 - external >=20 > Export/import is done WITHOUT removing/adding local disks. >=20 > root@nfs1:/root # zpool status > pool: jbod > state: ONLINE > scan: scrub repaired 0 in 0h0m with 0 errors on Wed Jun 26 13:14:55 = 2013 > config: >=20 > NAME STATE READ WRITE CKSUM > jbod ONLINE 0 0 0 > raidz3-0 ONLINE 0 0 0 > da10 ONLINE 0 0 0 > da11 ONLINE 0 0 0 > da12 ONLINE 0 0 0 > da13 ONLINE 0 0 0 > da14 ONLINE 0 0 0 > da15 ONLINE 0 0 0 > da16 ONLINE 0 0 0 > da17 ONLINE 0 0 0 > da18 ONLINE 0 0 0 > da19 ONLINE 0 0 0 > logs > mirror-1 ONLINE 0 0 0 > da32s1 ONLINE 0 0 0 > da33s1 ONLINE 0 0 0 > cache > da32s2 ONLINE 0 0 0 > da33s2 ONLINE 0 0 0 > da1 ONLINE 0 0 0 > da2 ONLINE 0 0 0 >=20 > On 25 jun 2013, at 21:22, mxb wrote: >=20 > > > > I think I'v found the root of this issue. > > Looks like "wiring down" disks the same way on both nodes (as = suggested) fixes this issue. > > > > //mxb > > > > On 20 jun 2013, at 12:30, mxb wrote: > > > >> > >> Well, > >> > >> I'm back to square one. > >> > >> After some uptime and successful import/export from one node to = another, I eventually got 'metadata corruption'. > >> I had no problem with import/export while for ex. rebooting = master-node (nfs1), but not THIS time. > >> Metdata got corrupted while rebooting master-node?? > >> > >> Any ideas? > >> > >> [root@nfs1 ~]# zpool import > >> pool: jbod > >> id: 7663925948774378610 > >> state: FAULTED > >> status: The pool metadata is corrupted. > >> action: The pool cannot be imported due to damaged devices or data. > >> see: http://illumos.org/msg/ZFS-8000-72 > >> config: > >> > >> jbod FAULTED corrupted data > >> raidz3-0 ONLINE > >> da3 ONLINE > >> da4 ONLINE > >> da5 ONLINE > >> da6 ONLINE > >> da7 ONLINE > >> da8 ONLINE > >> da9 ONLINE > >> da10 ONLINE > >> da11 ONLINE > >> da12 ONLINE > >> cache > >> da13s2 > >> da14s2 > >> logs > >> mirror-1 ONLINE > >> da13s1 ONLINE > >> da14s1 ONLINE > >> [root@nfs1 ~]# zpool import jbod > >> cannot import 'jbod': I/O error > >> Destroy and re-create the pool from > >> a backup source. > >> [root@nfs1 ~]# > >> > >> On 11 jun 2013, at 10:46, mxb wrote: > >> > >>> > >>> Thanks everyone whom replied. > >>> Removing local L2ARC cache disks (da1,da2) indeed showed to be a = cure to my problem. > >>> > >>> Next is to test with add/remove after import/export as Jeremy = suggested. > >>> > >>> //mxb > >>> > >>> On 7 jun 2013, at 01:34, Jeremy Chadwick wrote: > >>> > >>>> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote: > >>>>> > >>>>> Sure, script is not perfects yet and does not handle many of = stuff, but moving highlight from zpool import/export to the script = itself not that > >>>>> clever,as this works most of the time. > >>>>> > >>>>> Question is WHY ZFS corrupts metadata then it should not. = Sometimes. > >>>>> I'v seen stale of zpool then manually importing/exporting pool. > >>>>> > >>>>> > >>>>> On 7 jun 2013, at 00:39, Jeremy Chadwick wrote: > >>>>> > >>>>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote: > >>>>>>> > >>>>>>> Then MASTER goes down, CARP on the second node goes MASTER = (devd.conf, and script for lifting): > >>>>>>> > >>>>>>> root@nfs2:/root # cat /etc/devd.conf > >>>>>>> > >>>>>>> > >>>>>>> notify 30 { > >>>>>>> match "system" "IFNET"; > >>>>>>> match "subsystem" "carp0"; > >>>>>>> match "type" "LINK_UP"; > >>>>>>> action "/etc/zfs_switch.sh active"; > >>>>>>> }; > >>>>>>> > >>>>>>> notify 30 { > >>>>>>> match "system" "IFNET"; > >>>>>>> match "subsystem" "carp0"; > >>>>>>> match "type" "LINK_DOWN"; > >>>>>>> action "/etc/zfs_switch.sh backup"; > >>>>>>> }; > >>>>>>> > >>>>>>> root@nfs2:/root # cat /etc/zfs_switch.sh > >>>>>>> #!/bin/sh > >>>>>>> > >>>>>>> DATE=3D`date +%Y%m%d` > >>>>>>> HOSTNAME=3D`hostname` > >>>>>>> > >>>>>>> ZFS_POOL=3D"jbod" > >>>>>>> > >>>>>>> > >>>>>>> case $1 in > >>>>>>> active) > >>>>>>> echo "Switching to ACTIVE and importing ZFS" | = mail -s ''$DATE': '$HOSTNAME' switching to ACTIVE' root > >>>>>>> sleep 10 > >>>>>>> /sbin/zpool import -f jbod > >>>>>>> /etc/rc.d/mountd restart > >>>>>>> /etc/rc.d/nfsd restart > >>>>>>> ;; > >>>>>>> backup) > >>>>>>> echo "Switching to BACKUP and exporting ZFS" | = mail -s ''$DATE': '$HOSTNAME' switching to BACKUP' root > >>>>>>> /sbin/zpool export jbod > >>>>>>> /etc/rc.d/mountd restart > >>>>>>> /etc/rc.d/nfsd restart > >>>>>>> ;; > >>>>>>> *) > >>>>>>> exit 0 > >>>>>>> ;; > >>>>>>> esac > >>>>>>> > >>>>>>> This works, most of the time, but sometimes I'm forced to = re-create pool. Those machines suppose to go into prod. > >>>>>>> Loosing pool(and data inside it) stops me from deploy this = setup. > >>>>>> > >>>>>> This script looks highly error-prone. Hasty hasty... :-) > >>>>>> > >>>>>> This script assumes that the "zpool" commands (import and = export) always > >>>>>> work/succeed; there is no exit code ($?) checking being used. > >>>>>> > >>>>>> Since this is run from within devd(8): where does stdout/stderr = go to > >>>>>> when running a program/script under devd(8)? Does it = effectively go > >>>>>> to the bit bucket (/dev/null)? If so, you'd never know if the = import or > >>>>>> export actually succeeded or not (the export sounds more likely = to be > >>>>>> the problem point). > >>>>>> > >>>>>> I imagine there would be some situations where the export would = fail > >>>>>> (some files on filesystems under pool "jbod" still in use), yet = CARP is > >>>>>> already blindly assuming everything will be fantastic. = Surprise. > >>>>>> > >>>>>> I also do not know if devd.conf(5) "action" commands spawn a = sub-shell > >>>>>> (/bin/sh) or not. If they don't, you won't be able to use = things like" > >>>>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'. = You > >>>>>> would then need to implement the equivalent of logging within = your > >>>>>> zfs_switch.sh script. > >>>>>> > >>>>>> You may want to consider the -f flag to zpool import/export > >>>>>> (particularly export). However there are risks involved -- = userland > >>>>>> applications which have an fd/fh open on a file which is stored = on a > >>>>>> filesystem that has now completely disappeared can sometimes = crash > >>>>>> (segfault) or behave very oddly (100% CPU usage, etc.) = depending on how > >>>>>> they're designed. > >>>>>> > >>>>>> Basically what I'm trying to say is that devd(8) being used as = a form of > >>>>>> HA (high availability) and load balancing is not always = possible. > >>>>>> Real/true HA (especially with SANs) is often done very = differently (now > >>>>>> you know why it's often proprietary. :-) ) > >>>> > >>>> Add error checking to your script. That's my first and foremost > >>>> recommendation. It's not hard to do, really. :-) > >>>> > >>>> After you do that and still experience the issue (e.g. you see no = actual > >>>> errors/issues during the export/import phases), I recommend = removing > >>>> the "cache" devices which are "independent" on each system from = the pool > >>>> entirely. Quoting you (for readers, since I snipped it from my = previous > >>>> reply): > >>>> > >>>>>>> Note, that ZIL(mirrored) resides on external enclosure. Only = L2ARC > >>>>>>> is both local and external - da1,da2, da13s2, da14s2 > >>>> > >>>> I interpret this to mean the primary and backup nodes (physical = systems) > >>>> have actual disks which are not part of the "external enclosure". = If > >>>> that's the case -- those disks are always going to vary in their > >>>> contents and metadata. Those are never going to be 100% = identical all > >>>> the time (is this not obvious?). I'm surprised your stuff has = worked at > >>>> all using that model, honestly. > >>>> > >>>> ZFS is going to bitch/cry if it cannot verify the integrity of = certain > >>>> things, all the way down to the L2ARC. That's my understanding = of it at > >>>> least, meaning there must always be "some" kind of metadata that = has to > >>>> be kept/maintained there. > >>>> > >>>> Alternately you could try doing this: > >>>> > >>>> zpool remove jbod cache daX daY ... > >>>> zpool export jbod > >>>> > >>>> Then on the other system: > >>>> > >>>> zpool import jbod > >>>> zpool add jbod cache daX daY ... > >>>> > >>>> Where daX and daY are the disks which are independent to each = system > >>>> (not on the "external enclosure"). > >>>> > >>>> Finally, it would also be useful/worthwhile if you would provide > >>>> "dmesg" from both systems and for you to explain the physical = wiring > >>>> along with what device (e.g. daX) correlates with what exact = thing on > >>>> each system. (We right now have no knowledge of that, and your = terse > >>>> explanations imply we do -- we need to know more) > >>>> > >>>> -- > >>>> | Jeremy Chadwick = jdc@koitsu.org | > >>>> | UNIX Systems Administrator = http://jdc.koitsu.org/ | > >>>> | Making life hard for others since 1977. PGP = 4BD6C0CB | > >>>> > >>> > >> > > >=20 > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >=20 >=20 >=20 > --=20 > Marcelo Araujo > araujo@FreeBSD.org