Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 27 Jun 2013 17:43:17 +0800
From:      Marcelo Araujo <araujobsdport@gmail.com>
To:        mxb <mxb@alumni.chalmers.se>
Cc:        "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   Re: zpool export/import on failover - The pool metadata is corrupted
Message-ID:  <CAOfEmZj=12VOEv6RRQUAmRtm6Mp%2BxHo47DwT%2BwmUDqmRyQJU3w@mail.gmail.com>
In-Reply-To: <47B6A89F-6444-485A-88DD-69A9A93D9B3F@alumni.chalmers.se>
References:  <D7F099CB-855F-43F8-ACB5-094B93201B4B@alumni.chalmers.se> <CAKYr3zyPLpLau8xsv3fCkYrpJVzS0tXkyMn4E2aLz29EMBF9cA@mail.gmail.com> <016B635E-4EDC-4CDF-AC58-82AC39CBFF56@alumni.chalmers.se> <20130606223911.GA45807@icarus.home.lan> <C3FC39B3-D09F-4E73-9476-3BFC8B817278@alumni.chalmers.se> <20130606233417.GA46506@icarus.home.lan> <61E414CF-FCD3-42BB-9533-A40EA934DB99@alumni.chalmers.se> <09717048-12BE-474B-9B20-F5E72D00152E@alumni.chalmers.se> <5A26ABDE-C7F2-41CC-A3D1-69310AB6BC36@alumni.chalmers.se> <47B6A89F-6444-485A-88DD-69A9A93D9B3F@alumni.chalmers.se>

next in thread | previous in thread | raw e-mail | index | archive | help
For this failover solution, did you create a heartbeat or something such
like that? How do you avoid split-brain?

Best Regards.


2013/6/27 mxb <mxb@alumni.chalmers.se>

>
> Notation for archives.
>
> I have so far not experienced any problems with both local (per head unit)
> and external (on disk enclosure) caches while importing
> and exporting my pool. Disks I use on both nodes are identical -
> manufacturer, size, model.
>
> da1,da2 - local
> da32,da33 - external
>
> Export/import is done WITHOUT removing/adding local disks.
>
> root@nfs1:/root # zpool status
>   pool: jbod
>  state: ONLINE
>   scan: scrub repaired 0 in 0h0m with 0 errors on Wed Jun 26 13:14:55 2013
> config:
>
>         NAME        STATE     READ WRITE CKSUM
>         jbod        ONLINE       0     0     0
>           raidz3-0  ONLINE       0     0     0
>             da10    ONLINE       0     0     0
>             da11    ONLINE       0     0     0
>             da12    ONLINE       0     0     0
>             da13    ONLINE       0     0     0
>             da14    ONLINE       0     0     0
>             da15    ONLINE       0     0     0
>             da16    ONLINE       0     0     0
>             da17    ONLINE       0     0     0
>             da18    ONLINE       0     0     0
>             da19    ONLINE       0     0     0
>         logs
>           mirror-1  ONLINE       0     0     0
>             da32s1  ONLINE       0     0     0
>             da33s1  ONLINE       0     0     0
>         cache
>           da32s2    ONLINE       0     0     0
>           da33s2    ONLINE       0     0     0
>           da1       ONLINE       0     0     0
>           da2       ONLINE       0     0     0
>
> On 25 jun 2013, at 21:22, mxb <mxb@alumni.chalmers.se> wrote:
>
> >
> > I think I'v found the root of this issue.
> > Looks like "wiring down" disks the same way on both nodes (as suggested)
> fixes this issue.
> >
> > //mxb
> >
> > On 20 jun 2013, at 12:30, mxb <mxb@alumni.chalmers.se> wrote:
> >
> >>
> >> Well,
> >>
> >> I'm back to square one.
> >>
> >> After some uptime and successful import/export from one node to
> another, I eventually got 'metadata corruption'.
> >> I had no problem with import/export while for ex. rebooting master-node
> (nfs1), but not THIS time.
> >> Metdata got corrupted while rebooting master-node??
> >>
> >> Any ideas?
> >>
> >> [root@nfs1 ~]# zpool import
> >>  pool: jbod
> >>    id: 7663925948774378610
> >> state: FAULTED
> >> status: The pool metadata is corrupted.
> >> action: The pool cannot be imported due to damaged devices or data.
> >>  see: http://illumos.org/msg/ZFS-8000-72
> >> config:
> >>
> >>      jbod        FAULTED  corrupted data
> >>        raidz3-0  ONLINE
> >>          da3     ONLINE
> >>          da4     ONLINE
> >>          da5     ONLINE
> >>          da6     ONLINE
> >>          da7     ONLINE
> >>          da8     ONLINE
> >>          da9     ONLINE
> >>          da10    ONLINE
> >>          da11    ONLINE
> >>          da12    ONLINE
> >>      cache
> >>        da13s2
> >>        da14s2
> >>      logs
> >>        mirror-1  ONLINE
> >>          da13s1  ONLINE
> >>          da14s1  ONLINE
> >> [root@nfs1 ~]# zpool import jbod
> >> cannot import 'jbod': I/O error
> >>      Destroy and re-create the pool from
> >>      a backup source.
> >> [root@nfs1 ~]#
> >>
> >> On 11 jun 2013, at 10:46, mxb <mxb@alumni.chalmers.se> wrote:
> >>
> >>>
> >>> Thanks everyone whom replied.
> >>> Removing local L2ARC cache disks (da1,da2) indeed showed to be a cure
> to my problem.
> >>>
> >>> Next is to test with add/remove after import/export as Jeremy
> suggested.
> >>>
> >>> //mxb
> >>>
> >>> On 7 jun 2013, at 01:34, Jeremy Chadwick <jdc@koitsu.org> wrote:
> >>>
> >>>> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote:
> >>>>>
> >>>>> Sure, script is not perfects yet and does not handle many of stuff,
> but moving highlight from zpool import/export to the script itself not that
> >>>>> clever,as this works most of the time.
> >>>>>
> >>>>> Question is WHY ZFS corrupts metadata then it should not. Sometimes.
> >>>>> I'v seen stale of zpool then manually importing/exporting pool.
> >>>>>
> >>>>>
> >>>>> On 7 jun 2013, at 00:39, Jeremy Chadwick <jdc@koitsu.org> wrote:
> >>>>>
> >>>>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote:
> >>>>>>>
> >>>>>>> Then MASTER goes down, CARP on the second node goes MASTER
> (devd.conf, and script for lifting):
> >>>>>>>
> >>>>>>> root@nfs2:/root # cat /etc/devd.conf
> >>>>>>>
> >>>>>>>
> >>>>>>> notify 30 {
> >>>>>>> match "system"          "IFNET";
> >>>>>>> match "subsystem"       "carp0";
> >>>>>>> match "type"            "LINK_UP";
> >>>>>>> action "/etc/zfs_switch.sh active";
> >>>>>>> };
> >>>>>>>
> >>>>>>> notify 30 {
> >>>>>>> match "system"          "IFNET";
> >>>>>>> match "subsystem"       "carp0";
> >>>>>>> match "type"            "LINK_DOWN";
> >>>>>>> action "/etc/zfs_switch.sh backup";
> >>>>>>> };
> >>>>>>>
> >>>>>>> root@nfs2:/root # cat /etc/zfs_switch.sh
> >>>>>>> #!/bin/sh
> >>>>>>>
> >>>>>>> DATE=`date +%Y%m%d`
> >>>>>>> HOSTNAME=`hostname`
> >>>>>>>
> >>>>>>> ZFS_POOL="jbod"
> >>>>>>>
> >>>>>>>
> >>>>>>> case $1 in
> >>>>>>>         active)
> >>>>>>>                 echo "Switching to ACTIVE and importing ZFS" |
> mail -s ''$DATE': '$HOSTNAME' switching to ACTIVE' root
> >>>>>>>                 sleep 10
> >>>>>>>                 /sbin/zpool import -f jbod
> >>>>>>>                 /etc/rc.d/mountd restart
> >>>>>>>                 /etc/rc.d/nfsd restart
> >>>>>>>                 ;;
> >>>>>>>         backup)
> >>>>>>>                 echo "Switching to BACKUP and exporting ZFS" |
> mail -s ''$DATE': '$HOSTNAME' switching to BACKUP' root
> >>>>>>>                 /sbin/zpool export jbod
> >>>>>>>                 /etc/rc.d/mountd restart
> >>>>>>>            /etc/rc.d/nfsd restart
> >>>>>>>                 ;;
> >>>>>>>         *)
> >>>>>>>                 exit 0
> >>>>>>>                 ;;
> >>>>>>> esac
> >>>>>>>
> >>>>>>> This works, most of the time, but sometimes I'm forced to
> re-create pool. Those machines suppose to go into prod.
> >>>>>>> Loosing pool(and data inside it) stops me from deploy this setup.
> >>>>>>
> >>>>>> This script looks highly error-prone.  Hasty hasty...  :-)
> >>>>>>
> >>>>>> This script assumes that the "zpool" commands (import and export)
> always
> >>>>>> work/succeed; there is no exit code ($?) checking being used.
> >>>>>>
> >>>>>> Since this is run from within devd(8): where does stdout/stderr go
> to
> >>>>>> when running a program/script under devd(8)?  Does it effectively go
> >>>>>> to the bit bucket (/dev/null)?  If so, you'd never know if the
> import or
> >>>>>> export actually succeeded or not (the export sounds more likely to
> be
> >>>>>> the problem point).
> >>>>>>
> >>>>>> I imagine there would be some situations where the export would fail
> >>>>>> (some files on filesystems under pool "jbod" still in use), yet
> CARP is
> >>>>>> already blindly assuming everything will be fantastic.  Surprise.
> >>>>>>
> >>>>>> I also do not know if devd.conf(5) "action" commands spawn a
> sub-shell
> >>>>>> (/bin/sh) or not.  If they don't, you won't be able to use things
> like"
> >>>>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'.  You
> >>>>>> would then need to implement the equivalent of logging within your
> >>>>>> zfs_switch.sh script.
> >>>>>>
> >>>>>> You may want to consider the -f flag to zpool import/export
> >>>>>> (particularly export).  However there are risks involved -- userland
> >>>>>> applications which have an fd/fh open on a file which is stored on a
> >>>>>> filesystem that has now completely disappeared can sometimes crash
> >>>>>> (segfault) or behave very oddly (100% CPU usage, etc.) depending on
> how
> >>>>>> they're designed.
> >>>>>>
> >>>>>> Basically what I'm trying to say is that devd(8) being used as a
> form of
> >>>>>> HA (high availability) and load balancing is not always possible.
> >>>>>> Real/true HA (especially with SANs) is often done very differently
> (now
> >>>>>> you know why it's often proprietary.  :-) )
> >>>>
> >>>> Add error checking to your script.  That's my first and foremost
> >>>> recommendation.  It's not hard to do, really.  :-)
> >>>>
> >>>> After you do that and still experience the issue (e.g. you see no
> actual
> >>>> errors/issues during the export/import phases), I recommend removing
> >>>> the "cache" devices which are "independent" on each system from the
> pool
> >>>> entirely.  Quoting you (for readers, since I snipped it from my
> previous
> >>>> reply):
> >>>>
> >>>>>>> Note, that ZIL(mirrored) resides on external enclosure. Only L2ARC
> >>>>>>> is both local and external - da1,da2, da13s2, da14s2
> >>>>
> >>>> I interpret this to mean the primary and backup nodes (physical
> systems)
> >>>> have actual disks which are not part of the "external enclosure".  If
> >>>> that's the case -- those disks are always going to vary in their
> >>>> contents and metadata.  Those are never going to be 100% identical all
> >>>> the time (is this not obvious?).  I'm surprised your stuff has worked
> at
> >>>> all using that model, honestly.
> >>>>
> >>>> ZFS is going to bitch/cry if it cannot verify the integrity of certain
> >>>> things, all the way down to the L2ARC.  That's my understanding of it
> at
> >>>> least, meaning there must always be "some" kind of metadata that has
> to
> >>>> be kept/maintained there.
> >>>>
> >>>> Alternately you could try doing this:
> >>>>
> >>>> zpool remove jbod cache daX daY ...
> >>>> zpool export jbod
> >>>>
> >>>> Then on the other system:
> >>>>
> >>>> zpool import jbod
> >>>> zpool add jbod cache daX daY ...
> >>>>
> >>>> Where daX and daY are the disks which are independent to each system
> >>>> (not on the "external enclosure").
> >>>>
> >>>> Finally, it would also be useful/worthwhile if you would provide
> >>>> "dmesg" from both systems and for you to explain the physical wiring
> >>>> along with what device (e.g. daX) correlates with what exact thing on
> >>>> each system.  (We right now have no knowledge of that, and your terse
> >>>> explanations imply we do -- we need to know more)
> >>>>
> >>>> --
> >>>> | Jeremy Chadwick                                   jdc@koitsu.org |
> >>>> | UNIX Systems Administrator                http://jdc.koitsu.org/ |
> >>>> | Making life hard for others since 1977.             PGP 4BD6C0CB |
> >>>>
> >>>
> >>
> >
>
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>



-- 
Marcelo Araujo
araujo@FreeBSD.org



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOfEmZj=12VOEv6RRQUAmRtm6Mp%2BxHo47DwT%2BwmUDqmRyQJU3w>