Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 27 Oct 2014 21:47:44 -0400
From:      Zaphod Beeblebrox <zbeeble@gmail.com>
To:        Robert Banz <rob@nofocus.org>
Cc:        freebsd-fs <freebsd-fs@freebsd.org>, Steven Hartland <smh@freebsd.org>
Subject:   Re: ZFS errors on the array but not the disk.
Message-ID:  <CACpH0McVeuUGoC45rsK-cwrG0TFd_s=Cj66G7_TX=8a8jNBWQQ@mail.gmail.com>
In-Reply-To: <CA%2B-fWwBgh-mzKFRVhtddZVZz9j8T2fh-M-gpgR%2B4XmchbW8W1A@mail.gmail.com>
References:  <CACpH0MeAvs6rzWUo3uF8uTygPk6qnZE8W=3-zsiTAKdvm4N01w@mail.gmail.com> <CAOtMX2g5GYZqYgWNmD_K_TSdTc8oxvvpe4463ni=sEX_b7_Erw@mail.gmail.com> <CACpH0MfL1J8fbP%2BMkdop8C=iTJmvscDv16mVynSqXC0uspdLfw@mail.gmail.com> <544B12B8.8060302@freebsd.org> <CACpH0Md8f1dAqUvgAMnKN%2BiZbWmL2ANXuwj7xDqkiGcHaiS9jg@mail.gmail.com> <CACpH0MdQDi85pvks%2BE1A2OYRKYXi6CMiXcsL4U1Ud5r_Zw4d8g@mail.gmail.com> <CA%2B-fWwBgh-mzKFRVhtddZVZz9j8T2fh-M-gpgR%2B4XmchbW8W1A@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Well... why wouldn't this trigger an error with (say) the checksums on the
devices themselves?  Without throwing an error, why is the vdev re -
resilvering?  I don't have spare hardware to throw at it.  It's otherwise a
sane system.  It can "make -j32 buildworld" without choking.  It can
download several hundred torrents at a time without corrupting them.
Hardly seems like suspect hardware.

On Mon, Oct 27, 2014 at 7:13 PM, Robert Banz <rob@nofocus.org> wrote:

> Have you tried different hardware? This screams something's up anywhere in
> the stack -- DRAM, cabling, controller...
>
> On Mon, Oct 27, 2014 at 11:34 AM, Zaphod Beeblebrox <zbeeble@gmail.com>
> wrote:
>
>> Ok... This is just frustrating.  I've trusted ZFS through many versions
>> ...
>> and pretty much ... it's delivered.  There are five symptoms here:
>>
>> 1. after each reboot, resilver starts again... even if after the resilver
>> I
>> complete a full scrub.
>>
>> 2. seemingly random objects (files, zvols or snapshot items) get marked as
>> having errors.  when I say random, to be clear; different items each time.
>>
>> 3. none of the drives are showing errors in zpool status, neither are they
>> chucking errors into dmesg.
>>
>> 4. errors are being logged against the vdev (only one of the two vdevs)
>> and
>> the array (half as many as the vdev).
>>
>> 5. The activity light for the recently replaced disk does not "flash"
>> "with" the others in it's vdev during either resilver or scrub.  This last
>> bit might need some explanation. I realize that raidz-1 stripes do not
>> always use all the disks, but "generally" the activity lights of the
>> drives
>> in a vdev go "together"... In this case, the light of the recently
>> replaced
>> drive is off much of the time ...
>>
>> Is there anything I can/should do?  I pulled the new disk, moved it's
>> partitions around (it's larger than the array disks because you can't buy
>> 1.5T drives anymore) and then re-added it... so I've tried that.
>>
>>
>> On Fri, Oct 24, 2014 at 11:47 PM, Zaphod Beeblebrox <zbeeble@gmail.com>
>> wrote:
>>
>> > Thanks for the heads up.  I'm following releng/10.1 and 271683 seems to
>> be
>> > part of that, but a good catch/guess.
>> >
>> >
>> > On Fri, Oct 24, 2014 at 11:02 PM, Steven Hartland <smh@freebsd.org>
>> wrote:
>> >
>> >> There was an issue which would cause resilver restarts fixed by
>> *265253* <
>> >> https://svnweb.freebsd.org/base?view=revision&revision=265253>; which
>> was
>> >> MFC'ed to stable/10 by *271683* <https://svnweb.freebsd.org/
>> >> base?view=revision&revision=271683>so you'll want to make sure your
>> >> latter than that.
>> >>
>> >>
>> >> On 24/10/2014 19:42, Zaphod Beeblebrox wrote:
>> >>
>> >>> I manually replaced a disk... and the array was scrubbed recently.
>> >>> Interestingly, I seem to be in the "endless loop"  of resilvering
>> >>> problem.
>> >>> Not much I can find on it.  but resilvering will complete and I can
>> then
>> >>> run another scrub.  It will complete, too.  Then rebooting causes
>> another
>> >>> resilvering.
>> >>>
>> >>> Another odd data point: it seems as if the things that show up as
>> >>> "errors"
>> >>> change from resilvering to resilvering.
>> >>>
>> >>> One bug, it would seem, is that once ZFS has detected an error...
>> another
>> >>> scrub can reset it, but no attempt is made to read-through the error
>> if
>> >>> you
>> >>> access the object directly.
>> >>>
>> >>> On Fri, Oct 24, 2014 at 11:33 AM, Alan Somers <asomers@freebsd.org>
>> >>> wrote:
>> >>>
>> >>>  On Thu, Oct 23, 2014 at 11:37 PM, Zaphod Beeblebrox <
>> zbeeble@gmail.com>
>> >>>> wrote:
>> >>>>
>> >>>>> What does it mean when checksum errors appear on the array (and the
>> >>>>> vdev)
>> >>>>> but not on any of the disks?  See the paste below.  One would think
>> >>>>> that
>> >>>>> there isn't some ephemeral data stored somewhere that is not one of
>> the
>> >>>>> disks, yet "cksum" errors show only on the vdev and the array lines.
>> >>>>>
>> >>>> Help?
>> >>>>
>> >>>>> [2:17:316]root@virtual:/vr2/torrent/in> zpool status
>> >>>>>    pool: vr2
>> >>>>>   state: ONLINE
>> >>>>> status: One or more devices is currently being resilvered.  The pool
>> >>>>> will
>> >>>>>          continue to function, possibly in a degraded state.
>> >>>>> action: Wait for the resilver to complete.
>> >>>>>    scan: resilver in progress since Thu Oct 23 23:11:29 2014
>> >>>>>          1.53T scanned out of 22.6T at 62.4M/s, 98h23m to go
>> >>>>>          119G resilvered, 6.79% done
>> >>>>> config:
>> >>>>>
>> >>>>>          NAME               STATE     READ WRITE CKSUM
>> >>>>>          vr2                ONLINE       0     0    36
>> >>>>>            raidz1-0         ONLINE       0     0    72
>> >>>>>              label/vr2-d0   ONLINE       0     0     0
>> >>>>>              label/vr2-d1   ONLINE       0     0     0
>> >>>>>              gpt/vr2-d2c    ONLINE       0     0     0  block size:
>> >>>>> 512B
>> >>>>> configured, 4096B native  (resilvering)
>> >>>>>              gpt/vr2-d3b    ONLINE       0     0     0  block size:
>> >>>>> 512B
>> >>>>> configured, 4096B native
>> >>>>>              gpt/vr2-d4a    ONLINE       0     0     0  block size:
>> >>>>> 512B
>> >>>>> configured, 4096B native
>> >>>>>              ada14          ONLINE       0     0     0
>> >>>>>              label/vr2-d6   ONLINE       0     0     0
>> >>>>>              label/vr2-d7c  ONLINE       0     0     0
>> >>>>>              label/vr2-d8   ONLINE       0     0     0
>> >>>>>            raidz1-1         ONLINE       0     0     0
>> >>>>>              gpt/vr2-e0     ONLINE       0     0     0  block size:
>> >>>>> 512B
>> >>>>> configured, 4096B native
>> >>>>>              gpt/vr2-e1     ONLINE       0     0     0  block size:
>> >>>>> 512B
>> >>>>> configured, 4096B native
>> >>>>>              gpt/vr2-e2     ONLINE       0     0     0  block size:
>> >>>>> 512B
>> >>>>> configured, 4096B native
>> >>>>>              gpt/vr2-e3     ONLINE       0     0     0
>> >>>>>              gpt/vr2-e4     ONLINE       0     0     0  block size:
>> >>>>> 512B
>> >>>>> configured, 4096B native
>> >>>>>              gpt/vr2-e5     ONLINE       0     0     0  block size:
>> >>>>> 512B
>> >>>>> configured, 4096B native
>> >>>>>              gpt/vr2-e6     ONLINE       0     0     0  block size:
>> >>>>> 512B
>> >>>>> configured, 4096B native
>> >>>>>              gpt/vr2-e7     ONLINE       0     0     0  block size:
>> >>>>> 512B
>> >>>>> configured, 4096B native
>> >>>>>
>> >>>>> errors: 43 data errors, use '-v' for a list
>> >>>>>
>> >>>> The checksum errors will appear on the raidz vdev instead of a leaf
>> if
>> >>>> vdev_raidz.c can't determine which leaf vdev was responsible.  This
>> >>>> could happen if two or more leaf vdevs return bad data for the same
>> >>>> block, which would also lead to unrecoverable data errors.  I see
>> that
>> >>>> you have some unrecoverable data errors, so maybe that's what
>> happened
>> >>>> to you.
>> >>>>
>> >>>> Subtle design bugs in ZFS can also lead to vdev_raidz.c being unable
>> >>>> to determine which child was responsible for a checksum error.
>> >>>> However, I've only seen that happen when a raidz vdev has a mirror
>> >>>> child.  That can only happen if the child is a spare or replacing
>> >>>> vdev.  Did you activate any spares, or did you manually replace a
>> >>>> vdev?
>> >>>>
>> >>>> -Alan
>> >>>>
>> >>>>  _______________________________________________
>> >>> freebsd-fs@freebsd.org mailing list
>> >>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> >>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>> >>>
>> >>>
>> >>>
>> >> _______________________________________________
>> >> freebsd-fs@freebsd.org mailing list
>> >> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>> >>
>> >
>> >
>> _______________________________________________
>> freebsd-fs@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>>
>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CACpH0McVeuUGoC45rsK-cwrG0TFd_s=Cj66G7_TX=8a8jNBWQQ>