From owner-freebsd-fs@FreeBSD.ORG Mon Oct 27 18:34:12 2014 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 122D8287; Mon, 27 Oct 2014 18:34:12 +0000 (UTC) Received: from mail-vc0-x235.google.com (mail-vc0-x235.google.com [IPv6:2607:f8b0:400c:c03::235]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id AE94CF62; Mon, 27 Oct 2014 18:34:11 +0000 (UTC) Received: by mail-vc0-f181.google.com with SMTP id hy10so1404492vcb.26 for ; Mon, 27 Oct 2014 11:34:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=01sPY+OMHWg/Up+AutcbW5dtl/bmfoCU5izse5Na2L4=; b=C9We9w5vkS5uLO0ySoNiB2CG8XkoxUrgnZ5NaW7aA56efwvawZy7z62OyXrJqrnwk7 f2+WaY50mgWrq79PQxyaQR6DtFQL7AER5BatYuaFNsRLLRM5n0m+65L+Nbbl3RPJP9Zb Ogz9s0gfwcY7EnEHZDqg5QBcsQYUjyFgYlsTV1VoR9IB4I/EDy6MzpD5q9iVfPdNKs3D 1yH+kdFvZrnqW3kMvzXtCW3VIAt0MudZ9I1feB2LmEt4rIpge+Q7fIJYli4Ok/AqJTzL YUGVaA4dhULu9/ntYiSxzizNbb+/9bnu8gy0S6Em15BddWfe2fnq694ca2hlJmKS4Vc7 u0ZA== MIME-Version: 1.0 X-Received: by 10.220.213.197 with SMTP id gx5mr1326433vcb.51.1414434850001; Mon, 27 Oct 2014 11:34:10 -0700 (PDT) Received: by 10.220.118.73 with HTTP; Mon, 27 Oct 2014 11:34:09 -0700 (PDT) In-Reply-To: References: <544B12B8.8060302@freebsd.org> Date: Mon, 27 Oct 2014 14:34:09 -0400 Message-ID: Subject: Re: ZFS errors on the array but not the disk. From: Zaphod Beeblebrox To: Steven Hartland Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 Cc: freebsd-fs X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Oct 2014 18:34:12 -0000 Ok... This is just frustrating. I've trusted ZFS through many versions ... and pretty much ... it's delivered. There are five symptoms here: 1. after each reboot, resilver starts again... even if after the resilver I complete a full scrub. 2. seemingly random objects (files, zvols or snapshot items) get marked as having errors. when I say random, to be clear; different items each time. 3. none of the drives are showing errors in zpool status, neither are they chucking errors into dmesg. 4. errors are being logged against the vdev (only one of the two vdevs) and the array (half as many as the vdev). 5. The activity light for the recently replaced disk does not "flash" "with" the others in it's vdev during either resilver or scrub. This last bit might need some explanation. I realize that raidz-1 stripes do not always use all the disks, but "generally" the activity lights of the drives in a vdev go "together"... In this case, the light of the recently replaced drive is off much of the time ... Is there anything I can/should do? I pulled the new disk, moved it's partitions around (it's larger than the array disks because you can't buy 1.5T drives anymore) and then re-added it... so I've tried that. On Fri, Oct 24, 2014 at 11:47 PM, Zaphod Beeblebrox wrote: > Thanks for the heads up. I'm following releng/10.1 and 271683 seems to be > part of that, but a good catch/guess. > > > On Fri, Oct 24, 2014 at 11:02 PM, Steven Hartland wrote: > >> There was an issue which would cause resilver restarts fixed by *265253* < >> https://svnweb.freebsd.org/base?view=revision&revision=265253> which was >> MFC'ed to stable/10 by *271683* > base?view=revision&revision=271683>so you'll want to make sure your >> latter than that. >> >> >> On 24/10/2014 19:42, Zaphod Beeblebrox wrote: >> >>> I manually replaced a disk... and the array was scrubbed recently. >>> Interestingly, I seem to be in the "endless loop" of resilvering >>> problem. >>> Not much I can find on it. but resilvering will complete and I can then >>> run another scrub. It will complete, too. Then rebooting causes another >>> resilvering. >>> >>> Another odd data point: it seems as if the things that show up as >>> "errors" >>> change from resilvering to resilvering. >>> >>> One bug, it would seem, is that once ZFS has detected an error... another >>> scrub can reset it, but no attempt is made to read-through the error if >>> you >>> access the object directly. >>> >>> On Fri, Oct 24, 2014 at 11:33 AM, Alan Somers >>> wrote: >>> >>> On Thu, Oct 23, 2014 at 11:37 PM, Zaphod Beeblebrox >>>> wrote: >>>> >>>>> What does it mean when checksum errors appear on the array (and the >>>>> vdev) >>>>> but not on any of the disks? See the paste below. One would think >>>>> that >>>>> there isn't some ephemeral data stored somewhere that is not one of the >>>>> disks, yet "cksum" errors show only on the vdev and the array lines. >>>>> >>>> Help? >>>> >>>>> [2:17:316]root@virtual:/vr2/torrent/in> zpool status >>>>> pool: vr2 >>>>> state: ONLINE >>>>> status: One or more devices is currently being resilvered. The pool >>>>> will >>>>> continue to function, possibly in a degraded state. >>>>> action: Wait for the resilver to complete. >>>>> scan: resilver in progress since Thu Oct 23 23:11:29 2014 >>>>> 1.53T scanned out of 22.6T at 62.4M/s, 98h23m to go >>>>> 119G resilvered, 6.79% done >>>>> config: >>>>> >>>>> NAME STATE READ WRITE CKSUM >>>>> vr2 ONLINE 0 0 36 >>>>> raidz1-0 ONLINE 0 0 72 >>>>> label/vr2-d0 ONLINE 0 0 0 >>>>> label/vr2-d1 ONLINE 0 0 0 >>>>> gpt/vr2-d2c ONLINE 0 0 0 block size: >>>>> 512B >>>>> configured, 4096B native (resilvering) >>>>> gpt/vr2-d3b ONLINE 0 0 0 block size: >>>>> 512B >>>>> configured, 4096B native >>>>> gpt/vr2-d4a ONLINE 0 0 0 block size: >>>>> 512B >>>>> configured, 4096B native >>>>> ada14 ONLINE 0 0 0 >>>>> label/vr2-d6 ONLINE 0 0 0 >>>>> label/vr2-d7c ONLINE 0 0 0 >>>>> label/vr2-d8 ONLINE 0 0 0 >>>>> raidz1-1 ONLINE 0 0 0 >>>>> gpt/vr2-e0 ONLINE 0 0 0 block size: >>>>> 512B >>>>> configured, 4096B native >>>>> gpt/vr2-e1 ONLINE 0 0 0 block size: >>>>> 512B >>>>> configured, 4096B native >>>>> gpt/vr2-e2 ONLINE 0 0 0 block size: >>>>> 512B >>>>> configured, 4096B native >>>>> gpt/vr2-e3 ONLINE 0 0 0 >>>>> gpt/vr2-e4 ONLINE 0 0 0 block size: >>>>> 512B >>>>> configured, 4096B native >>>>> gpt/vr2-e5 ONLINE 0 0 0 block size: >>>>> 512B >>>>> configured, 4096B native >>>>> gpt/vr2-e6 ONLINE 0 0 0 block size: >>>>> 512B >>>>> configured, 4096B native >>>>> gpt/vr2-e7 ONLINE 0 0 0 block size: >>>>> 512B >>>>> configured, 4096B native >>>>> >>>>> errors: 43 data errors, use '-v' for a list >>>>> >>>> The checksum errors will appear on the raidz vdev instead of a leaf if >>>> vdev_raidz.c can't determine which leaf vdev was responsible. This >>>> could happen if two or more leaf vdevs return bad data for the same >>>> block, which would also lead to unrecoverable data errors. I see that >>>> you have some unrecoverable data errors, so maybe that's what happened >>>> to you. >>>> >>>> Subtle design bugs in ZFS can also lead to vdev_raidz.c being unable >>>> to determine which child was responsible for a checksum error. >>>> However, I've only seen that happen when a raidz vdev has a mirror >>>> child. That can only happen if the child is a spare or replacing >>>> vdev. Did you activate any spares, or did you manually replace a >>>> vdev? >>>> >>>> -Alan >>>> >>>> _______________________________________________ >>> freebsd-fs@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >>> >>> >>> >> _______________________________________________ >> freebsd-fs@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >> > >