From owner-freebsd-hackers@FreeBSD.ORG  Fri Oct 24 15:33:25 2014
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 10F94563;
 Fri, 24 Oct 2014 15:33:25 +0000 (UTC)
Received: from mail-wg0-x234.google.com (mail-wg0-x234.google.com
 [IPv6:2a00:1450:400c:c00::234])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 78FFC33F;
 Fri, 24 Oct 2014 15:33:24 +0000 (UTC)
Received: by mail-wg0-f52.google.com with SMTP id a1so1328591wgh.11
 for <multiple recipients>; Fri, 24 Oct 2014 08:33:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=wgXrmU+8kn1JH5xWfoyDqY0K1SagVeWCg58rIyA14Ro=;
 b=u5U1J+Gn1f5euOJ+sLti7ZS7B+VkHIHDVMUw7f8rTpEQl+rmAONsGO3ScdB0YV4pHM
 02WKeiXkLKdFPe5mq1OH8rQ+p+cbFK0g7u2GXKulmdOgRQcplG0QFPshE+nLufqbMqXJ
 hh6d0tmOm1UrKqVkhF4mFHt4336ioLHXkcqX4yFzVDWyT4CkOr7WO7cAX90qBGwLOz4B
 S7OstHMV0q9LVX48Ef7+SLwr+QfaFdWQZlS4WuFnX/4Jxetx3762AVzmZ475+XSabpgH
 lwIpF+PL0G6/fvrEX96uqEr4LppaD6/ecB01lcfw3+iW0LdlJ7kFWl3gmfjyS5Bxmukg
 IR6g==
MIME-Version: 1.0
X-Received: by 10.180.109.99 with SMTP id hr3mr4907955wib.82.1414164802558;
 Fri, 24 Oct 2014 08:33:22 -0700 (PDT)
Sender: asomers@gmail.com
Received: by 10.194.220.227 with HTTP; Fri, 24 Oct 2014 08:33:22 -0700 (PDT)
In-Reply-To: <CACpH0MeAvs6rzWUo3uF8uTygPk6qnZE8W=3-zsiTAKdvm4N01w@mail.gmail.com>
References: <CACpH0MeAvs6rzWUo3uF8uTygPk6qnZE8W=3-zsiTAKdvm4N01w@mail.gmail.com>
Date: Fri, 24 Oct 2014 09:33:22 -0600
X-Google-Sender-Auth: 8SCGdQ-LlO1ZTBX4V5xzL6yIcYw
Message-ID: <CAOtMX2g5GYZqYgWNmD_K_TSdTc8oxvvpe4463ni=sEX_b7_Erw@mail.gmail.com>
Subject: Re: ZFS errors on the array but not the disk.
From: Alan Somers <asomers@freebsd.org>
To: Zaphod Beeblebrox <zbeeble@gmail.com>
Content-Type: text/plain; charset=UTF-8
Cc: freebsd-fs <freebsd-fs@freebsd.org>,
 FreeBSD Hackers <freebsd-hackers@freebsd.org>
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 24 Oct 2014 15:33:25 -0000

On Thu, Oct 23, 2014 at 11:37 PM, Zaphod Beeblebrox <zbeeble@gmail.com> wrote:
> What does it mean when checksum errors appear on the array (and the vdev)
> but not on any of the disks?  See the paste below.  One would think that
> there isn't some ephemeral data stored somewhere that is not one of the
> disks, yet "cksum" errors show only on the vdev and the array lines.  Help?
>
> [2:17:316]root@virtual:/vr2/torrent/in> zpool status
>   pool: vr2
>  state: ONLINE
> status: One or more devices is currently being resilvered.  The pool will
>         continue to function, possibly in a degraded state.
> action: Wait for the resilver to complete.
>   scan: resilver in progress since Thu Oct 23 23:11:29 2014
>         1.53T scanned out of 22.6T at 62.4M/s, 98h23m to go
>         119G resilvered, 6.79% done
> config:
>
>         NAME               STATE     READ WRITE CKSUM
>         vr2                ONLINE       0     0    36
>           raidz1-0         ONLINE       0     0    72
>             label/vr2-d0   ONLINE       0     0     0
>             label/vr2-d1   ONLINE       0     0     0
>             gpt/vr2-d2c    ONLINE       0     0     0  block size: 512B
> configured, 4096B native  (resilvering)
>             gpt/vr2-d3b    ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>             gpt/vr2-d4a    ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>             ada14          ONLINE       0     0     0
>             label/vr2-d6   ONLINE       0     0     0
>             label/vr2-d7c  ONLINE       0     0     0
>             label/vr2-d8   ONLINE       0     0     0
>           raidz1-1         ONLINE       0     0     0
>             gpt/vr2-e0     ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>             gpt/vr2-e1     ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>             gpt/vr2-e2     ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>             gpt/vr2-e3     ONLINE       0     0     0
>             gpt/vr2-e4     ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>             gpt/vr2-e5     ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>             gpt/vr2-e6     ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>             gpt/vr2-e7     ONLINE       0     0     0  block size: 512B
> configured, 4096B native
>
> errors: 43 data errors, use '-v' for a list

The checksum errors will appear on the raidz vdev instead of a leaf if
vdev_raidz.c can't determine which leaf vdev was responsible.  This
could happen if two or more leaf vdevs return bad data for the same
block, which would also lead to unrecoverable data errors.  I see that
you have some unrecoverable data errors, so maybe that's what happened
to you.

Subtle design bugs in ZFS can also lead to vdev_raidz.c being unable
to determine which child was responsible for a checksum error.
However, I've only seen that happen when a raidz vdev has a mirror
child.  That can only happen if the child is a spare or replacing
vdev.  Did you activate any spares, or did you manually replace a
vdev?

-Alan