From owner-freebsd-fs@FreeBSD.ORG  Mon Jun 22 01:10:26 2015
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@nevdull.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 29DD2CC5
 for <freebsd-fs@nevdull.freebsd.org>; Mon, 22 Jun 2015 01:10:26 +0000 (UTC)
 (envelope-from wjw@digiware.nl)
Received: from smtp.digiware.nl (smtp.digiware.nl [31.223.170.169])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id C69D8A7D
 for <freebsd-fs@freebsd.org>; Mon, 22 Jun 2015 01:10:25 +0000 (UTC)
 (envelope-from wjw@digiware.nl)
Received: from rack1.digiware.nl (unknown [127.0.0.1])
 by smtp.digiware.nl (Postfix) with ESMTP id B891D16A403;
 Mon, 22 Jun 2015 00:43:51 +0200 (CEST)
X-Virus-Scanned: amavisd-new at digiware.nl
Received: from smtp.digiware.nl ([127.0.0.1])
 by rack1.digiware.nl (rack1.digiware.nl [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id gUsaS4ddV81U; Mon, 22 Jun 2015 00:43:39 +0200 (CEST)
Received: from [IPv6:2001:4cb8:3:1:a079:ce8f:c2bf:e69] (unknown
 [IPv6:2001:4cb8:3:1:a079:ce8f:c2bf:e69])
 by smtp.digiware.nl (Postfix) with ESMTPA id 91B3D16A402;
 Mon, 22 Jun 2015 00:43:39 +0200 (CEST)
Message-ID: <55873E1D.9010401@digiware.nl>
Date: Mon, 22 Jun 2015 00:43:41 +0200
From: Willem Jan Withagen <wjw@digiware.nl>
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64;
 rv:31.0) Gecko/20100101 Thunderbird/31.7.0
MIME-Version: 1.0
To: Tom Curry <thomasrcurry@gmail.com>
CC: freebsd-fs@freebsd.org
Subject: Re: This diskfailure should not panic a system, but just disconnect
 disk from ZFS
References: <5585767B.4000206@digiware.nl>	<558590BD.40603@isletech.net>	<5586C396.9010100@digiware.nl>
 <CAGtEZUAO5-rBoz0YBcYfvZ6tx_sj0MEFuxGSYk+z0XHrJySk2A@mail.gmail.com>
In-Reply-To: <CAGtEZUAO5-rBoz0YBcYfvZ6tx_sj0MEFuxGSYk+z0XHrJySk2A@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 22 Jun 2015 01:10:26 -0000

On 21/06/2015 21:50, Tom Curry wrote:
> Was there by chance a lot of disk activity going on when this occurred? 

Define 'a lot'??
But very likely, since the system is also a backup location for several
external service which backup thru rsync. And they can generate generate
quite some traffic. Next to the fact that it also serves a NVR with a
ZVOL trhu iSCSI...

--WjW

> 
> On Sun, Jun 21, 2015 at 10:00 AM, Willem Jan Withagen <wjw@digiware.nl
> <mailto:wjw@digiware.nl>> wrote:
> 
>     On 20/06/2015 18:11, Daryl Richards wrote:
>     > Check the failmode setting on your pool. From man zpool:
>     >
>     >        failmode=wait | continue | panic
>     >
>     >            Controls the system behavior in the event of catastrophic
>     > pool failure.  This  condition  is  typically  a
>     >            result  of  a  loss of connectivity to the underlying storage
>     > device(s) or a failure of all devices within
>     >            the pool. The behavior of such an event is determined as
>     > follows:
>     >
>     >            wait        Blocks all I/O access until the device
>     > connectivity is recovered and the errors  are  cleared.
>     >                        This is the default behavior.
>     >
>     >            continue    Returns  EIO  to  any  new write I/O requests but
>     > allows reads to any of the remaining healthy
>     >                        devices. Any write requests that have yet to be
>     > committed to disk would be blocked.
>     >
>     >            panic       Prints out a message to the console and generates
>     > a system crash dump.
> 
>     'mmm
> 
>     Did not know about this setting. Nice one, but alas my current
>     setting is:
>     zfsboot  failmode         wait                           default
>     zfsraid  failmode         wait                           default
> 
>     So either the setting is not working, or something else is up?
>     Is waiting only meant to wait a limited time? And then panic anyways?
> 
>     But then still I wonder why even in the 'continue'-case the ZFS system
>     ends in a state where the filesystem is not able to continue in its
>     standard functioning ( read and write ) and disconnects the disk???
> 
>     All failmode settings result in a seriously handicapped system...
>     On a raidz2 system I would perhaps expected this to occur when the
>     second disk goes into thin space??
> 
>     The other question is: The man page talks about
>     'Controls the system behavior in the event of catastrophic pool failure'
>     And is a hung disk a 'catastrophic pool failure'?
> 
>     Still very puzzled?
> 
>     --WjW
> 
>     >
>     >
>     > On 2015-06-20 10:19 AM, Willem Jan Withagen wrote:
>     >> Hi,
>     >>
>     >> Found my system rebooted this morning:
>     >>
>     >> Jun 20 05:28:33 zfs kernel: sonewconn: pcb 0xfffff8011b6da498: Listen
>     >> queue overflow: 8 already in queue awaiting acceptance (48
>     occurrences)
>     >> Jun 20 05:28:33 zfs kernel: panic: I/O to pool 'zfsraid' appears
>     to be
>     >> hung on vdev guid 18180224580327100979 at '/dev/da0'.
>     >> Jun 20 05:28:33 zfs kernel: cpuid = 0
>     >> Jun 20 05:28:33 zfs kernel: Uptime: 8d9h7m9s
>     >> Jun 20 05:28:33 zfs kernel: Dumping 6445 out of 8174
>     >> MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%
>     >>
>     >> Which leads me to believe that /dev/da0 went out on vacation, leaving
>     >> ZFS into trouble.... But the array is:
>     >> ----
>     >> NAME               SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP
>     >> zfsraid           32.5T  13.3T  19.2T         -     7%    41%  1.00x
>     >> ONLINE  -
>     >>    raidz2          16.2T  6.67T  9.58T         -     8%    41%
>     >>      da0               -      -      -         -      -      -
>     >>      da1               -      -      -         -      -      -
>     >>      da2               -      -      -         -      -      -
>     >>      da3               -      -      -         -      -      -
>     >>      da4               -      -      -         -      -      -
>     >>      da5               -      -      -         -      -      -
>     >>    raidz2          16.2T  6.67T  9.58T         -     7%    41%
>     >>      da6               -      -      -         -      -      -
>     >>      da7               -      -      -         -      -      -
>     >>      ada4              -      -      -         -      -      -
>     >>      ada5              -      -      -         -      -      -
>     >>      ada6              -      -      -         -      -      -
>     >>      ada7              -      -      -         -      -      -
>     >>    mirror           504M  1.73M   502M         -    39%     0%
>     >>      gpt/log0          -      -      -         -      -      -
>     >>      gpt/log1          -      -      -         -      -      -
>     >> cache                 -      -      -      -      -      -
>     >>    gpt/raidcache0   109G  1.34G   107G         -     0%     1%
>     >>    gpt/raidcache1   109G   787M   108G         -     0%     0%
>     >> ----
>     >>
>     >> And thus I'd would have expected that ZFS would disconnect
>     /dev/da0 and
>     >> then switch to DEGRADED state and continue, letting the operator
>     fix the
>     >> broken disk.
>     >> Instead it chooses to panic, which is not a nice thing to do. :)
>     >>
>     >> Or do I have to high hopes of ZFS?
>     >>
>     >> Next question to answer is why this WD RED on:
>     >>
>     >> arcmsr0@pci0:7:14:0:    class=0x010400 card=0x112017d3
>     chip=0x112017d3
>     >> rev=0x00 hdr=0x00
>     >>      vendor     = 'Areca Technology Corp.'
>     >>      device     = 'ARC-1120 8-Port PCI-X to SATA RAID Controller'
>     >>      class      = mass storage
>     >>      subclass   = RAID
>     >>
>     >> got hung, and nothing for this shows in SMART....
> 
>     _______________________________________________
>     freebsd-fs@freebsd.org <mailto:freebsd-fs@freebsd.org> mailing list
>     http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>     To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org
>     <mailto:freebsd-fs-unsubscribe@freebsd.org>"
> 
>