From owner-freebsd-stable@FreeBSD.ORG  Thu May  7 09:50:53 2015
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 73006255
 for <freebsd-stable@freebsd.org>; Thu,  7 May 2015 09:50:53 +0000 (UTC)
Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 2C40A192F
 for <freebsd-stable@freebsd.org>; Thu,  7 May 2015 09:50:53 +0000 (UTC)
Received: from slw by zxy.spb.ru with local (Exim 4.84 (FreeBSD))
 (envelope-from <slw@zxy.spb.ru>)
 id 1YqIS9-000Pc3-4S; Thu, 07 May 2015 12:50:49 +0300
Date: Thu, 7 May 2015 12:50:49 +0300
From: Slawa Olhovchenkov <slw@zxy.spb.ru>
To: Steven Hartland <killing@multiplay.co.uk>
Cc: freebsd-stable@freebsd.org
Subject: Re: zfs, cam sticking on failed disk
Message-ID: <20150507095048.GC1394@zxy.spb.ru>
References: <20150507080749.GB1394@zxy.spb.ru>
 <554B2547.1090307@multiplay.co.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <554B2547.1090307@multiplay.co.uk>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-SA-Exim-Connect-IP: <locally generated>
X-SA-Exim-Mail-From: slw@zxy.spb.ru
X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable/>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 07 May 2015 09:50:53 -0000

On Thu, May 07, 2015 at 09:41:43AM +0100, Steven Hartland wrote:

> On 07/05/2015 09:07, Slawa Olhovchenkov wrote:
> > I have zpool of 12 vdev (zmirrors).
> > One disk in one vdev out of service and stop serving reuquest:
> >
> > dT: 1.036s  w: 1.000s
> >   L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
> >      0      0      0      0    0.0      0      0    0.0    0.0| ada0
> >      0      0      0      0    0.0      0      0    0.0    0.0| ada1
> >      1      0      0      0    0.0      0      0    0.0    0.0| ada2
> >      0      0      0      0    0.0      0      0    0.0    0.0| ada3
> >      0      0      0      0    0.0      0      0    0.0    0.0| da0
> >      0      0      0      0    0.0      0      0    0.0    0.0| da1
> >      0      0      0      0    0.0      0      0    0.0    0.0| da2
> >      0      0      0      0    0.0      0      0    0.0    0.0| da3
> >      0      0      0      0    0.0      0      0    0.0    0.0| da4
> >      0      0      0      0    0.0      0      0    0.0    0.0| da5
> >      0      0      0      0    0.0      0      0    0.0    0.0| da6
> >      0      0      0      0    0.0      0      0    0.0    0.0| da7
> >      0      0      0      0    0.0      0      0    0.0    0.0| da8
> >      0      0      0      0    0.0      0      0    0.0    0.0| da9
> >      0      0      0      0    0.0      0      0    0.0    0.0| da10
> >      0      0      0      0    0.0      0      0    0.0    0.0| da11
> >      0      0      0      0    0.0      0      0    0.0    0.0| da12
> >      0      0      0      0    0.0      0      0    0.0    0.0| da13
> >      0      0      0      0    0.0      0      0    0.0    0.0| da14
> >      0      0      0      0    0.0      0      0    0.0    0.0| da15
> >      0      0      0      0    0.0      0      0    0.0    0.0| da16
> >      0      0      0      0    0.0      0      0    0.0    0.0| da17
> >      0      0      0      0    0.0      0      0    0.0    0.0| da18
> >     24      0      0      0    0.0      0      0    0.0    0.0| da19
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >      0      0      0      0    0.0      0      0    0.0    0.0| da20
> >      0      0      0      0    0.0      0      0    0.0    0.0| da21
> >      0      0      0      0    0.0      0      0    0.0    0.0| da22
> >      0      0      0      0    0.0      0      0    0.0    0.0| da23
> >      0      0      0      0    0.0      0      0    0.0    0.0| da24
> >      0      0      0      0    0.0      0      0    0.0    0.0| da25
> >      0      0      0      0    0.0      0      0    0.0    0.0| da26
> >      0      0      0      0    0.0      0      0    0.0    0.0| da27
> >
> > As result zfs operation on this pool stoped too.
> > `zpool list -v` don't worked.
> > `zpool detach tank da19` don't worked.
> > Application worked with this pool sticking in `zfs` wchan and don't killed.
> >
> > # camcontrol tags da19 -v
> > (pass19:isci0:0:3:0): dev_openings  7
> > (pass19:isci0:0:3:0): dev_active    25
> > (pass19:isci0:0:3:0): allocated     25
> > (pass19:isci0:0:3:0): queued        0
> > (pass19:isci0:0:3:0): held          0
> > (pass19:isci0:0:3:0): mintags       2
> > (pass19:isci0:0:3:0): maxtags       255
> >
> > How I can cancel this 24 requst?
> > Why this requests don't timeout (3 hours already)?
> > How I can forced detach this disk? (I am lready try `camcontrol reset`, `camconrol rescan`).
> > Why ZFS (or geom) don't timeout on request and don't rerouted to da18?
> >
> If they are in mirrors, in theory you can just pull the disk, isci will 
> report to cam and cam will report to ZFS which should all recover.

Yes, zmirror with da18.
I am surprise that ZFS don't use da18. All zpool fully stuck.

> With regards to not timing out this could be a default issue, but having 

I am understand, no universal acceptable timeout for all cases: good
disk, good saturated disk, tape, tape library, failed disk, etc.
In my case -- failed disk. This model already failed (other specimen)
with same symptoms).

May be exist some tricks for cancel/aborting all request in queue and
removing disk from system?

> a very quick look that's not obvious in the code as 
> isci_io_request_construct etc do indeed set a timeout when 
> CAM_TIME_INFINITY hasn't been requested.
> 
> The sysctl hw.isci.debug_level may be able to provide more information, 
> but be aware this can be spammy.

I am already have this situation, what command interesting after
setting hw.isci.debug_level?