From owner-freebsd-stable@FreeBSD.ORG  Tue Apr 24 18:04:38 2012
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 14D12106564A
	for <freebsd-stable@freebsd.org>; Tue, 24 Apr 2012 18:04:38 +0000 (UTC)
	(envelope-from dustinwenz@xtechllc.com)
Received: from internet02.xtechllc.com (internet02.tru-signal.com
	[65.127.24.21]) by mx1.freebsd.org (Postfix) with ESMTP id CC9A98FC12
	for <freebsd-stable@freebsd.org>; Tue, 24 Apr 2012 18:04:37 +0000 (UTC)
Received: from service02.office.ebureau.com (service02.office.ebureau.com
	[192.168.20.15])
	by internet02.xtechllc.com (Postfix) with ESMTP id D4BC4C434F5
	for <freebsd-stable@freebsd.org>; Tue, 24 Apr 2012 12:55:07 -0500 (CDT)
Received: from localhost (localhost [127.0.0.1])
	by service02.office.ebureau.com (Postfix) with ESMTP id 78AE998C3B36
	for <freebsd-stable@freebsd.org>; Tue, 24 Apr 2012 12:55:07 -0500 (CDT)
X-Virus-Scanned: amavisd-new at ebureau.com
Received: from service02.office.ebureau.com ([127.0.0.1])
	by localhost (service02.office.iscompanies.com [127.0.0.1])
	(amavisd-new, port 10024)
	with ESMTP id 4OrtGBAQ3XZj for <freebsd-stable@freebsd.org>;
	Tue, 24 Apr 2012 12:55:06 -0500 (CDT)
Received: from square.office.iscompanies.com (square.office.iscompanies.com
	[10.10.20.22])
	by service02.office.ebureau.com (Postfix) with ESMTPSA id B29E298C3B28
	for <freebsd-stable@freebsd.org>; Tue, 24 Apr 2012 12:55:06 -0500 (CDT)
From: Dustin Wenz <dustinwenz@xtechllc.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Date: Tue, 24 Apr 2012 12:55:06 -0500
Message-Id: <17DD4C39-6905-4A5B-AE86-87F149CBD5BC@xtechllc.com>
To: freebsd-stable@freebsd.org
Mime-Version: 1.0 (Apple Message framework v1257)
X-Mailer: Apple Mail (2.1257)
Subject: Can MPS discard a misbehaving disk?
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Apr 2012 18:04:38 -0000

I am having trouble with MPS becoming unresponsive in certain disk =
failure conditions. So far, I've experienced this with 3TB Hitachi disks =
(0S03208) and 3TB Seagate Barracuda disks (ST3000DM001, firmware CC9D) =
while using the MPS driver with an LSI SAS2116 controller on FreeBSD =
8.2-STABLE.

In these particular instances, the disks are part of a zpool of mirrors. =
When a disk fails, I generally see a message like "kernel: =
(da5:mps0:0:5:0): SCSI command timeout on device handle 0x0017 SMID =
148", followed by an indefinite number of "mps0: (0:5:0) terminated ioc =
804b scsi 0 state c xfer 65536" messages.

What I would want to happen in this case is for the disk to simply go =
offline in the zpool, in order for the pool to continue functioning. =
However, the pool status still shows the disk as online. Any attempts to =
disable the disk (such as with zpool offline, remove, or detach) will =
hang and never complete, as will attempting a rescan with camcontrol. Of =
course, any attempts to access data in the pool will hang as well.

Rebooting the system in this state is also bad; when the disk is first =
discovered, it will begin a cycle of mps scsi errors during startup that =
never seem to stop. The only way to recover, at least that I know of, is =
to physically remove the disk from the chassis. Once I do that, the =
system continues running perfectly.

Basically my question is this: How can I get MPS to ignore a failed disk =
and never attempt to access it again? I don't care if it does so =
automatically, or I if I need to perform some administrative operation =
to drop the device reference. I've seen a number of people on the list =
having problems that appear similar to this; but those seem more to do =
with firmware or compatibility issues. I my case, these disks are =
definitely dead... they no longer work in any other systems, and often =
make sad clicking noises.

I suppose this is also something that ZFS could do, independent of the =
driver. If a device is unresponsive, shouldn't it take it offline on =
it's own?

	- .Dustin