From owner-freebsd-stable@FreeBSD.ORG Tue Apr 24 18:04:38 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 14D12106564A for ; Tue, 24 Apr 2012 18:04:38 +0000 (UTC) (envelope-from dustinwenz@xtechllc.com) Received: from internet02.xtechllc.com (internet02.tru-signal.com [65.127.24.21]) by mx1.freebsd.org (Postfix) with ESMTP id CC9A98FC12 for ; Tue, 24 Apr 2012 18:04:37 +0000 (UTC) Received: from service02.office.ebureau.com (service02.office.ebureau.com [192.168.20.15]) by internet02.xtechllc.com (Postfix) with ESMTP id D4BC4C434F5 for ; Tue, 24 Apr 2012 12:55:07 -0500 (CDT) Received: from localhost (localhost [127.0.0.1]) by service02.office.ebureau.com (Postfix) with ESMTP id 78AE998C3B36 for ; Tue, 24 Apr 2012 12:55:07 -0500 (CDT) X-Virus-Scanned: amavisd-new at ebureau.com Received: from service02.office.ebureau.com ([127.0.0.1]) by localhost (service02.office.iscompanies.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4OrtGBAQ3XZj for ; Tue, 24 Apr 2012 12:55:06 -0500 (CDT) Received: from square.office.iscompanies.com (square.office.iscompanies.com [10.10.20.22]) by service02.office.ebureau.com (Postfix) with ESMTPSA id B29E298C3B28 for ; Tue, 24 Apr 2012 12:55:06 -0500 (CDT) From: Dustin Wenz Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Date: Tue, 24 Apr 2012 12:55:06 -0500 Message-Id: <17DD4C39-6905-4A5B-AE86-87F149CBD5BC@xtechllc.com> To: freebsd-stable@freebsd.org Mime-Version: 1.0 (Apple Message framework v1257) X-Mailer: Apple Mail (2.1257) Subject: Can MPS discard a misbehaving disk? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Apr 2012 18:04:38 -0000 I am having trouble with MPS becoming unresponsive in certain disk = failure conditions. So far, I've experienced this with 3TB Hitachi disks = (0S03208) and 3TB Seagate Barracuda disks (ST3000DM001, firmware CC9D) = while using the MPS driver with an LSI SAS2116 controller on FreeBSD = 8.2-STABLE. In these particular instances, the disks are part of a zpool of mirrors. = When a disk fails, I generally see a message like "kernel: = (da5:mps0:0:5:0): SCSI command timeout on device handle 0x0017 SMID = 148", followed by an indefinite number of "mps0: (0:5:0) terminated ioc = 804b scsi 0 state c xfer 65536" messages. What I would want to happen in this case is for the disk to simply go = offline in the zpool, in order for the pool to continue functioning. = However, the pool status still shows the disk as online. Any attempts to = disable the disk (such as with zpool offline, remove, or detach) will = hang and never complete, as will attempting a rescan with camcontrol. Of = course, any attempts to access data in the pool will hang as well. Rebooting the system in this state is also bad; when the disk is first = discovered, it will begin a cycle of mps scsi errors during startup that = never seem to stop. The only way to recover, at least that I know of, is = to physically remove the disk from the chassis. Once I do that, the = system continues running perfectly. Basically my question is this: How can I get MPS to ignore a failed disk = and never attempt to access it again? I don't care if it does so = automatically, or I if I need to perform some administrative operation = to drop the device reference. I've seen a number of people on the list = having problems that appear similar to this; but those seem more to do = with firmware or compatibility issues. I my case, these disks are = definitely dead... they no longer work in any other systems, and often = make sad clicking noises. I suppose this is also something that ZFS could do, independent of the = driver. If a device is unresponsive, shouldn't it take it offline on = it's own? - .Dustin