From owner-freebsd-hackers@FreeBSD.ORG  Fri Sep 12 09:59:04 2008
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5A3C61065671
	for <freebsd-hackers@freebsd.org>; Fri, 12 Sep 2008 09:59:04 +0000 (UTC)
	(envelope-from kpielorz_lst@tdx.co.uk)
Received: from mx0.tdx.com (mx1.tdx.com [62.13.128.202])
	by mx1.freebsd.org (Postfix) with ESMTP id 7C9298FC1E
	for <freebsd-hackers@freebsd.org>; Fri, 12 Sep 2008 09:59:03 +0000 (UTC)
	(envelope-from kpielorz_lst@tdx.co.uk)
X-Meat-Content: Unsure
Received: from Slim64.dmpriest.net.uk (thebrick.dmpriest.net.uk [62.13.130.30])
	(authenticated bits=0)
	by mx0.tdx.com (8.13.8/8.13.8/Kp) with ESMTP id m8CAHgg0095119
	for <freebsd-hackers@freebsd.org>; Fri, 12 Sep 2008 11:17:42 +0100 (BST)
Date: Fri, 12 Sep 2008 10:45:24 +0100
From: Karl Pielorz <kpielorz_lst@tdx.co.uk>
To: freebsd-hackers@freebsd.org
Message-ID: <C984A6E7B1C6657CD8C4F79E@Slim64.dmpriest.net.uk>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Subject: ZFS w/failing drives - any equivalent of Solaris FMA?
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 12 Sep 2008 09:59:04 -0000


Hi,

Recently, a ZFS pool on my FreeBSD box started showing lots of errors on 
one drive in a mirrored pair.

The pool consists of around 14 drives (as 7 mirrored pairs), hung off of a 
couple of SuperMicro 8 port SATA controllers (1 drive of each pair is on 
each controller).

One of the drives started picking up a lot of errors (by the end of things 
it was returning errors pretty much for any reads/writes issued) - and 
taking ages to complete the I/O's.

However, ZFS kept trying to use the drive - e.g. as I attached another 
drive to the remaining 'good' drive in the mirrored pair, ZFS was still 
trying to read data off the failed drive (and remaining good one) in order 
to complete it's re-silver to the newly attached drive.

Having posted on the Open Solaris ZFS list - it appears, under Solaris 
there's an 'FMA Engine' which communicates drive failures and the like to 
ZFS - advising ZFS when a drive should be marked as 'failed'.

Is there anything similar to this on FreeBSD yet? - i.e. Does/can anything 
on the system tell ZFS "This drives experiencing failures" rather than ZFS 
just seeing lots of timed out I/O 'errors'? (as appears to be the case).

In the end, the failing drive was timing out literally every I/O - I did 
recover the situation by detaching it from the pool (which hung the machine 
- probably caused by ZFS having to update the meta-data on all drives, 
including the failed one). A reboot bought the pool back, minus the 
'failed' drive, so enough of the 'detach' must have completed.

The newly attached drive completed the re-silver in half an hour (as 
opposed to an estimated 755 hours and climbing with the other drive still 
in the pool, limping along).

-Kp