From owner-freebsd-hackers@FreeBSD.ORG Fri Sep 12 09:59:04 2008 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5A3C61065671 for ; Fri, 12 Sep 2008 09:59:04 +0000 (UTC) (envelope-from kpielorz_lst@tdx.co.uk) Received: from mx0.tdx.com (mx1.tdx.com [62.13.128.202]) by mx1.freebsd.org (Postfix) with ESMTP id 7C9298FC1E for ; Fri, 12 Sep 2008 09:59:03 +0000 (UTC) (envelope-from kpielorz_lst@tdx.co.uk) X-Meat-Content: Unsure Received: from Slim64.dmpriest.net.uk (thebrick.dmpriest.net.uk [62.13.130.30]) (authenticated bits=0) by mx0.tdx.com (8.13.8/8.13.8/Kp) with ESMTP id m8CAHgg0095119 for ; Fri, 12 Sep 2008 11:17:42 +0100 (BST) Date: Fri, 12 Sep 2008 10:45:24 +0100 From: Karl Pielorz To: freebsd-hackers@freebsd.org Message-ID: X-Mailer: Mulberry/4.0.8 (Win32) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline Subject: ZFS w/failing drives - any equivalent of Solaris FMA? X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Sep 2008 09:59:04 -0000 Hi, Recently, a ZFS pool on my FreeBSD box started showing lots of errors on one drive in a mirrored pair. The pool consists of around 14 drives (as 7 mirrored pairs), hung off of a couple of SuperMicro 8 port SATA controllers (1 drive of each pair is on each controller). One of the drives started picking up a lot of errors (by the end of things it was returning errors pretty much for any reads/writes issued) - and taking ages to complete the I/O's. However, ZFS kept trying to use the drive - e.g. as I attached another drive to the remaining 'good' drive in the mirrored pair, ZFS was still trying to read data off the failed drive (and remaining good one) in order to complete it's re-silver to the newly attached drive. Having posted on the Open Solaris ZFS list - it appears, under Solaris there's an 'FMA Engine' which communicates drive failures and the like to ZFS - advising ZFS when a drive should be marked as 'failed'. Is there anything similar to this on FreeBSD yet? - i.e. Does/can anything on the system tell ZFS "This drives experiencing failures" rather than ZFS just seeing lots of timed out I/O 'errors'? (as appears to be the case). In the end, the failing drive was timing out literally every I/O - I did recover the situation by detaching it from the pool (which hung the machine - probably caused by ZFS having to update the meta-data on all drives, including the failed one). A reboot bought the pool back, minus the 'failed' drive, so enough of the 'detach' must have completed. The newly attached drive completed the re-silver in half an hour (as opposed to an estimated 755 hours and climbing with the other drive still in the pool, limping along). -Kp