Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 12 Sep 2008 17:44:27 +0200 (CEST)
From:      Oliver Fromme <olli@lurza.secnetix.de>
To:        freebsd-hackers@FreeBSD.ORG, kpielorz_lst@tdx.co.uk
Subject:   Re: ZFS w/failing drives - any equivalent of Solaris FMA?
Message-ID:  <200809121544.m8CFiRHQ099725@lurza.secnetix.de>
In-Reply-To: <C984A6E7B1C6657CD8C4F79E@Slim64.dmpriest.net.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
Karl Pielorz wrote:
 > Recently, a ZFS pool on my FreeBSD box started showing lots of errors on 
 > one drive in a mirrored pair.
 > 
 > The pool consists of around 14 drives (as 7 mirrored pairs), hung off of a 
 > couple of SuperMicro 8 port SATA controllers (1 drive of each pair is on 
 > each controller).
 > 
 > One of the drives started picking up a lot of errors (by the end of things 
 > it was returning errors pretty much for any reads/writes issued) - and 
 > taking ages to complete the I/O's.
 > 
 > However, ZFS kept trying to use the drive - e.g. as I attached another 
 > drive to the remaining 'good' drive in the mirrored pair, ZFS was still 
 > trying to read data off the failed drive (and remaining good one) in order 
 > to complete it's re-silver to the newly attached drive.
 > 
 > Having posted on the Open Solaris ZFS list - it appears, under Solaris 
 > there's an 'FMA Engine' which communicates drive failures and the like to 
 > ZFS - advising ZFS when a drive should be marked as 'failed'.
 > 
 > Is there anything similar to this on FreeBSD yet? - i.e. Does/can anything 
 > on the system tell ZFS "This drives experiencing failures" rather than ZFS 
 > just seeing lots of timed out I/O 'errors'? (as appears to be the case).
 > 
 > In the end, the failing drive was timing out literally every I/O - I did 
 > recover the situation by detaching it from the pool (which hung the machine 
 > - probably caused by ZFS having to update the meta-data on all drives, 
 > including the failed one). A reboot bought the pool back, minus the 
 > 'failed' drive, so enough of the 'detach' must have completed.

Did you try "atacontrol detach" to remove the disk from
the bus?  I haven't tried that with ZFS, but gmirror
automatically detects when a disk has gone away, and
doesn't try to do anything with it anymore.  It certainly
should not hang the machine.  After all, what's the
purpose of a RAID when you have to reboot upon drive
failure.  ;-)

Best regards
   Oliver

-- 
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606,  Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758,  Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr:  http://www.secnetix.de/bsd

"C++ is over-complicated nonsense. And Bjorn Shoestrap's book
a danger to public health. I tried reading it once, I was in
recovery for months."
        -- Cliff Sarginson



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200809121544.m8CFiRHQ099725>