Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 12 Sep 2008 15:34:30 +0100
From:      Karl Pielorz <kpielorz_lst@tdx.co.uk>
To:        Jeremy Chadwick <koitsu@FreeBSD.org>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: ZFS w/failing drives - any equivalent of Solaris FMA?
Message-ID:  <3BE629D093001F6BA2C6791C@Slim64.dmpriest.net.uk>
In-Reply-To: <20080912132102.GB56923@icarus.home.lan>
References:  <C984A6E7B1C6657CD8C4F79E@Slim64.dmpriest.net.uk> <20080912132102.GB56923@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help


--On 12 September 2008 06:21 -0700 Jeremy Chadwick <koitsu@FreeBSD.org> 
wrote:

> As far as I know, there is no such "standard" mechanism in FreeBSD.  If
> the drive falls off the bus entirely (e.g. detached), I would hope ZFS
> would notice that.  I can imagine it (might) also depend on if the disk
> subsystem you're using is utilising CAM or not (e.g. disks should be daX
> not adX); Scott Long might know if something like this is implemented in
> CAM.  I'm fairly certain nothing like this is implemented in ata(4).

For ATA, at the moment - I don't think it'll notice even if a drive 
detaches. I think like my system the other day, it'll just keep issuing I/O 
commands to the drive, even if it's disappeared (it might get much 'quicker 
failures' if the device has 'gone' to the point of FreeBSD just quickly 
returning 'fail' for every request).

> Ideally, it would be the job of the controller and controller driver to
> announce to underlying I/O operations fail/success.  Do you agree?
>
> I hope this "FMA Engine" on Solaris only *tells* underlying pieces of
> I/O errors, rather than acting on them (e.g. automatically yanking the
> disk off the bus for you).  I'm in no way shunning Solaris, I'm simply
> saying such a mechanism could be as risky/deadly as it could be useful.

Yeah, I guess so - I think the way it's meant to happen (and this is only 
AFAIK) is that FMA 'detects' a failing drive by applying some configurable 
policy to it. That policy would also include notifying ZFS, so that ZFS 
could then decide to stop issuing I/O commands to that device.

None of this seems to be in place, at least for ATA under FreeBSD - when a 
drive goes bad, you can just end up with 'hours' worth of I/O timeouts, 
until someone intervenes.

I did enquire on the Open Solaris list about setting limits for 'errors' in 
ZFS, which netted me a reply that it's FMA (at least in Solaris) that's 
responsible for this - it just then informs ZFS of the condition. We don't 
appear (again at least for ATA) to have anything similar for FreeBSD yet :(

-Kp




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3BE629D093001F6BA2C6791C>