Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 23 Feb 2010 19:35:42 +0200
From:      Alexander Motin <mav@FreeBSD.org>
To:        Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
Cc:        freebsd-stable@FreeBSD.org
Subject:   Re: ahcich timeouts, only with ahci, not with ataahci
Message-ID:  <4B8411EE.5030909@FreeBSD.org>
In-Reply-To: <4B840C54.3010304@omnilan.de>
References:  <1266934981.00222684.1266922202@10.7.7.3> <4B83EFD4.8050403@FreeBSD.org> <4B83FD62.2020407@omnilan.de> <4B83FFEF.7010509@FreeBSD.org> <4B840C54.3010304@omnilan.de>

next in thread | previous in thread | raw e-mail | index | archive | help
Harald Schmalzbauer wrote:
> Alexander Motin schrieb am 23.02.2010 17:18 (localtime):
> ...
>>> I guess if it's a HDD firmware issue with NCQ the hang shouldn't happen
>>> when NCQ is disabled.
>>
>> Just for case of real I/O timeout, run full surface test with SMART.
> 
> Unfortunately I couldn't find new firmware from Samsung, although one
> drive shows version 1AG01113 while the other two have 1AG01118. But the
> timeout happened at different channels, so it's not one certain disk...
> 
> One understanding question: If the drive doesn't complete a command,
> regardless if it's due to a firmware bug, a disk surface error or
> whatever, is there no way for the driver to terminate the request and
> take the drive offline after some time? This would be a very important
> behaviour for me. It doesn't make sense building RAIDz storage when a
> failing drive hangs the complete machine, even if the system partitions
> are on a complete different SSD.

That's what timeouts are used for. When timeout detected, driver resets
device and reports error to upper layer. After receiving error, CAM
reinitializes device. If device is completely dead, reinitialization
will fail and device will be dropped immediately. If device is still
alive, reinit succeed and CAM will retry command again. If all retries
failed, error reported to the GEOM layer and then possibly to file
system. I have no idea how RAIDZ behaves in such case. May be after few
such errors it should drop that device out of array.

Timeout is a worst possible case for any device, as it takes too much
time and doesn't give any recovery information. Half-dead case is worst
possible case of timeout. It is difficult to say what which way is
better: drop last drive from degraded array and lost all info, or retry
forever. There is probably no right answer.

-- 
Alexander Motin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4B8411EE.5030909>