Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 3 Jun 2013 15:06:53 +0100
From:      Mike Pumford <michaelp@bsquare.com>
To:        <freebsd-stable@FreeBSD.org>
Subject:   Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout
Message-ID:  <51ACA2FD.2050609@bsquare.com>
In-Reply-To: <1369840577.1258.45.camel@revolution.hippie.lan>
References:  <201305291421.r4TELY8p042536@grabthar.secnetix.de> <1369840577.1258.45.camel@revolution.hippie.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
Ian Lepore wrote:
> On Wed, 2013-05-29 at 16:21 +0200, Oliver Fromme wrote:
>> Steven Hartland wrote:
>>   > Have you checked your sata cables and psu outputs?
>>   >
>>   > Both of these could be the underlying cause of poor signalling.
>>
>> I can't easily check that because it is a cheap rented
>> server in a remote location.
>>
>> But I don't believe it is bad cabling or PSU anyway, or
>> otherwise the problem would occur intermittently all the
>> time if the load on the disks is sufficiently high.
>> But it only occurs at tags=3 and above.  At tags=2 it does
>> not occur at all, no matter how hard I hammer on the disks.
>>
>> At the moment I'm inclined to believe that it is either
>> a bug in the HDD firmware or in the controller.  The disks
>> aren't exactly new, they're 400 GB Samsung ones that are
>> several years old.  I think it's not uncommon to have bugs
>> in the NCQ implementation in such disks.
>>
>> The only thing that puzzles me is the fact that the problem
>> also disappears completely when I reduce the SATA rev from
>> II to I, even at tags=32.
>>
>
> It seems to me that you dismiss signaling problems too quickly.
> Consider the possibilities... A bad cable leads to intermittant errors
> at higher speeds.  When NCQ is disabled or limited the software handles
> these errors pretty much transparently.  When NCQ is not limitted and
> there are many outstanding requests, suddenly the error handling in the
> software breaks down somehow and a minor recoverable problem becomes an
> in-your-face error.
>
It could also be a software bug in the way CAM handles the failure of 
NCQ commands. When command queueing is used on a SCSI drive and a queued 
command fails only that command fails. A queued command failure on a 
SATA device fails ALL currently queued commands. I've not looked at the 
code but do the SATA CAM drivers do the right thing here?

Less commands queued makes it less likely that multiple commands will be 
in progress when a failure occurs.  A lower link rate also makes you 
more immune to signal failures.

Mike




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51ACA2FD.2050609>