Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 04 Jun 2013 09:05:42 +0300
From:      Alexander Motin <mav@FreeBSD.org>
To:        Jeremy Chadwick <jdc@koitsu.org>
Cc:        freebsd-stable@FreeBSD.org, Mike Pumford <michaelp@bsquare.com>
Subject:   Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout
Message-ID:  <51AD83B6.3090204@FreeBSD.org>
In-Reply-To: <20130603202206.GA49602@icarus.home.lan>
References:  <201305291421.r4TELY8p042536@grabthar.secnetix.de> <1369840577.1258.45.camel@revolution.hippie.lan> <51ACA2FD.2050609@bsquare.com> <20130603202206.GA49602@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
On 03.06.2013 23:22, Jeremy Chadwick wrote:
> On Mon, Jun 03, 2013 at 03:06:53PM +0100, Mike Pumford wrote:
>> Ian Lepore wrote:
>>> On Wed, 2013-05-29 at 16:21 +0200, Oliver Fromme wrote:
>>>> Steven Hartland wrote:
>>>>   > Have you checked your sata cables and psu outputs?
>>>>   >
>>>>   > Both of these could be the underlying cause of poor signalling.
>>>>
>>>> I can't easily check that because it is a cheap rented
>>>> server in a remote location.
>>>>
>>>> But I don't believe it is bad cabling or PSU anyway, or
>>>> otherwise the problem would occur intermittently all the
>>>> time if the load on the disks is sufficiently high.
>>>> But it only occurs at tags=3 and above.  At tags=2 it does
>>>> not occur at all, no matter how hard I hammer on the disks.
>>>>
>>>> At the moment I'm inclined to believe that it is either
>>>> a bug in the HDD firmware or in the controller.  The disks
>>>> aren't exactly new, they're 400 GB Samsung ones that are
>>>> several years old.  I think it's not uncommon to have bugs
>>>> in the NCQ implementation in such disks.
>>>>
>>>> The only thing that puzzles me is the fact that the problem
>>>> also disappears completely when I reduce the SATA rev from
>>>> II to I, even at tags=32.
>>>>
>>>
>>> It seems to me that you dismiss signaling problems too quickly.
>>> Consider the possibilities... A bad cable leads to intermittant errors
>>> at higher speeds.  When NCQ is disabled or limited the software handles
>>> these errors pretty much transparently.  When NCQ is not limitted and
>>> there are many outstanding requests, suddenly the error handling in the
>>> software breaks down somehow and a minor recoverable problem becomes an
>>> in-your-face error.
>>>
>> It could also be a software bug in the way CAM handles the failure
>> of NCQ commands. When command queueing is used on a SCSI drive and a
>> queued command fails only that command fails. A queued command
>> failure on a SATA device fails ALL currently queued commands. I've
>> not looked at the code but do the SATA CAM drivers do the right
>> thing here?
>
> Quoting T13/2015-D ATA8-ACS2 WD spec:
>
> "If an error occurs while the device is processing an NCQ command, then
> the device shall return command aborted for all NCQ commands that are in
> the queue and shall return command aborted for any new commands, except
> a READ LOG EXT command requesting log address 10h, until the device
> completes a READ LOG EXT command requesting log address 10h (i.e.,
> reading the NCQ Command Error log) without error."
>
> While I can't easily provide an answer to your question, I can tell you
> that sys/dev/ahci/ahci.c does execute READ LOG EXT (command 0x2f) for
> certain scenarios (the code is in function ahci_issue_recovery()).

I am not aware about any flows in present CAM ATA error recovery logic. 
READ LOG EXT sending indeed implemented on ahci(4) driver level (same as 
siis(4) and mvs(4)) since it was complicated/impossible to do in shared 
code because higher levels have no idea about tags allocation done by 
lower-level drivers.

> The one person who can answer this question is mav@, who is now CC'd.
>
>> Less commands queued makes it less likely that multiple commands
>> will be in progress when a failure occurs.  A lower link rate also
>> makes you more immune to signal failures.
>
> He isn't seeing SATA-level signal/link failure; the AHCI driver would
> complain about that, and those messages aren't there.  Unless, of
> course, those messages are only visible when verbose booting is enabled
> (I hope not).

Just a curious history point: I had one old system on NVIDIA MCP55 
chipset where Linux worked well before, but FreeBSD had problems with 
SATA -- all disk transfers were really slow, but without reporting any 
errors, and after some point system started to hang. That series of 
chipsets had long history of problems, so for some time I was looking 
for some way to handle it in software. But after many experiments I've 
accidentally found out that disabling 6 small but very powerful fans 
workarounded the problem. I've checked PSU voltages, and they were fine. 
Switching fans to separate PSU also helped. Finally I've just replaced 
system's main PSU with different one and problems have gone. My best 
guess was that capacitors in that PSU due to old age were unable to 
filter fan's electric noise that started to interfere with SATA and 
later other signals. Now the same PSU works perfectly fine in the same 
case with smaller Atom-based motherbard without any issues.

I am not telling that ahci(4) driver is perfect, but hardware issues are 
always possible even if system worked fine before that.

-- 
Alexander Motin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51AD83B6.3090204>