Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 03 Mar 2010 10:18:05 +0200
From:      Alexander Motin <mav@FreeBSD.org>
To:        Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
Cc:        freebsd-stable@FreeBSD.org
Subject:   Re: ahcich timeouts, only with ahci, not with ataahci
Message-ID:  <4B8E1B3D.306@FreeBSD.org>
In-Reply-To: <4B8E1489.2070306@omnilan.de>
References:  <1266934981.00222684.1266922202@10.7.7.3> <4B83EFD4.8050403@FreeBSD.org> <4B8E1489.2070306@omnilan.de>

next in thread | previous in thread | raw e-mail | index | archive | help
Harald Schmalzbauer wrote:
> Alexander Motin schrieb am 23.02.2010 16:10 (localtime):
>> Harald Schmalzbauer wrote:
>>> I'm frequently getting my machine locked with ahcichX timeouts:
>>> ahcich2: Timeout on slot 0
>>> ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 serr
>>> 00000000
>>> ahcich2: Timeout on slot 8
>>> ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr
>>> 00000000
>>> ahcich2: Timeout on slot 8
>>> ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 serr
>>> 00000000
>>> ...
>>
>> Looking that is (Interrupt status) is zero and `rs == cs | ss` (running
>> command bitmasks in driver and hardware), controller doesn't report
>> command completion. Looking on TFD status 0xc0 with BUSY bit set, I
>> would suppose that either disk stuck in command processing for some
>> reason, or controller missed command completion status.
>>
>> Have you noticed 30 second (default ATA timeout) pause before timeout
>> message printed? Just want to be sure that driver waited enough before
>> give up.
>>
>>> This happens when backup over GbE overloads ZFS/HDD capabilities.
>>> I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from locking
>>> up almost immediately, but from it still happens.
>>> When I don't use ahci but ataahci (the old driver if I understand things
>>> correct) I also see the ZFS burst write congestion, but this doesn't
>>> lead to controller timeouts, thus blocking the machine.
>>>
>>> Sometimes the machine recovers from the disk lock, but most often I have
>>> to reboot.
>>
>> How it looks when it doesn't? Can you send me full log messages?
> 
> Hello, this morning I had a stall, but the machine recovered after about
>  one Minute. Here's what I got from the kernel:
> ahcich2: Timeout on slot 29
> ahcich2: is 00000000 cs 00000003 ss e0000003 rs e0000003 tfd c0 serr
> 00000000
> em1: watchdog timeout -- resetting
> em1: watchdog timeout -- resetting
> ahcich2: Timeout on slot 10
> ahcich2: is 00000000 cs 00006000 ss 00007c00 rs 00007c00 tfd c0 serr
> 00000000
> ahcich2: Timeout on slot 18
> ahcich2: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd c0 serr
> 00000000
> ahcich2: Timeout on slot 2
> ahcich2: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd c0 serr
> 00000000
> ahcich2: Timeout on slot 2
> ahcich2: is 00000000 cs 00000000 ss 0000000c rs 0000000c tfd 40 serr
> 00000000
> 
> Does this tell you something useful?

It doesn't. Looking on logged register content - commands are indeed
still running and no interrupts requested. Interesting to see em1
watchdog timeout there. Aren't they related somehow?

-- 
Alexander Motin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4B8E1B3D.306>