Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 26 Mar 2010 00:04:41 +0100
From:      Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
To:        Jeremy Chadwick <freebsd@jdc.parodius.com>
Cc:        Alexander Motin <mav@freebsd.org>, freebsd-stable@freebsd.org
Subject:   Dtrong elcheapo-ZFS-disk recommendation [Was: Re: ahcich timeouts,  only with ahci, not with ataahci]
Message-ID:  <4BABEC09.8070709@omnilan.de>
In-Reply-To: <4B9CC493.30009@omnilan.de>
References:  <1266934981.00222684.1266922202@10.7.7.3>	<4B83EFD4.8050403@FreeBSD.org>	<4B8E1489.2070306@omnilan.de>	<4B8E1B3D.306@FreeBSD.org>	<4B8E1DA9.2090406@omnilan.de>	<20100303110647.GA51588@icarus.home.lan>	<4B9C034B.90900@omnilan.de> <4B9CC493.30009@omnilan.de>

next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enig43C912A24C95AA60A133E470
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: quoted-printable

Harald Schmalzbauer schrieb am 14.03.2010 12:12 (localtime):
> Harald Schmalzbauer schrieb am 13.03.2010 22:27 (localtime):
>> Am 03.03.2010 12:06, schrieb Jeremy Chadwick:
>>> On Wed, Mar 03, 2010 at 09:28:25AM +0100, Harald Schmalzbauer wrote:
>>>> Alexander Motin schrieb am 03.03.2010 09:18 (localtime):
>>>>> Harald Schmalzbauer wrote:
>>>>>> Alexander Motin schrieb am 23.02.2010 16:10 (localtime):
>>>>>>> Harald Schmalzbauer wrote:
>>>>>>>> I'm frequently getting my machine locked with ahcichX timeouts:
>>>>>>>> ahcich2: Timeout on slot 0
>>>>>>>> ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 =

>>>>>>>> serr
>>>>>>>> 00000000
>>>>>>>> ahcich2: Timeout on slot 8
>>>>>>>> ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 =

>>>>>>>> serr
>>>>>>>> 00000000
>>>>>>>> ahcich2: Timeout on slot 8
>>>>>>>> ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 =

>>>>>>>> serr
>>>>>>>> 00000000
>>>>>>>> ...
>>>>>>> Looking that is (Interrupt status) is zero and `rs =3D=3D cs | ss=
`=20
>>>>>>> (running
>>>>>>> command bitmasks in driver and hardware), controller doesn't repo=
rt
>>>>>>> command completion. Looking on TFD status 0xc0 with BUSY bit set,=
 I
>>>>>>> would suppose that either disk stuck in command processing for so=
me
>>>>>>> reason, or controller missed command completion status.
>>>>>>>
>>>>>>> Have you noticed 30 second (default ATA timeout) pause before=20
>>>>>>> timeout
>>>>>>> message printed? Just want to be sure that driver waited enough=20
>>>>>>> before
>>>>>>> give up.
>>>>>>>
>>>>>>>> This happens when backup over GbE overloads ZFS/HDD capabilities=
=2E
>>>>>>>> I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from=20
>>>>>>>> locking
>>>>>>>> up almost immediately, but from it still happens.
>>>>>>>> When I don't use ahci but ataahci (the old driver if I=20
>>>>>>>> understand things
>>>>>>>> correct) I also see the ZFS burst write congestion, but this=20
>>>>>>>> doesn't
>>>>>>>> lead to controller timeouts, thus blocking the machine.
>>>>>>>>
>>>>>>>> Sometimes the machine recovers from the disk lock, but most=20
>>>>>>>> often I have
>>>>>>>> to reboot.
>>>>>>> How it looks when it doesn't? Can you send me full log messages?
>>>>>> Hello, this morning I had a stall, but the machine recovered after=
=20
>>>>>> about
>>>>>> one Minute. Here's what I got from the kernel:
>>>>>> ahcich2: Timeout on slot 29
>>>>>> ahcich2: is 00000000 cs 00000003 ss e0000003 rs e0000003 tfd c0 se=
rr
>>>>>> 00000000
>>>>>> em1: watchdog timeout -- resetting
>>>>>> em1: watchdog timeout -- resetting
>>>>>> ahcich2: Timeout on slot 10
>>>>>> ahcich2: is 00000000 cs 00006000 ss 00007c00 rs 00007c00 tfd c0 se=
rr
>>>>>> 00000000
>>>>>> ahcich2: Timeout on slot 18
>>>>>> ahcich2: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd c0 se=
rr
>>>>>> 00000000
>>>>>> ahcich2: Timeout on slot 2
>>>>>> ahcich2: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd c0 se=
rr
>>>>>> 00000000
>>>>>> ahcich2: Timeout on slot 2
>>>>>> ahcich2: is 00000000 cs 00000000 ss 0000000c rs 0000000c tfd 40 se=
rr
>>>>>> 00000000
>>>>>>
>>>>>> Does this tell you something useful?
>>>>> It doesn't. Looking on logged register content - commands are indee=
d
>>>>> still running and no interrupts requested. Interesting to see em1
>>>>> watchdog timeout there. Aren't they related somehow?
>>>>     dmesg | grep "irq 18":
>>>> uhci0: <Intel 82801I (ICH9) USB controller> port 0x20c0-0x20df irq
>>>> 18 at device 26.0 on pci0
>>>> uhci4: <Intel 82801I (ICH9) USB controller> port 0x2040-0x205f irq
>>>> 18 at device 29.2 on pci0
>>>> em1: <Intel(R) PRO/1000 Network Connection 6.9.14> port
>>>> 0x1000-0x103f mem 0xe1920000-0xe193ffff,0xe1900000-0xe191ffff irq 18=

>>>> at device 2.0 on pci3
>>>> ichsmb0: <Intel 82801I (ICH9) SMBus controller> port 0x2000-0x201f
>>>> mem 0xe1a22000-0xe1a220ff irq 18 at device 31.3 on pci0
>>>>
>>>> The don't share the same IRQ at least.
=2E..
For the records: I replaced the Samsung F2 1.5TB 5200rpm EcoGreen Drives.=

In my dreams that should improove my 3-disk RAIDZ from 33MB/s avarage=20
(>5G transferes) to about 60MB/s.
In reality, it improoved it to 90MB/s, _and_ completely eliminatong the=20
ahcich timeouts, as well as the burst writes where the complete machine=20
stuck while ZFS flushed/wrote trransaction groups.
So the difference in ZFS usage between the disks is far beond my=20
imagination.
I can higly recommend the:
=3D=3D=3D START OF INFORMATION SECTION =3D=3D=3D
Model Family:     Hitachi Deskstar 7K2000
Device Model:     Hitachi HDS722020ALA330
Serial Number:    JK1174YAH9ZH7W
Firmware Version: JKAOA28A
User Capacity:    2,000,398,934,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Thu Mar 25 23:48:13 2010 CET

Some TB restored so far, no errors, no oddities, no problems at all.=20
Same server, same FreeBSD, but ahci.ko enabled again (so with NCQ,=20
thanks mav and friends).

I can confirm that the F2 Samsung drives worked fine with the old ata=20
driver (speaking without enabling NQC) and ZFS. They did their job for 2 =

weeks without any error in that time, but reproducable showed ahcich=20
timeouts (with the newer ahci.ko) if load was higher than about 50MB/s=20
@raizd with 3 disks (same ICH9)
So if I got my problem solved by replacing my HDDs (even the old one had =

the latest firmware) ans also got triple performance :))

Just to share the info.

Thanks,

-Harry


--------------enig43C912A24C95AA60A133E470
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.13 (FreeBSD)

iEYEARECAAYFAkur7BQACgkQLDqVQ9VXb8gEhwCgnuIk7hCb5UG/w/vH8aQZ4iPk
jbgAnii5epltON0RxQwo52oE96ihSzIK
=VLpd
-----END PGP SIGNATURE-----

--------------enig43C912A24C95AA60A133E470--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4BABEC09.8070709>