Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 12 Oct 2012 09:41:56 -0700
From:      "Keegan,Nate" <nate.keegan@prescott-az.gov>
To:        "freebsd-questions@freebsd.org" <freebsd-questions@freebsd.org>
Subject:   ahcich Timeouts SATA SSD
Message-ID:  <0488BA670C8E594D93BE0556FEB89063054C373D29@obsidian.ad.cityofprescott.org>

next in thread | raw e-mail | index | archive | help
My configuration is as follows:

FreeBSD 8.2-RELEASE
Supermicro X8DTi-LN4F (Intel Tylersburg 5520 chipset) motherboard
24 GB system memory
32 x Hitachi Deskstar 5K3000 disks connected to 4 x Intel SASUC8I (LSI 3081=
E-R) in IT mode
2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot)
2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap
SSD are connected to on-board SATA port on motherboard

This system was commissioned in February of 2012 and ran without issue as a=
 ZFS backup system on our network until about 3 weeks ago.

At that time I started getting kernel panics due to timeouts to the on-boar=
d SATA devices. The only change to the system since it was built was to add=
 an SSD for swap (32 Gb swap device) and this issue did not happen until se=
veral months after this was added.

My initial thought was that I might have a bad SSD drive so I swapped out o=
ne of the Crucial SSD drives and the problem happened again a few days late=
r.

I then moved to systematically replacing items such as SATA cables, memory,=
 motherboard, etc and the problem continued. For example, I swapped out the=
 4 SATA cables with brand new SATA cables and waited to see if the problem =
happened again. Once it did I moved on to replacing the motherboard with an=
 identical motherboard, waited, etc.

I could not find an obvious hardware related explanation for this behavior =
so about a week and a half ago I did a fresh install of FreeBSD 9.0-RELEASE=
 to move from the ATA driver to the AHCI driver as I found some evidence th=
at this was helpful.

The problem continued with something like this:

ahcich0: Timeout on slot 29 port 0
ahcich0: is 000000000 cs 00000000 ss e0000000 rs e0000000 tfd 40 serr 00000=
000 cmd 0004df17

ahcich0: AHCI reset: device not ready after 31000ms (tfd =3D 00000080)
ahcich0: Timeout on slot 31 port 0
ahcich0: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr 000000=
00 cmd 0004df17
(ada0:ahcich0:0:0:0): lost device

ahcich0: AHCI reset: device not ready after 3100ms (tfd =3D 00000080)
ahcich0: Timeout on slot 31 port 0
ahcich0: is 00000000 cs 80000003 ss 800000003 rs 80000003 tfd 80 serr 00000=
00 cmd 0004df17
(ada0:ahcich0:0:0:0): removing device entry

ahcich0: AHCI reset: device not ready after 31000ms (tfd =3D 00000080)
ahcich0: Poll timeout on slot 1 port 0
ahcich0: is 00000000 cs 00000002 ss 000000000 rs 0000002 tfd 80 serr 000000=
00 cmd 004c117

When this happens the only way to recover the system is to hard boot via IP=
MI (yanking the power vs hitting reset). I cannot say that every time this =
happens a hard reset is necessary but more often than not a hard reset is n=
ecessary as the on-board AHCI portion of the BIOS does not always see the d=
isks after the event without a hard system power reset.

I have done a bunch of Google work on this and have seen the issue appear i=
n FreeNAS and FreeBSD but no clear cut resolution in terms of how to addres=
s it or what causes it. Some people had a bad SSD, others had to disable NC=
Q or power management on their SSD, particular brands of SSD (Samsung), etc=
.

Nothing conclusive so far.

At the present time the issue happens every 1-2 hours unless I have the fol=
lowing in my /boot/loader.conf after the ahci_load statement:

ahci_load=3D"YES"

# See ahci(4)
hint.ahcich.0.sata_rev=3D1
hint.ahcich.1.sata_rev=3D1
hint.ahcich.2.sata_rev=3D1
hint.ahcich.3.sata_rev=3D1

hint.ahcich.0.pm_level=3D1
hint.ahcich.1.pm_level=3D1
hint.ahcich.2.pm_level=3D1
hint.ahcich.3.pm_level=3D1

I have a script in /usr/local/etc/rc.d which disables NCQ on these drives:

#!/bin/sh

CAMCONTROL=3D/sbin/camcontrol

$CAMCONTROL tags ada0 -N 1 > /dev/null
$CAMCONTROL tags ada1 -N 1 > /dev/null
$CAMCONTROL tags ada2 -N 1 > /dev/null
$CAMCONTROL tags ada3 -N 1 > /dev/null

exit 0

I went ahead and pulled the Intel SSDs as they were showing ASR and hardwar=
e resets which incremented. Removing both of these disks from the system di=
d not change the situation.

The combination of /boot/loader.conf and this script gets me 6 days or so o=
f operation before the issue pops up again. If I remove these two items I g=
et maybe 2 hours before the issue happens again.

Right now I'm down to one OS disk and one swap disk and that is it for SSD =
disks on the system.

At the last reboot (yesterday) I disabled APM on the disks (ada0 and ada1 a=
t this point) to see if that makes a difference as I found a reference to t=
his being a potential problem.

I'm looking for insight/help on this as I'm about out of options. If there =
is a way to gather more information when this happens, post up information,=
 etc I'm open to trying it.

What is driving me crazy is that I can't seem to come up with a concrete ex=
planation as to why now and not back when the system was built. The issue o=
nly seems to happen when the system is idle and the SSD drives do not see m=
uch action other than to host OS, scripts, etc while the Intel/LSI based dr=
ives is where the actual I/O is at.

The system logs do not show anything prior to event happening and the OS wi=
ll respond to ping requests after the issue and if you have an active SSH s=
ession you will remain connected to the system until you attempt to do some=
thing like 'ls', 'ps', etc.

New SSH requests to the system get 'connection refused'.

As far as I can see I have three real options left:

* Hope that someone here knows something I don't
* Ditch SSD for straight SATA disks (plan on doing this next week before ne=
xt likely happening sometime Wed am) as perhaps there is some odd SATA/SSD =
interaction with FreeBSD or with controller I'm not aware of (haven't had t=
his happen with plain SATA and FreeBSD before)
* Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended purpos=
e of this system

I'm open to suggestions, direction, etc to see if I can nail down what is g=
oing on and put this issue to bed for not only myself but for anyone else w=
ho might run into it before I lose what little hair and sanity I have left.=
..heh

- Nate



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0488BA670C8E594D93BE0556FEB89063054C373D29>