Date: Mon, 22 Jul 2013 16:17:14 -0700 From: Dieter BSD <dieterbsd@gmail.com> To: freebsd-hardware@freebsd.org Subject: Re: Reset Problem with SATA Port Multiplier Message-ID: <CAA3ZYrCrz-%2BJWFDnYU5ueBeuawZ9QpMNFYJ=rNG-%2BBj9LYrHmQ@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
> Drives: 45 * Seagate Altos ST3000NC002 > Port Multipliers: 9 * SiI3826 > SATA Controller: 3 * Marvell 88SX7042 > > After a few hours of a database-like workload over ZFS (NCQ enable, disk > write caches disabled), a disk becomes unresponsive (we think due to a > drive firmware problem): I have an 8.2 machine with Sil3132 controllers with Sil3726 pm with variety of drives. I have been getting the "Timeout on slot <small integer>" followed by "lost device". Sometimes the device reappears. (Although the /dev/ufs/label does *not* reappear. :-( ) I have not seen the other drives on the pm get removed, or had to power cycle to recover. Seagate ST3000DM001 with CC4B firmware seems especially bad. ST3000DM001 with CC24 firmware have been ok. So your theory that the drive firmware has a problem seems promising. Sounds like FreeBSD is doing something bad to the pm, which Linux isn't doing. Perhaps log the commands the OS sends to the controller (over the network to a 2nd machine, or to a local disk not on a pm) and compare BSD to Linux? Perhaps start logging when you get the first timeout, to save hours of commands to wade through. Alternately you could stare at the driver sources until enlightenment occurs. AFAIK FreeBSD has never gotten a proper workaround for the quirk in the 1st generation Sil sata controllers, while they run fine on NetBSD. There might be a bug/quirk in the pm's firmware that FreeBSD triggers but Linus doesn't.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAA3ZYrCrz-%2BJWFDnYU5ueBeuawZ9QpMNFYJ=rNG-%2BBj9LYrHmQ>