Date: Sun, 14 Apr 2013 12:44:40 -0700 From: Jeremy Chadwick <jdc@koitsu.org> To: Zaphod Beeblebrox <zbeeble@gmail.com> Cc: freebsd-fs <freebsd-fs@freebsd.org>, Radio =?unknown-8bit?B?bcS5P29keWNoIGJhbmR5dMQ/xT93?= <radiomlodychbandytow@o2.pl>, support@lists.pcbsd.org Subject: Re: A failed drive causes system to hang Message-ID: <20130414194440.GB38338@icarus.home.lan> In-Reply-To: <CACpH0Mebufi5=bEsu6MF03NCn6gDmKkx-OP3sP14t3Xe3CXdpw@mail.gmail.com> References: <516A8092.2080002@o2.pl> <9C59759CB64B4BE282C1D1345DD0C78E@multiplay.co.uk> <516AF61B.7060204@o2.pl> <20130414185117.GA38259@icarus.home.lan> <CACpH0Mebufi5=bEsu6MF03NCn6gDmKkx-OP3sP14t3Xe3CXdpw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Apr 14, 2013 at 02:58:15PM -0400, Zaphod Beeblebrox wrote: > I'd like to throw in my two cents here. I've seen this (drives in RAID-1 > configuration) hanging whole systems. Back in the IDE days, two drives > were connected with one cable --- I largely wrote it off as a deficiency of > IDE hardware and resolved to by SCSI hardware for more important systems. > Of late, the physical hardware for SCSI (SAS) and SATA drives have > converged. I'm willing to accept that SAS hardware may be built to a > different standard, but I'm suspicious of the fact that a bad SATA drive on > an ACH* controller can hang the whole system. Note to readers: this is borderline off-topic and is going to confuse the thread even more. I will respond to this ONLY ONCE, and WILL NOT be responding to this part of the thread past this point. I have only seen this happen on very specific controllers (JMicron for example), where either the AHCI driver was broken/badly written, or the underlying AHCI option ROM/firmware code was broken/badly written. > ... it's not complete, however. Often pulling the drive's cable will > unfreeze things. It's also not entirely consistent. Drives I have > behind 4:1 port multipliers haven't (so far) hung the system that > they're on (which uses ACH10). Right now, I have a remote ACH10 > system that's hung hard a couple of times --- and it passes both it's > short and long SMART tests on both drives. PMPs (port multipliers) are a *completely* separate beast, where some AHCI controllers (at a silicon level) screw up/break. In fact, the IXP600/700 is one such controller, and workarounds had to be put into FreeBSD and Linux for them. I can dig up the commits if need be. Rule of thumb (which you know -- this is for other readers): when using a PM, it's VERY IMPORTANT that be disclosed up front. These add a serious complication to analysis of the SATA subsystem as a whole, and in a lot of cases visibility into details are lost as a result. PMPs in general are "bleh". > Is there no global timeout we can depend on here? Please see kern.cam.ada.default_timeout (for adaX devices) and kern.cam.pmp.default_timeout (for I/O requests going across a PMP). Otherwise Alexander Motin (mav@) would be the guy to ask about PMP issues, and/or get him hardware + provide a reliable reproduction methodology for the issue. All the above said: Respectfully, please do not conflate your issue with this one. Please start a new thread (do not reply to this thread and change the Subject line, please actually start a brand new Email to ensure no Reference headers are retained) about this issue if you wish. There is already too much crap going on in this thread with 4 different people with what are 4 different issues, and nobody at this point is able to keep track of it all (including the participants). This situation happens way, WAY too often with storage-related matters on the list. ANYTHING ZFS-related and ANYTHING storage-related results in bandwagon-jumping and threads that spiral out of control/become almost useless and certainly impossible to follow. It needs to stop. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130414194440.GB38338>