From owner-freebsd-stable Sun Nov 28 9:57:35 1999 Delivered-To: freebsd-stable@freebsd.org Received: from panzer.kdm.org (panzer.kdm.org [216.160.178.169]) by hub.freebsd.org (Postfix) with ESMTP id 338F514D4C for ; Sun, 28 Nov 1999 09:57:32 -0800 (PST) (envelope-from ken@panzer.kdm.org) Received: (from ken@localhost) by panzer.kdm.org (8.9.3/8.9.1) id KAA21363; Sun, 28 Nov 1999 10:56:12 -0700 (MST) (envelope-from ken) Message-Id: <199911281756.KAA21363@panzer.kdm.org> Subject: Re: ahc problems (with vinum?) In-Reply-To: <199911281633.KAA55332@aurora.sol.net> from Joe Greco at "Nov 28, 1999 10:33:07 am" To: jgreco@ns.sol.net (Joe Greco) Date: Sun, 28 Nov 1999 10:56:12 -0700 (MST) Cc: dgilbert@velocet.ca, stable@freebsd.org From: "Kenneth D. Merry" X-Mailer: ELM [version 2.4ME+ PL54 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-stable@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Joe Greco wrote... > Just having spent a week debugging a (very) intermittent SCSI bus problem, > I agree that I've seen some odd behaviour of this sort. What's even more > exasperating is that, at least in some cases, it does appear to recover > the one device that erred, but the rest stop functioning. > > I've got serial consoles on my machines, let me see if I can dig up... > Okay, you've got two problems that I can see. [ ... ] > Copyright (c) 1992-1999 FreeBSD Inc. > Copyright (c) 1982, 1986, 1989, 1991, 1993 > The Regents of the University of California. All rights reserved. > FreeBSD 3.3-RELEASE #0: Mon Nov 22 13:38:07 CST 1999 > root@host:/usr/src/sys/compile/DEMO The first problem is that you're running 3.3-R with two 7890s. Justin worked around a bug in the 7890 in the Adaptec driver shortly after 3.3 came out. I'd recommend at the very least updating your Adaptec driver, although depending on your circumstances, it might be easier to just update to the latest -stable. That isn't where your problems are showing up, however. (Likely you haven't loaded your system enough to trigger the 7890 problem.) [ ... ] > ahc0: rev 0x00 int a irq 19 on pci0.6.0 > ahc0: aic7890/91 Wide Channel A, SCSI Id=7, 16/255 SCBs > hfa0: rev 0x00 int a irq 19 on pci0.9.0 > chip4: rev 0x03 on pci0.10.0 > ahc1: rev 0x00 int a irq 17 on pci0.11.0 > ahc1: aic7890/91 Wide Channel A, SCSI Id=7, 16/255 SCBs > ahc2: rev 0x00 int a irq 16 on pci0.12.0 > ahc2: aic7880 Wide Channel A, SCSI Id=7, 16/255 SCBs [ ... ] > # sh run& > # dd: /dev/rda17: Device not configured > dd: /dev/rda18: Device not configured > (da13:ahc2:0:3:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x110 > (da13:ahc2:0:3:0): Other SCB Timeout > (da11:ahc2:0:1:0): SCB 0xb - timed out in datain phase, SEQADDR == 0x110 > (da11:ahc2:0:1:0): Other SCB Timeout > (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR == 0x110 > (da10:ahc2:0:0:0): BDR message in message buffer > (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR == 0x10f > (da10:ahc2:0:0:0): no longer in timeout, status = 34b > ahc2: Issued Channel A Bus Reset. 7 SCBs aborted > (da11:ahc2:0:1:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x153 > (da11:ahc2:0:1:0): Other SCB Timeout > (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR == 0x153 > (da10:ahc2:0:0:0): BDR message in message buffer > (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR == 0x153 > (da10:ahc2:0:0:0): no longer in timeout, status = 34b > ahc2: Issued Channel A Bus Reset. 3 SCBs aborted > (da10:ahc2:0:0:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x110 > (da10:ahc2:0:0:0): BDR message in message buffer > (da10:ahc2:0:0:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x10f > (da10:ahc2:0:0:0): no longer in timeout, status = 34b > ahc2: Issued Channel A Bus Reset. 6 SCBs aborted > 4357+1 records in > 4357+1 records out > 4569600000 bytes transferred in 428.640450 secs (10660683 bytes/sec) [ ... ] "Timed out in {datain|dataout} phase" means that a transaction took longer than 60 seconds to complete, and the bus was stuck in datain/dataout phase at the time. This is almost always the result of a cabling or termination problem. So you'll probably want to replace the cable on your Ultra-Wide chain, and verify that the termination is correct. > run is a little script that sucks data in from all SCSI drives with dd and > dumps it to /dev/null, in parallel. > > Now, when the bus reset happens, often the drive listed will actually > recover and continue going, but if so, the others will typically stop (but > dd is just waiting for data). This isn't written in stone, I've seen all > drives drop off, and I've also seen the whole thing recover just fine. > I have no idea what the result was for the incident listed above. It was > one of dozens of incidents. Well, the SCSI layer does its best to recover, but naturally if you've got cabling problems that cause you to get stuck in certain bus phases, it won't be able to recover from everything. > The "reboot" bit is also mildly interesting. FreeBSD (cam?) seems to have > lots of problems halting or rebooting in the event that a device is > unavailable or a scbus is hung. I'd guess that it is waiting to flush some > buffers or something, except that my tests only do reads - no writes. Reboot problems with buffers not getting flushed are generally because of issues in the higher level code. I don't know if they've been fixed in -current or not, and I really don't have a good handle on why it happens. Perhaps someone else has an idea on that one.. Ken -- Kenneth Merry ken@kdm.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message