From owner-freebsd-stable  Sun Nov 28  9:57:35 1999
Delivered-To: freebsd-stable@freebsd.org
Received: from panzer.kdm.org (panzer.kdm.org [216.160.178.169])
	by hub.freebsd.org (Postfix) with ESMTP id 338F514D4C
	for <stable@freebsd.org>; Sun, 28 Nov 1999 09:57:32 -0800 (PST)
	(envelope-from ken@panzer.kdm.org)
Received: (from ken@localhost)
	by panzer.kdm.org (8.9.3/8.9.1) id KAA21363;
	Sun, 28 Nov 1999 10:56:12 -0700 (MST)
	(envelope-from ken)
Message-Id: <199911281756.KAA21363@panzer.kdm.org>
Subject: Re: ahc problems (with vinum?)
In-Reply-To: <199911281633.KAA55332@aurora.sol.net> from Joe Greco at "Nov 28, 1999 10:33:07 am"
To: jgreco@ns.sol.net (Joe Greco)
Date: Sun, 28 Nov 1999 10:56:12 -0700 (MST)
Cc: dgilbert@velocet.ca, stable@freebsd.org
From: "Kenneth D. Merry" <ken@kdm.org>
X-Mailer: ELM [version 2.4ME+ PL54 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-stable@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Joe Greco wrote...
> Just having spent a week debugging a (very) intermittent SCSI bus problem,
> I agree that I've seen some odd behaviour of this sort.  What's even more
> exasperating is that, at least in some cases, it does appear to recover
> the one device that erred, but the rest stop functioning.
> 
> I've got serial consoles on my machines, let me see if I can dig up...
> 

Okay, you've got two problems that I can see.

[ ... ]


> Copyright (c) 1992-1999 FreeBSD Inc.
> Copyright (c) 1982, 1986, 1989, 1991, 1993
> 	The Regents of the University of California. All rights reserved.
> FreeBSD 3.3-RELEASE #0: Mon Nov 22 13:38:07 CST 1999
>     root@host:/usr/src/sys/compile/DEMO

The first problem is that you're running 3.3-R with two 7890s.  Justin
worked around a bug in the 7890 in the Adaptec driver shortly after 3.3
came out.  I'd recommend at the very least updating your Adaptec driver,
although depending on your circumstances, it might be easier to just update
to the latest -stable.

That isn't where your problems are showing up, however.  (Likely you
haven't loaded your system enough to trigger the 7890 problem.)

[ ... ]

> ahc0: <Adaptec aic7890/91 Ultra2 SCSI adapter> rev 0x00 int a irq 19 on pci0.6.0
> ahc0: aic7890/91 Wide Channel A, SCSI Id=7, 16/255 SCBs
> hfa0: <FORE Systems PCA-200E ATM> rev 0x00 int a irq 19 on pci0.9.0
> chip4: <PCI to PCI bridge (vendor=1011 device=0024)> rev 0x03 on pci0.10.0
> ahc1: <Adaptec 2940 Ultra2 SCSI adapter> rev 0x00 int a irq 17 on pci0.11.0
> ahc1: aic7890/91 Wide Channel A, SCSI Id=7, 16/255 SCBs
> ahc2: <Adaptec 2940 Ultra SCSI adapter> rev 0x00 int a irq 16 on pci0.12.0
> ahc2: aic7880 Wide Channel A, SCSI Id=7, 16/255 SCBs

[ ... ]

> # sh run&
> # dd: /dev/rda17: Device not configured
> dd: /dev/rda18: Device not configured
> (da13:ahc2:0:3:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x110
> (da13:ahc2:0:3:0): Other SCB Timeout
> (da11:ahc2:0:1:0): SCB 0xb - timed out in datain phase, SEQADDR == 0x110
> (da11:ahc2:0:1:0): Other SCB Timeout
> (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR == 0x110
> (da10:ahc2:0:0:0): BDR message in message buffer
> (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR == 0x10f
> (da10:ahc2:0:0:0): no longer in timeout, status = 34b
> ahc2: Issued Channel A Bus Reset. 7 SCBs aborted
> (da11:ahc2:0:1:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x153
> (da11:ahc2:0:1:0): Other SCB Timeout
> (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR == 0x153
> (da10:ahc2:0:0:0): BDR message in message buffer
> (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR == 0x153
> (da10:ahc2:0:0:0): no longer in timeout, status = 34b
> ahc2: Issued Channel A Bus Reset. 3 SCBs aborted
> (da10:ahc2:0:0:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x110
> (da10:ahc2:0:0:0): BDR message in message buffer
> (da10:ahc2:0:0:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x10f
> (da10:ahc2:0:0:0): no longer in timeout, status = 34b
> ahc2: Issued Channel A Bus Reset. 6 SCBs aborted
> 4357+1 records in
> 4357+1 records out
> 4569600000 bytes transferred in 428.640450 secs (10660683 bytes/sec)

[ ... ]

"Timed out in {datain|dataout} phase" means that a transaction took longer
than 60 seconds to complete, and the bus was stuck in datain/dataout phase
at the time.

This is almost always the result of a cabling or termination problem.

So you'll probably want to replace the cable on your Ultra-Wide chain, and
verify that the termination is correct.

> run is a little script that sucks data in from all SCSI drives with dd and
> dumps it to /dev/null, in parallel.
> 
> Now, when the bus reset happens, often the drive listed will actually
> recover and continue going, but if so, the others will typically stop (but
> dd is just waiting for data).  This isn't written in stone, I've seen all
> drives drop off, and I've also seen the whole thing recover just fine.
> I have no idea what the result was for the incident listed above.  It was
> one of dozens of incidents.

Well, the SCSI layer does its best to recover, but naturally if you've got
cabling problems that cause you to get stuck in certain bus phases, it
won't be able to recover from everything.

> The "reboot" bit is also mildly interesting.  FreeBSD (cam?) seems to have
> lots of problems halting or rebooting in the event that a device is
> unavailable or a scbus is hung.  I'd guess that it is waiting to flush some
> buffers or something, except that my tests only do reads - no writes.

Reboot problems with buffers not getting flushed are generally because
of issues in the higher level code.  I don't know if they've been fixed
in -current or not, and I really don't have a good handle on why it
happens.  Perhaps someone else has an idea on that one..

Ken
-- 
Kenneth Merry
ken@kdm.org


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message