Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 28 Nov 1999 13:45:12 -0600 (CST)
From:      Joe Greco <jgreco@ns.sol.net>
To:        ken@kdm.org (Kenneth D. Merry)
Cc:        dgilbert@velocet.ca, stable@freebsd.org
Subject:   Re: ahc problems (with vinum?)
Message-ID:  <199911281945.NAA68811@aurora.sol.net>
In-Reply-To: <199911281756.KAA21363@panzer.kdm.org> from "Kenneth D. Merry" at "Nov 28, 1999 10:56:12 am"

next in thread | previous in thread | raw e-mail | index | archive | help
> > Copyright (c) 1992-1999 FreeBSD Inc.
> > Copyright (c) 1982, 1986, 1989, 1991, 1993
> > 	The Regents of the University of California. All rights reserved.
> > FreeBSD 3.3-RELEASE #0: Mon Nov 22 13:38:07 CST 1999
> >     root@host:/usr/src/sys/compile/DEMO
> 
> The first problem is that you're running 3.3-R with two 7890s.  Justin
> worked around a bug in the 7890 in the Adaptec driver shortly after 3.3
> came out.  I'd recommend at the very least updating your Adaptec driver,
> although depending on your circumstances, it might be easier to just update
> to the latest -stable.

Noted.  One is an onboard controller, part of the ASUS P2B-DS.  This
particular system was supposed to have a 3940, but I didn't have one
so I crammed in two 2940-type controllers.  Would this also be an issue
for a system with the onboard controller and a 3940-type controller?

> That isn't where your problems are showing up, however.  (Likely you
> haven't loaded your system enough to trigger the 7890 problem.)

Maybe/maybe not.  What might I expect to see from such a problem?

I have certainly beat the $#!+ out of these systems in a variety of ways,
and have run into some odd things.  Most were traceable to SCSI issues.
Some didn't get classified.  I'm running vinum in a ten-filesystem config
on top of the 18 18GB drives, and I copy in data from another machine.  I
then have an application which mmap()'s the files, doing search and replace
ops on the data.  Running this app in parallel causes the system to hang
(eventually causing the watchdog to expire and reset the system).  Running
it serially on one fs at a time doesn't.  This is probably the most
worrisome of the issues I've seen.  If you have a recommended revision of
the ahc driver you'd like me to try, let me know.

> > # sh run&
> > # dd: /dev/rda17: Device not configured
> > dd: /dev/rda18: Device not configured
> > (da13:ahc2:0:3:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x110
> > (da13:ahc2:0:3:0): Other SCB Timeout
> > (da11:ahc2:0:1:0): SCB 0xb - timed out in datain phase, SEQADDR == 0x110
> > (da11:ahc2:0:1:0): Other SCB Timeout
> > (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR == 0x110
> > (da10:ahc2:0:0:0): BDR message in message buffer
> > (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR == 0x10f
> > (da10:ahc2:0:0:0): no longer in timeout, status = 34b
> > ahc2: Issued Channel A Bus Reset. 7 SCBs aborted
> > (da11:ahc2:0:1:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x153
> > (da11:ahc2:0:1:0): Other SCB Timeout
> > (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR == 0x153
> > (da10:ahc2:0:0:0): BDR message in message buffer
> > (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR == 0x153
> > (da10:ahc2:0:0:0): no longer in timeout, status = 34b
> > ahc2: Issued Channel A Bus Reset. 3 SCBs aborted
> > (da10:ahc2:0:0:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x110
> > (da10:ahc2:0:0:0): BDR message in message buffer
> > (da10:ahc2:0:0:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x10f
> > (da10:ahc2:0:0:0): no longer in timeout, status = 34b
> > ahc2: Issued Channel A Bus Reset. 6 SCBs aborted
> > 4357+1 records in
> > 4357+1 records out
> > 4569600000 bytes transferred in 428.640450 secs (10660683 bytes/sec)
> 
> [ ... ]
> 
> "Timed out in {datain|dataout} phase" means that a transaction took longer
> than 60 seconds to complete, and the bus was stuck in datain/dataout phase
> at the time.
> 
> This is almost always the result of a cabling or termination problem.
> 
> So you'll probably want to replace the cable on your Ultra-Wide chain, and
> verify that the termination is correct.

It's more complex than that. :-)  These machines are intended for deployment
in remote areas, and realistically I may never see many of them ever again
after that point.  They are rackmount in Antec PC cases and Kingston 9-bay
drive arrays, the drives themselves are mounted in Antec 690 drive modules.
This allows for easy replacement/upgrade in the event of problems, and with
the exception of this one problem-child machine, has worked out fantastic
so far.  But it introduces multi-multi variables into the equation.  The
3940-to-PC backplate cable, the external cables, the terminators, the
internal 9-position Kingston ribbon cable, any of the 9 receiving brackets,
any of the 9 drive modules, and any of the 9 drives can potentially be an
issue.  The Antec drive modules seem to be the typical source of flakiness,
about 1:20 seem to give problems.

Okay, now, stop rolling your eyes.  I know it is ugly from a SCSI
perspective, but it is very functional and very useful, not to mention very
nice and damn fast.  It's hard to build something like that which can also
be deployed in a remote location where you'll have to explain to someone who
has 1/2 a clue what you want replaced, and how.  I prefer the
no-screwdriver-required method.

> > run is a little script that sucks data in from all SCSI drives with dd and
> > dumps it to /dev/null, in parallel.
> > 
> > Now, when the bus reset happens, often the drive listed will actually
> > recover and continue going, but if so, the others will typically stop (but
> > dd is just waiting for data).  This isn't written in stone, I've seen all
> > drives drop off, and I've also seen the whole thing recover just fine.
> > I have no idea what the result was for the incident listed above.  It was
> > one of dozens of incidents.
> 
> Well, the SCSI layer does its best to recover, but naturally if you've got
> cabling problems that cause you to get stuck in certain bus phases, it
> won't be able to recover from everything.

I thought a bus reset was supposed to deal with bus phase issues...?  But
I'm admittedly an armchair SCSI quarterback.  I used to see Suns that had
a heterogeneous SCSI array of mildly incompatible SCSI devices routinely
go through the jam-reset-restart sequence.

> > The "reboot" bit is also mildly interesting.  FreeBSD (cam?) seems to have
> > lots of problems halting or rebooting in the event that a device is
> > unavailable or a scbus is hung.  I'd guess that it is waiting to flush some
> > buffers or something, except that my tests only do reads - no writes.
> 
> Reboot problems with buffers not getting flushed are generally because
> of issues in the higher level code.  I don't know if they've been fixed
> in -current or not, and I really don't have a good handle on why it
> happens.  Perhaps someone else has an idea on that one..


... Joe

-------------------------------------------------------------------------------
Joe Greco - Systems Administrator			      jgreco@ns.sol.net
Solaria Public Access UNIX - Milwaukee, WI			   414/342-4847


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199911281945.NAA68811>