Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 28 Nov 1999 14:54:44 -0700 (MST)
From:      "Kenneth D. Merry" <ken@kdm.org>
To:        jgreco@ns.sol.net (Joe Greco)
Cc:        dgilbert@velocet.ca, stable@freebsd.org
Subject:   Re: ahc problems (with vinum?)
Message-ID:  <199911282154.OAA22358@panzer.kdm.org>
In-Reply-To: <199911281945.NAA68811@aurora.sol.net> from Joe Greco at "Nov 28, 1999 01:45:12 pm"

next in thread | previous in thread | raw e-mail | index | archive | help
Joe Greco wrote...
> > > Copyright (c) 1992-1999 FreeBSD Inc.
> > > Copyright (c) 1982, 1986, 1989, 1991, 1993
> > > 	The Regents of the University of California. All rights reserved.
> > > FreeBSD 3.3-RELEASE #0: Mon Nov 22 13:38:07 CST 1999
> > >     root@host:/usr/src/sys/compile/DEMO
> > 
> > The first problem is that you're running 3.3-R with two 7890s.  Justin
> > worked around a bug in the 7890 in the Adaptec driver shortly after 3.3
> > came out.  I'd recommend at the very least updating your Adaptec driver,
> > although depending on your circumstances, it might be easier to just update
> > to the latest -stable.
> 
> Noted.  One is an onboard controller, part of the ASUS P2B-DS.  This
> particular system was supposed to have a 3940, but I didn't have one
> so I crammed in two 2940-type controllers.  Would this also be an issue
> for a system with the onboard controller and a 3940-type controller?

It will be an issue for any system with a 7890/1 in it.  I'm not sure if
the same bug affects the 7896/7, so I can't say whether using a 3950 would
fix the problem.

> > That isn't where your problems are showing up, however.  (Likely you
> > haven't loaded your system enough to trigger the 7890 problem.)
> 
> Maybe/maybe not.  What might I expect to see from such a problem?

Well, I know you would probably get some data corruption.  I can't remember
which list the thread was on, but you can search for "data corruption" and
"aic7890" in the -current and -hackers list archives and see what turns up.

> I have certainly beat the $#!+ out of these systems in a variety of ways,
> and have run into some odd things.  Most were traceable to SCSI issues.
> Some didn't get classified.  I'm running vinum in a ten-filesystem config
> on top of the 18 18GB drives, and I copy in data from another machine.  I
> then have an application which mmap()'s the files, doing search and replace
> ops on the data.  Running this app in parallel causes the system to hang
> (eventually causing the watchdog to expire and reset the system).  Running
> it serially on one fs at a time doesn't.  This is probably the most
> worrisome of the issues I've seen.  If you have a recommended revision of
> the ahc driver you'd like me to try, let me know.

Yes, you should run a version of the driver that has Justin's fix from
September 20th.  Unfortunately, he didn't find the problem before 3.3 came
out.

> > > (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR == 0x153
> > > (da10:ahc2:0:0:0): no longer in timeout, status = 34b
> > > ahc2: Issued Channel A Bus Reset. 3 SCBs aborted
> > > (da10:ahc2:0:0:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x110
> > > (da10:ahc2:0:0:0): BDR message in message buffer
> > > (da10:ahc2:0:0:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x10f
> > > (da10:ahc2:0:0:0): no longer in timeout, status = 34b
> > > ahc2: Issued Channel A Bus Reset. 6 SCBs aborted
> > > 4357+1 records in
> > > 4357+1 records out
> > > 4569600000 bytes transferred in 428.640450 secs (10660683 bytes/sec)
> > 
> > [ ... ]
> > 
> > "Timed out in {datain|dataout} phase" means that a transaction took longer
> > than 60 seconds to complete, and the bus was stuck in datain/dataout phase
> > at the time.
> > 
> > This is almost always the result of a cabling or termination problem.
> > 
> > So you'll probably want to replace the cable on your Ultra-Wide chain, and
> > verify that the termination is correct.
> 
> It's more complex than that. :-)  These machines are intended for deployment
> in remote areas, and realistically I may never see many of them ever again
> after that point.  They are rackmount in Antec PC cases and Kingston 9-bay
> drive arrays, the drives themselves are mounted in Antec 690 drive modules.
> This allows for easy replacement/upgrade in the event of problems, and with
> the exception of this one problem-child machine, has worked out fantastic
> so far.  But it introduces multi-multi variables into the equation.  The
> 3940-to-PC backplate cable, the external cables, the terminators, the
> internal 9-position Kingston ribbon cable, any of the 9 receiving brackets,
> any of the 9 drive modules, and any of the 9 drives can potentially be an
> issue.  The Antec drive modules seem to be the typical source of flakiness,
> about 1:20 seem to give problems.
> 
> Okay, now, stop rolling your eyes.  I know it is ugly from a SCSI
> perspective, but it is very functional and very useful, not to mention very
> nice and damn fast.  It's hard to build something like that which can also
> be deployed in a remote location where you'll have to explain to someone who
> has 1/2 a clue what you want replaced, and how.  I prefer the
> no-screwdriver-required method.

Oh, I can certainly appreciate the idiot-proof approach.  In your
situation, it makes a lot of sense.  However it'll make it a little more
difficult to track down the problem.

I'm pretty sure you've got a problem somewhere in your Ultra-Wide chain,
and the fact that you've had good success with the same configuration
before seems to point to that.

It could be bent connector pins or who knows what, but you'll have to track
it down one way or another to solve this problem.  (i.e. it is unlikely
that this is a software problem, since you're having this on a chain driven
by a 7880, not a 7890.)

> > > run is a little script that sucks data in from all SCSI drives with dd and
> > > dumps it to /dev/null, in parallel.
> > > 
> > > Now, when the bus reset happens, often the drive listed will actually
> > > recover and continue going, but if so, the others will typically stop (but
> > > dd is just waiting for data).  This isn't written in stone, I've seen all
> > > drives drop off, and I've also seen the whole thing recover just fine.
> > > I have no idea what the result was for the incident listed above.  It was
> > > one of dozens of incidents.
> > 
> > Well, the SCSI layer does its best to recover, but naturally if you've got
> > cabling problems that cause you to get stuck in certain bus phases, it
> > won't be able to recover from everything.
> 
> I thought a bus reset was supposed to deal with bus phase issues...?  But
> I'm admittedly an armchair SCSI quarterback.  I used to see Suns that had
> a heterogeneous SCSI array of mildly incompatible SCSI devices routinely
> go through the jam-reset-restart sequence.

It does, generally, but if you've got flaky cabling, it's hard to guarantee
that the bus reset will fix all of your problems.

Ken
-- 
Kenneth Merry
ken@kdm.org


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199911282154.OAA22358>