Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 28 Nov 1999 16:11:40 -0600 (CST)
From:      Joe Greco <jgreco@ns.sol.net>
To:        ken@kdm.org (Kenneth D. Merry)
Cc:        dgilbert@velocet.ca, stable@freebsd.org
Subject:   Re: ahc problems (with vinum?)
Message-ID:  <199911282211.QAA79181@aurora.sol.net>
In-Reply-To: <199911282154.OAA22358@panzer.kdm.org> from "Kenneth D. Merry" at "Nov 28, 1999  2:54:44 pm"

next in thread | previous in thread | raw e-mail | index | archive | help
> > Noted.  One is an onboard controller, part of the ASUS P2B-DS.  This
> > particular system was supposed to have a 3940, but I didn't have one
> > so I crammed in two 2940-type controllers.  Would this also be an issue
> > for a system with the onboard controller and a 3940-type controller?
> 
> It will be an issue for any system with a 7890/1 in it.  I'm not sure if
> the same bug affects the 7896/7, so I can't say whether using a 3950 would
> fix the problem.
> 
> > > That isn't where your problems are showing up, however.  (Likely you
> > > haven't loaded your system enough to trigger the 7890 problem.)
> > 
> > Maybe/maybe not.  What might I expect to see from such a problem?
> 
> Well, I know you would probably get some data corruption.  I can't remember
> which list the thread was on, but you can search for "data corruption" and
> "aic7890" in the -current and -hackers list archives and see what turns up.

Ok.

> > I have certainly beat the $#!+ out of these systems in a variety of ways,
> > and have run into some odd things.  Most were traceable to SCSI issues.
> > Some didn't get classified.  I'm running vinum in a ten-filesystem config
> > on top of the 18 18GB drives, and I copy in data from another machine.  I
> > then have an application which mmap()'s the files, doing search and replace
> > ops on the data.  Running this app in parallel causes the system to hang
> > (eventually causing the watchdog to expire and reset the system).  Running
> > it serially on one fs at a time doesn't.  This is probably the most
> > worrisome of the issues I've seen.  If you have a recommended revision of
> > the ahc driver you'd like me to try, let me know.
> 
> Yes, you should run a version of the driver that has Justin's fix from
> September 20th.  Unfortunately, he didn't find the problem before 3.3 came
> out.

Ok.

> > > > (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR == 0x153
> > > > (da10:ahc2:0:0:0): no longer in timeout, status = 34b
> > > > ahc2: Issued Channel A Bus Reset. 3 SCBs aborted
> > > > (da10:ahc2:0:0:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x110
> > > > (da10:ahc2:0:0:0): BDR message in message buffer
> > > > (da10:ahc2:0:0:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x10f
> > > > (da10:ahc2:0:0:0): no longer in timeout, status = 34b
> > > > ahc2: Issued Channel A Bus Reset. 6 SCBs aborted
> > > > 4357+1 records in
> > > > 4357+1 records out
> > > > 4569600000 bytes transferred in 428.640450 secs (10660683 bytes/sec)
> > > 
> > > [ ... ]
> > > 
> > > "Timed out in {datain|dataout} phase" means that a transaction took longer
> > > than 60 seconds to complete, and the bus was stuck in datain/dataout phase
> > > at the time.
> > > 
> > > This is almost always the result of a cabling or termination problem.
> > > 
> > > So you'll probably want to replace the cable on your Ultra-Wide chain, and
> > > verify that the termination is correct.
> > 
> > It's more complex than that. :-)  These machines are intended for deployment
> > in remote areas, and realistically I may never see many of them ever again
> > after that point.  They are rackmount in Antec PC cases and Kingston 9-bay
> > drive arrays, the drives themselves are mounted in Antec 690 drive modules.
> > This allows for easy replacement/upgrade in the event of problems, and with
> > the exception of this one problem-child machine, has worked out fantastic
> > so far.  But it introduces multi-multi variables into the equation.  The
> > 3940-to-PC backplate cable, the external cables, the terminators, the
> > internal 9-position Kingston ribbon cable, any of the 9 receiving brackets,
> > any of the 9 drive modules, and any of the 9 drives can potentially be an
> > issue.  The Antec drive modules seem to be the typical source of flakiness,
> > about 1:20 seem to give problems.
> > 
> > Okay, now, stop rolling your eyes.  I know it is ugly from a SCSI
> > perspective, but it is very functional and very useful, not to mention very
> > nice and damn fast.  It's hard to build something like that which can also
> > be deployed in a remote location where you'll have to explain to someone who
> > has 1/2 a clue what you want replaced, and how.  I prefer the
> > no-screwdriver-required method.
> 
> Oh, I can certainly appreciate the idiot-proof approach.  In your
> situation, it makes a lot of sense.  However it'll make it a little more
> difficult to track down the problem.

Already tracked down and fixed as of last week, it just took some time since
the problem only manifested itself after really hammering on the thing for a
while.  Sorry I didn't make that clear.  :-)  It makes for a really sucky
debug cycle... try "x", hammer on system for hours, watch for errors.  You
know.  Bleah.

... Joe

-------------------------------------------------------------------------------
Joe Greco - Systems Administrator			      jgreco@ns.sol.net
Solaria Public Access UNIX - Milwaukee, WI			   414/342-4847


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199911282211.QAA79181>