Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 20 Aug 1997 09:08:10 +0930
From:      Greg Lehey <grog@lemis.com>
To:        "Justin T. Gibbs" <gibbs@plutotech.com>
Cc:        FreeBSD SCSI Mailing List <freebsd-scsi@freebsd.org>
Subject:   Re: Bus resets. Grrrr.
Message-ID:  <19970820090810.54774@lemis.com>
In-Reply-To: <199708191654.KAA24228@pluto.plutotech.com>; from Justin T. Gibbs on Tue, Aug 19, 1997 at 10:53:54AM -0600
References:  <19970819153023.02433@lemis.com> <199708191654.KAA24228@pluto.plutotech.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Aug 19, 1997 at 10:53:54AM -0600, Justin T. Gibbs wrote:
>>> What version of the kernel are you using
>>
>> Recent versions of -current.  The ones I reported it against were some
>> time last week.  I've just rebuilt with a version supped this morning.
>
> And it is still reproducible?

I changed the configuration file and added (inter alia)
AHC_SCBPAGING_ENABLE.  The resultant kernel hung solid three times in
the course of a couple of hours, once with a disk activity light on
solid, and the other two without.  I removed AHC_SCBPAGING_ENABLE, and
last night the backup went through for the first time in a week.  It
ran fine until last Wednesday, however, so this could be a
coincidence.

>>> So, what does a "timeout while idle" tell us?  Well, it means that either
>>> the timeout that the type driver (in this case the "st" driver)
>>
>> In fact, this was the sd driver, specifically sd0.  It always seems to
>> be sd0, although I have 3 disks connected to the bus, which tends to
>> confirm the theory that there is something wrong with the physical
>> bus.  It could also, of course, indicate that the disk is dying.
>
> Hmmm. How many devices are active at the time that the timeout occurs?
> Since you are not using tagged queuing, you would need 5 devices active
> at a time to overflow the QOUTFIFO (the bug that I fixed recently) on an
> aic7860 based controller.

No, at the moment the chain only has four devices connected, but I
notice it finds two LUNs for the tape changer:

ahc0: <Adaptec 2940 SCSI host adapter> rev 0x03 int a irq 12 on pci0.18.0
ahc0: aic7870 Single Channel, SCSI Id=7, 16 SCBs
ahc0: waiting for scsi devices to settle
scbus0 at ahc0 bus 0
scbus0 target 0 lun 0: <MICROP 2112-15MQ1094802 HQ48> type 0 fixed SCSI 2
sd0 at scbus0 target 0 lun 0
sd0: Direct-Access 1001MB (2051615 512 byte sectors)
sd0: with 1760 cyls, 15 heads, and an average 77 sectors/track
scbus0 target 3 lun 0: <IBM DORS-32160 WA0A> type 0 fixed SCSI 2
sd1 at scbus0 target 3 lun 0
sd1: Direct-Access 2063MB (4226725 512 byte sectors)
sd1: with 6703 cyls, 5 heads, and an average 126 sectors/track
scbus0 target 4 lun 0: <ARCHIVE Python 28849-XXX 4.CM> type 1 removable SCSI 2
st0 at scbus0 target 4 lun 0
st0: Sequential-Access density code 0x24, 512-byte blocks, write-enabled
scbus0 target 4 lun 1: <ARCHIVE Python 28849-XXX 4.CM> type 8 removable SCSI 2
uk0 at scbus0 target 4 lun 1
uk0: Unknown 
scbus0 target 5 lun 0: <TANDBERG  TDC 3800 -03:> type 1 removable SCSI 1
st1 at scbus0 target 5 lun 0
st1: Sequential-Access density code 0x0,  drive empty

You can be pretty sure that the Tandberg tape is not active, though--I
don't use it very often.

>>> specified was too short, or the aic7xxx driver lost the command
>>> somewhere either in route to or from the device.  The latter problem
>>> did occur under heavy load prior to my latest "spin lock" change to
>>> the driver.
>>
>> When was that?  Would it also have the effect that the abort message
>> wouldn't be taken?
>
> The abort probably was taken, but the tape drive took a long time to
> release the bus, which was why the bus was reset.  I put my fix in
> the kernel on 8/13 in rev 1.121 or aic7xxx.c.

That looks about right for it to be this fix.  The previous kernel was
compiled (and current) on the 9th.

>>> The first problem seems really common in the st driver especially
>>> when older media or a rewind operation is involved.  You can try
>>> bumping up the timeouts in sys/scsi/st.c to see if this solves your
>>> problem.
>>
>> As I said, this wasn't a tape device timeout.  In any case, this
>> always seems to happen when the tape is writing, which makes it look
>> more like the heavy load scenario.
>
> Could it be that you don't have disconnections enabled for your tape drive?
> You should check both SCSI-Select for the 2940 and any relevant jumpers
> on the tape drive itself.  If disconnections are disabled, a tape write that
> required multiple retries could easily tie up the SCSI bus for the 10s
> needed to make a disk command time out.

You'd see that on the activity light, right?  In any case, the host
adapter is set correctly, and the tape doesn't seem to have any such
config switch.  Would there be another way to test that?

>> In this connection, it's interesting to  report how I tried to recover
>> from  the  problem.   I'm writing   several  files to  a non-rewinding
>> device, and lately they've been  dying in the same  file.  I check the
>> return status from tar, and if it's  non-0, do a  bsf 1, an fsf 1, and
>> restart the tar.  The first bsf 1 always fails, apparently because the
>> drive doesn't know where it is.  The second bsf 1 succeeds.
>
> The first one probably fails because the device isn't ready.  

That's what I thought, too, so I put a sleep 30 into the script.  It
still works the second time.

> What error is reported on the console?

I can't remember seeing one.  I can't reproduce this at will, but I've
looked through /var/adm/messages, and I don't see anything.

Greg




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19970820090810.54774>