Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 18 Aug 1997 21:59:02 -0600
From:      "Justin T. Gibbs" <gibbs@plutotech.com>
To:        Greg Lehey <grog@lemis.com>
Cc:        freebsd-scsi@FreeBSD.ORG (FreeBSD SCSI Mailing List)
Subject:   Re: Bus resets. Grrrr. 
Message-ID:  <199708190359.VAA10011@pluto.plutotech.com>
In-Reply-To: Your message of "Sun, 17 Aug 1997 10:59:43 %2B0930." <199708170129.KAA03776@freebie.lemis.com> 

next in thread | previous in thread | raw e-mail | index | archive | help
>This is the third time in a row that I haven't been able to complete a backup
>because of "recoverable" SCSI errors.  Here's a pretty typical scenario:

What version of the kernel are you using with what AHC options enabled
in your kernel config file?

>Aug 17 10:27:19 freebie /kernel: sd0: SCB 0x4 - timed out while idle, LASTPHAS
>E == 0x1, SCSISIGI == 0x0
>
>What does this mean?  What can time out when nothing's happening?  Or is
>this a timeout accepting a new command when it shouldn't have to?  Is this
>a device or a driver logic error?

The message simply indicates the state of the SCSI bus at the time the
timeout occured.  In this case, the sequencer was sitting in it's idle
loop waiting for new work to be queued by the kernel driver or for a
device to reconnect and continue a previously established connection.
To understand why a timeout can occur in this state or any other, it's
probably best to explain the timeout model that is used in FreeBSD.
The "type" drivers (e.g. sd, cd, st, worm, etc) specifies a timeout value
for each command it issues.  When the adaptec driver queues a command
to the controller, it sets up a "callout" that will occur timeout ms
in the future for the just queued command.  In this particular case,
that "callout" fired while the sequencer wasn't working on any commands.

So, what does a "timeout while idle" tell us?  Well, it means that either
the timeout that the type driver (in this case the "st" driver) specified
was too short, or the aic7xxx driver lost the command somewhere either
in route to or from the device.  The latter problem did occur under
heavy load prior to my latest "spin lock" change to the driver.  The
first problem seems really common in the st driver especially when older
media or a rewind operation is involved.  You can try bumping up the
timeouts in sys/scsi/st.c to see if this solves your problem.

>
>Aug 17 10:27:31 freebie /kernel: SEQADDR = 0x9 SCSISEQ = 0x12 SSTAT0 = 0x5 SST
>AT1 = 0xa
>Aug 17 10:27:31 freebie /kernel: sd0: Queueing an Abort SCB
>Aug 17 10:27:31 freebie /kernel: sd0: Abort Message Sent
>Aug 17 10:27:31 freebie /kernel: sd0: SCB 0x4 - timed out in message out phase
>, SCSISIGI == 0xa4
>Aug 17 10:27:31 freebie /kernel: SEQADDR = 0x9a SCSISEQ = 0x12 SSTAT0 = 0x5 SS
>TAT1 = 0x2
>
>If I understand this correctly, this means that the abort SCB wasn't received
>either, so the driver does a bus reset:

What it means is that the tape drive accepted the connection from
the controller, most likely accepted the ABORT message, but took
longer than the driver allowed for it to process the abort request,
free the bus, and thus signal that the abort was successful.  So,
we take out the hammer and reset the bus.  The timeout in the
aic7xxx driver for abort requests may be too short.

>
>Aug 17 10:27:31 freebie /kernel: ahc0: Issued Channel A Bus Reset. 3 SCBs abor
>ted
>Aug 17 10:27:32 freebie /kernel: Clearing bus reset
>Aug 17 10:27:32 freebie /kernel: Clearing 'in-reset' flag
>Aug 17 10:27:32 freebie /kernel: sd0: no longer in timeout
>
>... which works.
>
>Aug 17 10:27:32 freebie /kernel: sd0: SCB 0x4 - timed out in command phase, SC
>SISIGI == 0x84
>
>So why do we get another timeout?  Or is this overlapping?

This is due to the way that timeout process occurs in the current driver.
If I have 20 operations outstanding, each with a timeout queued, those
timeouts are still in effect while I am doing recovery processing.  So,
even if I just unwedged the bus, a timeout for a pending command can
fire "logically too soon".

>Aug 17 10:27:32 freebie /kernel: SEQADDR = 0x42 SCSISEQ = 0x12 SSTAT0 = 0x7 SS
>TAT1 = 0x2
>Aug 17 10:27:32 freebie /kernel: sd0: abort message in message buffer
>Aug 17 10:27:32 freebie /kernel: sd1: SCB 0x3 timedout while recovery in progr
>ess
>Aug 17 10:27:32 freebie /kernel: sd0: SCB 1 - Abort Completed.
>Aug 17 10:27:32 freebie /kernel: sd0: no longer in timeout
>Aug 17 10:27:32 freebie /kernel: sd1: UNIT ATTENTION asc:29,0
>Aug 17 10:27:32 freebie /kernel: sd1:  Power on, reset, or bus device reset oc
>curred
>Aug 17 10:27:32 freebie /kernel: , retries:3
>
>So sd3 complains, but carries on with no harm done,

Yup.  The "sd" type driver seems to handle a bus reset gracefully.

>Aug 17 10:27:32 freebie /kernel: st0: UNIT ATTENTION asc:29,0
>Aug 17 10:27:32 freebie /kernel: st0:  Power on, reset, or bus
>                                       device reset occurred
>Aug 17 10:27:32 freebie /kernel: st0: Target Busy
>
>but the tape dies.  Is there a good reason for this?  I would have
>thought that it would make sense for a power on or reset, but not
>for a bus reset.  Does a tape unit lose its position or data when
>it receives a bus reset?

As you can tell by the sense code that is returned, your tape drive
draws no distinction between "Power on", "reset" (i.e. bus reset),
or "bus device reset" and is probably returning "Target Busy" because
it is going through self test.  Any information regarding tape position
is almost certainly lost as is probably the case for the compression/density
settings.  The "st" driver should be able to restore the drive to
the previous condition though since it knows all of the information
to do so.  This is a bug.

>Is anybody doing anything about this?

In a way I am, and in a way I'm not.  All of these problems are
well known to me and are being addressed in a complete rewrite of
the FreeBSD SCSI layer.  Ken Merry and myself are working diligently
on this task and I have chosen to put 99% of my effort into completing
this work instead of attempting to patch around the outstanding
problems in the current implementation.  The current status of the
rewrite includes support for the Adaptec aic7xxx cards, disks,
cdroms, application pass through (what scsi(8) would use to access
devices), and fledgling target mode support.  Once we have a tape
driver, target mode, and some remaining error recovery issues dealt
with, the code will be available for beta test while all remaining
controller drivers are ported to the new system and a few remaining
features are added.  If you are willing to do without some features
and device support and are interrested in giving early feedback on
the work, contact me in private mail.

>
>Greg
>

--
Justin T. Gibbs
===========================================
  FreeBSD: Turning PCs into workstations
===========================================



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199708190359.VAA10011>