Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 15 Nov 1997 22:10:52 -0700
From:      "Justin T. Gibbs" <gibbs@plutotech.com>
To:        harold barker Hbarker <hbarker@rhiannon.sm.dsms.com>
Cc:        hackers@FreeBSD.org, scsi@FreeBSD.org, aic7xxx@FreeBSD.org
Subject:   Re: AHC / SCSI UPDATE
Message-ID:  <199711160511.WAA24691@pluto.plutotech.com>

next in thread | raw e-mail | index | archive | help
Sorry for not responding sooner, but I don't read this list regularly
anymore...

>If the person responsible for the code in question will email me, i will  
>ship/open for login a machine that exibits the broblem.

That would be me, but I do believe that I have a system here that
exibits the same problem you are having.  When I have a fix for
this machine, I might take you up on your offer if it doesn't seem
to work with your equipment.

Here's a little info about what we (Ken Merry and myself) have determined
about the problem so far.

System:
       P6-233 256k cache
       2940UW (SCSI ID 7)
       1 X PLEXTOR CD-ROM PX-4XCS 1.04 (SCSI ID 4)
       2 X QUANTUM XP34550W LXY4 (SCSI IDs 0 and 1)

How to repeat:
       run concurrent I/O to all 3 devices at the same time.

Symptom:
       After a varying period of time, disk 0 or 1 stops performing
       reselections for it's outstanding I/O.  This eventually results
       in a timeout, usually with the controller in an "idle" state.

Using a SCSI bus analyzer, we've looked at the transactions on the bus
that lead up to this state.  No protocol errors were discovered.  What
we did find, however, was a disturbing pattern of disconnections and
reconnections from the CDROM drive.  The plextor seems to perform
disconnections "often enough" to allow other targets to perform a 
reselection, but unfortunately seems to partake in the next arbitration
phase if it has a task to continue.  Since the arbitration algorithm
breaks "ties" based on the SCSI ID (from highest to lowest priority
7->0, 15->8), this effectively gives the CD drive the bus for as long
as it wants it.  Since the CD drive only handles a single task at a time,
one would think that there would be plenty of time that the CD was idle
and not wanting the bus.  Unforunately, it seems that the SCSI system/
aic7xxx driver is fast enough to process a command completion for the
CD drive, setup a new command to send, and participate in the next
arbitration phase.  As the controller has the highest priority ID on
the bus, this again "starves" the drives and opens the possibility for
the CD drive to start requesting the bus.

In the end, what I believe is happening is that the drive exhausts it's
"reconnect attempt" count, and decides not to attempt to contact the
initiator again.  In the case of an Atlas II, if the initiator selects
the drive (say to send an abort or abort tag message), the drive starts
making reconnection attempts again and the wedge is cleared.  Other drives
may not behave as nicely.

So, what can be done about this?  I'm currently looking through the SCSI
II and III specs to determine what the standard has to say about reconnect
attempt failures and how to properly deal with them.  It may be that
the SCSI layer/Adaptec driver can take actions that will work on most
devices.

For a more immediate fix, I suggest experimenting with:

	1) Swapping the IDs on your devices so that hard drives have higher
	   arbitration priority on the bus.  The Adaptec BIOS will still
	   find your disks in the proper order for you to boot even if
	   you stick your CDROM or tape drive's IDs down before the hard
	   disks.

	2) Playing with the settings in the Disconnect-Reconnect
	   mode Page (page #0x2). Try setting the "Disconnect Time Limit"
	   variable to something other than 0.  This is the time, in
	   hundredths of a millisecond, the device waits after 
	   disconnecting before participating in arbitration.

For many of you, I would expect solution 1 to work just fine.  For those
of you with lots of disks on a single chain (even if you don't have a
tape or cdrom drive), you will probably have to try solution #2.

Remeber that it's not really the type of device that matters, but the
possibility of starvation.  If you have lots of concurrent I/O going on
to multiple disks on a single chain, you can still experience this problem
(Hi Satoshi!).

More information when it becomes available.

--
Justin





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199711160511.WAA24691>