From owner-freebsd-scsi  Mon Feb  5 04:56:52 1996
Return-Path: owner-freebsd-scsi
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id EAA25012
          for freebsd-scsi-outgoing; Mon, 5 Feb 1996 04:56:52 -0800 (PST)
Received: from bunyip.cc.uq.oz.au (bunyip.cc.uq.oz.au [130.102.2.1])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id EAA25006
          for <freebsd-scsi@freebsd.org>; Mon, 5 Feb 1996 04:56:48 -0800 (PST)
Received: from cc.uq.oz.au by bunyip.cc.uq.oz.au 
          id <08126-0@bunyip.cc.uq.oz.au>; Mon, 5 Feb 1996 22:18:02 +1000
Received: from orion.devetir.qld.gov.au 
          by pandora.devetir.qld.gov.au (8.6.10/DEVETIR-E0.3a) with ESMTP 
          id WAA08895 for <freebsd-scsi@freebsd.org>;
          Mon, 5 Feb 1996 22:18:52 +1000
Received: from localhost by orion.devetir.qld.gov.au (8.6.10/DEVETIR-0.3) 
          id WAA04227; Mon, 5 Feb 1996 22:12:42 +1000
Message-Id: <199602051212.WAA04227@orion.devetir.qld.gov.au>
To: freebsd-scsi@freebsd.org
cc: syssgm@devetir.qld.gov.au
Subject: aha1542 MBO problem in 2.0.5
Date: Mon, 05 Feb 1996 22:12:41 +1000
From: Stephen McKay <syssgm@devetir.qld.gov.au>
Sender: owner-freebsd-scsi@freebsd.org
Precedence: bulk

Since I revamped my machine (16->24Mb ram, DX33->DX4/100, +CD-ROM), I have had
various SCSI problems.  I still run 2.0.5 (because I'm using the PC too much
to upgrade yet), and use a BT545S SCSI card to run the disk, tape and CDROM.
I can't access my Archive 2525 at all using the bt driver, or I get crashes
and reboots (bounce buffer problem, maybe).  So, I use the aha driver.

Unfortunately, I get lots of messages when accessing the disk and the CDROM
simultaneously, like:

    aha0: MBO 01 and not 00 (free)
    sd0(aha0:0:0): timed out

The timeouts are not necessarily paired with complaints about MBO.

Now, I have a wild theory about the MBO not free problem. :-)

Outgoing mailboxes are paired up with ccb's pretty early on in the aha driver.
Thereafter, ccb's are allocated, and mailboxes just come with them.  The most
recently freed ccb is the next to be allocated, so when the system is busy,
it is highly likely that a ccb will be reused immediately.  This implies that
the outgoing mailbox will be quickly reused.  The manual with my BT545S
proudly proclaims its multi tasking nature, so perhaps if it gets really
busy, it might postpone marking the mailbox as read, especially since mailboxes
are supposed to be used in a round robin manner, and there are bound to be
a few still free.

So, the scenario I am postulating is:

	host - allocate and set up ccb
	host - mark mailbox as active
	bt545 - read mailbox
	bt545 - read ccb, do the work, mark ccb done
	bt545 - interrupt host
	bt545 - (become really busy and defer updating mailbox)
	host - reallocate same ccb
	host - complain about mailbox still marked busy
	host - set up ccb
	host - mark mailbox as active
	bt545 - (finish being busy)
	bt545 - mark mailbox as free
	host - timeout (because bt545 ignored the mailbox)

To combat this, I've changed the ccb allocation policy to reuse the oldest
rather than the newest free ccb, in the expectation that this would access
mailboxes almost round robin.

I applied the patch given below, and thrashed the disk, tape and cdrom
simultaneously (doing tar's and wc of big files in a loop) without any
failures or errors logged.  I reverted to the previous kernel and MBO not
free errors turned up almost immediately.  Then I got a couple of "pid 301:
sh: uid 0:  exited on signal 11" type messages and hurredly terminated the
experiment.  I'm back on the patched kernel and abusing it as I type.

So, it appears that treating one's outgoing mailboxes in the official
round robin manner is not optional.  I intend to add proper round robin
code myself soon, but I'm realistic enough about my erratic spare time
to invite others to beat me to it.

Anyway, here's my patch against 2.0.5 (but -current doesn't LOOK much
different):

Patch relative to "aha1542.c,v 1.45 1995/05/30 08:01:05 rgrimes Exp"

--- aha1542.c	Tue May 30 18:01:05 1995
+++ aha1542.sgm.c	Sun Feb  4 21:26:02 1996
@@ -302,6 +302,7 @@
 	long int kv_phys_xor;
 	struct aha_mbx aha_mbx;	/* all the mailboxes */
 	struct aha_ccb *aha_ccb_free;	/* the next free ccb */
+	struct aha_ccb *aha_ccb_tail;	/* end of the free ccb list */
 	struct aha_ccb aha_ccb[AHA_MBX_SIZE];	/* all the CCBs      */
 	int     aha_int;	/* irq level        */
 	int     aha_dma;	/* DMA req channel  */
@@ -782,14 +783,20 @@
 	if (!(flags & SCSI_NOMASK))
 		opri = splbio();
 
-	ccb->next = aha->aha_ccb_free;
-	aha->aha_ccb_free = ccb;
 	ccb->flags = CCB_FREE;
+
+	ccb->next = NULL;
+	if (aha->aha_ccb_free == NULL)
+	    aha->aha_ccb_free = ccb;
+	else
+	    aha->aha_ccb_tail->next = ccb;
+	aha->aha_ccb_tail = ccb;
+
 	/*
 	 * If there were none, wake anybody waiting for
 	 * one to come free, starting with queued entries
 	 */
-	if (!ccb->next) {
+	if (aha->aha_ccb_free == aha->aha_ccb_tail) {
 		wakeup((caddr_t)&aha->aha_ccb_free);
 	}
 	if (!(flags & SCSI_NOMASK))
@@ -819,6 +826,8 @@
 	}
 	if (rc) {
 		aha->aha_ccb_free = aha->aha_ccb_free->next;
+		if (aha->aha_ccb_free == NULL)
+		    aha->aha_ccb_tail = NULL;	/* Unnecessary, but neat. */
 		rc->flags = CCB_ACTIVE;
 	}
 	if (!(flags & SCSI_NOMASK))
@@ -1214,6 +1223,7 @@
 	 * into a free-list
 	 * this is a kludge but it works
 	 */
+	aha->aha_ccb_tail = &aha->aha_ccb[0];
 	for (i = 0; i < AHA_MBX_SIZE; i++) {
 		aha->aha_ccb[i].next = aha->aha_ccb_free;
 		aha->aha_ccb_free = &aha->aha_ccb[i];
@@ -1354,9 +1364,13 @@
 		xs->error = XS_DRIVER_STUFFUP;
 		return (TRY_AGAIN_LATER);
 	}
-	if (ccb->mbx->cmd != AHA_MBO_FREE)
+	if (ccb->mbx->cmd != AHA_MBO_FREE) {
 		printf("aha%d: MBO %02x and not %02x (free)\n",
-		unit, ccb->mbx->cmd, AHA_MBO_FREE);
+			unit, ccb->mbx->cmd, AHA_MBO_FREE);
+		aha_free_ccb(unit, ccb, flags);
+		xs->error = XS_DRIVER_STUFFUP;
+		return (TRY_AGAIN_LATER);
+	}
 
 	/*
 	 * Put all the arguments for the xfer in the ccb

Stephen McKay.