FreeBSD Mail Archives

Date:      Sat, 14 Apr 2001 20:39:25 +0200
From:      J Wunsch <j@uriah.heep.sax.de>
To:        freebsd-scsi@freebsd.org
Subject:   Re: Problem with current sa(4) driver
Message-ID:  <20010414203925.A63281@uriah.heep.sax.de>
In-Reply-To: <20010413000553.A5245@uriah.heep.sax.de>; from j@uriah.heep.sax.de on Fri, Apr 13, 2001 at 12:05:53AM %2B0200
References:  <20010410233410.A9305@uriah.heep.sax.de> <Pine.LNX.4.21.0104101440510.16909-100000@zeppo.feral.com> <20010411083552.A8646@uriah.heep.sax.de> <20010413000553.A5245@uriah.heep.sax.de>

Hi folks,

it seems the latest cleanup in CAMs error handling code (ken's commit
from 2001/03/27) has broken the ability for the sa(4) driver to handle
residuals properly.  I first noticed it on a new kernel that the
implied test read for any mt(1) operation broke on my Tandberg TDC4222
drive.  The test attempts to read 8 KB from the tape, but tape
blocking was usually 10 or 32 KB, depending on how the supplied
cartridge has been written.  The symptom was that the driver continued
to read the entire tape, until it eventually hit EOM.  Thus, a `mt
blocksize variable' took usually an hour if a full backup cartridge
was in the drive...

I then noticed that any form of tape block size vs. read(2) buffer
sice mismatch causes this behaviour, also for other drive types (see
below).  Since Matt replied that he has no idea why, i got interested
in investigating myself :), and pursued it a little.

Let me first start with my recent mail to Matt that contains a summary
of the code paths involved.

> Since i got hold of an old
> 
> # camcontrol inquiry sa1
> pass6: <ARCHIVE Python 28388-XXX 5.AC> Removable Sequential Access SCSI-2 device 
> pass6: Serial Number DR0J340
> pass6: 7.812MB/s transfers (7.812MHz, offset 8)
> 
> drive as well, i've tried the same procedure there.  Similar behaviour
> to the Tandberg TDC4222 drive.  As soon as i request a too large block
> from the drive, it starts reading the entire tape (until i reset the
> SCSI bus, or it will eventually encounter EOM).
> 
> I've now DDB'ed it, and can tell you at least the following:
> 
> sadone() calls saerror(), which will always return ERESTART (== -1),
> thus it loops at this point.
> 
> saerror(), when called, bcopy()s the sense data, resid etc. into
> softc, then (most likely) enters the SSD_ILI handling, and eventually
> calls cam_periph_error() to handle the error.  I think something
> might be wrong with this.
> 
> cam_periph_error() eventually calls camperiphscsisenseerror(), which
> in turn calls scsi_error_action().  err_action will (apparently) be
> set by this to SS_RETRY.  In the end, this will cause
> cam_periph_error() to return ERESTART.
> 
> I did not investigate yet scsi_error_action() itself, nor examined its
> return value directly (only indirectly based on the code path inside
> cam_periph_error()).

So now, i think it's rather in the domain of the sa(4) driver to
handle an illegal length indication properly by itself, since it's
rather special to handling tapes that a `short read' from the tape
(supplied blocksize to read(2) is larger than logical block size on
the tape) is basically a normal operating condition which is never to
be returned as an EIO, but always to be reported (in the b_resid)
filed to the bio layer.

So my first attempt was to catch this case, and never call
cam_periph_error() in the first place at all, but return 0 from
saerror(), indicating it not actually to be an error at all.  Copying
the residual value over to csio->resid has already been properly
handled in saerror().  This procedure solves the initial problem; a
test read with a 32 KB blocksize dd command from a 10 KB blocked tar
tape results in 10240 bytes being read.  However, using this method,
saclose() always jams the process waiting in state "cbwait", as a
result of the final call to "saprevent(..., PR_ALLOW)".

Somehow i thought it's necessary to actually call cam_periph_error(),
in order to fetch the check condition from the device.  OK, i did so,
and rearranged my modification to first call cam_periph_error(), but
to then ignore its returned ERESTART, and have saerror() return 0.
This reproducibly causes a segmentation violation panic with the
following stack trace:

(gdb) bt
#0  0xc011f683 in heap_down (queue_array=0xc04e66fc, index=-1, num_entries=0)
    at ../../cam/cam_queue.c:345
#1  0xc011f390 in camq_remove (queue=0xc0851a44, index=-1)
    at ../../cam/cam_queue.c:180
#2  0xc0121f0c in xpt_run_dev_sendq (bus=0xc08229c0)
    at ../../cam/cam_queue.h:199
#3  0xc012487f in camisr (V_queue=0xc02bc4b4) at ../../cam/cam_xpt.c:6946
#4  0xc015e857 in ithread_loop (arg=0xc04f1980) at ../../kern/kern_intr.c:517
#5  0xc015da01 in fork_exit (callout=0xc015e618 <ithread_loop>, 
    arg=0xc04f1980, frame=0xc39c8fa8) at ../../kern/kern_fork.c:731

The error happens because of:

(gdb) up
#1  0xc011f390 in camq_remove (queue=0xc0851a44, index=-1)
    at ../../cam/cam_queue.c:180
(gdb) p index
$1 = -2

(I have also observed -1 once.)

cam_pinfo *
camq_remove(struct camq *queue, int index)
{
	cam_pinfo *removed_entry;

	if (index == 0 || index > queue->entries)
		return (NULL);
	removed_entry = queue->queue_array[index];
	if (queue->entries != index) {
		queue->queue_array[index] = queue->queue_array[queue->entries];
		queue->queue_array[index]->index = index;
=>		heap_down(queue->queue_array, index, queue->entries - 1);
	}
	removed_entry->index = CAM_UNQUEUED_INDEX;
	queue->entries--;
	return (removed_entry);
}


Now, is this a bug somewhere in the CAM code, or did i overlook
something in my attempt to fix the sa driver?

I've also reviewed the old (working) code, and it seems that one
always returned 0 from cam_periph_error(), after first decrementing
the retry count:

	case SSD_CURRENT_ERROR:
	{
		
		switch (sense_key) {
		case SSD_KEY_NO_SENSE:
		^^^^^^^^^^^^^^^^^^^^^^ <- we do have no sense for an ILI cond.
			/* Why were we called then? Well don't bail now */
			/* FALLTHROUGH */
		case SSD_KEY_EQUAL:
			/* These should be filtered by the peripheral drivers */
			/* FALLTHROUGH */
		case SSD_KEY_MISCOMPARE:
			print_sense = FALSE;
			/* FALLTHROUGH */
		case SSD_KEY_RECOVERED_ERROR:

			/* decrement the number of retries */
			retry = ccb->ccb_h.retry_count > 0;
			if (retry)
				ccb->ccb_h.retry_count--;

			error = 0;
			break;

[I've finally subscribed to this list again, so no need to send
me a personal Cc.]
-- 
cheers, J"org               .-.-.   --... ...--   -.. .  DL8DTL

http://www.sax.de/~joerg/                        NIC: JW11-RIPE
Never trust an operating system you don't have sources for. ;-)

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-scsi" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010414203925.A63281>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation