Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 16 Jun 2010 17:32:18 -0600
From:      Scott Long <scottl@samsco.org>
To:        Andrew Boyer <aboyer@averesystems.com>
Cc:        freebsd-scsi@freebsd.org
Subject:   Re: Overlapped Commands error
Message-ID:  <C46A13B3-BFA7-4FD7-AD52-F0A60D6CF424@samsco.org>
In-Reply-To: <51DD9715-89B2-4058-A4FE-7097603013CC@averesystems.com>
References:  <51DD9715-89B2-4058-A4FE-7097603013CC@averesystems.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Jun 16, 2010, at 10:17 AM, Andrew Boyer wrote:
> Hello SCSI experts,
> We recently saw this SCSI command error:
>=20
>> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): READ(10). CDB: 28 0 =
2 c8 7f a0 0 0 20 0
>> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): CAM Status: SCSI =
Status Error
>> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): SCSI Status: Check =
Condition
>> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): ABORTED COMMAND =
asc:4e,0
>> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): Overlapped commands =
attempted field replaceable unit: 1
>> Jun 15 15:08:32 eval12 kernel: (da1:mpt0:0:1:0): Retrying Command =
(per Sense Data)
>> Jun 15 15:08:37 eval12 kernel: mpt0: request 0xffffffff815d5c20:40101 =
timed out for ccb 0xffffff000d54d800 (req->ccb 0xffffff000d54d800)
>> Jun 15 15:08:37 eval12 kernel: mpt0: attempting to abort req =
0xffffffff815d5c20:40101 function 0
>> Jun 15 15:08:38 eval12 kernel: mpt0: mpt_wait_req(1) timed out
>> Jun 15 15:08:38 eval12 kernel: mpt0: mpt_recover_commands: abort =
timed-out. Resetting controller
>> Jun 15 15:08:38 eval12 kernel: mpt0: mpt_cam_event: 0x0
>> Jun 15 15:08:38 eval12 kernel: mpt0: mpt_cam_event: 0x0
>> Jun 15 15:08:38 eval12 kernel: mpt0: completing timedout/aborted req =
0xffffffff815d5c20:40101
>> Jun 15 15:09:00 eval12 kernel: mpt0: mpt_cam_event: 0x16
>> Jun 15 15:09:00 eval12 kernel: mpt0: mpt_cam_event: 0x12
>> Jun 15 15:09:00 eval12 kernel: mpt0: mpt_cam_event: 0x16
>=20
> No one here has ever seen this before.  We're using a CAM and MPT =
stack from August 2009 with an LSI1068e HBA connected to Seagate SAS =
HDDs.
>=20
> This is what the SCSI Architecture Manual (SAM-5 draft) has to say =
about overlapped commands:

>> [...]
>=20
> Can anyone point me to where in the stack the command identifier is =
assigned?  I see where MPT assigns tags in target mode, but it's the =
initiator in this case.  Any advice?

Don't want to step on Matt, but wanted to expand on what he's said so =
far.

CAM doesn't assign tag identifiers for initiator I/O, it leaves that up =
to the driver and hardware.  The tag_id field that you see in CCB's is =
for target I/O only.  In the case of MPT, the firmware assigns tags, =
while on simpler controllers like ESP the driver does it.  CAM does =
provide the tag action message, i.e. SIMPLE, ORDERED, HEAD_OF_Q, and =
it's up to the driver to relay that to hardware, which MPT does in =
mpt_start().

The MPT architecture abstracts a lot of the transport protocol away, so =
it's generally assumed that it's going to do the right thing in a case =
like this.  I don't know if the firmware is wrong, or if FreeBSD is =
wrong.  CAM almost always attaches a SIMPLE action flag with I/O =
commands, and the MPT driver looks like it will faithfully translate =
that into the corresponding MPT flag.  By looking at the inquiry data, =
it's roughly possible to determine if the device supports tagged =
queuing, so maybe CAM needs to be smarter about this.  Instead of the TQ =
flag just affecting command scheduling, maybe it also needs to suppress =
attaching the SIMPLE action flag, and likewise the MPT driver should set =
an UNTAGGED flag in correlation to that.

I would expect the MPT firmware to look at the inquiry data and behave =
appropriately despite what might be sent in the MPT i/o request, but =
again, maybe that's asking too much.  If you're adventurous, try =
modifying the MPT driver to always set the MPI_SCSIIO_CONTROL_UNTAGGED =
flag in mpt_start(), and see if that makes your problem go away.

>=20
> Also, is CAM doing the right thing by retrying?  scsi_error_action() =
in cam/scsi/scsi_all.c always sets the retry bit on aborted commands, =
even though the spec quoted above makes it sound like this should be a =
fatal error ("This is considered a catastrophic failure on the part of =
the SCSI initiator device").  Should scsi_error_action() be looking at =
the Additional Sense Code?
>=20

The error recovery code in CAM already cross references the ASC/ASCQ to =
an action table, but that table is often incomplete for uncommon edge =
cases.  Try the following:

RCS file: /usr1/ncvs/src/sys/cam/scsi/scsi_all.c,v
retrieving revision 1.55.2.3
diff -u -r1.55.2.3 scsi_all.c
--- scsi_all.c	14 Feb 2010 19:38:27 -0000	1.55.2.3
+++ scsi_all.c	16 Jun 2010 23:31:47 -0000
@@ -1962,7 +1962,7 @@
 	{ SST(0x4D, 0xFF, SS_RDEF | SSQ_RANGE,
 	    NULL) },			/* Range 0x00->0xFF */
 	/* DTLPWROMAEBKVF */
-	{ SST(0x4E, 0x00, SS_RDEF,
+	{ SST(0x4E, 0x00, SS_FATAL | ENXIO,
 	    "Overlapped commands attempted") },
 	/*  T             */
 	{ SST(0x50, 0x00, SS_RDEF,

Scott




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C46A13B3-BFA7-4FD7-AD52-F0A60D6CF424>