Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 25 Apr 2016 10:38:41 -0600
From:      "Stephen McConnell" <stephen.mcconnell@broadcom.com>
To:        "'Dan Langille'" <dan@langille.org>, <freebsd-scsi@freebsd.org>
Subject:   RE: terminated ioc 804b scsi 0 state c xfer 0
Message-ID:  <0d7401d19f10$ee329300$ca97b900$@broadcom.com>
In-Reply-To: <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org>
References:  <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> <5EEF0794-B06E-4A72-89DA-7DCD94AE1FC6@langille.org> <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org>

next in thread | previous in thread | raw e-mail | index | archive | help


> -----Original Message-----
> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd-
> scsi@freebsd.org] On Behalf Of Dan Langille
> Sent: Monday, April 25, 2016 9:40 AM
> To: freebsd-scsi@freebsd.org
> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
> 
> > On Apr 25, 2016, at 8:17 AM, Dan Langille <dan@langille.org> wrote:
> >
> >>
> >> On Apr 24, 2016, at 9:35 AM, Dan Langille <dan@langille.org> wrote:
> >>
> >> More of the pasted output is also at
> https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae
> <https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae>;
> and added smartctl output.
> >>
> >> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI SAS2008
> card.  It's running a zfs root system.
> >>
> >> This morning the system was unresponsive via ssh. Attempts to log in at
> the console did not yield a password prompt.
> >>
> >> A power cycle brought the system online.  Inspecting /var/log/messages,
I
> found about 63,000 entries similar to those which appear below.
> >>
> >> zpool status of all are OK. A scrub is in progress for one pool (since
before
> this issue arose). da7 is in that pool.
> >>
> >>
> >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
> >> 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b scsi
> >> 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0):
> >> READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 length 16384 SMID 614
> >> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
> >> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 50 00 00 20
> >> 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c xfer 0
> >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
> >> 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc 804b
> >> scsi 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0):
> >> READ(10). CDB: 28 00 8b 6f ef 50 00 00 08 00 length 4096 SMID 674
> >> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
> >> kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b 0f a2 48 00 00 18
> >> 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state c xfer
> >> 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB:
> >> 28 00 ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc
> >> 804b scsi 0 state c xfer 0 Apr 24 11:25:56 knew kernel:
> >> (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00
> >> length 16384 SMID 376 terminated ioc 804b scsi 0 state c xfer 0 Apr
> >> 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b
> >> d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi 0
> >> state c xfer 0
> >>
> >> Is this a cabling issue?  The drive is a SATA device (smartctl output
in the
> URL above).  Anyone familiar with these errors?
> >
> > This morning:
> >
> > 13410079654596185797  REMOVED      0     0     0  was /dev/da7p3
> >
> > At least I know i'm looking for Serial Number: 13Q8PNBYS
> >
> > From the logs:
> >
> > Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 lun 0
> > Apr 25 05:34:50 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> s/n
> 13Q8PNBYS detached
> > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
> > d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b scsi 0
> > state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0):
> > READ(10). CDB: 28 00 d8 33 26 f8 00 00 20 00 length 16384 SMID 204
> > terminated ioc 804b scsi 0 state c xfer(da7:mps1:0:17:0): READ(10). CDB:
28
> 00 d8 33 53 e0 00 00 08 00 Apr 25 05:34:51 knew kernel: 0 Apr 25 05:34:51
> knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue
> Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28
> 00 d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc 804b scsi
0
> state c xfer(da7: 0
> > Apr 25 05:34:51 knew kernel: mps1:0:    (da7:mps1:0:17:0): READ(10).
CDB:
> 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc 804b
scsi 0
> state c xfer 17:0
> > Apr 25 05:34:51 knew kernel: 0):        (da7:mps1:0:17:0): WRITE(10).
CDB: 2a
> 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc 804b scsi
0
> state c xfeError 5, Periph was invalidated
> > Apr 25 05:34:51 knew kernel: r 0
> > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
> > d8 33 26 f8 00 00 20 00 Apr 25 05:34:51 knew kernel:
> > (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 length
> > 4096 SMID 435 terminated ioc 804b scsi 0 state c xfer
> > (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue Request Apr 25
> > 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1:
> > mps1:0:IOCStatus = 0x4b while resetting device 0xa Apr 25 05:34:51
> > knew kernel: 17:mps1: 0): Unfreezing devq for target ID 17 Apr 25
> > 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25 05:34:51
> > knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 26 d8 00 00
> > 20 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
> > Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
> > (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
> > knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 6c 42 40 00 00
> > 10 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
> > Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
> > (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
> > knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 d8 2a 90 00
> > 00 90 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 knew
> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00
Apr
> 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally
Re-
> queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5,
> Periph was invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR: Device
> swap: provider da7p2 disconnected.
> > Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS 'vdev
is
> removed, pool_guid=15378250086669402288
> vdev_guid=13410079654596185797''
> > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed Apr
> > 25 05:34:51 knew ZFS: vdev is removed, pool_guid=15378250086669402288
> > vdev_guid=13410079654596185797
> 
> Current status: after powering off the box, reseating the cables for that
drive, I
> powered up the system and a resilver commenced which completed in 30
> minutes.
> 
> Seems OK now.  I am not sure if the two events are related.

Recently, a bug was uncovered where a device is gets 'lost'.

Here's what happens:
The first message in your "failure on Monday"  log is for
'mpssas_prepare_remove'.  This message is likely logged because the FW sends
an event to the driver that the device is no longer responsive (pulled,
cable issue, or something else).  When the driver gets this event, it sends
a reset to the device to clear out any pending I/O.  This is where the
'terminated ioc' messages come from.  When the reset completes, the driver
is supposed to send a SAS_IO_UNIT message to FW so that the DevHandle for
that disk is removed from FW's list.  Then, when the device comes back
on-line, everything is fine. But, with this bug, before that SAS_IO_UNIT
message is sent to FW, the driver exits the function where that happens
(mpssas_remove_device).  This happens where you see the log message,
"IOCStatus - 0x4b while resetting device 0x0a".  The driver logs that
message and then exits.  What the driver should do is log that message and
continue on to send the SAS_IO_UNIT message to FW.  The fix is to remove the
two lines in the driver shown here with '>>':

	if (le16toh(reply->IOCStatus) != MPI2_IOCSTATUS_SUCCESS) {
		mps_printf(sc, "IOCStatus = 0x%x while resetting device
0x%x\n",
		   le16toh(reply->IOCStatus), handle);
>>		mpssas_free_tm(sc, tm);
>>		return;
	}

A reboot will solve the problem, as you saw, but the real fix is to remove
the DevHandle as described above.  This fix will go into the driver before
the next scheduled release and then MFC'd to 10.x.

Steve

> 
> --
> Dan Langille - BSDCan / PGCon
> dan@langille.org
> 
> 
> 
> 
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0d7401d19f10$ee329300$ca97b900$>