Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 10 May 2016 06:10:00 -0500
From:      Josh Paetzel <josh@tcbug.org>
To:        Stephen McConnell <stephen.mcconnell@broadcom.com>
Cc:        freebsd-scsi@freebsd.org, Scott Long <scott4long@yahoo.com>
Subject:   Re: terminated ioc 804b scsi 0 state c xfer 0
Message-ID:  <0BA358CB-779B-460E-9FFC-FA033C136825@tcbug.org>
In-Reply-To: <3D06CE25-4159-4F30-A8C5-8A188144681B@yahoo.com>
References:  <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> <5EEF0794-B06E-4A72-89DA-7DCD94AE1FC6@langille.org> <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org> <0d7401d19f10$ee329300$ca97b900$@broadcom.com> <3D06CE25-4159-4F30-A8C5-8A188144681B@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help


> On Apr 25, 2016, at 12:30 PM, Scott Long via freebsd-scsi <freebsd-scsi@fr=
eebsd.org> wrote:
>=20
>=20
>> On Apr 25, 2016, at 10:38 AM, Stephen McConnell via freebsd-scsi <freebsd=
-scsi@freebsd.org> wrote:
>>=20
>>=20
>>=20
>>> -----Original Message-----
>>> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd-
>>> scsi@freebsd.org] On Behalf Of Dan Langille
>>> Sent: Monday, April 25, 2016 9:40 AM
>>> To: freebsd-scsi@freebsd.org
>>> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
>>>=20
>>>>> On Apr 25, 2016, at 8:17 AM, Dan Langille <dan@langille.org> wrote:
>>>>>=20
>>>>>=20
>>>>> On Apr 24, 2016, at 9:35 AM, Dan Langille <dan@langille.org> wrote:
>>>>>=20
>>>>> More of the pasted output is also at
>>> https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae
>>> <https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae>;
>>> and added smartctl output.
>>>>>=20
>>>>> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI SAS2008=

>>> card.  It's running a zfs root system.
>>>>>=20
>>>>> This morning the system was unresponsive via ssh. Attempts to log in a=
t
>>> the console did not yield a password prompt.
>>>>>=20
>>>>> A power cycle brought the system online.  Inspecting /var/log/messages=
,
>> I
>>> found about 63,000 entries similar to those which appear below.
>>>>>=20
>>>>> zpool status of all are OK. A scrub is in progress for one pool (since=

>> before
>>> this issue arose). da7 is in that pool.
>>>>>=20
>>>>>=20
>>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
>>>>> 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b scsi
>>>>> 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0):
>>>>> READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 length 16384 SMID 614
>>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
>>>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 50 00 00 20
>>>>> 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c xfer 0
>>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
>>>>> 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc 804b
>>>>> scsi 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0):
>>>>> READ(10). CDB: 28 00 8b 6f ef 50 00 00 08 00 length 4096 SMID 674
>>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
>>>>> kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b 0f a2 48 00 00 18
>>>>> 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state c xfer
>>>>> 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB:
>>>>> 28 00 ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc
>>>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:56 knew kernel:
>>>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00
>>>>> length 16384 SMID 376 terminated ioc 804b scsi 0 state c xfer 0 Apr
>>>>> 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b
>>>>> d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi 0
>>>>> state c xfer 0
>>>>>=20
>>>>> Is this a cabling issue?  The drive is a SATA device (smartctl output
>> in the
>>> URL above).  Anyone familiar with these errors?
>>>>=20
>>>> This morning:
>>>>=20
>>>> 13410079654596185797  REMOVED      0     0     0  was /dev/da7p3
>>>>=20
>>>> At least I know i'm looking for Serial Number: 13Q8PNBYS
>>>>=20
>>>> =46rom the logs:
>>>>=20
>>>> Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 lun 0
>>>> Apr 25 05:34:50 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> s/n
>>> 13Q8PNBYS detached
>>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
>>>> d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b scsi 0
>>>> state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0):
>>>> READ(10). CDB: 28 00 d8 33 26 f8 00 00 20 00 length 16384 SMID 204
>>>> terminated ioc 804b scsi 0 state c xfer(da7:mps1:0:17:0): READ(10). CDB=
:
>> 28
>>> 00 d8 33 53 e0 00 00 08 00 Apr 25 05:34:51 knew kernel: 0 Apr 25 05:34:5=
1
>>> knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue
>>> Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 2=
8
>>> 00 d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc 804b scs=
i
>> 0
>>> state c xfer(da7: 0
>>>> Apr 25 05:34:51 knew kernel: mps1:0:    (da7:mps1:0:17:0): READ(10).
>> CDB:
>>> 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc 804b
>> scsi 0
>>> state c xfer 17:0
>>>> Apr 25 05:34:51 knew kernel: 0):        (da7:mps1:0:17:0): WRITE(10).
>> CDB: 2a
>>> 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc 804b scs=
i
>> 0
>>> state c xfeError 5, Periph was invalidated
>>>> Apr 25 05:34:51 knew kernel: r 0
>>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
>>>> d8 33 26 f8 00 00 20 00 Apr 25 05:34:51 knew kernel:
>>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 length
>>>> 4096 SMID 435 terminated ioc 804b scsi 0 state c xfer
>>>> (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue Request Apr 25
>>>> 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1:
>>>> mps1:0:IOCStatus =3D 0x4b while resetting device 0xa Apr 25 05:34:51
>>>> knew kernel: 17:mps1: 0): Unfreezing devq for target ID 17 Apr 25
>>>> 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25 05:34:51
>>>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 26 d8 00 00
>>>> 20 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
>>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
>>>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 6c 42 40 00 00
>>>> 10 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
>>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
>>>> knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 d8 2a 90 00
>>>> 00 90 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 knew
>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00
>> Apr
>>> 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally
>> Re-
>>> queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5,
>>> Periph was invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR: Device
>>> swap: provider da7p2 disconnected.
>>>> Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS 'vde=
v
>> is
>>> removed, pool_guid=3D15378250086669402288
>>> vdev_guid=3D13410079654596185797''
>>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed Apr
>>>> 25 05:34:51 knew ZFS: vdev is removed, pool_guid=3D15378250086669402288=

>>>> vdev_guid=3D13410079654596185797
>>>=20
>>> Current status: after powering off the box, reseating the cables for tha=
t
>> drive, I
>>> powered up the system and a resilver commenced which completed in 30
>>> minutes.
>>>=20
>>> Seems OK now.  I am not sure if the two events are related.
>>=20
>> Recently, a bug was uncovered where a device is gets 'lost'.
>>=20
>> Here's what happens:
>> The first message in your "failure on Monday"  log is for
>> 'mpssas_prepare_remove'.  This message is likely logged because the FW se=
nds
>> an event to the driver that the device is no longer responsive (pulled,
>> cable issue, or something else).  When the driver gets this event, it sen=
ds
>> a reset to the device to clear out any pending I/O.  This is where the
>> 'terminated ioc' messages come from.  When the reset completes, the drive=
r
>> is supposed to send a SAS_IO_UNIT message to FW so that the DevHandle for=

>> that disk is removed from FW's list.  Then, when the device comes back
>> on-line, everything is fine. But, with this bug, before that SAS_IO_UNIT
>> message is sent to FW, the driver exits the function where that happens
>> (mpssas_remove_device).  This happens where you see the log message,
>> "IOCStatus - 0x4b while resetting device 0x0a".  The driver logs that
>> message and then exits.  What the driver should do is log that message an=
d
>> continue on to send the SAS_IO_UNIT message to FW.  The fix is to remove t=
he
>> two lines in the driver shown here with '>>':
>>=20
>>    if (le16toh(reply->IOCStatus) !=3D MPI2_IOCSTATUS_SUCCESS) {
>>        mps_printf(sc, "IOCStatus =3D 0x%x while resetting device
>> 0x%x\n",
>>           le16toh(reply->IOCStatus), handle);
>>>>        mpssas_free_tm(sc, tm);
>>>>        return;
>>    }
>>=20
>> A reboot will solve the problem, as you saw, but the real fix is to remov=
e
>> the DevHandle as described above.  This fix will go into the driver befor=
e
>> the next scheduled release and then MFC'd to 10.x.
>=20
> Thanks for the diagnosis, Steve.  I forgot about that case.  We should als=
o make
> this chain of events more evident in the syslog, it=E2=80=99s very confusi=
ng when it happens.
> I=E2=80=99m not exactly sure yet what it should look like.
>=20
> Scott
>=20

Steve,

We are seeing this same error case for mpr and mps in FreeNAS. We are decoup=
led from FreeBSD releases and have an auto-update mechanism.

I'd be very happy to test patches against stable/10 on our nightly users.=20=




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0BA358CB-779B-460E-9FFC-FA033C136825>