From owner-freebsd-scsi@freebsd.org  Mon May 30 18:27:46 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4E066B55FB8
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Mon, 30 May 2016 18:27:46 +0000 (UTC)
 (envelope-from dan@langille.org)
Received: from clavin1.langille.org (clavin.langille.org [162.208.116.86])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "clavin.langille.org",
 Issuer "StartCom Class 2 Primary Intermediate Server CA" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 2BD7E1960
 for <freebsd-scsi@freebsd.org>; Mon, 30 May 2016 18:27:45 +0000 (UTC)
 (envelope-from dan@langille.org)
Received: from (clavin1.int.langille.org (clavin1.int.unixathome.org
 [10.4.7.7]) (Authenticated sender: hidden) with ESMTPSA id D5F9C3C93 
 for <freebsd-scsi@freebsd.org>; Mon, 30 May 2016 18:27:30 +0000 (UTC)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
From: Dan Langille <dan@langille.org>
In-Reply-To: <0d7401d19f10$ee329300$ca97b900$@broadcom.com>
Date: Mon, 30 May 2016 14:27:31 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <F4C52A2E-AC8D-4A6D-BA93-0D96C9090251@langille.org>
References: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org>
 <5EEF0794-B06E-4A72-89DA-7DCD94AE1FC6@langille.org>
 <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org>
 <0d7401d19f10$ee329300$ca97b900$@broadcom.com>
To: freebsd-scsi@freebsd.org
X-Mailer: Apple Mail (2.3124)
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 30 May 2016 18:27:46 -0000

> On Apr 25, 2016, at 12:38 PM, Stephen McConnell =
<stephen.mcconnell@broadcom.com> wrote:
>=20
>=20
>=20
>> -----Original Message-----
>> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd-
>> scsi@freebsd.org] On Behalf Of Dan Langille
>> Sent: Monday, April 25, 2016 9:40 AM
>> To: freebsd-scsi@freebsd.org
>> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
>>=20
>>> On Apr 25, 2016, at 8:17 AM, Dan Langille <dan@langille.org> wrote:
>>>=20
>>>>=20
>>>> On Apr 24, 2016, at 9:35 AM, Dan Langille <dan@langille.org> wrote:
>>>>=20
>>>> More of the pasted output is also at
>> https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae
>> <https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae>
>> and added smartctl output.
>>>>=20
>>>> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI =
SAS2008
>> card.  It's running a zfs root system.
>>>>=20
>>>> This morning the system was unresponsive via ssh. Attempts to log =
in at
>> the console did not yield a password prompt.
>>>>=20
>>>> A power cycle brought the system online.  Inspecting =
/var/log/messages,
> I
>> found about 63,000 entries similar to those which appear below.
>>>>=20
>>>> zpool status of all are OK. A scrub is in progress for one pool =
(since
> before
>> this issue arose). da7 is in that pool.
>>>>=20
>>>>=20
>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 =
00
>>>> 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b =
scsi
>>>> 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0):
>>>> READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 length 16384 SMID 614
>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
>>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 50 00 00 =
20
>>>> 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c xfer 0
>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 =
00
>>>> 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc 804b
>>>> scsi 0 state c xfer 0 Apr 24 11:25:55 knew kernel: =
(da7:mps1:0:17:0):
>>>> READ(10). CDB: 28 00 8b 6f ef 50 00 00 08 00 length 4096 SMID 674
>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
>>>> kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b 0f a2 48 00 00 =
18
>>>> 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state c xfer
>>>> 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). =
CDB:
>>>> 28 00 ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc
>>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:56 knew kernel:
>>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00
>>>> length 16384 SMID 376 terminated ioc 804b scsi 0 state c xfer 0 Apr
>>>> 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b
>>>> d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi =
0
>>>> state c xfer 0
>>>>=20
>>>> Is this a cabling issue?  The drive is a SATA device (smartctl =
output
> in the
>> URL above).  Anyone familiar with these errors?
>>>=20
>>> This morning:
>>>=20
>>> 13410079654596185797  REMOVED      0     0     0  was /dev/da7p3
>>>=20
>>> At least I know i'm looking for Serial Number: 13Q8PNBYS
>>>=20
>>> =46rom the logs:
>>>=20
>>> Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 lun =
0
>>> Apr 25 05:34:50 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> s/n
>> 13Q8PNBYS detached

Just for the record, this happened again this morning. Fixed by power =
cycle.

May 30 03:22:08 knew kernel: mps1: mpssas_prepare_remove: Sending reset =
for target ID 17
May 30 03:22:10 knew kernel: da7 at mps1 bus 0 scbus1 target 17 lun 0
May 30 03:22:10 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> s/n        =
    13Q8PNBYS detached
May 30 03:22:10 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8c =
5c 91 c0 00 00 08 00 length 4096 SMID 179 terminated ioc 804b scsi 0 =
state c xfer 0
May 30 03:22:10 knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 6b =
bf db a0 00 00 f0 00 length 122880 SMID 938 terminated ioc 804b scsi 0 =
state c xf(da7:mps1:0:17:0): READ(10). CDB: 28 00 8c 5c 91 c0 00 00 08 =
00=20
May 30 03:22:10 knew kernel: er 122880


>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
>>> d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b scsi =
0
>>> state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0):
>>> READ(10). CDB: 28 00 d8 33 26 f8 00 00 20 00 length 16384 SMID 204
>>> terminated ioc 804b scsi 0 state c xfer(da7:mps1:0:17:0): READ(10). =
CDB:
> 28
>> 00 d8 33 53 e0 00 00 08 00 Apr 25 05:34:51 knew kernel: 0 Apr 25 =
05:34:51
>> knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue
>> Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). =
CDB: 28
>> 00 d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc 804b =
scsi
> 0
>> state c xfer(da7: 0
>>> Apr 25 05:34:51 knew kernel: mps1:0:    (da7:mps1:0:17:0): READ(10).
> CDB:
>> 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc =
804b
> scsi 0
>> state c xfer 17:0
>>> Apr 25 05:34:51 knew kernel: 0):        (da7:mps1:0:17:0): =
WRITE(10).
> CDB: 2a
>> 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc 804b =
scsi
> 0
>> state c xfeError 5, Periph was invalidated
>>> Apr 25 05:34:51 knew kernel: r 0
>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
>>> d8 33 26 f8 00 00 20 00 Apr 25 05:34:51 knew kernel:
>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 =
length
>>> 4096 SMID 435 terminated ioc 804b scsi 0 state c xfer
>>> (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue Request Apr =
25
>>> 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1:
>>> mps1:0:IOCStatus =3D 0x4b while resetting device 0xa Apr 25 05:34:51
>>> knew kernel: 17:mps1: 0): Unfreezing devq for target ID 17 Apr 25
>>> 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25 =
05:34:51
>>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 26 d8 00 =
00
>>> 20 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
>>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 6c 42 40 00 =
00
>>> 10 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
>>> knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 d8 2a 90 00
>>> 00 90 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 =
knew
>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 =
00
> Apr
>> 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: =
Unconditionally
> Re-
>> queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error =
5,
>> Periph was invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR: =
Device
>> swap: provider da7p2 disconnected.
>>> Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS =
'vdev
> is
>> removed, pool_guid=3D15378250086669402288
>> vdev_guid=3D13410079654596185797''
>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed Apr
>>> 25 05:34:51 knew ZFS: vdev is removed, =
pool_guid=3D15378250086669402288
>>> vdev_guid=3D13410079654596185797
>>=20
>> Current status: after powering off the box, reseating the cables for =
that
> drive, I
>> powered up the system and a resilver commenced which completed in 30
>> minutes.
>>=20
>> Seems OK now.  I am not sure if the two events are related.
>=20
> Recently, a bug was uncovered where a device is gets 'lost'.
>=20
> Here's what happens:
> The first message in your "failure on Monday"  log is for
> 'mpssas_prepare_remove'.  This message is likely logged because the FW =
sends
> an event to the driver that the device is no longer responsive =
(pulled,
> cable issue, or something else).  When the driver gets this event, it =
sends
> a reset to the device to clear out any pending I/O.  This is where the
> 'terminated ioc' messages come from.  When the reset completes, the =
driver
> is supposed to send a SAS_IO_UNIT message to FW so that the DevHandle =
for
> that disk is removed from FW's list.  Then, when the device comes back
> on-line, everything is fine. But, with this bug, before that =
SAS_IO_UNIT
> message is sent to FW, the driver exits the function where that =
happens
> (mpssas_remove_device).  This happens where you see the log message,
> "IOCStatus - 0x4b while resetting device 0x0a".  The driver logs that
> message and then exits.  What the driver should do is log that message =
and
> continue on to send the SAS_IO_UNIT message to FW.  The fix is to =
remove the
> two lines in the driver shown here with '>>':
>=20
> 	if (le16toh(reply->IOCStatus) !=3D MPI2_IOCSTATUS_SUCCESS) {
> 		mps_printf(sc, "IOCStatus =3D 0x%x while resetting =
device
> 0x%x\n",
> 		   le16toh(reply->IOCStatus), handle);
>>> 		mpssas_free_tm(sc, tm);
>>> 		return;
> 	}
>=20
> A reboot will solve the problem, as you saw, but the real fix is to =
remove
> the DevHandle as described above.  This fix will go into the driver =
before
> the next scheduled release and then MFC'd to 10.x.
>=20
> Steve
>=20
>>=20
>> --
>> Dan Langille - BSDCan / PGCon
>> dan@langille.org
>>=20
>>=20
>>=20
>>=20
>> _______________________________________________
>> freebsd-scsi@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to =
"freebsd-scsi-unsubscribe@freebsd.org"
>=20


From owner-freebsd-scsi@freebsd.org  Tue May 31 16:17:47 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id F08AFB56B07
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Tue, 31 May 2016 16:17:47 +0000 (UTC)
 (envelope-from stephen.mcconnell@broadcom.com)
Received: from mail-wm0-x234.google.com (mail-wm0-x234.google.com
 [IPv6:2a00:1450:400c:c09::234])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 80D551A35
 for <freebsd-scsi@freebsd.org>; Tue, 31 May 2016 16:17:47 +0000 (UTC)
 (envelope-from stephen.mcconnell@broadcom.com)
Received: by mail-wm0-x234.google.com with SMTP id n129so115167899wmn.1
 for <freebsd-scsi@freebsd.org>; Tue, 31 May 2016 09:17:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=broadcom.com; s=google;
 h=from:to:references:in-reply-to:subject:date:message-id:mime-version
 :content-transfer-encoding:thread-index:content-language;
 bh=4O8hnjgU1HdkVqZFcvGQkn0EligJW/BG4vlHiNVYLYM=;
 b=NgZsCGN1z8CmZEFsnL37LWh8rxxwWmCU5rtZAB91GgHt7FcvH19vWMsUBOEp1IUUAU
 6tsX1XacsCa+1G9crjvqRZNot56nLE7OTzooqGKveze6oTwodLrvMIruxRUxxfXKXBXc
 ZunCwPYd263HrNVE4mmURO1aOuBp5BqZMpb3E=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:from:to:references:in-reply-to:subject:date
 :message-id:mime-version:content-transfer-encoding:thread-index
 :content-language;
 bh=4O8hnjgU1HdkVqZFcvGQkn0EligJW/BG4vlHiNVYLYM=;
 b=D9hmrZAsLMauqjVHB/9LKZ+7FsqaCPJRcyn9Unzu2znRBxT0QhBzDTdele6vAw4HyH
 F/1uNVfCog7eD2qktlwzm6aFQGS18ODPK+tQp2yoIJYKFjl3bLa7KVJWY11PZeSOr/n3
 794H9Vhtroo+rk3TWD2uKFtDqwmXN+X+JePwmfr8jkBCFMviU9cCu1m4LFJ9u+yAJPgn
 8WUIRrb0bqb927++nG6NEt5kKZ8fE9+agBrzgHIrUVEC97ztPPqJ4xDRFZ0uf+fy9Aca
 JA/ZGYVx5rHAe/qxzgLr68SRXCmdNv7s9grTN0Qzj07AAJ3yshgSoViDkl/0S6Y2JTF1
 TNPQ==
X-Gm-Message-State: ALyK8tJc5km/YfXT0N59ZqkQlcwScRRan59t8NW2Nq5X7UkyC2RAMgvSiNUFnDWW6CDFt0BJ
X-Received: by 10.194.118.9 with SMTP id ki9mr12390612wjb.100.1464711465703;
 Tue, 31 May 2016 09:17:45 -0700 (PDT)
Received: from C5SDN12 ([192.19.220.253])
 by smtp.gmail.com with ESMTPSA id e1sm40740774wjv.9.2016.05.31.09.17.43
 (version=TLSv1/SSLv3 cipher=OTHER);
 Tue, 31 May 2016 09:17:44 -0700 (PDT)
From: "Stephen McConnell" <stephen.mcconnell@broadcom.com>
To: "'Dan Langille'" <dan@langille.org>,
	<freebsd-scsi@freebsd.org>
References: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org>
 <5EEF0794-B06E-4A72-89DA-7DCD94AE1FC6@langille.org>
 <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org>
 <0d7401d19f10$ee329300$ca97b900$@broadcom.com>
 <F4C52A2E-AC8D-4A6D-BA93-0D96C9090251@langille.org>
In-Reply-To: <F4C52A2E-AC8D-4A6D-BA93-0D96C9090251@langille.org>
Subject: RE: terminated ioc 804b scsi 0 state c xfer 0
Date: Tue, 31 May 2016 10:17:41 -0600
Message-ID: <068601d1bb57$f675f710$e361e530$@broadcom.com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Mailer: Microsoft Outlook 14.0
Thread-Index: AQJ4ucc4xV6h5VXTBEGxwBGwzw4I/wFGbfNtAl4WzZcCWIVPzgIwmuGNnkPOc+A=
Content-Language: en-us
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 31 May 2016 16:17:48 -0000


> -----Original Message-----
> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd-
> scsi@freebsd.org] On Behalf Of Dan Langille
> Sent: Monday, May 30, 2016 12:28 PM
> To: freebsd-scsi@freebsd.org
> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
> 
> > On Apr 25, 2016, at 12:38 PM, Stephen McConnell
> <stephen.mcconnell@broadcom.com> wrote:
> >
> >
> >
> >> -----Original Message-----
> >> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd-
> >> scsi@freebsd.org] On Behalf Of Dan Langille
> >> Sent: Monday, April 25, 2016 9:40 AM
> >> To: freebsd-scsi@freebsd.org
> >> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
> >>
> >>> On Apr 25, 2016, at 8:17 AM, Dan Langille <dan@langille.org> wrote:
> >>>
> >>>>
> >>>> On Apr 24, 2016, at 9:35 AM, Dan Langille <dan@langille.org> wrote:
> >>>>
> >>>> More of the pasted output is also at
> >> https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae
> >> <https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae>
> >> and added smartctl output.
> >>>>
> >>>> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI
> >>>> SAS2008
> >> card.  It's running a zfs root system.
> >>>>
> >>>> This morning the system was unresponsive via ssh. Attempts to log
> >>>> in at
> >> the console did not yield a password prompt.
> >>>>
> >>>> A power cycle brought the system online.  Inspecting
> >>>> /var/log/messages,
> > I
> >> found about 63,000 entries similar to those which appear below.
> >>>>
> >>>> zpool status of all are OK. A scrub is in progress for one pool
> >>>> (since
> > before
> >> this issue arose). da7 is in that pool.
> >>>>
> >>>>
> >>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28
> >>>> 00 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b
> >>>> scsi
> >>>> 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0):
> >>>> READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 length 16384 SMID 614
> >>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
> >>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 50 00 00
> >>>> 20
> >>>> 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c xfer 0
> >>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28
> >>>> 00 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc
> >>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew kernel:
(da7:mps1:0:17:0):
> >>>> READ(10). CDB: 28 00 8b 6f ef 50 00 00 08 00 length 4096 SMID 674
> >>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
> >>>> kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b 0f a2 48 00 00
> >>>> 18
> >>>> 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state c xfer
> >>>> 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB:
> >>>> 28 00 ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc
> >>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:56 knew kernel:
> >>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00
> >>>> length 16384 SMID 376 terminated ioc 804b scsi 0 state c xfer 0 Apr
> >>>> 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b
> >>>> d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi
> >>>> 0 state c xfer 0
> >>>>
> >>>> Is this a cabling issue?  The drive is a SATA device (smartctl
> >>>> output
> > in the
> >> URL above).  Anyone familiar with these errors?
> >>>
> >>> This morning:
> >>>
> >>> 13410079654596185797  REMOVED      0     0     0  was /dev/da7p3
> >>>
> >>> At least I know i'm looking for Serial Number: 13Q8PNBYS
> >>>
> >>> From the logs:
> >>>
> >>> Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 lun
> >>> 0 Apr 25 05:34:50 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> s/n
> >> 13Q8PNBYS detached
> 
> Just for the record, this happened again this morning. Fixed by power
cycle.
> 
> May 30 03:22:08 knew kernel: mps1: mpssas_prepare_remove: Sending reset
> for target ID 17 May 30 03:22:10 knew kernel: da7 at mps1 bus 0 scbus1
target
> 17 lun 0
> May 30 03:22:10 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> s/n
> 13Q8PNBYS detached
> May 30 03:22:10 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8c 5c
> 91 c0 00 00 08 00 length 4096 SMID 179 terminated ioc 804b scsi 0 state c
xfer
> 0 May 30 03:22:10 knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 6b
> bf db a0 00 00 f0 00 length 122880 SMID 938 terminated ioc 804b scsi 0
state c
> xf(da7:mps1:0:17:0): READ(10). CDB: 28 00 8c 5c 91 c0 00 00 08 00 May 30
> 03:22:10 knew kernel: er 122880
> 
I just realized that you're using mps, not mpr.  The fix went into the mpr
driver, but not mps yet.  It'll have to be ported over to mps.

Steve

> >>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
> >>> d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b scsi 0
> >>> state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0):
> >>> READ(10). CDB: 28 00 d8 33 26 f8 00 00 20 00 length 16384 SMID 204
> >>> terminated ioc 804b scsi 0 state c xfer(da7:mps1:0:17:0): READ(10).
CDB:
> > 28
> >> 00 d8 33 53 e0 00 00 08 00 Apr 25 05:34:51 knew kernel: 0 Apr 25
05:34:51
> >> knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue
> >> Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB:
> 28
> >> 00 d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc 804b
scsi
> > 0
> >> state c xfer(da7: 0
> >>> Apr 25 05:34:51 knew kernel: mps1:0:    (da7:mps1:0:17:0): READ(10).
> > CDB:
> >> 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc 804b
> > scsi 0
> >> state c xfer 17:0
> >>> Apr 25 05:34:51 knew kernel: 0):        (da7:mps1:0:17:0): WRITE(10).
> > CDB: 2a
> >> 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc 804b
scsi
> > 0
> >> state c xfeError 5, Periph was invalidated
> >>> Apr 25 05:34:51 knew kernel: r 0
> >>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
> >>> d8 33 26 f8 00 00 20 00 Apr 25 05:34:51 knew kernel:
> >>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 length
> >>> 4096 SMID 435 terminated ioc 804b scsi 0 state c xfer
> >>> (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue Request Apr 25
> >>> 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1:
> >>> mps1:0:IOCStatus = 0x4b while resetting device 0xa Apr 25 05:34:51
> >>> knew kernel: 17:mps1: 0): Unfreezing devq for target ID 17 Apr 25
> >>> 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25 05:34:51
> >>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 26 d8 00 00
> >>> 20 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
> >>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
> >>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
> >>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 6c 42 40 00 00
> >>> 10 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
> >>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
> >>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
> >>> knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 d8 2a 90 00
> >>> 00 90 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
> >> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
> >> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 knew
> >> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00
> > Apr
> >> 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally
> > Re-
> >> queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5,
> >> Periph was invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR:
> Device
> >> swap: provider da7p2 disconnected.
> >>> Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS
'vdev
> > is
> >> removed, pool_guid=15378250086669402288
> >> vdev_guid=13410079654596185797''
> >>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed Apr
> >>> 25 05:34:51 knew ZFS: vdev is removed,
> pool_guid=15378250086669402288
> >>> vdev_guid=13410079654596185797
> >>
> >> Current status: after powering off the box, reseating the cables for
that
> > drive, I
> >> powered up the system and a resilver commenced which completed in 30
> >> minutes.
> >>
> >> Seems OK now.  I am not sure if the two events are related.
> >
> > Recently, a bug was uncovered where a device is gets 'lost'.
> >
> > Here's what happens:
> > The first message in your "failure on Monday"  log is for
> > 'mpssas_prepare_remove'.  This message is likely logged because the FW
> sends
> > an event to the driver that the device is no longer responsive (pulled,
> > cable issue, or something else).  When the driver gets this event, it
sends
> > a reset to the device to clear out any pending I/O.  This is where the
> > 'terminated ioc' messages come from.  When the reset completes, the
driver
> > is supposed to send a SAS_IO_UNIT message to FW so that the DevHandle
> for
> > that disk is removed from FW's list.  Then, when the device comes back
> > on-line, everything is fine. But, with this bug, before that SAS_IO_UNIT
> > message is sent to FW, the driver exits the function where that happens
> > (mpssas_remove_device).  This happens where you see the log message,
> > "IOCStatus - 0x4b while resetting device 0x0a".  The driver logs that
> > message and then exits.  What the driver should do is log that message
and
> > continue on to send the SAS_IO_UNIT message to FW.  The fix is to remove
> the
> > two lines in the driver shown here with '>>':
> >
> > 	if (le16toh(reply->IOCStatus) != MPI2_IOCSTATUS_SUCCESS) {
> > 		mps_printf(sc, "IOCStatus = 0x%x while resetting device
> > 0x%x\n",
> > 		   le16toh(reply->IOCStatus), handle);
> >>> 		mpssas_free_tm(sc, tm);
> >>> 		return;
> > 	}
> >
> > A reboot will solve the problem, as you saw, but the real fix is to
remove
> > the DevHandle as described above.  This fix will go into the driver
before
> > the next scheduled release and then MFC'd to 10.x.
> >
> > Steve
> >
> >>
> >> --
> >> Dan Langille - BSDCan / PGCon
> >> dan@langille.org
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> freebsd-scsi@freebsd.org mailing list
> >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
> >
> 
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"


From owner-freebsd-scsi@freebsd.org  Tue May 31 16:20:16 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4DF9BB56BBE
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Tue, 31 May 2016 16:20:16 +0000 (UTC)
 (envelope-from dan@langille.org)
Received: from clavin2.langille.org (clavin2.langille.org [199.233.228.197])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "clavin.langille.org",
 Issuer "StartCom Class 2 Primary Intermediate Server CA" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 09B971AB6
 for <freebsd-scsi@freebsd.org>; Tue, 31 May 2016 16:20:15 +0000 (UTC)
 (envelope-from dan@langille.org)
Received: from (clavin2.int.langille.org (clavin2.int.unixathome.org
 [10.4.7.7]) (Authenticated sender: hidden) with ESMTPSA id 12F6760BC ; 
 Tue, 31 May 2016 16:20:07 +0000 (UTC)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
From: Dan Langille <dan@langille.org>
In-Reply-To: <068601d1bb57$f675f710$e361e530$@broadcom.com>
Date: Tue, 31 May 2016 12:20:07 -0400
Cc: freebsd-scsi@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <DD765878-E988-4B11-B4E6-6E10FEC5B5BE@langille.org>
References: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org>
 <5EEF0794-B06E-4A72-89DA-7DCD94AE1FC6@langille.org>
 <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org>
 <0d7401d19f10$ee329300$ca97b900$@broadcom.com>
 <F4C52A2E-AC8D-4A6D-BA93-0D96C9090251@langille.org>
 <068601d1bb57$f675f710$e361e530$@broadcom.com>
To: Stephen McConnell <stephen.mcconnell@broadcom.com>
X-Mailer: Apple Mail (2.3124)
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 31 May 2016 16:20:16 -0000

> On May 31, 2016, at 12:17 PM, Stephen McConnell =
<stephen.mcconnell@broadcom.com> wrote:
>=20
>=20
>=20
>> -----Original Message-----
>> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd-
>> scsi@freebsd.org] On Behalf Of Dan Langille
>> Sent: Monday, May 30, 2016 12:28 PM
>> To: freebsd-scsi@freebsd.org
>> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
>>=20
>>> On Apr 25, 2016, at 12:38 PM, Stephen McConnell
>> <stephen.mcconnell@broadcom.com> wrote:
>>>=20
>>>=20
>>>=20
>>>> -----Original Message-----
>>>> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd-
>>>> scsi@freebsd.org] On Behalf Of Dan Langille
>>>> Sent: Monday, April 25, 2016 9:40 AM
>>>> To: freebsd-scsi@freebsd.org
>>>> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
>>>>=20
>>>>> On Apr 25, 2016, at 8:17 AM, Dan Langille <dan@langille.org> =
wrote:
>>>>>=20
>>>>>>=20
>>>>>> On Apr 24, 2016, at 9:35 AM, Dan Langille <dan@langille.org> =
wrote:
>>>>>>=20
>>>>>> More of the pasted output is also at
>>>> https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae
>>>> =
<https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae>
>>>> and added smartctl output.
>>>>>>=20
>>>>>> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI
>>>>>> SAS2008
>>>> card.  It's running a zfs root system.
>>>>>>=20
>>>>>> This morning the system was unresponsive via ssh. Attempts to log
>>>>>> in at
>>>> the console did not yield a password prompt.
>>>>>>=20
>>>>>> A power cycle brought the system online.  Inspecting
>>>>>> /var/log/messages,
>>> I
>>>> found about 63,000 entries similar to those which appear below.
>>>>>>=20
>>>>>> zpool status of all are OK. A scrub is in progress for one pool
>>>>>> (since
>>> before
>>>> this issue arose). da7 is in that pool.
>>>>>>=20
>>>>>>=20
>>>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28
>>>>>> 00 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc =
804b
>>>>>> scsi
>>>>>> 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0):
>>>>>> READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 length 16384 SMID =
614
>>>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
>>>>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 50 00 00
>>>>>> 20
>>>>>> 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c xfer =
0
>>>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28
>>>>>> 00 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc
>>>>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew kernel:
> (da7:mps1:0:17:0):
>>>>>> READ(10). CDB: 28 00 8b 6f ef 50 00 00 08 00 length 4096 SMID 674
>>>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
>>>>>> kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b 0f a2 48 00 =
00
>>>>>> 18
>>>>>> 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state c xfer
>>>>>> 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). =
CDB:
>>>>>> 28 00 ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc
>>>>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:56 knew kernel:
>>>>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00
>>>>>> length 16384 SMID 376 terminated ioc 804b scsi 0 state c xfer 0 =
Apr
>>>>>> 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 =
8b
>>>>>> d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b =
scsi
>>>>>> 0 state c xfer 0
>>>>>>=20
>>>>>> Is this a cabling issue?  The drive is a SATA device (smartctl
>>>>>> output
>>> in the
>>>> URL above).  Anyone familiar with these errors?
>>>>>=20
>>>>> This morning:
>>>>>=20
>>>>> 13410079654596185797  REMOVED      0     0     0  was /dev/da7p3
>>>>>=20
>>>>> At least I know i'm looking for Serial Number: 13Q8PNBYS
>>>>>=20
>>>>> =46rom the logs:
>>>>>=20
>>>>> Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 =
lun
>>>>> 0 Apr 25 05:34:50 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> =
s/n
>>>> 13Q8PNBYS detached
>>=20
>> Just for the record, this happened again this morning. Fixed by power
> cycle.
>>=20
>> May 30 03:22:08 knew kernel: mps1: mpssas_prepare_remove: Sending =
reset
>> for target ID 17 May 30 03:22:10 knew kernel: da7 at mps1 bus 0 =
scbus1
> target
>> 17 lun 0
>> May 30 03:22:10 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> s/n
>> 13Q8PNBYS detached
>> May 30 03:22:10 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 =
8c 5c
>> 91 c0 00 00 08 00 length 4096 SMID 179 terminated ioc 804b scsi 0 =
state c
> xfer
>> 0 May 30 03:22:10 knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a =
00 6b
>> bf db a0 00 00 f0 00 length 122880 SMID 938 terminated ioc 804b scsi =
0
> state c
>> xf(da7:mps1:0:17:0): READ(10). CDB: 28 00 8c 5c 91 c0 00 00 08 00 May =
30
>> 03:22:10 knew kernel: er 122880
>>=20
> I just realized that you're using mps, not mpr.  The fix went into the =
mpr
> driver, but not mps yet.  It'll have to be ported over to mps.

This hit me again last night.  Same drive again.  Power cycle cleared =
it.

Now I'm wondering if it's heat or dud drive related.

>=20
> Steve
>=20
>>>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 =
00
>>>>> d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b =
scsi 0
>>>>> state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0):
>>>>> READ(10). CDB: 28 00 d8 33 26 f8 00 00 20 00 length 16384 SMID 204
>>>>> terminated ioc 804b scsi 0 state c xfer(da7:mps1:0:17:0): =
READ(10).
> CDB:
>>> 28
>>>> 00 d8 33 53 e0 00 00 08 00 Apr 25 05:34:51 knew kernel: 0 Apr 25
> 05:34:51
>>>> knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally =
Re-queue
>>>> Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). =
CDB:
>> 28
>>>> 00 d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc =
804b
> scsi
>>> 0
>>>> state c xfer(da7: 0
>>>>> Apr 25 05:34:51 knew kernel: mps1:0:    (da7:mps1:0:17:0): =
READ(10).
>>> CDB:
>>>> 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc =
804b
>>> scsi 0
>>>> state c xfer 17:0
>>>>> Apr 25 05:34:51 knew kernel: 0):        (da7:mps1:0:17:0): =
WRITE(10).
>>> CDB: 2a
>>>> 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc =
804b
> scsi
>>> 0
>>>> state c xfeError 5, Periph was invalidated
>>>>> Apr 25 05:34:51 knew kernel: r 0
>>>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 =
00
>>>>> d8 33 26 f8 00 00 20 00 Apr 25 05:34:51 knew kernel:
>>>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 =
length
>>>>> 4096 SMID 435 terminated ioc 804b scsi 0 state c xfer
>>>>> (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue Request =
Apr 25
>>>>> 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1:
>>>>> mps1:0:IOCStatus =3D 0x4b while resetting device 0xa Apr 25 =
05:34:51
>>>>> knew kernel: 17:mps1: 0): Unfreezing devq for target ID 17 Apr 25
>>>>> 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25 =
05:34:51
>>>>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 26 d8 =
00 00
>>>>> 20 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
>>>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>>>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
>>>>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 6c 42 40 =
00 00
>>>>> 10 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
>>>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>>>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
>>>>> knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 d8 2a 90 =
00
>>>>> 00 90 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM =
status:
>>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 =
knew
>>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 =
08 00
>>> Apr
>>>> 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: =
Unconditionally
>>> Re-
>>>> queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error =
5,
>>>> Periph was invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR:
>> Device
>>>> swap: provider da7p2 disconnected.
>>>>> Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS
> 'vdev
>>> is
>>>> removed, pool_guid=3D15378250086669402288
>>>> vdev_guid=3D13410079654596185797''
>>>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed =
Apr
>>>>> 25 05:34:51 knew ZFS: vdev is removed,
>> pool_guid=3D15378250086669402288
>>>>> vdev_guid=3D13410079654596185797
>>>>=20
>>>> Current status: after powering off the box, reseating the cables =
for
> that
>>> drive, I
>>>> powered up the system and a resilver commenced which completed in =
30
>>>> minutes.
>>>>=20
>>>> Seems OK now.  I am not sure if the two events are related.
>>>=20
>>> Recently, a bug was uncovered where a device is gets 'lost'.
>>>=20
>>> Here's what happens:
>>> The first message in your "failure on Monday"  log is for
>>> 'mpssas_prepare_remove'.  This message is likely logged because the =
FW
>> sends
>>> an event to the driver that the device is no longer responsive =
(pulled,
>>> cable issue, or something else).  When the driver gets this event, =
it
> sends
>>> a reset to the device to clear out any pending I/O.  This is where =
the
>>> 'terminated ioc' messages come from.  When the reset completes, the
> driver
>>> is supposed to send a SAS_IO_UNIT message to FW so that the =
DevHandle
>> for
>>> that disk is removed from FW's list.  Then, when the device comes =
back
>>> on-line, everything is fine. But, with this bug, before that =
SAS_IO_UNIT
>>> message is sent to FW, the driver exits the function where that =
happens
>>> (mpssas_remove_device).  This happens where you see the log message,
>>> "IOCStatus - 0x4b while resetting device 0x0a".  The driver logs =
that
>>> message and then exits.  What the driver should do is log that =
message
> and
>>> continue on to send the SAS_IO_UNIT message to FW.  The fix is to =
remove
>> the
>>> two lines in the driver shown here with '>>':
>>>=20
>>> 	if (le16toh(reply->IOCStatus) !=3D MPI2_IOCSTATUS_SUCCESS) {
>>> 		mps_printf(sc, "IOCStatus =3D 0x%x while resetting =
device
>>> 0x%x\n",
>>> 		   le16toh(reply->IOCStatus), handle);
>>>>> 		mpssas_free_tm(sc, tm);
>>>>> 		return;
>>> 	}
>>>=20
>>> A reboot will solve the problem, as you saw, but the real fix is to
> remove
>>> the DevHandle as described above.  This fix will go into the driver
> before
>>> the next scheduled release and then MFC'd to 10.x.
>>>=20
>>> Steve
>>>=20
>>>>=20
>>>> --
>>>> Dan Langille - BSDCan / PGCon
>>>> dan@langille.org
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>> _______________________________________________
>>>> freebsd-scsi@freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>>> To unsubscribe, send any mail to =
"freebsd-scsi-unsubscribe@freebsd.org"
>>>=20
>>=20
>> _______________________________________________
>> freebsd-scsi@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to =
"freebsd-scsi-unsubscribe@freebsd.org"
>=20


From owner-freebsd-scsi@freebsd.org  Fri Jun  3 20:37:03 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 793F9B6853C
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Fri,  3 Jun 2016 20:37:03 +0000 (UTC)
 (envelope-from list-news@mindpackstudios.com)
Received: from mail.furymx.com (mindpack.mx1.furymx.net [64.141.130.10])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 3F52F130B
 for <freebsd-scsi@freebsd.org>; Fri,  3 Jun 2016 20:37:02 +0000 (UTC)
 (envelope-from list-news@mindpackstudios.com)
Received: from mindpack.furymx.net (mindpack.mx1.furymx.net [10.10.1.10])
 by mail.furymx.com (Postfix) with ESMTP id 5117B1ECE45
 for <freebsd-scsi@freebsd.org>; Fri,  3 Jun 2016 15:37:01 -0500 (CDT)
X-Virus-Scanned: amavisd-new at furymx.com
Received: from mail.furymx.com ([10.10.1.10])
 by mindpack.furymx.net (mail.furymx.com [10.10.1.10]) (amavisd-new, port 10024)
 with ESMTP id MORFLMp_R2Lg for <freebsd-scsi@freebsd.org>;
 Fri,  3 Jun 2016 15:36:59 -0500 (CDT)
Received: from vortex.local (c-98-215-180-176.hsd1.in.comcast.net
 [98.215.180.176])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 (Authenticated sender: kyle@mindpackstudios.com)
 by mail.furymx.com (Postfix) with ESMTPSA id 1554D1ECE36
 for <freebsd-scsi@freebsd.org>; Fri,  3 Jun 2016 15:36:59 -0500 (CDT)
From: list-news <list-news@mindpackstudios.com>
To: freebsd-scsi@freebsd.org
Subject: Avago LSI SAS 3008 & Intel SSD Timeouts
Message-ID: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
Date: Fri, 3 Jun 2016 15:36:58 -0500
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0)
 Gecko/20100101 Thunderbird/45.1.1
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 03 Jun 2016 20:37:03 -0000

Greetings!

For the last month I've been fighting with a SAS 3008 issues on a 
SuperMicro 2028TP-DECR (2 node Twin server).  The system is running 
10.3-RELEASE #1 r298749

The errors we see on this system look very similar to the errors posted 
here back in March:
https://lists.freebsd.org/pipermail/freebsd-scsi/2016-March/007000.html

The system hosts a large postgresql database and an application which 
spends most of its days calculating log entries on a couple hundred GB 
of tables.  Initially this system had 2 WD HDD mirrored boot drives, and 
8 Intel 730 - 240GB's SSD's in a zfs pool with 4 mirrored vdevs.  
Shortly after install/setup, a couple of those Intel 730's began showing 
READ & WRITE errors displayed with zpool status, additionally mpr0 
messages were showing drive timeouts when attempting READ(10), WRITE(10) 
and SYNCHRONIZE CACHE(10), such as this:

Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15 
5a 18 c0 00 00 08 00 length 4096 SMID 240 command timeout cm 
0xfffffe0000cefb00 ccb 0xfffff805dfeca800 target 16, handle(0x0011)
Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
connector name (    )
Jun  2 19:48:42 s18 kernel: mpr0: timedout cm 0xfffffe0000cefb00 
allocated tm 0xfffffe0000cdd3b0
Jun  2 19:48:42 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 15 
Aborting command 0xfffffe0000cefb00
Jun  2 19:48:42 s18 kernel: mpr0: Sending reset from mprsas_send_abort 
for target ID 16
[...snipped about 50 mores lines for brevity]

As a solution, the immediate thought was the drives were faulty.  We 
decided to replace some of the drives with Intel 480GB DC3610's.  After 
replacing 4 of the drives we noticed the 3610's were also throwing 
errors.  Read/write error counts displayed with zpool status would climb 
into the thousands in a night of calculations with dozens of mpr errors 
in the logs. This was happening to all of the drives in the array.  Some 
are more likely than others, but given enough time, all drives would 
display the errors. This provoked some suspicions that this problem was 
backplane and/or controller related, and the drives were fine (or 
possibly the drive firmware is the same in the 730's and 3610's and both 
exhibit this problem, but the backplane did seem more likely at the time).

We moved Node A system sled (cpu/mem/sas/mb) into Node B chassis slot to 
see if the errors would change at all (thinking maybe some SATA ports 
were possibly bad on the backplane).  We hit a bit of luck and noticed 
the boot drives, when inserted into Node B drive sleds 1&2 wouldn't fire 
up and were undetectable by the SAS controller.  SuperMicro concluded 
this was likely a backplane issue and a backplane was replaced.  The new 
backplane came in, the drives in Node B sleds 1 / 2 are now visible by 
the controller during boot, confirming the backplane did have a 
problem.  (This is important to mention as it seems an internets 
suggested fix for similar timeout errors is to replace the backplane 
and/or cables.)

Unfortunately, the luck ran out, once we put the system back under load, 
the errors came back.  I rebuilt the pool just using four 3610's as 2x 
mirrored vdevs and copied all the data over. Here is the pool as of 
yesterday:

   pool: zp1
  state: ONLINE
   scan: resilvered 107M in 0h0m with 0 errors on Thu Jun  2 14:52:43 2016
config:

         NAME STATE     READ WRITE CKSUM
         zp1 ONLINE       0     0     0
           mirror-0 ONLINE       0     0     0
             da3 ONLINE       0     0     0
             da5 ONLINE       0     0     0
           mirror-1 ONLINE       0     0     0
             da7 ONLINE       0     0     0
             da8 ONLINE       0     0     0

When performing logging calculations, the system looks like this pretty 
much continuously (via gstat):

L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w %busy Name
     7   8175   1374   7526    0.5   3000  48931 0.5   65.2| da3
     6   8337   1542   8025    0.5   2995  48947    0.5   68.1| da5
     2   8740   1529   8584    0.5   3138  52564    0.4   72.6| da7
     3   8637   1427   7961    0.5   3136  52564    0.4   65.6| da8

And zpool iostat:
zp1                                      414G   474G  10.9K 8.94K  
50.2M  51.8M
   mirror                                 207G   237G 5.44K  4.56K  
25.0M  26.2M
     da3                                     -      - 2.50K  2.37K  
12.3M  26.2M
     da5                                     -      - 2.65K  2.37K  
13.3M  26.1M
   mirror                                 207G   237G 5.42K  4.39K  
25.3M  25.6M
     da7                                     -      - 2.62K  2.40K  
13.2M  25.6M
     da8                                     -      - 2.57K  2.40K  
12.5M  25.6M

After a couple minutes (usually between 2 and 10) with drive %busy and a 
cpu load around 70%, the system hangs.

During the hang, one of two scenarios play out (possibly more, but these 
are the ones officially witnessed):

Scenario 1) Processes continue to function as long as they don't touch 
the storage.  One drive immediately hits 0 r/w 0 r/s, the rest of the 
drives quickly fall to 0 r/s 0 w/s (maybe within 1 - 5 seconds), and the 
lights on the chassis for the drives goes black.  Any commands I enter 
in the terminal hang after the newline.

During Scenario 1 - gstat output (notice the L(q) column for da7):
L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w %busy Name
     0      0      0      0    0.0      0      0 0.0    0.0| da3
     0      0      0      0    0.0      0      0 0.0    0.0| da5
     8      0      0      0    0.0      0      0 0.0    0.0| da7
     0      0      0      0    0.0      0      0 0.0    0.0| da8


Scenario 2) Some processes continue to function in a read-only state, 
but all writes for all drives fall to 0.

During Scenario 2 - gstat output (again notice the L(q) column for da3):
dT: 1.002s  w: 1.000s
  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps ms/w   %busy Name
    69      0      0      0    0.0      0      0 0.0    0.0| da3
     0   3207   3207  45727    0.2      0      0 0.0   37.8| da5
     0   2753   2753  24673    0.2      0      0 0.0   23.1| da7
     0   3027   3027  26425    0.2      0      0 0.0   28.6| da8

About 60 seconds later the system becomes responsive again (I assume 
kern.cam.da.default_timeout is the 60 second timer, although I haven't 
changed this to confirm), and a bunch of error logs appear for the drive 
that had a number in column L(q) > 0 (da7 in scenario 1 last night, da3 
in scenario 2 today). Any commands I typed in the terminal finally fire 
off, the cpu and disk usage quickly climb back to 70-80%+.  If the 
errors are bad enough (or possibly if two drives hang simultaneously?) 
it'll show read and/or write errors in zpool status.

Failed attempts to repair: Replaced Backplane. Full memory tests.  
Upgraded 3610's to the latest firmware. Used Node B in place of Node A.  
Flashed the controller with latest IT firmware from SuperMicro.  Swapped 
power supplies.  Pulled one proc off Node A and half the memory, put 
that into Node B and tried the tests again.  Set 
kern.cam.da.X.delete_method=DISABLE.  Probably tuned a couple dozen 
other cam and zfs sysctl/loader variables in testing.

Nothing stopped these errors.

Until I ran this:
# camcontrol tags daX -N 1
(where X is 3,5,7 and 8)

With tags set to 1 for each device, all errors for the last 18 hours 
have ceased (system ran all night at load).  I've not spent the time yet 
to determine how much that affects performance, but it is definitely 
working.

To reproduce this bug.  Today I ran:
# camcontrol tags daX -N 8

On all drives.  A few minutes later, the drive hangs (see gstat output 
for da3 above in scenario 2, which I had copied just seconds before the 
errors that follow):

Jun  3 13:34:10 s18 kernel: (da3:mpr0:0:12:0): READ(10). CDB: 28 00 09 
cc 97 30 00 00 08 00 length 4096 SMID 961 command timeout cm 
0xfffffe0000d2ad50 ccb 0xfffff807bfcef800 target 12, handle(0x000d)
Jun  3 13:34:10 s18 kernel: mpr0: At enclosure level 0, slot 4, 
connector name (    )
Jun  3 13:34:10 s18 kernel: mpr0: timedout cm 0xfffffe0000d2ad50 
allocated tm 0xfffffe0000cdc150
Jun  3 13:34:10 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 
Aborting command 0xfffffe0000d2ad50
Jun  3 13:34:10 s18 kernel: mpr0: Sending reset from mprsas_send_abort 
for target ID 12
[...snipped about 50 mores lines for brevity]

With 1 tag, gstat still shows L(q) values that jump into the 70's at 
times, but with what I understand of the source, devstat calculates the 
queued operations separate from camcontrol's simultaneous 
dev_openings.   So I assume that's nothing to be concerned about but 
figured I'd mention it.

# camcontrol tags da7 -v

(pass7:mpr0:0:16:0): dev_openings  0
(pass7:mpr0:0:16:0): dev_active    1
(pass7:mpr0:0:16:0): allocated     1
(pass7:mpr0:0:16:0): queued        0
(pass7:mpr0:0:16:0): held          0
(pass7:mpr0:0:16:0): mintags       2
(pass7:mpr0:0:16:0): maxtags       255

Here is a full error message from last night on da7 (gstat scenario 1 
above was copied right before these lines). Apologies for the verbosity, 
but it may be helpful to some, logged with dev.mpr.0.debug_level=0x037F:

Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15 
5a 18 c0 00 00 08 00 length 4096 SMID 240 command timeout cm 
0xfffffe0000cefb00 ccb 0xfffff805dfeca800 target 16, handle(0x0011)
Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
connector name (    )
Jun  2 19:48:42 s18 kernel: mpr0: timedout cm 0xfffffe0000cefb00 
allocated tm 0xfffffe0000cdd3b0
Jun  2 19:48:42 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 15 
Aborting command 0xfffffe0000cefb00
Jun  2 19:48:42 s18 kernel: mpr0: Sending reset from mprsas_send_abort 
for target ID 16
Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 14 
24 35 e0 00 00 08 00 length 4096 SMID 562 command timeout cm 
0xfffffe0000d0a1a0 ccb 0xfffff80daeb84000 target 16, handle(0x0011)
Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
connector name (    )
Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 0xfffffe0000d0a1a0 
for processing by tm 0xfffffe0000cdd3b0
Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 command timeout cm 
0xfffffe0000d1e720 ccb 0xfffff80628c3d800 target 16, handle(0x0011)
Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
connector name (    )
Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 0xfffffe0000d1e720 
for processing by tm 0xfffffe0000cdd3b0
Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
cd fc 98 00 00 08 00 length 4096 SMID 571 command timeout cm 
0xfffffe0000d0ad70 ccb 0xfffff8018a54b800 target 16, handle(0x0011)
Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
connector name (    )
Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 0xfffffe0000d0ad70 
for processing by tm 0xfffffe0000cdd3b0
Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
cd fc 90 00 00 08 00 length 4096 SMID 540 command timeout cm 
0xfffffe0000d084c0 ccb 0xfffff80288e58000 target 16, handle(0x0011)
Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
connector name (    )
Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 0xfffffe0000d084c0 
for processing by tm 0xfffffe0000cdd3b0
Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
cd fc 88 00 00 08 00 length 4096 SMID 817 command timeout cm 
0xfffffe0000d1f050 ccb 0xfffff8004c3ef800 target 16, handle(0x0011)
Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
connector name (    )
Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 0xfffffe0000d1f050 
for processing by tm 0xfffffe0000cdd3b0
Jun  2 19:48:42 s18 kernel: mpr0: EventReply    :
Jun  2 19:48:42 s18 kernel: EventDataLength: 2
Jun  2 19:48:42 s18 kernel: AckRequired: 0
Jun  2 19:48:42 s18 kernel: Event: SasDiscovery (0x16)
Jun  2 19:48:42 s18 kernel: EventContext: 0x0
Jun  2 19:48:42 s18 kernel: Flags: 1<InProgress>
Jun  2 19:48:42 s18 kernel: ReasonCode: Discovery Started
Jun  2 19:48:42 s18 kernel: PhysicalPort: 0
Jun  2 19:48:42 s18 kernel: DiscoveryStatus: 0
Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 11 
8d fe 50 00 00 10 00 length 8192 SMID 186 completed cm 
0xfffffe0000ceb420 ccb 0xfffff80cd1d88800 during recovery ioc 804b scsi 
0 state c xfer 0
Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 11 
8d fe 50 00 00 10 00 length 8192 SMID 186 terminated ioc 804b scsi 0 
state c xfer 0
Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
cd fc 98 00 00 08 00 length 4096 SMID 571 completed timedout cm 
0xfffffe0000d0ad70 ccb 0xfffff8018a54b800 during recovery ioc 804b scsi 
0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 98 00 
00 08 00 length 4096 SMID 571 terminated ioc 804b scsi 0 state c xfer 0
Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
cd fc 90 00 00 08 00 length 4096 SMID 540 completed timedout cm 
0xfffffe0000d084c0 ccb 0xfffff80288e58000 during recovery ioc 804b scsi 
0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 90 00 
00 08 00 length 4096 SMID 540 terminated ioc 804b scsi 0 state c xfer 0
Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
cd fc 88 00 00 08 00 length 4096 SMID 817 completed timedout cm 
0xfffffe0000d1f050 ccb 0xfffff8004c3ef800 during recovery ioc 804b scsi 
0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 88 00 
00 08 00 length 4096 SMID 817 terminated ioc 804b scsi 0 state c xfer 0
Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15 
5a 18 c0 00 00 08 00 length 4096 SMID 240 completed timedout cm 
0xfffffe0000cefb00 ccb 0xfffff805dfeca800 during recovery ioc 8048 scsi 
0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 14 24 35 e0 00 
00 08 00 length 4096 SMID 562 completed timedout cm 0xfffffe0000d0a1a0 
ccb 0xfffff80daeb84000 during recovery ioc 804b scsi 0 state c xf    
(da7:mpr0:0:16:0): READ(10). CDB: 28 00 14 24 35 e0 00 00 08 00 length 
4096 SMID 562 terminated ioc 804b scsi 0 state c xfer 0
Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 completed timedout 
cm 0xfffffe0000d1e720 ccb 0xfffff80628c3d800 during recovery ioc 804b 
scsi 0
Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 terminated ioc 804b 
scsi 0 state c xfer 0
Jun  2 19:48:42 s18 kernel: (xpt0:mpr0:0:16:0): SMID 15 abort TaskMID 
240 status 0x0 code 0x0 count 7
Jun  2 19:48:42 s18 kernel: (xpt0:mpr0:0:16:0): SMID 15 finished 
recovery after aborting TaskMID 240
Jun  2 19:48:42 s18 kernel: mpr0: Unfreezing devq for target ID 16
Jun  2 19:48:42 s18 kernel: mpr0: EventReply    :
Jun  2 19:48:42 s18 kernel: EventDataLength: 4
Jun  2 19:48:42 s18 kernel: AckRequired: 0
Jun  2 19:48:42 s18 kernel: Event: SasTopologyChangeList (0x1c)
Jun  2 19:48:42 s18 kernel: EventContext: 0x0
Jun  2 19:48:42 s18 kernel: EnclosureHandle: 0x2
Jun  2 19:48:42 s18 kernel: ExpanderDevHandle: 0x9
Jun  2 19:48:42 s18 kernel: NumPhys: 31
Jun  2 19:48:42 s18 kernel: NumEntries: 1
Jun  2 19:48:42 s18 kernel: StartPhyNum: 8
Jun  2 19:48:42 s18 kernel: ExpStatus: Responding (0x3)
Jun  2 19:48:42 s18 kernel: PhysicalPort: 0
Jun  2 19:48:42 s18 kernel: PHY[8].AttachedDevHandle: 0x0011
Jun  2 19:48:42 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
Jun  2 19:48:42 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
Jun  2 19:48:42 s18 kernel: mpr0: (73)->(mprsas_fw_work) Working on  
Event: [16]
Jun  2 19:48:42 s18 kernel: mpr0: (74)->(mprsas_fw_work) Event Free: [16]
Jun  2 19:48:42 s18 kernel: mpr0: (74)->(mprsas_fw_work) Working on  
Event: [1c]
Jun  2 19:48:42 s18 kernel: mpr0: (75)->(mprsas_fw_work) Event Free: [1c]
Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15 
5a 18 c0 00 00 08 00
Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): CAM status: Command timeout
Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): Retrying command
Jun  2 19:48:43 s18 kernel: mpr0: EventReply    :
Jun  2 19:48:43 s18 kernel: EventDataLength: 2
Jun  2 19:48:43 s18 kernel: AckRequired: 0
Jun  2 19:48:43 s18 kernel: Event: SasDiscovery (0x16)
Jun  2 19:48:43 s18 kernel: EventContext: 0x0
Jun  2 19:48:43 s18 kernel: Flags: 0
Jun  2 19:48:43 s18 kernel: ReasonCode: Discovery Complete
Jun  2 19:48:43 s18 kernel: PhysicalPort: 0
Jun  2 19:48:43 s18 kernel: DiscoveryStatus: 0
Jun  2 19:48:43 s18 kernel: mpr0: (75)->(mprsas_fw_work) Working on  
Event: [16]
Jun  2 19:48:43 s18 kernel: mpr0: (76)->(mprsas_fw_work) Event Free: [16]
Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 11 
8d fe 50 00 00 10 00
Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): CAM status: SCSI Status Error
Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): SCSI status: Check Condition
Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): SCSI sense: UNIT 
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): Retrying command (per 
sense data)

I'm not sure if this is an Intel, CAM, or Avago issue (or something 
else?).  But I'd be happy to help figure it out if anyone has any 
suggestions for me to test?

Thanks!

-Kyle


From owner-freebsd-scsi@freebsd.org  Fri Jun  3 21:49:33 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1BAFFB69714
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Fri,  3 Jun 2016 21:49:33 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: from mail-wm0-x22b.google.com (mail-wm0-x22b.google.com
 [IPv6:2a00:1450:400c:c09::22b])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 7B4801F77
 for <freebsd-scsi@freebsd.org>; Fri,  3 Jun 2016 21:49:32 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: by mail-wm0-x22b.google.com with SMTP id a136so13917514wme.0
 for <freebsd-scsi@freebsd.org>; Fri, 03 Jun 2016 14:49:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623;
 h=subject:to:references:from:message-id:date:user-agent:mime-version
 :in-reply-to; bh=Y5w6FWe+yOmxubcUCKAyyJxksqmLINDY62mrwo8ULxY=;
 b=I9pKJRF+7pbhJXZ4USDRUiNJL7J1B7dMaYe4FKM2Gm/LyJU4IOpyC+jCFVsAlrGcfh
 ERsfa5J2/pEZDgrAKbrZLR+ZTj1iQYoyRJKqy/qYfRAIcnlvkbc7Ha+orJhezWfIKkLQ
 1uuLsXTe2C7SkHeJzRVNo8eg3KUxdnBM4WoG5TdilL7vsB+6AlH53sVW6cqrM+emYg4X
 dap8c1tMwiajdnYsK5pewjuL/Slygywb8KwZCTvzldXEl3Om8VJkgEq0B6BRCfWoS8hk
 +XZCNvPMkKZ+Di9rtLE4NAu7OIV042g8Xd4+DbE4AFoGm59h/TeHiN8xeh/C6bgr/lv/
 CtwQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:subject:to:references:from:message-id:date
 :user-agent:mime-version:in-reply-to;
 bh=Y5w6FWe+yOmxubcUCKAyyJxksqmLINDY62mrwo8ULxY=;
 b=WQytWz9SV26AlM1cMOlJ8G+rKn8wI8atfg7T3bC3CI5icerC4Gm49mCIhUdwFk4mTD
 iI08vDxzmBDnK8oanDQUE1bjitt2ywOjsNGXrdus3nNwmun/fnamYcAOhENOTjFAga10
 2PGvTYcsmoS75Ch4ckQP+i48GqPTGOEqaLOyh29XPjjqaKRzhOv7ZJpus7Zi5Ot941A1
 +PZVva7fvMbSQrWoH+AdA/2teCU2e3pouadWlHUzoxxhXozHFtsCr48EmDE87HhVzKqJ
 3tmFFCYbDbbBjk/r8uBL+qn1u89JshBL+MUVipZ3eRV6uma58FGnfEzYazEGTdpuM5dA
 oJWA==
X-Gm-Message-State: ALyK8tLEBVdCeZr98mcKbCAh7FwVflnAXNk2p363wNEgL341AhRDAvhG9TW0NRFtrlth5+U5
X-Received: by 10.28.23.84 with SMTP id 81mr1384081wmx.46.1464990569918;
 Fri, 03 Jun 2016 14:49:29 -0700 (PDT)
Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171])
 by smtp.gmail.com with ESMTPSA id o76sm1482300wme.0.2016.06.03.14.49.28
 for <freebsd-scsi@freebsd.org> (version=TLSv1/SSLv3 cipher=OTHER);
 Fri, 03 Jun 2016 14:49:28 -0700 (PDT)
Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts
To: freebsd-scsi@freebsd.org
References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
From: Steven Hartland <killing@multiplay.co.uk>
Message-ID: <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
Date: Fri, 3 Jun 2016 22:49:29 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.1.0
MIME-Version: 1.0
In-Reply-To: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.22
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 03 Jun 2016 21:49:33 -0000

First thing would be to run gstat with -d to see if you're actually 
stacking up deletes, a symptom of which can be r/w dropping to zero.

If you are seeing significant deletes it could be a FW issue on the drives.

If you can eliminate that you could still be seeing a cabling / 
backplane issue. SSD's stress the signalling much more than HDD's and on 
several occasions we've had to replace backplanes a few times until it 
was fixed.

Other thing to check is FW version on controller and on the drives, are 
you up to date?

On 03/06/2016 21:36, list-news wrote:
> Greetings!
>
> For the last month I've been fighting with a SAS 3008 issues on a 
> SuperMicro 2028TP-DECR (2 node Twin server).  The system is running 
> 10.3-RELEASE #1 r298749
>
> The errors we see on this system look very similar to the errors 
> posted here back in March:
> https://lists.freebsd.org/pipermail/freebsd-scsi/2016-March/007000.html
>
> The system hosts a large postgresql database and an application which 
> spends most of its days calculating log entries on a couple hundred GB 
> of tables.  Initially this system had 2 WD HDD mirrored boot drives, 
> and 8 Intel 730 - 240GB's SSD's in a zfs pool with 4 mirrored vdevs.  
> Shortly after install/setup, a couple of those Intel 730's began 
> showing READ & WRITE errors displayed with zpool status, additionally 
> mpr0 messages were showing drive timeouts when attempting READ(10), 
> WRITE(10) and SYNCHRONIZE CACHE(10), such as this:
>
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15 
> 5a 18 c0 00 00 08 00 length 4096 SMID 240 command timeout cm 
> 0xfffffe0000cefb00 ccb 0xfffff805dfeca800 target 16, handle(0x0011)
> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
> connector name (    )
> Jun  2 19:48:42 s18 kernel: mpr0: timedout cm 0xfffffe0000cefb00 
> allocated tm 0xfffffe0000cdd3b0
> Jun  2 19:48:42 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 15 
> Aborting command 0xfffffe0000cefb00
> Jun  2 19:48:42 s18 kernel: mpr0: Sending reset from mprsas_send_abort 
> for target ID 16
> [...snipped about 50 mores lines for brevity]
>
> As a solution, the immediate thought was the drives were faulty. We 
> decided to replace some of the drives with Intel 480GB DC3610's.  
> After replacing 4 of the drives we noticed the 3610's were also 
> throwing errors.  Read/write error counts displayed with zpool status 
> would climb into the thousands in a night of calculations with dozens 
> of mpr errors in the logs. This was happening to all of the drives in 
> the array.  Some are more likely than others, but given enough time, 
> all drives would display the errors. This provoked some suspicions 
> that this problem was backplane and/or controller related, and the 
> drives were fine (or possibly the drive firmware is the same in the 
> 730's and 3610's and both exhibit this problem, but the backplane did 
> seem more likely at the time).
>
> We moved Node A system sled (cpu/mem/sas/mb) into Node B chassis slot 
> to see if the errors would change at all (thinking maybe some SATA 
> ports were possibly bad on the backplane).  We hit a bit of luck and 
> noticed the boot drives, when inserted into Node B drive sleds 1&2 
> wouldn't fire up and were undetectable by the SAS controller.  
> SuperMicro concluded this was likely a backplane issue and a backplane 
> was replaced.  The new backplane came in, the drives in Node B sleds 1 
> / 2 are now visible by the controller during boot, confirming the 
> backplane did have a problem.  (This is important to mention as it 
> seems an internets suggested fix for similar timeout errors is to 
> replace the backplane and/or cables.)
>
> Unfortunately, the luck ran out, once we put the system back under 
> load, the errors came back.  I rebuilt the pool just using four 3610's 
> as 2x mirrored vdevs and copied all the data over. Here is the pool as 
> of yesterday:
>
>   pool: zp1
>  state: ONLINE
>   scan: resilvered 107M in 0h0m with 0 errors on Thu Jun  2 14:52:43 2016
> config:
>
>         NAME STATE     READ WRITE CKSUM
>         zp1 ONLINE       0     0     0
>           mirror-0 ONLINE       0     0     0
>             da3 ONLINE       0     0     0
>             da5 ONLINE       0     0     0
>           mirror-1 ONLINE       0     0     0
>             da7 ONLINE       0     0     0
>             da8 ONLINE       0     0     0
>
> When performing logging calculations, the system looks like this 
> pretty much continuously (via gstat):
>
> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w %busy Name
>     7   8175   1374   7526    0.5   3000  48931 0.5   65.2| da3
>     6   8337   1542   8025    0.5   2995  48947    0.5   68.1| da5
>     2   8740   1529   8584    0.5   3138  52564    0.4   72.6| da7
>     3   8637   1427   7961    0.5   3136  52564    0.4   65.6| da8
>
> And zpool iostat:
> zp1                                      414G   474G  10.9K 8.94K 
> 50.2M  51.8M
>   mirror                                 207G   237G 5.44K  4.56K 
> 25.0M  26.2M
>     da3                                     -      - 2.50K  2.37K 
> 12.3M  26.2M
>     da5                                     -      - 2.65K  2.37K 
> 13.3M  26.1M
>   mirror                                 207G   237G 5.42K  4.39K 
> 25.3M  25.6M
>     da7                                     -      - 2.62K  2.40K 
> 13.2M  25.6M
>     da8                                     -      - 2.57K  2.40K 
> 12.5M  25.6M
>
> After a couple minutes (usually between 2 and 10) with drive %busy and 
> a cpu load around 70%, the system hangs.
>
> During the hang, one of two scenarios play out (possibly more, but 
> these are the ones officially witnessed):
>
> Scenario 1) Processes continue to function as long as they don't touch 
> the storage.  One drive immediately hits 0 r/w 0 r/s, the rest of the 
> drives quickly fall to 0 r/s 0 w/s (maybe within 1 - 5 seconds), and 
> the lights on the chassis for the drives goes black.  Any commands I 
> enter in the terminal hang after the newline.
>
> During Scenario 1 - gstat output (notice the L(q) column for da7):
> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w %busy Name
>     0      0      0      0    0.0      0      0 0.0    0.0| da3
>     0      0      0      0    0.0      0      0 0.0    0.0| da5
>     8      0      0      0    0.0      0      0 0.0    0.0| da7
>     0      0      0      0    0.0      0      0 0.0    0.0| da8
>
>
> Scenario 2) Some processes continue to function in a read-only state, 
> but all writes for all drives fall to 0.
>
> During Scenario 2 - gstat output (again notice the L(q) column for da3):
> dT: 1.002s  w: 1.000s
>  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps ms/w   %busy Name
>    69      0      0      0    0.0      0      0 0.0    0.0| da3
>     0   3207   3207  45727    0.2      0      0 0.0   37.8| da5
>     0   2753   2753  24673    0.2      0      0 0.0   23.1| da7
>     0   3027   3027  26425    0.2      0      0 0.0   28.6| da8
>
> About 60 seconds later the system becomes responsive again (I assume 
> kern.cam.da.default_timeout is the 60 second timer, although I haven't 
> changed this to confirm), and a bunch of error logs appear for the 
> drive that had a number in column L(q) > 0 (da7 in scenario 1 last 
> night, da3 in scenario 2 today). Any commands I typed in the terminal 
> finally fire off, the cpu and disk usage quickly climb back to 
> 70-80%+.  If the errors are bad enough (or possibly if two drives hang 
> simultaneously?) it'll show read and/or write errors in zpool status.
>
> Failed attempts to repair: Replaced Backplane. Full memory tests. 
> Upgraded 3610's to the latest firmware. Used Node B in place of Node 
> A.  Flashed the controller with latest IT firmware from SuperMicro.  
> Swapped power supplies.  Pulled one proc off Node A and half the 
> memory, put that into Node B and tried the tests again.  Set 
> kern.cam.da.X.delete_method=DISABLE.  Probably tuned a couple dozen 
> other cam and zfs sysctl/loader variables in testing.
>
> Nothing stopped these errors.
>
> Until I ran this:
> # camcontrol tags daX -N 1
> (where X is 3,5,7 and 8)
>
> With tags set to 1 for each device, all errors for the last 18 hours 
> have ceased (system ran all night at load).  I've not spent the time 
> yet to determine how much that affects performance, but it is 
> definitely working.
>
> To reproduce this bug.  Today I ran:
> # camcontrol tags daX -N 8
>
> On all drives.  A few minutes later, the drive hangs (see gstat output 
> for da3 above in scenario 2, which I had copied just seconds before 
> the errors that follow):
>
> Jun  3 13:34:10 s18 kernel: (da3:mpr0:0:12:0): READ(10). CDB: 28 00 09 
> cc 97 30 00 00 08 00 length 4096 SMID 961 command timeout cm 
> 0xfffffe0000d2ad50 ccb 0xfffff807bfcef800 target 12, handle(0x000d)
> Jun  3 13:34:10 s18 kernel: mpr0: At enclosure level 0, slot 4, 
> connector name (    )
> Jun  3 13:34:10 s18 kernel: mpr0: timedout cm 0xfffffe0000d2ad50 
> allocated tm 0xfffffe0000cdc150
> Jun  3 13:34:10 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 
> Aborting command 0xfffffe0000d2ad50
> Jun  3 13:34:10 s18 kernel: mpr0: Sending reset from mprsas_send_abort 
> for target ID 12
> [...snipped about 50 mores lines for brevity]
>
> With 1 tag, gstat still shows L(q) values that jump into the 70's at 
> times, but with what I understand of the source, devstat calculates 
> the queued operations separate from camcontrol's simultaneous 
> dev_openings.   So I assume that's nothing to be concerned about but 
> figured I'd mention it.
>
> # camcontrol tags da7 -v
>
> (pass7:mpr0:0:16:0): dev_openings  0
> (pass7:mpr0:0:16:0): dev_active    1
> (pass7:mpr0:0:16:0): allocated     1
> (pass7:mpr0:0:16:0): queued        0
> (pass7:mpr0:0:16:0): held          0
> (pass7:mpr0:0:16:0): mintags       2
> (pass7:mpr0:0:16:0): maxtags       255
>
> Here is a full error message from last night on da7 (gstat scenario 1 
> above was copied right before these lines). Apologies for the 
> verbosity, but it may be helpful to some, logged with 
> dev.mpr.0.debug_level=0x037F:
>
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15 
> 5a 18 c0 00 00 08 00 length 4096 SMID 240 command timeout cm 
> 0xfffffe0000cefb00 ccb 0xfffff805dfeca800 target 16, handle(0x0011)
> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
> connector name (    )
> Jun  2 19:48:42 s18 kernel: mpr0: timedout cm 0xfffffe0000cefb00 
> allocated tm 0xfffffe0000cdd3b0
> Jun  2 19:48:42 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 15 
> Aborting command 0xfffffe0000cefb00
> Jun  2 19:48:42 s18 kernel: mpr0: Sending reset from mprsas_send_abort 
> for target ID 16
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 14 
> 24 35 e0 00 00 08 00 length 4096 SMID 562 command timeout cm 
> 0xfffffe0000d0a1a0 ccb 0xfffff80daeb84000 target 16, handle(0x0011)
> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
> connector name (    )
> Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 
> 0xfffffe0000d0a1a0 for processing by tm 0xfffffe0000cdd3b0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 command timeout 
> cm 0xfffffe0000d1e720 ccb 0xfffff80628c3d800 target 16, handle(0x0011)
> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
> connector name (    )
> Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 
> 0xfffffe0000d1e720 for processing by tm 0xfffffe0000cdd3b0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
> cd fc 98 00 00 08 00 length 4096 SMID 571 command timeout cm 
> 0xfffffe0000d0ad70 ccb 0xfffff8018a54b800 target 16, handle(0x0011)
> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
> connector name (    )
> Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 
> 0xfffffe0000d0ad70 for processing by tm 0xfffffe0000cdd3b0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
> cd fc 90 00 00 08 00 length 4096 SMID 540 command timeout cm 
> 0xfffffe0000d084c0 ccb 0xfffff80288e58000 target 16, handle(0x0011)
> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
> connector name (    )
> Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 
> 0xfffffe0000d084c0 for processing by tm 0xfffffe0000cdd3b0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
> cd fc 88 00 00 08 00 length 4096 SMID 817 command timeout cm 
> 0xfffffe0000d1f050 ccb 0xfffff8004c3ef800 target 16, handle(0x0011)
> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
> connector name (    )
> Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 
> 0xfffffe0000d1f050 for processing by tm 0xfffffe0000cdd3b0
> Jun  2 19:48:42 s18 kernel: mpr0: EventReply    :
> Jun  2 19:48:42 s18 kernel: EventDataLength: 2
> Jun  2 19:48:42 s18 kernel: AckRequired: 0
> Jun  2 19:48:42 s18 kernel: Event: SasDiscovery (0x16)
> Jun  2 19:48:42 s18 kernel: EventContext: 0x0
> Jun  2 19:48:42 s18 kernel: Flags: 1<InProgress>
> Jun  2 19:48:42 s18 kernel: ReasonCode: Discovery Started
> Jun  2 19:48:42 s18 kernel: PhysicalPort: 0
> Jun  2 19:48:42 s18 kernel: DiscoveryStatus: 0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
> 11 8d fe 50 00 00 10 00 length 8192 SMID 186 completed cm 
> 0xfffffe0000ceb420 ccb 0xfffff80cd1d88800 during recovery ioc 804b 
> scsi 0 state c xfer 0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
> 11 8d fe 50 00 00 10 00 length 8192 SMID 186 terminated ioc 804b scsi 
> 0 state c xfer 0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
> cd fc 98 00 00 08 00 length 4096 SMID 571 completed timedout cm 
> 0xfffffe0000d0ad70 ccb 0xfffff8018a54b800 during recovery ioc 804b 
> scsi 0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 
> 98 00 00 08 00 length 4096 SMID 571 terminated ioc 804b scsi 0 state c 
> xfer 0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
> cd fc 90 00 00 08 00 length 4096 SMID 540 completed timedout cm 
> 0xfffffe0000d084c0 ccb 0xfffff80288e58000 during recovery ioc 804b 
> scsi 0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 
> 90 00 00 08 00 length 4096 SMID 540 terminated ioc 804b scsi 0 state c 
> xfer 0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
> cd fc 88 00 00 08 00 length 4096 SMID 817 completed timedout cm 
> 0xfffffe0000d1f050 ccb 0xfffff8004c3ef800 during recovery ioc 804b 
> scsi 0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 
> 88 00 00 08 00 length 4096 SMID 817 terminated ioc 804b scsi 0 state c 
> xfer 0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15 
> 5a 18 c0 00 00 08 00 length 4096 SMID 240 completed timedout cm 
> 0xfffffe0000cefb00 ccb 0xfffff805dfeca800 during recovery ioc 8048 
> scsi 0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 14 24 35 
> e0 00 00 08 00 length 4096 SMID 562 completed timedout cm 
> 0xfffffe0000d0a1a0 ccb 0xfffff80daeb84000 during recovery ioc 804b 
> scsi 0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 14 24 35 
> e0 00 00 08 00 length 4096 SMID 562 terminated ioc 804b scsi 0 state c 
> xfer 0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 completed 
> timedout cm 0xfffffe0000d1e720 ccb 0xfffff80628c3d800 during recovery 
> ioc 804b scsi 0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 terminated ioc 
> 804b scsi 0 state c xfer 0
> Jun  2 19:48:42 s18 kernel: (xpt0:mpr0:0:16:0): SMID 15 abort TaskMID 
> 240 status 0x0 code 0x0 count 7
> Jun  2 19:48:42 s18 kernel: (xpt0:mpr0:0:16:0): SMID 15 finished 
> recovery after aborting TaskMID 240
> Jun  2 19:48:42 s18 kernel: mpr0: Unfreezing devq for target ID 16
> Jun  2 19:48:42 s18 kernel: mpr0: EventReply    :
> Jun  2 19:48:42 s18 kernel: EventDataLength: 4
> Jun  2 19:48:42 s18 kernel: AckRequired: 0
> Jun  2 19:48:42 s18 kernel: Event: SasTopologyChangeList (0x1c)
> Jun  2 19:48:42 s18 kernel: EventContext: 0x0
> Jun  2 19:48:42 s18 kernel: EnclosureHandle: 0x2
> Jun  2 19:48:42 s18 kernel: ExpanderDevHandle: 0x9
> Jun  2 19:48:42 s18 kernel: NumPhys: 31
> Jun  2 19:48:42 s18 kernel: NumEntries: 1
> Jun  2 19:48:42 s18 kernel: StartPhyNum: 8
> Jun  2 19:48:42 s18 kernel: ExpStatus: Responding (0x3)
> Jun  2 19:48:42 s18 kernel: PhysicalPort: 0
> Jun  2 19:48:42 s18 kernel: PHY[8].AttachedDevHandle: 0x0011
> Jun  2 19:48:42 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
> Jun  2 19:48:42 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
> Jun  2 19:48:42 s18 kernel: mpr0: (73)->(mprsas_fw_work) Working on  
> Event: [16]
> Jun  2 19:48:42 s18 kernel: mpr0: (74)->(mprsas_fw_work) Event Free: [16]
> Jun  2 19:48:42 s18 kernel: mpr0: (74)->(mprsas_fw_work) Working on  
> Event: [1c]
> Jun  2 19:48:42 s18 kernel: mpr0: (75)->(mprsas_fw_work) Event Free: [1c]
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15 
> 5a 18 c0 00 00 08 00
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): CAM status: Command 
> timeout
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): Retrying command
> Jun  2 19:48:43 s18 kernel: mpr0: EventReply    :
> Jun  2 19:48:43 s18 kernel: EventDataLength: 2
> Jun  2 19:48:43 s18 kernel: AckRequired: 0
> Jun  2 19:48:43 s18 kernel: Event: SasDiscovery (0x16)
> Jun  2 19:48:43 s18 kernel: EventContext: 0x0
> Jun  2 19:48:43 s18 kernel: Flags: 0
> Jun  2 19:48:43 s18 kernel: ReasonCode: Discovery Complete
> Jun  2 19:48:43 s18 kernel: PhysicalPort: 0
> Jun  2 19:48:43 s18 kernel: DiscoveryStatus: 0
> Jun  2 19:48:43 s18 kernel: mpr0: (75)->(mprsas_fw_work) Working on  
> Event: [16]
> Jun  2 19:48:43 s18 kernel: mpr0: (76)->(mprsas_fw_work) Event Free: [16]
> Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
> 11 8d fe 50 00 00 10 00
> Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): CAM status: SCSI Status 
> Error
> Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): SCSI status: Check 
> Condition
> Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): SCSI sense: UNIT 
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): Retrying command (per 
> sense data)
>
> I'm not sure if this is an Intel, CAM, or Avago issue (or something 
> else?).  But I'd be happy to help figure it out if anyone has any 
> suggestions for me to test?
>
> Thanks!
>
> -Kyle
>
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"


From owner-freebsd-scsi@freebsd.org  Fri Jun  3 22:33:20 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id B12B9B69D8F
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Fri,  3 Jun 2016 22:33:20 +0000 (UTC)
 (envelope-from list-news@mindpackstudios.com)
Received: from mail.furymx.com (mindpack.mx1.furymx.net [64.141.130.10])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 90F4F11E1
 for <freebsd-scsi@freebsd.org>; Fri,  3 Jun 2016 22:33:19 +0000 (UTC)
 (envelope-from list-news@mindpackstudios.com)
Received: from mindpack.furymx.net (mindpack.mx1.furymx.net [10.10.1.10])
 by mail.furymx.com (Postfix) with ESMTP id 8AF9C210A65
 for <freebsd-scsi@freebsd.org>; Fri,  3 Jun 2016 17:33:18 -0500 (CDT)
X-Virus-Scanned: amavisd-new at furymx.com
Received: from mail.furymx.com ([10.10.1.10])
 by mindpack.furymx.net (mail.furymx.com [10.10.1.10]) (amavisd-new, port 10024)
 with ESMTP id RXJeNFVFEHqP for <freebsd-scsi@freebsd.org>;
 Fri,  3 Jun 2016 17:33:16 -0500 (CDT)
Received: from vortex.local (c-98-215-180-176.hsd1.in.comcast.net
 [98.215.180.176])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 (Authenticated sender: kyle@mindpackstudios.com)
 by mail.furymx.com (Postfix) with ESMTPSA id 2FF73210A59
 for <freebsd-scsi@freebsd.org>; Fri,  3 Jun 2016 17:33:16 -0500 (CDT)
Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts
To: freebsd-scsi@freebsd.org
References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
 <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
From: list-news <list-news@mindpackstudios.com>
Message-ID: <be2162ab-82ca-cfbf-2372-7ad9a3d748f4@mindpackstudios.com>
Date: Fri, 3 Jun 2016 17:33:15 -0500
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0)
 Gecko/20100101 Thunderbird/45.1.1
MIME-Version: 1.0
In-Reply-To: <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 03 Jun 2016 22:33:20 -0000

Thanks for the input Steven!

#gstat -do
  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w d/s   kBps   
ms/d    o/s   ms/o   %busy Name
     1   9671   2977  16275    0.3   3979  68439    0.3 2286  51067    
1.8    429    0.5   77.8| da3
     0   9681   3039  19510    0.3   3927  66828    0.3 2286  51067    
1.9    429    0.5   76.1| da5
     4   9727   3079  20295    0.3   3917  66485    0.3 2319  50943    
1.8    412    0.5   77.4| da7
     1   9936   3246  18353    0.3   3958  66672    0.3 2319  50943    
1.6    413    0.5   78.8| da8

#camcontrol daX -N 8

(5 minutes later)

#gstat -do (still running after drives hang)
  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w d/s   kBps   
ms/d    o/s   ms/o   %busy Name
    69      0      0      0    0.0      0      0 0.0      0      0    
0.0      0    0.0    0.0| da3
     0      0      0      0    0.0      0      0 0.0      0      0    
0.0      0    0.0    0.0| da5
     0      0      0      0    0.0      0      0 0.0      0      0    
0.0      0    0.0    0.0| da7
     0      0      0      0    0.0      0      0 0.0      0      0    
0.0      0    0.0    0.0| da8

(60 seconds later)
Jun  3 17:01:36 s18 kernel: (da3:mpr0:0:12:0): READ(10). CDB: 28 00 1b 
80 fc c0 00 00 10 00 length 8192 SMID 258 command timeout cm 
0xfffffe0000cf12a0 ccb 0xfffff81010eae800 target 12, handle(0x000d)
Jun  3 17:01:36 s18 kernel: mpr0: At enclosure level 0, slot 4, 
connector name (    )
Jun  3 17:01:36 s18 kernel: mpr0: timedout cm 0xfffffe0000cf12a0 
allocated tm 0xfffffe0000cdc2a0
Jun  3 17:01:36 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 2 
Aborting command 0xfffffe0000cf12a0
Jun  3 17:01:36 s18 kernel: mpr0: Sending reset from mprsas_send_abort 
for target ID 12
[...]

#camcontrol daX -N 1
(20 minutes just passed, no errors)

#gstat -do (still running)
  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w d/s   kBps   
ms/d    o/s   ms/o   %busy Name
     2   6790   3647  67501    0.9   1627  21469    0.9 1319  24320    
1.2    198    1.0   92.8| da3
     2   6496   3359  70132    0.9   1620  21473    1.0 1319  24320    
1.2    198    1.0   93.4| da5
     4   6268   3236  58771    0.9   1455  23094    1.0 1389  25840    
1.0    188    0.9   90.4| da7
     4   6675   3654  71142    0.8   1444  23070    0.9 1389  25840    
1.0    188    1.0   89.0| da8

##########

sysctl info
dev.mpr.0.driver_version: 09.255.01.00-fbsd
dev.mpr.0.firmware_version: 10.00.03.00

This firmware was just flashed using SuperMicro's latest recommended 
firmware (10.00.03.00) about a week ago. I do believe Avago may have 
something newer for the 3008, although I'm not certain how flashing 
something from Avago would affect the supermicro card.  If need be, I 
can contact them and gather their thoughts on anything newer than 
10.00.03.  I'm not sure exactly how to upgrade the driver_version, or if 
that's necessary/possible without upgrading to FreeBSD 11?

All the drives have been updated to the latest Intel firmware for the s3610.
# smartctl -a /dev/da3 | grep Firmware
Firmware Version: G2010150

Another point of interest.  I have 6 of these nodes in operation (3 
servers), all using Intel 730,3500,3610's, and all of them seem to be 
exhibiting similar errors over time.  Arguably, they aren't under 
anywhere near the same io load, and they aren't seeing errors very often.

Here is a different server (same system type = 2028TP-DECR), using an 
older mpr firmware (v6), having an error on May 9th, and then again on 
Jun 3.  da4 is an intel 730 in this server, and the OS is 10.2-RELEASE 
#0 r286795:

May  9 11:21:41 s15 kernel: (noperiph:mpr0:0:4294967295:0): SMID 2 
Aborting command 0xfffffe0000c0b8e0
May  9 11:21:41 s15 kernel: mpr0: Sending reset from mprsas_send_abort 
for target ID 12
May  9 11:21:41 s15 kernel: (da4:mpr0:0:12:0): WRITE(10). CDB: 2a 00 0b 
5c 9b 9b 00 00 08 00 length 4096 SMID 795 terminated ioc 804b scsi 0 
state c xfer 0
May  9 11:21:41 s15 kernel: (da4:mpr0:0:12:0): ATA COMMAND PASS 
THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 
512 SMID 578 ter(da4:mpr0:0:12:0): WRITE(10). CDB: 2a 00 0b 9a f3 49 00 
00 0c 00
May  9 11:21:41 s15 kernel: minated ioc 804b scsi 0 state c xfer 0
May  9 11:21:41 s15 kernel: (da4:mpr0:0:12:0): CAM status: Command timeout
May  9 11:21:41 s15 kernel: mpr0: (da4:Unfreezing devq for target ID 12
May  9 11:21:41 s15 kernel: mpr0:0:12:0): Retrying command
May  9 11:21:42 s15 kernel: (da4:mpr0:0:12:0): WRITE(10). CDB: 2a 00 0b 
9a f3 49 00 00 0c 00
May  9 11:21:42 s15 kernel: (da4:mpr0:0:12:0): CAM status: SCSI Status Error
May  9 11:21:42 s15 kernel: (da4:mpr0:0:12:0): SCSI status: Check Condition
May  9 11:21:42 s15 kernel: (da4:mpr0:0:12:0): SCSI sense: UNIT 
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
May  9 11:21:42 s15 kernel: (da4:mpr0:0:12:0): Retrying command (per 
sense data)Jun  3 12:32:06 s15 kernel: (noperiph:mpr0:0:4294967295:0): 
SMID 5 Aborting command 0xfffffe0000bf46c0
Jun  3 12:32:06 s15 kernel: mpr0: Sending reset from mprsas_send_abort 
for target ID 12
Jun  3 12:32:06 s15 kernel: (da4:mpr0:0:12:0): WRITE(10). CDB: 2a 00 0d 
e9 08 9e 00 00 58 00 length 45056 SMID 273 terminated ioc 804b scsi 0 
state c xfer 0
Jun  3 12:32:06 s15 kernel: (da4:mpr0:0:12:0): WRITE(10). CDB: 2a 00 0d 
d3 24 03 00 00 08 00 length 4096 SMID 155 terminated ioc 804b scsi 0 
state c xfer 0
Jun  3 12:32:06 s15 kernel: (da4:mpr0:0:12:0): ATA COMMAND PASS 
THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 
512 SMID 845 terminated ioc 804b scsi 0 state c xfer 0
Jun  3 12:32:06 s15 kernel: mpr0: Unfreezing devq for target ID 12
Jun  3 12:32:06 s15 kernel: (da4:mpr0:0:12:0): WRITE(10). CDB: 2a 00 0d 
cf 4f 64 00 00 02 00
Jun  3 12:32:06 s15 kernel: (da4:mpr0:0:12:0): CAM status: Command timeout
Jun  3 12:32:06 s15 kernel: (da4:mpr0:0:12:0): Retrying command
Jun  3 12:32:07 s15 kernel: (da4:mpr0:0:12:0): WRITE(10). CDB: 2a 00 0d 
e9 08 9e 00 00 58 00
Jun  3 12:32:07 s15 kernel: (da4:mpr0:0:12:0): CAM status: SCSI Status Error
Jun  3 12:32:07 s15 kernel: (da4:mpr0:0:12:0): SCSI status: Check Condition
Jun  3 12:32:07 s15 kernel: (da4:mpr0:0:12:0): SCSI sense: UNIT 
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  3 12:32:07 s15 kernel: (da4:mpr0:0:12:0): Retrying command (per 
sense data)

-Kyle

On 6/3/16 4:49 PM, Steven Hartland wrote:
> First thing would be to run gstat with -d to see if you're actually 
> stacking up deletes, a symptom of which can be r/w dropping to zero.
>
> If you are seeing significant deletes it could be a FW issue on the 
> drives.
>
> If you can eliminate that you could still be seeing a cabling / 
> backplane issue. SSD's stress the signalling much more than HDD's and 
> on several occasions we've had to replace backplanes a few times until 
> it was fixed.
>
> Other thing to check is FW version on controller and on the drives, 
> are you up to date?
>
> On 03/06/2016 21:36, list-news wrote:
>> Greetings!
>>
>> For the last month I've been fighting with a SAS 3008 issues on a 
>> SuperMicro 2028TP-DECR (2 node Twin server).  The system is running 
>> 10.3-RELEASE #1 r298749
>>
>> The errors we see on this system look very similar to the errors 
>> posted here back in March:
>> https://lists.freebsd.org/pipermail/freebsd-scsi/2016-March/007000.html
>>
>> The system hosts a large postgresql database and an application which 
>> spends most of its days calculating log entries on a couple hundred 
>> GB of tables.  Initially this system had 2 WD HDD mirrored boot 
>> drives, and 8 Intel 730 - 240GB's SSD's in a zfs pool with 4 mirrored 
>> vdevs.  Shortly after install/setup, a couple of those Intel 730's 
>> began showing READ & WRITE errors displayed with zpool status, 
>> additionally mpr0 messages were showing drive timeouts when 
>> attempting READ(10), WRITE(10) and SYNCHRONIZE CACHE(10), such as this:
>>
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 15 5a 18 c0 00 00 08 00 length 4096 SMID 240 command timeout cm 
>> 0xfffffe0000cefb00 ccb 0xfffff805dfeca800 target 16, handle(0x0011)
>> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
>> connector name (    )
>> Jun  2 19:48:42 s18 kernel: mpr0: timedout cm 0xfffffe0000cefb00 
>> allocated tm 0xfffffe0000cdd3b0
>> Jun  2 19:48:42 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 15 
>> Aborting command 0xfffffe0000cefb00
>> Jun  2 19:48:42 s18 kernel: mpr0: Sending reset from 
>> mprsas_send_abort for target ID 16
>> [...snipped about 50 mores lines for brevity]
>>
>> As a solution, the immediate thought was the drives were faulty. We 
>> decided to replace some of the drives with Intel 480GB DC3610's.  
>> After replacing 4 of the drives we noticed the 3610's were also 
>> throwing errors.  Read/write error counts displayed with zpool status 
>> would climb into the thousands in a night of calculations with dozens 
>> of mpr errors in the logs. This was happening to all of the drives in 
>> the array.  Some are more likely than others, but given enough time, 
>> all drives would display the errors. This provoked some suspicions 
>> that this problem was backplane and/or controller related, and the 
>> drives were fine (or possibly the drive firmware is the same in the 
>> 730's and 3610's and both exhibit this problem, but the backplane did 
>> seem more likely at the time).
>>
>> We moved Node A system sled (cpu/mem/sas/mb) into Node B chassis slot 
>> to see if the errors would change at all (thinking maybe some SATA 
>> ports were possibly bad on the backplane).  We hit a bit of luck and 
>> noticed the boot drives, when inserted into Node B drive sleds 1&2 
>> wouldn't fire up and were undetectable by the SAS controller.  
>> SuperMicro concluded this was likely a backplane issue and a 
>> backplane was replaced.  The new backplane came in, the drives in 
>> Node B sleds 1 / 2 are now visible by the controller during boot, 
>> confirming the backplane did have a problem.  (This is important to 
>> mention as it seems an internets suggested fix for similar timeout 
>> errors is to replace the backplane and/or cables.)
>>
>> Unfortunately, the luck ran out, once we put the system back under 
>> load, the errors came back.  I rebuilt the pool just using four 
>> 3610's as 2x mirrored vdevs and copied all the data over. Here is the 
>> pool as of yesterday:
>>
>>   pool: zp1
>>  state: ONLINE
>>   scan: resilvered 107M in 0h0m with 0 errors on Thu Jun  2 14:52:43 
>> 2016
>> config:
>>
>>         NAME STATE     READ WRITE CKSUM
>>         zp1 ONLINE       0     0     0
>>           mirror-0 ONLINE       0     0     0
>>             da3 ONLINE       0     0     0
>>             da5 ONLINE       0     0     0
>>           mirror-1 ONLINE       0     0     0
>>             da7 ONLINE       0     0     0
>>             da8 ONLINE       0     0     0
>>
>> When performing logging calculations, the system looks like this 
>> pretty much continuously (via gstat):
>>
>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w %busy Name
>>     7   8175   1374   7526    0.5   3000  48931 0.5   65.2| da3
>>     6   8337   1542   8025    0.5   2995  48947    0.5   68.1| da5
>>     2   8740   1529   8584    0.5   3138  52564    0.4   72.6| da7
>>     3   8637   1427   7961    0.5   3136  52564    0.4   65.6| da8
>>
>> And zpool iostat:
>> zp1                                      414G   474G  10.9K 8.94K 
>> 50.2M  51.8M
>>   mirror                                 207G   237G 5.44K 4.56K 
>> 25.0M  26.2M
>>     da3                                     -      - 2.50K 2.37K 
>> 12.3M  26.2M
>>     da5                                     -      - 2.65K 2.37K 
>> 13.3M  26.1M
>>   mirror                                 207G   237G 5.42K 4.39K 
>> 25.3M  25.6M
>>     da7                                     -      - 2.62K 2.40K 
>> 13.2M  25.6M
>>     da8                                     -      - 2.57K 2.40K 
>> 12.5M  25.6M
>>
>> After a couple minutes (usually between 2 and 10) with drive %busy 
>> and a cpu load around 70%, the system hangs.
>>
>> During the hang, one of two scenarios play out (possibly more, but 
>> these are the ones officially witnessed):
>>
>> Scenario 1) Processes continue to function as long as they don't 
>> touch the storage.  One drive immediately hits 0 r/w 0 r/s, the rest 
>> of the drives quickly fall to 0 r/s 0 w/s (maybe within 1 - 5 
>> seconds), and the lights on the chassis for the drives goes black.  
>> Any commands I enter in the terminal hang after the newline.
>>
>> During Scenario 1 - gstat output (notice the L(q) column for da7):
>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w %busy Name
>>     0      0      0      0    0.0      0      0 0.0    0.0| da3
>>     0      0      0      0    0.0      0      0 0.0    0.0| da5
>>     8      0      0      0    0.0      0      0 0.0    0.0| da7
>>     0      0      0      0    0.0      0      0 0.0    0.0| da8
>>
>>
>> Scenario 2) Some processes continue to function in a read-only state, 
>> but all writes for all drives fall to 0.
>>
>> During Scenario 2 - gstat output (again notice the L(q) column for da3):
>> dT: 1.002s  w: 1.000s
>>  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps ms/w   %busy Name
>>    69      0      0      0    0.0      0      0 0.0    0.0| da3
>>     0   3207   3207  45727    0.2      0      0 0.0   37.8| da5
>>     0   2753   2753  24673    0.2      0      0 0.0   23.1| da7
>>     0   3027   3027  26425    0.2      0      0 0.0   28.6| da8
>>
>> About 60 seconds later the system becomes responsive again (I assume 
>> kern.cam.da.default_timeout is the 60 second timer, although I 
>> haven't changed this to confirm), and a bunch of error logs appear 
>> for the drive that had a number in column L(q) > 0 (da7 in scenario 1 
>> last night, da3 in scenario 2 today). Any commands I typed in the 
>> terminal finally fire off, the cpu and disk usage quickly climb back 
>> to 70-80%+.  If the errors are bad enough (or possibly if two drives 
>> hang simultaneously?) it'll show read and/or write errors in zpool 
>> status.
>>
>> Failed attempts to repair: Replaced Backplane. Full memory tests. 
>> Upgraded 3610's to the latest firmware. Used Node B in place of Node 
>> A.  Flashed the controller with latest IT firmware from SuperMicro.  
>> Swapped power supplies.  Pulled one proc off Node A and half the 
>> memory, put that into Node B and tried the tests again.  Set 
>> kern.cam.da.X.delete_method=DISABLE.  Probably tuned a couple dozen 
>> other cam and zfs sysctl/loader variables in testing.
>>
>> Nothing stopped these errors.
>>
>> Until I ran this:
>> # camcontrol tags daX -N 1
>> (where X is 3,5,7 and 8)
>>
>> With tags set to 1 for each device, all errors for the last 18 hours 
>> have ceased (system ran all night at load).  I've not spent the time 
>> yet to determine how much that affects performance, but it is 
>> definitely working.
>>
>> To reproduce this bug.  Today I ran:
>> # camcontrol tags daX -N 8
>>
>> On all drives.  A few minutes later, the drive hangs (see gstat 
>> output for da3 above in scenario 2, which I had copied just seconds 
>> before the errors that follow):
>>
>> Jun  3 13:34:10 s18 kernel: (da3:mpr0:0:12:0): READ(10). CDB: 28 00 
>> 09 cc 97 30 00 00 08 00 length 4096 SMID 961 command timeout cm 
>> 0xfffffe0000d2ad50 ccb 0xfffff807bfcef800 target 12, handle(0x000d)
>> Jun  3 13:34:10 s18 kernel: mpr0: At enclosure level 0, slot 4, 
>> connector name (    )
>> Jun  3 13:34:10 s18 kernel: mpr0: timedout cm 0xfffffe0000d2ad50 
>> allocated tm 0xfffffe0000cdc150
>> Jun  3 13:34:10 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 
>> Aborting command 0xfffffe0000d2ad50
>> Jun  3 13:34:10 s18 kernel: mpr0: Sending reset from 
>> mprsas_send_abort for target ID 12
>> [...snipped about 50 mores lines for brevity]
>>
>> With 1 tag, gstat still shows L(q) values that jump into the 70's at 
>> times, but with what I understand of the source, devstat calculates 
>> the queued operations separate from camcontrol's simultaneous 
>> dev_openings.   So I assume that's nothing to be concerned about but 
>> figured I'd mention it.
>>
>> # camcontrol tags da7 -v
>>
>> (pass7:mpr0:0:16:0): dev_openings  0
>> (pass7:mpr0:0:16:0): dev_active    1
>> (pass7:mpr0:0:16:0): allocated     1
>> (pass7:mpr0:0:16:0): queued        0
>> (pass7:mpr0:0:16:0): held          0
>> (pass7:mpr0:0:16:0): mintags       2
>> (pass7:mpr0:0:16:0): maxtags       255
>>
>> Here is a full error message from last night on da7 (gstat scenario 1 
>> above was copied right before these lines). Apologies for the 
>> verbosity, but it may be helpful to some, logged with 
>> dev.mpr.0.debug_level=0x037F:
>>
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 15 5a 18 c0 00 00 08 00 length 4096 SMID 240 command timeout cm 
>> 0xfffffe0000cefb00 ccb 0xfffff805dfeca800 target 16, handle(0x0011)
>> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
>> connector name (    )
>> Jun  2 19:48:42 s18 kernel: mpr0: timedout cm 0xfffffe0000cefb00 
>> allocated tm 0xfffffe0000cdd3b0
>> Jun  2 19:48:42 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 15 
>> Aborting command 0xfffffe0000cefb00
>> Jun  2 19:48:42 s18 kernel: mpr0: Sending reset from 
>> mprsas_send_abort for target ID 16
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 14 24 35 e0 00 00 08 00 length 4096 SMID 562 command timeout cm 
>> 0xfffffe0000d0a1a0 ccb 0xfffff80daeb84000 target 16, handle(0x0011)
>> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
>> connector name (    )
>> Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 
>> 0xfffffe0000d0a1a0 for processing by tm 0xfffffe0000cdd3b0
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
>> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 command timeout 
>> cm 0xfffffe0000d1e720 ccb 0xfffff80628c3d800 target 16, handle(0x0011)
>> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
>> connector name (    )
>> Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 
>> 0xfffffe0000d1e720 for processing by tm 0xfffffe0000cdd3b0
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 13 cd fc 98 00 00 08 00 length 4096 SMID 571 command timeout cm 
>> 0xfffffe0000d0ad70 ccb 0xfffff8018a54b800 target 16, handle(0x0011)
>> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
>> connector name (    )
>> Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 
>> 0xfffffe0000d0ad70 for processing by tm 0xfffffe0000cdd3b0
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 13 cd fc 90 00 00 08 00 length 4096 SMID 540 command timeout cm 
>> 0xfffffe0000d084c0 ccb 0xfffff80288e58000 target 16, handle(0x0011)
>> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
>> connector name (    )
>> Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 
>> 0xfffffe0000d084c0 for processing by tm 0xfffffe0000cdd3b0
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 13 cd fc 88 00 00 08 00 length 4096 SMID 817 command timeout cm 
>> 0xfffffe0000d1f050 ccb 0xfffff8004c3ef800 target 16, handle(0x0011)
>> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
>> connector name (    )
>> Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 
>> 0xfffffe0000d1f050 for processing by tm 0xfffffe0000cdd3b0
>> Jun  2 19:48:42 s18 kernel: mpr0: EventReply    :
>> Jun  2 19:48:42 s18 kernel: EventDataLength: 2
>> Jun  2 19:48:42 s18 kernel: AckRequired: 0
>> Jun  2 19:48:42 s18 kernel: Event: SasDiscovery (0x16)
>> Jun  2 19:48:42 s18 kernel: EventContext: 0x0
>> Jun  2 19:48:42 s18 kernel: Flags: 1<InProgress>
>> Jun  2 19:48:42 s18 kernel: ReasonCode: Discovery Started
>> Jun  2 19:48:42 s18 kernel: PhysicalPort: 0
>> Jun  2 19:48:42 s18 kernel: DiscoveryStatus: 0
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
>> 11 8d fe 50 00 00 10 00 length 8192 SMID 186 completed cm 
>> 0xfffffe0000ceb420 ccb 0xfffff80cd1d88800 during recovery ioc 804b 
>> scsi 0 state c xfer 0
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
>> 11 8d fe 50 00 00 10 00 length 8192 SMID 186 terminated ioc 804b scsi 
>> 0 state c xfer 0
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 13 cd fc 98 00 00 08 00 length 4096 SMID 571 completed timedout cm 
>> 0xfffffe0000d0ad70 ccb 0xfffff8018a54b800 during recovery ioc 804b 
>> scsi 0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 
>> 98 00 00 08 00 length 4096 SMID 571 terminated ioc 804b scsi 0 state 
>> c xfer 0
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 13 cd fc 90 00 00 08 00 length 4096 SMID 540 completed timedout cm 
>> 0xfffffe0000d084c0 ccb 0xfffff80288e58000 during recovery ioc 804b 
>> scsi 0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 
>> 90 00 00 08 00 length 4096 SMID 540 terminated ioc 804b scsi 0 state 
>> c xfer 0
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 13 cd fc 88 00 00 08 00 length 4096 SMID 817 completed timedout cm 
>> 0xfffffe0000d1f050 ccb 0xfffff8004c3ef800 during recovery ioc 804b 
>> scsi 0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 
>> 88 00 00 08 00 length 4096 SMID 817 terminated ioc 804b scsi 0 state 
>> c xfer 0
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 15 5a 18 c0 00 00 08 00 length 4096 SMID 240 completed timedout cm 
>> 0xfffffe0000cefb00 ccb 0xfffff805dfeca800 during recovery ioc 8048 
>> scsi 0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 14 24 35 
>> e0 00 00 08 00 length 4096 SMID 562 completed timedout cm 
>> 0xfffffe0000d0a1a0 ccb 0xfffff80daeb84000 during recovery ioc 804b 
>> scsi 0 state c xf (da7:mpr0:0:16:0): READ(10). CDB: 28 00 14 24 35 e0 
>> 00 00 08 00 length 4096 SMID 562 terminated ioc 804b scsi 0 state c 
>> xfer 0
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
>> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 completed 
>> timedout cm 0xfffffe0000d1e720 ccb 0xfffff80628c3d800 during recovery 
>> ioc 804b scsi 0
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
>> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 terminated ioc 
>> 804b scsi 0 state c xfer 0
>> Jun  2 19:48:42 s18 kernel: (xpt0:mpr0:0:16:0): SMID 15 abort TaskMID 
>> 240 status 0x0 code 0x0 count 7
>> Jun  2 19:48:42 s18 kernel: (xpt0:mpr0:0:16:0): SMID 15 finished 
>> recovery after aborting TaskMID 240
>> Jun  2 19:48:42 s18 kernel: mpr0: Unfreezing devq for target ID 16
>> Jun  2 19:48:42 s18 kernel: mpr0: EventReply    :
>> Jun  2 19:48:42 s18 kernel: EventDataLength: 4
>> Jun  2 19:48:42 s18 kernel: AckRequired: 0
>> Jun  2 19:48:42 s18 kernel: Event: SasTopologyChangeList (0x1c)
>> Jun  2 19:48:42 s18 kernel: EventContext: 0x0
>> Jun  2 19:48:42 s18 kernel: EnclosureHandle: 0x2
>> Jun  2 19:48:42 s18 kernel: ExpanderDevHandle: 0x9
>> Jun  2 19:48:42 s18 kernel: NumPhys: 31
>> Jun  2 19:48:42 s18 kernel: NumEntries: 1
>> Jun  2 19:48:42 s18 kernel: StartPhyNum: 8
>> Jun  2 19:48:42 s18 kernel: ExpStatus: Responding (0x3)
>> Jun  2 19:48:42 s18 kernel: PhysicalPort: 0
>> Jun  2 19:48:42 s18 kernel: PHY[8].AttachedDevHandle: 0x0011
>> Jun  2 19:48:42 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
>> Jun  2 19:48:42 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
>> Jun  2 19:48:42 s18 kernel: mpr0: (73)->(mprsas_fw_work) Working on  
>> Event: [16]
>> Jun  2 19:48:42 s18 kernel: mpr0: (74)->(mprsas_fw_work) Event Free: 
>> [16]
>> Jun  2 19:48:42 s18 kernel: mpr0: (74)->(mprsas_fw_work) Working on  
>> Event: [1c]
>> Jun  2 19:48:42 s18 kernel: mpr0: (75)->(mprsas_fw_work) Event Free: 
>> [1c]
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 15 5a 18 c0 00 00 08 00
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): CAM status: Command 
>> timeout
>> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): Retrying command
>> Jun  2 19:48:43 s18 kernel: mpr0: EventReply    :
>> Jun  2 19:48:43 s18 kernel: EventDataLength: 2
>> Jun  2 19:48:43 s18 kernel: AckRequired: 0
>> Jun  2 19:48:43 s18 kernel: Event: SasDiscovery (0x16)
>> Jun  2 19:48:43 s18 kernel: EventContext: 0x0
>> Jun  2 19:48:43 s18 kernel: Flags: 0
>> Jun  2 19:48:43 s18 kernel: ReasonCode: Discovery Complete
>> Jun  2 19:48:43 s18 kernel: PhysicalPort: 0
>> Jun  2 19:48:43 s18 kernel: DiscoveryStatus: 0
>> Jun  2 19:48:43 s18 kernel: mpr0: (75)->(mprsas_fw_work) Working on  
>> Event: [16]
>> Jun  2 19:48:43 s18 kernel: mpr0: (76)->(mprsas_fw_work) Event Free: 
>> [16]
>> Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
>> 11 8d fe 50 00 00 10 00
>> Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): CAM status: SCSI 
>> Status Error
>> Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): SCSI status: Check 
>> Condition
>> Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): SCSI sense: UNIT 
>> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
>> Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): Retrying command (per 
>> sense data)
>>
>> I'm not sure if this is an Intel, CAM, or Avago issue (or something 
>> else?).  But I'd be happy to help figure it out if anyone has any 
>> suggestions for me to test?
>>
>> Thanks!
>>
>> -Kyle
>>
>> _______________________________________________
>> freebsd-scsi@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
>
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"