From owner-freebsd-scsi@freebsd.org Mon Apr 25 17:32:57 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D1B87B1C9CB for ; Mon, 25 Apr 2016 17:32:57 +0000 (UTC) (envelope-from scott4long@yahoo.com) Received: from nm23-vm8.bullet.mail.gq1.yahoo.com (nm23-vm8.bullet.mail.gq1.yahoo.com [98.136.217.87]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id ABC9410EE for ; Mon, 25 Apr 2016 17:32:57 +0000 (UTC) (envelope-from scott4long@yahoo.com) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1461605445; bh=1eyZoYwWqR4aQQTicpSZIZi7szLhyC4e/5ozQwpoDaE=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From:Subject; b=CFeELlvO7riq5wmalhiae2VJ7B4KyqT6buLnmXySTT2Fo3k4uYKUQApGDtyvJtTK96nwKi2Xz0FqsAI21HcZ0wYvgrNpp6Jb1JMIj4ff3SKRoVKRfbDeNyqHZcmHvRoYxK5jmjBLPGRYbTteQxXgnHXosisXhbL8kmgIGQf6vhLmKwqIAXQSu7cjEf0OkWtFU5uC32y6DwwUwgvWdyhmPaBRkU44BZxfP0P5h1n0IbWx+So3Dnptkw7Qh1FeI8JfpFTOSXS4Kmr0fosVdzvO+zXg2IkzyX6F+CoyIGYyZ7QtU//n6/D4ghKqHzAw+ChtGddFNdNpsLjmK5q8Hs5B0g== Received: from [98.137.12.189] by nm23.bullet.mail.gq1.yahoo.com with NNFMP; 25 Apr 2016 17:30:45 -0000 Received: from [208.71.42.209] by tm10.bullet.mail.gq1.yahoo.com with NNFMP; 25 Apr 2016 17:30:45 -0000 Received: from [127.0.0.1] by smtp220.mail.gq1.yahoo.com with NNFMP; 25 Apr 2016 17:30:45 -0000 X-Yahoo-Newman-Id: 808136.73661.bm@smtp220.mail.gq1.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: s1H13asVM1nR1lLpQ5XuVePBuoYy199BnDoHWphKCUNsqwP iC.LaxvB.X6fbTAxEZY09SutUwk94UPvN4.h.xkkm5kzmYbCwDaW2qjktMkC ETUBvpudXKaV721oxH4sotHdqlnR5gH2FPiGufHIynbo9MqUB8gsjS13h0eo nu73iCxHntNLL40y9PARobH.eYZU0J.6Fw27tp48uz_dKItp2LUfoaZBUo7h Po0mQeJf4iEShl0Y5zEs4Dpt.U3IB4TJORL7cg_QwP79wdolLbO9cO7tfJuF erW5NV3qZN12FHEQG0VHvqSfGHGAe4wIM6BOGa2JCd4VIxZBfsMs7ZAcs51O u7J2OZ3ml7B6dgNFvsgzXnheA9xWCx8ikSJEu5c3dAUOvxCwKgFCn92q6jQI 6wL9oPgTDfea.UP8FO4godlUXHpLaqCMDSOCti1IG.xNQmE8H7rrkcheRzOO 8FsUMv9UvmTTd4akJQDwDloQAkqsO9Ca7S2co1Zp9_7T0.Gar9b8IkNtdfCc jkdN0rZZPPZy.I5Ll5j0cm7gHcIJ9cJHFOEsQsy0- X-Yahoo-SMTP: clhABp.swBB7fs.LwIJpv3jkWgo2NU8- Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 From: Scott Long In-Reply-To: <0d7401d19f10$ee329300$ca97b900$@broadcom.com> Date: Mon, 25 Apr 2016 11:30:41 -0600 Cc: Dan Langille , freebsd-scsi@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <3D06CE25-4159-4F30-A8C5-8A188144681B@yahoo.com> References: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> <5EEF0794-B06E-4A72-89DA-7DCD94AE1FC6@langille.org> <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org> <0d7401d19f10$ee329300$ca97b900$@broadcom.com> To: Stephen McConnell X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Apr 2016 17:32:57 -0000 > On Apr 25, 2016, at 10:38 AM, Stephen McConnell via freebsd-scsi = wrote: >=20 >=20 >=20 >> -----Original Message----- >> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- >> scsi@freebsd.org] On Behalf Of Dan Langille >> Sent: Monday, April 25, 2016 9:40 AM >> To: freebsd-scsi@freebsd.org >> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 >>=20 >>> On Apr 25, 2016, at 8:17 AM, Dan Langille wrote: >>>=20 >>>>=20 >>>> On Apr 24, 2016, at 9:35 AM, Dan Langille wrote: >>>>=20 >>>> More of the pasted output is also at >> https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae >> >> and added smartctl output. >>>>=20 >>>> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI = SAS2008 >> card. It's running a zfs root system. >>>>=20 >>>> This morning the system was unresponsive via ssh. Attempts to log = in at >> the console did not yield a password prompt. >>>>=20 >>>> A power cycle brought the system online. Inspecting = /var/log/messages, > I >> found about 63,000 entries similar to those which appear below. >>>>=20 >>>> zpool status of all are OK. A scrub is in progress for one pool = (since > before >> this issue arose). da7 is in that pool. >>>>=20 >>>>=20 >>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 = 00 >>>> 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b = scsi >>>> 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): >>>> READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 length 16384 SMID 614 >>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew >>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 50 00 00 = 20 >>>> 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c xfer 0 >>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 = 00 >>>> 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc 804b >>>> scsi 0 state c xfer 0 Apr 24 11:25:55 knew kernel: = (da7:mps1:0:17:0): >>>> READ(10). CDB: 28 00 8b 6f ef 50 00 00 08 00 length 4096 SMID 674 >>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew >>>> kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b 0f a2 48 00 00 = 18 >>>> 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state c xfer >>>> 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). = CDB: >>>> 28 00 ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc >>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:56 knew kernel: >>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 >>>> length 16384 SMID 376 terminated ioc 804b scsi 0 state c xfer 0 Apr >>>> 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b >>>> d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi = 0 >>>> state c xfer 0 >>>>=20 >>>> Is this a cabling issue? The drive is a SATA device (smartctl = output > in the >> URL above). Anyone familiar with these errors? >>>=20 >>> This morning: >>>=20 >>> 13410079654596185797 REMOVED 0 0 0 was /dev/da7p3 >>>=20 >>> At least I know i'm looking for Serial Number: 13Q8PNBYS >>>=20 >>> =46rom the logs: >>>=20 >>> Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 lun = 0 >>> Apr 25 05:34:50 knew kernel: da7: s/n >> 13Q8PNBYS detached >>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 >>> d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b scsi = 0 >>> state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): >>> READ(10). CDB: 28 00 d8 33 26 f8 00 00 20 00 length 16384 SMID 204 >>> terminated ioc 804b scsi 0 state c xfer(da7:mps1:0:17:0): READ(10). = CDB: > 28 >> 00 d8 33 53 e0 00 00 08 00 Apr 25 05:34:51 knew kernel: 0 Apr 25 = 05:34:51 >> knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue >> Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). = CDB: 28 >> 00 d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc 804b = scsi > 0 >> state c xfer(da7: 0 >>> Apr 25 05:34:51 knew kernel: mps1:0: (da7:mps1:0:17:0): READ(10). > CDB: >> 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc = 804b > scsi 0 >> state c xfer 17:0 >>> Apr 25 05:34:51 knew kernel: 0): (da7:mps1:0:17:0): = WRITE(10). > CDB: 2a >> 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc 804b = scsi > 0 >> state c xfeError 5, Periph was invalidated >>> Apr 25 05:34:51 knew kernel: r 0 >>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 >>> d8 33 26 f8 00 00 20 00 Apr 25 05:34:51 knew kernel: >>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 = length >>> 4096 SMID 435 terminated ioc 804b scsi 0 state c xfer >>> (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue Request Apr = 25 >>> 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1: >>> mps1:0:IOCStatus =3D 0x4b while resetting device 0xa Apr 25 05:34:51 >>> knew kernel: 17:mps1: 0): Unfreezing devq for target ID 17 Apr 25 >>> 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25 = 05:34:51 >>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 26 d8 00 = 00 >>> 20 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: >>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: >>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 >>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 6c 42 40 00 = 00 >>> 10 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: >>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: >>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 >>> knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 d8 2a 90 00 >>> 00 90 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: >> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: >> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 = knew >> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 = 00 > Apr >> 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: = Unconditionally > Re- >> queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error = 5, >> Periph was invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR: = Device >> swap: provider da7p2 disconnected. >>> Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS = 'vdev > is >> removed, pool_guid=3D15378250086669402288 >> vdev_guid=3D13410079654596185797'' >>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed Apr >>> 25 05:34:51 knew ZFS: vdev is removed, = pool_guid=3D15378250086669402288 >>> vdev_guid=3D13410079654596185797 >>=20 >> Current status: after powering off the box, reseating the cables for = that > drive, I >> powered up the system and a resilver commenced which completed in 30 >> minutes. >>=20 >> Seems OK now. I am not sure if the two events are related. >=20 > Recently, a bug was uncovered where a device is gets 'lost'. >=20 > Here's what happens: > The first message in your "failure on Monday" log is for > 'mpssas_prepare_remove'. This message is likely logged because the FW = sends > an event to the driver that the device is no longer responsive = (pulled, > cable issue, or something else). When the driver gets this event, it = sends > a reset to the device to clear out any pending I/O. This is where the > 'terminated ioc' messages come from. When the reset completes, the = driver > is supposed to send a SAS_IO_UNIT message to FW so that the DevHandle = for > that disk is removed from FW's list. Then, when the device comes back > on-line, everything is fine. But, with this bug, before that = SAS_IO_UNIT > message is sent to FW, the driver exits the function where that = happens > (mpssas_remove_device). This happens where you see the log message, > "IOCStatus - 0x4b while resetting device 0x0a". The driver logs that > message and then exits. What the driver should do is log that message = and > continue on to send the SAS_IO_UNIT message to FW. The fix is to = remove the > two lines in the driver shown here with '>>': >=20 > if (le16toh(reply->IOCStatus) !=3D MPI2_IOCSTATUS_SUCCESS) { > mps_printf(sc, "IOCStatus =3D 0x%x while resetting = device > 0x%x\n", > le16toh(reply->IOCStatus), handle); >>> mpssas_free_tm(sc, tm); >>> return; > } >=20 > A reboot will solve the problem, as you saw, but the real fix is to = remove > the DevHandle as described above. This fix will go into the driver = before > the next scheduled release and then MFC'd to 10.x. >=20 Thanks for the diagnosis, Steve. I forgot about that case. We should = also make this chain of events more evident in the syslog, it=E2=80=99s very = confusing when it happens. I=E2=80=99m not exactly sure yet what it should look like. Scott