From owner-freebsd-scsi@freebsd.org Mon Apr 25 16:38:46 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1F11DB1CAC3 for ; Mon, 25 Apr 2016 16:38:46 +0000 (UTC) (envelope-from stephen.mcconnell@broadcom.com) Received: from mail-pf0-x229.google.com (mail-pf0-x229.google.com [IPv6:2607:f8b0:400e:c00::229]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id EE4441F9B for ; Mon, 25 Apr 2016 16:38:45 +0000 (UTC) (envelope-from stephen.mcconnell@broadcom.com) Received: by mail-pf0-x229.google.com with SMTP id n1so70522112pfn.2 for ; Mon, 25 Apr 2016 09:38:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=broadcom.com; s=google; h=from:to:references:in-reply-to:subject:date:message-id:mime-version :content-transfer-encoding:thread-index:content-language; bh=gtvsP/PRtyEWdBckYTQWmi311NvpJi+S6bsgV9uGNws=; b=DEHnjcFBrfC4ZNqcbBcxzLPXcnUMniCjB2gjA1VLHliiTDj5jvuj3/GaakZon192CO K5TLFh6lrlpn4jceqY2WiRfLmKfZCOnRzBaABkB3n6/CEMF2C2TmbrHkH+U1/D5tlC9i jMe4tQeO8FAqnOdS4QcAXe0hlzhrd3fO8MNKg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:references:in-reply-to:subject:date :message-id:mime-version:content-transfer-encoding:thread-index :content-language; bh=gtvsP/PRtyEWdBckYTQWmi311NvpJi+S6bsgV9uGNws=; b=Pm2VG2DVd+S43licRMT97gRVtijYuyNE7nMMiUB6s5xi9IFUbkr3c5oZnpafvZSAE9 mWCMCwYs6YvKwx3TfFoy5GPCAhnbGp5JgSmjYowsOGRiL6XjjKvNk3p801ZxelHte94m 9lH9/Gl5XQTjMxDmRVgnLsOkTLg7BXbcGahJahcLYC2HApYOn4os5SmJC/+1sodC0HPO mBTtnMszHcUEUINVAhBNZlA49t3mz41nNeOgYQVor8iGhhN/iQOGMmpS8OEtxRSHWn9E pjOl0DcS0kY94+8XppzEpwM0M//Thb5zYXGqDl6JsQnWagP96IqE7Tdzdzl/arxpixXe hTCQ== X-Gm-Message-State: AOPr4FWwwP9uMhysxHf8Uv2ztokLvSQrWzZmUv2r/BrDddaLoUnVIHO1G+BXNIwF0y984+nl X-Received: by 10.98.35.12 with SMTP id j12mr17614594pfj.73.1461602325005; Mon, 25 Apr 2016 09:38:45 -0700 (PDT) Received: from C5SDN12 ([192.19.220.253]) by smtp.gmail.com with ESMTPSA id d13sm29381315pfd.80.2016.04.25.09.38.42 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 25 Apr 2016 09:38:43 -0700 (PDT) From: "Stephen McConnell" To: "'Dan Langille'" , References: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> <5EEF0794-B06E-4A72-89DA-7DCD94AE1FC6@langille.org> <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org> In-Reply-To: <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org> Subject: RE: terminated ioc 804b scsi 0 state c xfer 0 Date: Mon, 25 Apr 2016 10:38:41 -0600 Message-ID: <0d7401d19f10$ee329300$ca97b900$@broadcom.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQJ4ucc4xV6h5VXTBEGxwBGwzw4I/wFGbfNtAl4WzZeeL4AMwA== Content-Language: en-us X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Apr 2016 16:38:46 -0000 > -----Original Message----- > From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- > scsi@freebsd.org] On Behalf Of Dan Langille > Sent: Monday, April 25, 2016 9:40 AM > To: freebsd-scsi@freebsd.org > Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 > > > On Apr 25, 2016, at 8:17 AM, Dan Langille wrote: > > > >> > >> On Apr 24, 2016, at 9:35 AM, Dan Langille wrote: > >> > >> More of the pasted output is also at > https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae > > and added smartctl output. > >> > >> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI SAS2008 > card. It's running a zfs root system. > >> > >> This morning the system was unresponsive via ssh. Attempts to log in at > the console did not yield a password prompt. > >> > >> A power cycle brought the system online. Inspecting /var/log/messages, I > found about 63,000 entries similar to those which appear below. > >> > >> zpool status of all are OK. A scrub is in progress for one pool (since before > this issue arose). da7 is in that pool. > >> > >> > >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 > >> 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b scsi > >> 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): > >> READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 length 16384 SMID 614 > >> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew > >> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 50 00 00 20 > >> 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c xfer 0 > >> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 > >> 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc 804b > >> scsi 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): > >> READ(10). CDB: 28 00 8b 6f ef 50 00 00 08 00 length 4096 SMID 674 > >> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew > >> kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b 0f a2 48 00 00 18 > >> 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state c xfer > >> 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: > >> 28 00 ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc > >> 804b scsi 0 state c xfer 0 Apr 24 11:25:56 knew kernel: > >> (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 > >> length 16384 SMID 376 terminated ioc 804b scsi 0 state c xfer 0 Apr > >> 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b > >> d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi 0 > >> state c xfer 0 > >> > >> Is this a cabling issue? The drive is a SATA device (smartctl output in the > URL above). Anyone familiar with these errors? > > > > This morning: > > > > 13410079654596185797 REMOVED 0 0 0 was /dev/da7p3 > > > > At least I know i'm looking for Serial Number: 13Q8PNBYS > > > > From the logs: > > > > Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 lun 0 > > Apr 25 05:34:50 knew kernel: da7: s/n > 13Q8PNBYS detached > > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 > > d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b scsi 0 > > state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): > > READ(10). CDB: 28 00 d8 33 26 f8 00 00 20 00 length 16384 SMID 204 > > terminated ioc 804b scsi 0 state c xfer(da7:mps1:0:17:0): READ(10). CDB: 28 > 00 d8 33 53 e0 00 00 08 00 Apr 25 05:34:51 knew kernel: 0 Apr 25 05:34:51 > knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue > Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 > 00 d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc 804b scsi 0 > state c xfer(da7: 0 > > Apr 25 05:34:51 knew kernel: mps1:0: (da7:mps1:0:17:0): READ(10). CDB: > 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc 804b scsi 0 > state c xfer 17:0 > > Apr 25 05:34:51 knew kernel: 0): (da7:mps1:0:17:0): WRITE(10). CDB: 2a > 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc 804b scsi 0 > state c xfeError 5, Periph was invalidated > > Apr 25 05:34:51 knew kernel: r 0 > > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 > > d8 33 26 f8 00 00 20 00 Apr 25 05:34:51 knew kernel: > > (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 length > > 4096 SMID 435 terminated ioc 804b scsi 0 state c xfer > > (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue Request Apr 25 > > 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1: > > mps1:0:IOCStatus = 0x4b while resetting device 0xa Apr 25 05:34:51 > > knew kernel: 17:mps1: 0): Unfreezing devq for target ID 17 Apr 25 > > 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25 05:34:51 > > knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 26 d8 00 00 > > 20 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: > > Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: > > (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 > > knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 6c 42 40 00 00 > > 10 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: > > Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: > > (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 > > knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 d8 2a 90 00 > > 00 90 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: > Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: > (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 knew > kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 Apr > 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally Re- > queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5, > Periph was invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR: Device > swap: provider da7p2 disconnected. > > Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS 'vdev is > removed, pool_guid=15378250086669402288 > vdev_guid=13410079654596185797'' > > Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed Apr > > 25 05:34:51 knew ZFS: vdev is removed, pool_guid=15378250086669402288 > > vdev_guid=13410079654596185797 > > Current status: after powering off the box, reseating the cables for that drive, I > powered up the system and a resilver commenced which completed in 30 > minutes. > > Seems OK now. I am not sure if the two events are related. Recently, a bug was uncovered where a device is gets 'lost'. Here's what happens: The first message in your "failure on Monday" log is for 'mpssas_prepare_remove'. This message is likely logged because the FW sends an event to the driver that the device is no longer responsive (pulled, cable issue, or something else). When the driver gets this event, it sends a reset to the device to clear out any pending I/O. This is where the 'terminated ioc' messages come from. When the reset completes, the driver is supposed to send a SAS_IO_UNIT message to FW so that the DevHandle for that disk is removed from FW's list. Then, when the device comes back on-line, everything is fine. But, with this bug, before that SAS_IO_UNIT message is sent to FW, the driver exits the function where that happens (mpssas_remove_device). This happens where you see the log message, "IOCStatus - 0x4b while resetting device 0x0a". The driver logs that message and then exits. What the driver should do is log that message and continue on to send the SAS_IO_UNIT message to FW. The fix is to remove the two lines in the driver shown here with '>>': if (le16toh(reply->IOCStatus) != MPI2_IOCSTATUS_SUCCESS) { mps_printf(sc, "IOCStatus = 0x%x while resetting device 0x%x\n", le16toh(reply->IOCStatus), handle); >> mpssas_free_tm(sc, tm); >> return; } A reboot will solve the problem, as you saw, but the real fix is to remove the DevHandle as described above. This fix will go into the driver before the next scheduled release and then MFC'd to 10.x. Steve > > -- > Dan Langille - BSDCan / PGCon > dan@langille.org > > > > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"