From owner-freebsd-scsi@freebsd.org  Tue Feb 14 14:19:21 2017
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1A509CDE399
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Tue, 14 Feb 2017 14:19:21 +0000 (UTC) (envelope-from crest@rlwinm.de)
Received: from smtp.rlwinm.de (smtp.rlwinm.de [148.251.233.239])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id DB20818F3
 for <freebsd-scsi@freebsd.org>; Tue, 14 Feb 2017 14:19:20 +0000 (UTC)
 (envelope-from crest@rlwinm.de)
Received: from crest.local (unknown [87.253.189.132])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.rlwinm.de (Postfix) with ESMTPSA id 5FCF111156
 for <freebsd-scsi@freebsd.org>; Tue, 14 Feb 2017 15:19:13 +0100 (CET)
Subject: Re: multipath device never failing - loops over providers instead
To: freebsd-scsi@freebsd.org
References: <20170211045605.GA43225@FreeBSD.org>
From: Jan Bramkamp <crest@rlwinm.de>
Message-ID: <7dfa461d-ade3-0410-3ff1-540631561393@rlwinm.de>
Date: Tue, 14 Feb 2017 15:19:13 +0100
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0)
 Gecko/20100101 Thunderbird/45.7.1
MIME-Version: 1.0
In-Reply-To: <20170211045605.GA43225@FreeBSD.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 14 Feb 2017 14:19:21 -0000

On 11/02/2017 05:56, John wrote:
> Hi Folks,
>
>    Running 10.3-STABLE  r308246 from Nov 3, 2016
>
>    I thought I saw a commit in this area a while back but I
> cannot seem to find it nor is google helping..
>
>    I have SAS drives behind 2 multiplexers (4 paths total) which
> are all configured similar to the following:
>
> # gmultipath status Z76
>          Name   Status  Components
> multipath/Z76  OPTIMAL  da92 (ACTIVE)
>                         da236 (PASSIVE)
>                         da428 (PASSIVE)
>                         da572 (PASSIVE)
>
>    For each path on the components above, the following sequence occurs:
>
> kernel: (da92:mpr0:0:399:0): READ(10). CDB: 28 00 0b a7 20 c0 00 00 10 00
> kernel: (da92:mpr0:0:399:0): CAM status: SCSI Status Error
> kernel: (da92:mpr0:0:399:0): SCSI status: Check Condition
> kernel: (da92:mpr0:0:399:0): SCSI sense: HARDWARE FAILURE asc:32,0 (No defect spare location available)
> kernel: (da92:mpr0:0:399:0): Info: 0xba720c0
> kernel: (da92:mpr0:0:399:0): Field Replaceable Unit: 157
> kernel: (da92:mpr0:0:399:0): Command Specific Info: 0x80010000
> kernel: (da92:mpr0:0:399:0): Actual Retry Count: 255
> kernel: (da92:mpr0:0:399:0): Retrying command (per sense data)
>
>    After each path has failed, the following is seen:
>
> kernel: GEOM_MULTIPATH: Error 5, da92 in Z76 marked FAIL
> kernel: GEOM_MULTIPATH: all paths in Z76 were marked FAIL, restore da572
> kernel: GEOM_MULTIPATH: all paths in Z76 were marked FAIL, restore da428
> kernel: GEOM_MULTIPATH: all paths in Z76 were marked FAIL, restore da236
> kernel: GEOM_MULTIPATH: da572 is now active path in Z76
>
>    and the entire failure loop occurs again. The multipath device
> itself is never failed (so the zfs pool can never go into degraded mode,
> the faulty drive replaced with a spare, etc).
>
>    Once I pulled the drive the multipath device Z76 fails and
> things sent as expected.
>
>    It seems g_multipath_fault() in this instance should just fail the device.
>
>    Does anyone have any pointers on this issue?

This is a known bug in GEOM multipath. There are at least two open PRs 
mentioning exactly this problem. When i encountered it even prevented my 
system from booting into single user mode. As soon as GEOM multipath 
found the metadata over one path it consumed that path. Too bad that the 
tasting on of the new multipath provider triggered a read error, because 
GEOM multipath just entered an infinite retry loop over all known paths. 
Because of this bug GEOM multipath is unusable for production. I suspect 
that it wouldn't be too hard to fix if there is a way to attach some 
state (e.g. a bitmap of failed paths) to each BIO request.