From owner-freebsd-scsi@freebsd.org Tue Feb 14 14:19:21 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1A509CDE399 for ; Tue, 14 Feb 2017 14:19:21 +0000 (UTC) (envelope-from crest@rlwinm.de) Received: from smtp.rlwinm.de (smtp.rlwinm.de [148.251.233.239]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DB20818F3 for ; Tue, 14 Feb 2017 14:19:20 +0000 (UTC) (envelope-from crest@rlwinm.de) Received: from crest.local (unknown [87.253.189.132]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.rlwinm.de (Postfix) with ESMTPSA id 5FCF111156 for ; Tue, 14 Feb 2017 15:19:13 +0100 (CET) Subject: Re: multipath device never failing - loops over providers instead To: freebsd-scsi@freebsd.org References: <20170211045605.GA43225@FreeBSD.org> From: Jan Bramkamp Message-ID: <7dfa461d-ade3-0410-3ff1-540631561393@rlwinm.de> Date: Tue, 14 Feb 2017 15:19:13 +0100 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Thunderbird/45.7.1 MIME-Version: 1.0 In-Reply-To: <20170211045605.GA43225@FreeBSD.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Feb 2017 14:19:21 -0000 On 11/02/2017 05:56, John wrote: > Hi Folks, > > Running 10.3-STABLE r308246 from Nov 3, 2016 > > I thought I saw a commit in this area a while back but I > cannot seem to find it nor is google helping.. > > I have SAS drives behind 2 multiplexers (4 paths total) which > are all configured similar to the following: > > # gmultipath status Z76 > Name Status Components > multipath/Z76 OPTIMAL da92 (ACTIVE) > da236 (PASSIVE) > da428 (PASSIVE) > da572 (PASSIVE) > > For each path on the components above, the following sequence occurs: > > kernel: (da92:mpr0:0:399:0): READ(10). CDB: 28 00 0b a7 20 c0 00 00 10 00 > kernel: (da92:mpr0:0:399:0): CAM status: SCSI Status Error > kernel: (da92:mpr0:0:399:0): SCSI status: Check Condition > kernel: (da92:mpr0:0:399:0): SCSI sense: HARDWARE FAILURE asc:32,0 (No defect spare location available) > kernel: (da92:mpr0:0:399:0): Info: 0xba720c0 > kernel: (da92:mpr0:0:399:0): Field Replaceable Unit: 157 > kernel: (da92:mpr0:0:399:0): Command Specific Info: 0x80010000 > kernel: (da92:mpr0:0:399:0): Actual Retry Count: 255 > kernel: (da92:mpr0:0:399:0): Retrying command (per sense data) > > After each path has failed, the following is seen: > > kernel: GEOM_MULTIPATH: Error 5, da92 in Z76 marked FAIL > kernel: GEOM_MULTIPATH: all paths in Z76 were marked FAIL, restore da572 > kernel: GEOM_MULTIPATH: all paths in Z76 were marked FAIL, restore da428 > kernel: GEOM_MULTIPATH: all paths in Z76 were marked FAIL, restore da236 > kernel: GEOM_MULTIPATH: da572 is now active path in Z76 > > and the entire failure loop occurs again. The multipath device > itself is never failed (so the zfs pool can never go into degraded mode, > the faulty drive replaced with a spare, etc). > > Once I pulled the drive the multipath device Z76 fails and > things sent as expected. > > It seems g_multipath_fault() in this instance should just fail the device. > > Does anyone have any pointers on this issue? This is a known bug in GEOM multipath. There are at least two open PRs mentioning exactly this problem. When i encountered it even prevented my system from booting into single user mode. As soon as GEOM multipath found the metadata over one path it consumed that path. Too bad that the tasting on of the new multipath provider triggered a read error, because GEOM multipath just entered an infinite retry loop over all known paths. Because of this bug GEOM multipath is unusable for production. I suspect that it wouldn't be too hard to fix if there is a way to attach some state (e.g. a bitmap of failed paths) to each BIO request. From owner-freebsd-scsi@freebsd.org Wed Feb 15 20:34:32 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3E7C7CE00A5; Wed, 15 Feb 2017 20:34:32 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citapm.icyb.net.ua (citapm.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 6335C7B6; Wed, 15 Feb 2017 20:34:27 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citapm.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id WAA05093; Wed, 15 Feb 2017 22:34:26 +0200 (EET) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1ce6HS-000Eeb-1M; Wed, 15 Feb 2017 22:34:26 +0200 To: freebsd-geom@FreeBSD.org, freebsd-scsi@FreeBSD.org From: Andriy Gapon Subject: cd + geom -> panic on media change Message-ID: <6a59787c-ecba-80db-a669-d0987466e3f8@FreeBSD.org> Date: Wed, 15 Feb 2017 22:33:05 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:45.0) Gecko/20100101 Thunderbird/45.6.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Feb 2017 20:34:32 -0000 I had an Audio CD in a DVD drive. I booted with that CD in and this is how it was reported: cd0: Removable CD-ROM SCSI device cd0: 150.000MB/s transfers (SATA 1.x, UDMA5, ATAPI 12bytes, PIO 8192bytes) cd0: 392MB (200919 2048 byte sectors) I see that there is some special code in scsi_cd.c that sets sector size to 2352 for Audio CDs, but that was not reflected in the report quoted above. Later I popped out the CD (using the physical eject button, if that matters) and popped in a UDF formatted DVD disk. That resulted in the following panic: panic: wrong offset 472559440 for sectorsize 2048 KDB: stack backtrace: db_trace_self_wrapper() at 0xffffffff8043517b = db_trace_self_wrapper+0x2b/frame 0xfffffe0504bc8870 kdb_backtrace() at 0xffffffff80685289 = kdb_backtrace+0x39/frame 0xfffffe0504bc8920 vpanic() at 0xffffffff8064ce8c = vpanic+0x14c/frame 0xfffffe0504bc8960 panic() at 0xffffffff8064cbd3 = panic+0x43/frame 0xfffffe0504bc89c0 g_io_request() at 0xffffffff805c3b81 = g_io_request+0x3e1/frame 0xfffffe0504bc8a00 g_read_data() at 0xffffffff805c48e7 = g_read_data+0x77/frame 0xfffffe0504bc8a40 g_part_gpt_probe() at 0xffffffff805df0d1 = g_part_gpt_probe+0x111/frame 0xfffffe0504bc8a80 G_PART_PROBE() at 0xffffffff805daa0b = G_PART_PROBE+0x4b/frame 0xfffffe0504bc8aa0 g_part_probe() at 0xffffffff805da366 = g_part_probe+0xc6/frame 0xfffffe0504bc8af0 g_part_taste() at 0xffffffff805d8de7 = g_part_taste+0x147/frame 0xfffffe0504bc8b30 g_new_provider_event() at 0xffffffff805c7a0b = g_new_provider_event+0x10b/frame 0xfffffe0504bc8b50 one_event() at 0xffffffff805c2a4f = one_event+0xff/frame 0xfffffe0504bc8b70 g_run_events() at 0xffffffff805c2875 = g_run_events+0x65/frame 0xfffffe0504bc8b90 g_event_procbody() at 0xffffffff805c4f38 = g_event_procbody+0x58/frame 0xfffffe0504bc8ba0 fork_exit() at 0xffffffff80614800 = fork_exit+0xd0/frame 0xfffffe0504bc8bf0 fork_trampoline() at 0xffffffff80837a7e = fork_trampoline+0xe/frame 0xfffffe0504bc8bf0 The actual DVD has size 680368 x 2048. I did some basic math and 472559440 == 200919 * 2352 - 2048. Given that the code was probing GPT, the offset is consistent with mediasize still being the old size of the Audio CD while the sector size being 2048. Seems like the disk properties were not properly updated before posting the new provider event. Patches and suggestions are welcome :) Thank you! -- Andriy Gapon From owner-freebsd-scsi@freebsd.org Wed Feb 15 20:38:40 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A82F9CE012F; Wed, 15 Feb 2017 20:38:40 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citapm.icyb.net.ua (citapm.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id A92A2866; Wed, 15 Feb 2017 20:38:38 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citapm.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id WAA05104; Wed, 15 Feb 2017 22:38:37 +0200 (EET) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1ce6LV-000Eeq-BF; Wed, 15 Feb 2017 22:38:37 +0200 Subject: Re: cd + geom -> panic on media change To: freebsd-geom@FreeBSD.org, freebsd-scsi@FreeBSD.org References: <6a59787c-ecba-80db-a669-d0987466e3f8@FreeBSD.org> From: Andriy Gapon Message-ID: <07e08996-49d4-3817-cf2b-01504e25ca67@FreeBSD.org> Date: Wed, 15 Feb 2017 22:37:41 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:45.0) Gecko/20100101 Thunderbird/45.6.0 MIME-Version: 1.0 In-Reply-To: <6a59787c-ecba-80db-a669-d0987466e3f8@FreeBSD.org> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Feb 2017 20:38:40 -0000 On 15/02/2017 22:33, Andriy Gapon wrote: > > I had an Audio CD in a DVD drive. > I booted with that CD in and this is how it was reported: > cd0: Removable CD-ROM SCSI device > cd0: 150.000MB/s transfers (SATA 1.x, UDMA5, ATAPI 12bytes, PIO 8192bytes) > cd0: 392MB (200919 2048 byte sectors) > > I see that there is some special code in scsi_cd.c that sets sector size to 2352 > for Audio CDs, but that was not reflected in the report quoted above. > > Later I popped out the CD (using the physical eject button, if that matters) and > popped in a UDF formatted DVD disk. That resulted in the following panic: > > panic: wrong offset 472559440 for sectorsize 2048 By the way, I see that all requests going down pass through g_io_check() which would reject such a request with EINVAL. Why do we have the KASSERT in g_io_request() then? > KDB: stack backtrace: > db_trace_self_wrapper() at 0xffffffff8043517b = db_trace_self_wrapper+0x2b/frame > 0xfffffe0504bc8870 > kdb_backtrace() at 0xffffffff80685289 = kdb_backtrace+0x39/frame 0xfffffe0504bc8920 > vpanic() at 0xffffffff8064ce8c = vpanic+0x14c/frame 0xfffffe0504bc8960 > panic() at 0xffffffff8064cbd3 = panic+0x43/frame 0xfffffe0504bc89c0 > g_io_request() at 0xffffffff805c3b81 = g_io_request+0x3e1/frame 0xfffffe0504bc8a00 > g_read_data() at 0xffffffff805c48e7 = g_read_data+0x77/frame 0xfffffe0504bc8a40 > g_part_gpt_probe() at 0xffffffff805df0d1 = g_part_gpt_probe+0x111/frame > 0xfffffe0504bc8a80 > G_PART_PROBE() at 0xffffffff805daa0b = G_PART_PROBE+0x4b/frame 0xfffffe0504bc8aa0 > g_part_probe() at 0xffffffff805da366 = g_part_probe+0xc6/frame 0xfffffe0504bc8af0 > g_part_taste() at 0xffffffff805d8de7 = g_part_taste+0x147/frame 0xfffffe0504bc8b30 > g_new_provider_event() at 0xffffffff805c7a0b = g_new_provider_event+0x10b/frame > 0xfffffe0504bc8b50 > one_event() at 0xffffffff805c2a4f = one_event+0xff/frame 0xfffffe0504bc8b70 > g_run_events() at 0xffffffff805c2875 = g_run_events+0x65/frame 0xfffffe0504bc8b90 > g_event_procbody() at 0xffffffff805c4f38 = g_event_procbody+0x58/frame > 0xfffffe0504bc8ba0 > fork_exit() at 0xffffffff80614800 = fork_exit+0xd0/frame 0xfffffe0504bc8bf0 > fork_trampoline() at 0xffffffff80837a7e = fork_trampoline+0xe/frame > 0xfffffe0504bc8bf0 > > The actual DVD has size 680368 x 2048. > I did some basic math and 472559440 == 200919 * 2352 - 2048. > Given that the code was probing GPT, the offset is consistent with mediasize > still being the old size of the Audio CD while the sector size being 2048. > Seems like the disk properties were not properly updated before posting the new > provider event. > > Patches and suggestions are welcome :) > Thank you! > -- Andriy Gapon