From owner-svn-src-head@FreeBSD.ORG Sat Jul 30 21:17:07 2011 Return-Path: Delivered-To: svn-src-head@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1125B106566B; Sat, 30 Jul 2011 21:17:07 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id CBEAB8FC13; Sat, 30 Jul 2011 21:17:05 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id AAA00500; Sun, 31 Jul 2011 00:17:03 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1QnGu7-0008PW-9c; Sun, 31 Jul 2011 00:17:03 +0300 Message-ID: <4E3474CE.5090301@FreeBSD.org> Date: Sun, 31 Jul 2011 00:17:02 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:5.0) Gecko/20110706 Thunderbird/5.0 MIME-Version: 1.0 To: Alexander Motin References: <201107292030.p6TKUSaf064895@svn.freebsd.org> <20110730163723.GZ17489@deviant.kiev.zoral.com.ua> <4E346DD9.5030902@FreeBSD.org> In-Reply-To: <4E346DD9.5030902@FreeBSD.org> X-Enigmail-Version: 1.2pre Content-Type: text/plain; charset=KOI8-R Content-Transfer-Encoding: 7bit Cc: Kostik Belousov , svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org Subject: Re: svn commit: r224496 - head/sys/cam X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 30 Jul 2011 21:17:07 -0000 on 30/07/2011 23:47 Alexander Motin said the following: > Kostik Belousov wrote: >> On Fri, Jul 29, 2011 at 08:30:28PM +0000, Alexander Motin wrote: >>> Author: mav >>> Date: Fri Jul 29 20:30:28 2011 >>> New Revision: 224496 >>> URL: http://svn.freebsd.org/changeset/base/224496 >>> >>> Log: >>> In some cases failed SATA disks may report their presence, but don't >>> respond to any commands. I've found that because of multiple command >>> retries, each of which cause 30s timeout, bus reset and another retry or >>> requeue for many commands, it may take ages to eventually drop the >>> failed device. The odd thing is that those retries continue even after >>> XPT considered device as dead and invalidated it. >>> >>> This patch makes cam_periph_error() to block any command retries after >>> periph was marked as invalid. With that patch all activity completes in >>> 1-2 minutes, just after several timeouts, required to consider device >>> death. This should make ZFS, gmirror, graid, etc. operation more robust. >>> >>> Reviewed by: mjacob@ on scsi@ >>> >>> Approved by: re (kib) >>> >>> Modified: >>> head/sys/cam/cam_periph.c >> Amusingly, this commit makes my test machine to not boot. >> This is Ibex Peak PCH, with two SATA disks on the channels 0 and 1. >> >> It seems that geom thread 100012 owns GEOM topology lock, while sleeping >> in adaclose->cam_periph_getccb() : >> >> db> bt 100012 >> Tracing pid 12 tid 100012 td 0xfffffe00028a2000 >> sched_switch() at 0xffffffff8034a0c7 = sched_switch+0x157 >> mi_switch() at 0xffffffff803291fb = mi_switch+0x2eb >> sleepq_switch() at 0xffffffff803631f3 = sleepq_switch+0x123 >> sleepq_wait() at 0xffffffff80363eed = sleepq_wait+0x4d >> _sleep() at 0xffffffff80329b59 = _sleep+0x3b9 >> cam_periph_getccb() at 0xffffffff817ffc50 = cam_periph_getccb+0xa0 >> adaclose() at 0xffffffff8182c484 = adaclose+0xc4 >> g_disk_access() at 0xffffffff802bea74 = g_disk_access+0x1e4 >> g_access() at 0xffffffff802c519a = g_access+0x1ba >> g_dev_attrchanged() at 0xffffffff802bd1f6 = g_dev_attrchanged+0x96 >> g_dev_taste() at 0xffffffff802bd574 = g_dev_taste+0x284 >> g_new_provider_event() at 0xffffffff802c4ecd = g_new_provider_event+0xad >> g_run_events() at 0xffffffff802c0750 = g_run_events+0x250 >> fork_exit() at 0xffffffff802f0d99 = fork_exit+0x189 >> fork_trampoline() at 0xffffffff804ee3be = fork_trampoline+0xe >> --- trap 0, rip = 0, rsp = 0xffffff800025fd00, rbp = 0 --- >> >> (gdb) list *cam_periph_getccb+0xa0 >> 0x1c50 is in cam_periph_getccb (/usr/home/kostik/work/build/bsd/DEV/src/sys/modules/cam/../../cam/cam_periph.c:883). >> 882 >> 883 while (SLIST_FIRST(&periph->ccb_list) == NULL) { >> 884 if (periph->immediate_priority > priority) >> >> Reverting the rev. or not loading ahci.ko allows machine to boot. > > After many experiments I believe that problem is not related to this > change. I've managed to reproduce it depending on GEOM modules > registration order. After I disabled all GEOM modules and only geom_dev > left, problem became persistent. Specifics of the geom_dev is that it > opens device and closes it back without doing any I/O. That caused race > condition between CCB allocation for FLUSHCACHE execution in adaclose() > and higher-priority commands of device initialization sequence. Any I/O > scheduled before adaclose() closed that race, making problem rare. The > problem is specific to the ada, as for no other driver initialization > and payload requests may intersect in time. Alexander, somewhat contradicting your conclusions... can the following issue be also caused by similar mechanics? http://lists.freebsd.org/pipermail/freebsd-current/2010-October/020336.html > Attached patch solved the problem for me. Please try it and approve > commit if it works. > -- Andriy Gapon