Date: Sat, 30 Jul 2011 23:47:21 +0300 From: Alexander Motin <mav@FreeBSD.org> To: Kostik Belousov <kostikbel@gmail.com> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r224496 - head/sys/cam Message-ID: <4E346DD9.5030902@FreeBSD.org> In-Reply-To: <20110730163723.GZ17489@deviant.kiev.zoral.com.ua> References: <201107292030.p6TKUSaf064895@svn.freebsd.org> <20110730163723.GZ17489@deviant.kiev.zoral.com.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format. --------------020700080908040403090808 Content-Type: text/plain; charset=KOI8-R Content-Transfer-Encoding: 7bit Kostik Belousov wrote: > On Fri, Jul 29, 2011 at 08:30:28PM +0000, Alexander Motin wrote: >> Author: mav >> Date: Fri Jul 29 20:30:28 2011 >> New Revision: 224496 >> URL: http://svn.freebsd.org/changeset/base/224496 >> >> Log: >> In some cases failed SATA disks may report their presence, but don't >> respond to any commands. I've found that because of multiple command >> retries, each of which cause 30s timeout, bus reset and another retry or >> requeue for many commands, it may take ages to eventually drop the >> failed device. The odd thing is that those retries continue even after >> XPT considered device as dead and invalidated it. >> >> This patch makes cam_periph_error() to block any command retries after >> periph was marked as invalid. With that patch all activity completes in >> 1-2 minutes, just after several timeouts, required to consider device >> death. This should make ZFS, gmirror, graid, etc. operation more robust. >> >> Reviewed by: mjacob@ on scsi@ >> >> Approved by: re (kib) >> >> Modified: >> head/sys/cam/cam_periph.c > Amusingly, this commit makes my test machine to not boot. > This is Ibex Peak PCH, with two SATA disks on the channels 0 and 1. > > It seems that geom thread 100012 owns GEOM topology lock, while sleeping > in adaclose->cam_periph_getccb() : > > db> bt 100012 > Tracing pid 12 tid 100012 td 0xfffffe00028a2000 > sched_switch() at 0xffffffff8034a0c7 = sched_switch+0x157 > mi_switch() at 0xffffffff803291fb = mi_switch+0x2eb > sleepq_switch() at 0xffffffff803631f3 = sleepq_switch+0x123 > sleepq_wait() at 0xffffffff80363eed = sleepq_wait+0x4d > _sleep() at 0xffffffff80329b59 = _sleep+0x3b9 > cam_periph_getccb() at 0xffffffff817ffc50 = cam_periph_getccb+0xa0 > adaclose() at 0xffffffff8182c484 = adaclose+0xc4 > g_disk_access() at 0xffffffff802bea74 = g_disk_access+0x1e4 > g_access() at 0xffffffff802c519a = g_access+0x1ba > g_dev_attrchanged() at 0xffffffff802bd1f6 = g_dev_attrchanged+0x96 > g_dev_taste() at 0xffffffff802bd574 = g_dev_taste+0x284 > g_new_provider_event() at 0xffffffff802c4ecd = g_new_provider_event+0xad > g_run_events() at 0xffffffff802c0750 = g_run_events+0x250 > fork_exit() at 0xffffffff802f0d99 = fork_exit+0x189 > fork_trampoline() at 0xffffffff804ee3be = fork_trampoline+0xe > --- trap 0, rip = 0, rsp = 0xffffff800025fd00, rbp = 0 --- > > (gdb) list *cam_periph_getccb+0xa0 > 0x1c50 is in cam_periph_getccb (/usr/home/kostik/work/build/bsd/DEV/src/sys/modules/cam/../../cam/cam_periph.c:883). > 882 > 883 while (SLIST_FIRST(&periph->ccb_list) == NULL) { > 884 if (periph->immediate_priority > priority) > > Reverting the rev. or not loading ahci.ko allows machine to boot. After many experiments I believe that problem is not related to this change. I've managed to reproduce it depending on GEOM modules registration order. After I disabled all GEOM modules and only geom_dev left, problem became persistent. Specifics of the geom_dev is that it opens device and closes it back without doing any I/O. That caused race condition between CCB allocation for FLUSHCACHE execution in adaclose() and higher-priority commands of device initialization sequence. Any I/O scheduled before adaclose() closed that race, making problem rare. The problem is specific to the ada, as for no other driver initialization and payload requests may intersect in time. Attached patch solved the problem for me. Please try it and approve commit if it works. -- Alexander Motin --------------020700080908040403090808 Content-Type: text/plain; name="adaschedule.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="adaschedule.patch" --- cam/ata/ata_da.c.prev 2011-07-30 22:22:21.000000000 +0300 +++ cam/ata/ata_da.c 2011-07-30 23:12:22.000000000 +0300 @@ -488,12 +488,20 @@ static void adaschedule(struct cam_periph *periph) { struct ada_softc *softc = (struct ada_softc *)periph->softc; + uint32_t prio; + /* Check if cam_periph_getccb() was called. */ + prio = periph->immediate_priority; + + /* Check if we have more work to do. */ if (bioq_first(&softc->bio_queue) || (!softc->trim_running && bioq_first(&softc->trim_queue))) { - /* Have more work to do, so ensure we stay scheduled */ - xpt_schedule(periph, CAM_PRIORITY_NORMAL); + prio = CAM_PRIORITY_NORMAL; } + + /* Schedule CCB if any of above is true. */ + if (prio != CAM_PRIORITY_NONE) + xpt_schedule(periph, prio); } /* --------------020700080908040403090808--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4E346DD9.5030902>