From owner-freebsd-scsi@FreeBSD.ORG Mon Apr 25 21:27:56 2011 Return-Path: Delivered-To: scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B0C961065781; Mon, 25 Apr 2011 21:27:56 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 6E5F48FC16; Mon, 25 Apr 2011 21:27:56 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 0F64E46B9C; Mon, 25 Apr 2011 17:27:56 -0400 (EDT) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 999038A027; Mon, 25 Apr 2011 17:27:55 -0400 (EDT) From: John Baldwin To: Andre Albsmeier Date: Mon, 25 Apr 2011 17:03:18 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110325; KDE/4.5.5; amd64; ; ) References: <201102041444.p14EixJP087709@svn.freebsd.org> <201104190920.25924.jhb@freebsd.org> <20110420053226.GA22854@curry.mchp.siemens.de> In-Reply-To: <20110420053226.GA22854@curry.mchp.siemens.de> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201104251703.18518.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Mon, 25 Apr 2011 17:27:55 -0400 (EDT) Cc: "svn-src-stable-7@freebsd.org" , "scsi@freebsd.org" Subject: Re: svn commit: r218277 - in stable/7/sys: kern sys X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Apr 2011 21:27:56 -0000 On Wednesday, April 20, 2011 1:32:26 am Andre Albsmeier wrote: > On Tue, 19-Apr-2011 at 15:20:25 +0200, John Baldwin wrote: > > On Tuesday, April 19, 2011 8:56:48 am Andre Albsmeier wrote: > > > On Mon, 18-Apr-2011 at 15:18:25 +0200, John Baldwin wrote: > > > > On Monday, April 18, 2011 7:36:57 am Andre Albsmeier wrote: > > > > > On Fri, 15-Apr-2011 at 18:35:05 +0200, John Baldwin wrote: > > > > > > On Friday, April 15, 2011 9:25:25 am Andre Albsmeier wrote: > > > > > > > On Fri, 04-Feb-2011 at 14:44:59 +0000, John Baldwin wrote: > > > > > > > > Author: jhb > > > > > > > > Date: Fri Feb 4 14:44:59 2011 > > > > > > > > New Revision: 218277 > > > > > > > > URL: http://svn.freebsd.org/changeset/base/218277 > > > > > > > > > > > > > > > > Log: > > > > > > > > MFC 217075: > > > > > > > > Retire PCONFIG and leave the priority of thread0 alone when waiting for > > > > > > > > interrupt config hooks to execute. > > > > > > > > > > > > > > > > To preserve the KBI, I did not renumber priorities but simply removed > > > > > > > > PCONFIG. > > > > > > > > > > > > > > > > Modified: > > > > > > > > stable/7/sys/kern/subr_autoconf.c > > > > > > > > stable/7/sys/sys/priority.h > > > > > > > > Directory Properties: > > > > > > > > stable/7/sys/ (props changed) > > > > > > > > stable/7/sys/cddl/contrib/opensolaris/ (props changed) > > > > > > > > stable/7/sys/contrib/dev/acpica/ (props changed) > > > > > > > > stable/7/sys/contrib/pf/ (props changed) > > > > > > > > > > > > > > > > Modified: stable/7/sys/kern/subr_autoconf.c > > > > > > > > > > > > > > ============================================================================== > > > > > > > > --- stable/7/sys/kern/subr_autoconf.c Fri Feb 4 14:44:42 2011 > > > > > > (r218276) > > > > > > > > +++ stable/7/sys/kern/subr_autoconf.c Fri Feb 4 14:44:59 2011 > > > > > > (r218277) > > > > > > > > @@ -108,7 +108,7 @@ run_interrupt_driven_config_hooks(dummy) > > > > > > > > warned = 0; > > > > > > > > while (!TAILQ_EMPTY(&intr_config_hook_list)) { > > > > > > > > if (msleep(&intr_config_hook_list, &intr_config_hook_lock, > > > > > > > > - PCONFIG, "conifhk", WARNING_INTERVAL_SECS * hz) == > > > > > > > > + 0, "conifhk", WARNING_INTERVAL_SECS * hz) == > > > > > > > > EWOULDBLOCK) { > > > > > > > > mtx_unlock(&intr_config_hook_lock); > > > > > > > > warned++; > > > > > > > > > > > > > > > > > > > > > This broke several of my machines in a somewhat strange way: > > > > > > > > > > > > > > After upgrading them (17) to a recent 7-STABLE (as of 2011-04-12) > > > > > > > I noticed that some (4) of them didn't start. All 4 didn't find > > > > > > > their boot device anymore. What they all got in common is: > > > > > > > > > > > > > > - an Adaptec 2940 Ultra SCSI adapter > > > > > > > - two SCSI harddisks (da0 and da1) of various brands > > > > > > > - one SCSI CDROM drive (cd0) > > > > > > > > > > > > > > To be exact, none of the three devices (da0, da1, cd0) were > > > > > > > detected at all. Other machines with a similar configuration > > > > > > > (2940 and da0/da1) but _without_ the CDROM drive didn't have > > > > > > > any problems. So I simply removed the CDROM drives on the 4 > > > > > > > machines in question and they all booted again. > > > > > > > > > > > > > > Today I decided to dig into this and after reverting(*) the > > > > > > > above change, they worked with the CDROM again. I have cross- > > > > > > > checked it 3 times. No idea what's happening here... > > > > > > > > > > > > > > -Andre > > > > > > > > > > > > > > (*) To be honest, I use this patch so I had to modify only one file: > > > > > > > > > > > > > > --- sys/kern/subr_autoconf.c.ORI 2011-02-05 13:14:11.000000000 +0100 > > > > > > > +++ sys/kern/subr_autoconf.c 2011-04-15 14:34:31.000000000 +0200 > > > > > > > @@ -108,7 +108,7 @@ > > > > > > > warned = 0; > > > > > > > while (!TAILQ_EMPTY(&intr_config_hook_list)) { > > > > > > > if (msleep(&intr_config_hook_list, &intr_config_hook_lock, > > > > > > > - 0, "conifhk", WARNING_INTERVAL_SECS * hz) == > > > > > > > + PRI_MIN_KERN + 32, "conifhk", WARNING_INTERVAL_SECS * hz) == > > > > > > > EWOULDBLOCK) { > > > > > > > mtx_unlock(&intr_config_hook_lock); > > > > > > > warned++; > > > > > > > > > > > > Do you get any warnings about CAM timeouts, etc. when these probe? A verbose > > > > > > dmesg might be nice to look at if possible. > > > > > > > > > > OK, I have set up a machine for testing. In my other mail > > > > > I was wrong saying that the pass devices appear when using > > > > > the problematic kernel... > > > > > > > > > > Here are the dmesgs: > > > > > > > > > > - dmesg_bad is the original kernel as of Friday > > > > > - dmesg_ok is the patched kernel (see above) as of Friday > > > > > - dmesg.diff is the diff between both > > > > > > > > > > If you want me to try something just tell me... > > > > > > > > Hmmm, what if you make SCSI_DELAY larger? Also, can you let it fail the > > > > mount and drop into ddb and then get 'ps' output? > > > > > > As soon as I include the debugger into the kernel the problem > > > is gone. I have double-checked it two times now: With debugger > > > the drives are detected, without debugger mostly (but not always) > > > not. > > > > > > I currently have it running in an endless rebooting loop hoping, > > > that it fails eventually... > > > > Hummm. This seems like it is a timing related race. :( > > Success! Sometimes at night it finally panic'ed even with the > debugger in the kernel. Here is the output of 'ps' and some other > commands I remembered (no idea if any of these make sense in this > context :-)). It is still in this state with the serial console > attached so just tell me what to type ;-). > > > KDB: enter: manual escape to debugger > [thread pid 1 tid 100001 ] > Stopped at kdb_enter_why+0x3b: xorl %eax,%eax > db> ps > pid ppid pgrp uid state wmesg wchan cmd > 35 0 0 0 RL [softdepflush] > 34 0 0 0 RL [syncer] > 33 0 0 0 RL [vnlru] > 32 0 0 0 RL [bufdaemon] > 31 0 0 0 RL [pagezero] > 30 0 0 0 RL [idlepoll] > 29 0 0 0 RL [vmdaemon] > 28 0 0 0 RL [pagedaemon] > 27 0 0 0 WL [irq1: atkbd0] > 26 0 0 0 WL [swi0: uart uart] > 25 0 0 0 SL - 0xc182a63c [fdc0] > 24 0 0 0 SL idle 0xc1829600 [aic_recovery0] > 23 0 0 0 WL [irq11: ahc0] > 22 0 0 0 SL idle 0xc1829600 [aic_recovery0] > 21 0 0 0 WL [irq10: fxp0] > 20 0 0 0 WL [irq9: acpi0 intsmb0] > 19 0 0 0 SL - 0xc181b800 [kqueue taskq] > 18 0 0 0 WL [swi6: task queue] > 17 0 0 0 WL [swi6: Giant taskq] > --More-- 9 0 0 0 RL [thread taskq] > 16 0 0 0 WL [swi5: Fast task queue] > 15 0 0 0 WL [swi2: cambio] > 8 0 0 0 SL ccb_scan 0xc0766714 [xpt_thrd] > 7 0 0 0 SL - 0xc181bd80 [acpi_task_2] > 6 0 0 0 SL - 0xc181bd80 [acpi_task_1] > 5 0 0 0 SL - 0xc181bd80 [acpi_task_0] > 14 0 0 0 SL - 0xc077be54 [yarrow] > 4 0 0 0 SL - 0xc077942c [g_down] > 3 0 0 0 SL - 0xc0779428 [g_up] > 2 0 0 0 SL - 0xc0779420 [g_event] > 13 0 0 0 WL [swi3: vm] > 12 0 0 0 LL *Giant 0xc1821dc0 [swi4: clock] > 11 0 0 0 WL [swi1: net] > 10 0 0 0 RL [idle] > 1 0 0 0 RL CPU 0 [swapper] > 0 0 0 0 SLs sched 0xc07794c0 [swapper] Hummm, I don't see any "important" threads in a runnable state that I would think were denied the ability to run by the priority change. I wonder what callout swi4 is trying to run (or what other callouts might be stuck waiting for Giant). I'd be tempted to add some KTR traces or debugging printfs to try to figure out what sequence of events is happening when (e.g., when the xpt thread kicks off each scan CCB, when xpt_rescan_done() is called, and probably some events in the ahc driver). -- John Baldwin