From owner-freebsd-stable@FreeBSD.ORG Fri Aug 12 20:17:50 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 63A6A1065676; Fri, 12 Aug 2011 20:17:50 +0000 (UTC) (envelope-from aboyer@averesystems.com) Received: from zimbra.averesystems.com (75-149-8-245-Pennsylvania.hfc.comcastbusiness.net [75.149.8.245]) by mx1.freebsd.org (Postfix) with ESMTP id 16C0B8FC14; Fri, 12 Aug 2011 20:17:49 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by zimbra.averesystems.com (Postfix) with ESMTP id 3BA61446005; Fri, 12 Aug 2011 15:59:35 -0400 (EDT) X-Virus-Scanned: amavisd-new at averesystems.com Received: from zimbra.averesystems.com ([127.0.0.1]) by localhost (zimbra.averesystems.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id TsXUPV3Qnm3Z; Fri, 12 Aug 2011 15:59:33 -0400 (EDT) Received: from riven.arriad.com (fw.arriad.com [10.0.0.16]) by zimbra.averesystems.com (Postfix) with ESMTPSA id 5B8D58BC001; Fri, 12 Aug 2011 15:59:33 -0400 (EDT) From: Andrew Boyer Date: Fri, 12 Aug 2011 15:59:21 -0400 Message-Id: To: Andriy Gapon , Hans Petter Selasky Mime-Version: 1.0 (Apple Message framework v1084) X-Mailer: Apple Mail (2.1084) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: Vishal.Shah@netapp.com, freebsd-stable@freebsd.org, Steven Hartland , Eugene Grosbein , Jeremiah Lott Subject: USB/coredump hangs in 8 and 9 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Aug 2011 20:17:50 -0000 Re: panic: bufwrite: buffer is not busy??? (originally on freebsd-net) Re: debugging frequent kernel panics on 8.2-RELEASE (originally on = freebsd-stable) Re: System hang in USB umass module while processing panic (originally = on freebsd-usb) Hello Andriy and Hans, Sorry for tying in so many discussions on this topic, but I think I have = an explanation for the problems we have been reporting* with hanging = coredumps on multicore systems on 8.2-RELEASE, and it has implications = for Andriy's proposed scheduler patch** and for USB. In today's 8.X and 9.X branches, nothing that I can find stops the other = CPUs when the kernel panics, but many parts of the locking code get = disabled (grep on 'panicstr'). The 'bufwrite: buffer is not busy???' = panic is caused by the syncer encountering an error. If that happens = when it's on the dumping CPU everything hangs. If it's running on a = different CPU, it will be blocked and hidden by the panic_cpu spinlock = in panic(), and the dump continues, polling every attached keyboard for = a Ctl-C. But, the new 8.X USB stack relies on multithreading. (The new stack is = the variable that broke coredumps for us in the 7.1->8.2 transition, I = think.) SVN 224223 fixes a hang that would happen when dumpsys() polls = the USB keyboard (IPMI KVM, in our case). That helps, but it only gets = as far as usb_process(), where it hangs in a loop around a cv_wait() = call. This is easy to reproduce by adding code to the watchdog to break = into the debugger if panicstr is set. I am experimenting with Andriy's patch** to stop the scheduler and it = seems to be most of the way there, stopping the CPUs and disabling the = rest of locking. There are a few places that still reference panicstr, = but that's minor. These are the changes I made to the patch: * Changed ukbd_do_poll() to return immediately if SCHEDULER_STOPPED() = is true, so that we don't hang up in USB. ukbd_yield() locks up in = DROP_GIANT(), and if you skip ukbd_yield(), usbd_transfer_poll() locks = up trying to drop mutexes. * Changed the call to spinlock_enter() back to critical_enter(), so = that interrupts stay enabled and the hardclock still functions. * Added code in the beginning of panic() to switch to CPU 0, so that = we're able to service the hardclock interrupts and so that watchdog = panics get through. This has worked 100% for me so far, although anyone using a USB keyboard = or dump device would still be out of luck. Thoughts? It seems like stopping all of the other CPUs is the right = thing to do on a panic (what are they doing otherwise?). Are the USB = issues fixable? If Andriy's patch get committed it might just involve = short-circuiting all of the locking in the polling path, but I haven't = gotten that far yet. I bet dumping to NFS will have the same problem. Thanks, Andrew * - http://www.freebsd.org/cgi/query-pr.cgi?pr=3Dkern/155421 ** - http://people.freebsd.org/~avg/stop_scheduler_on_panic.8.x.diff -------------------------------------------------- Andrew Boyer aboyer@averesystems.com