From owner-freebsd-current@FreeBSD.ORG Mon Nov 26 15:51:14 2007 Return-Path: Delivered-To: freebsd-current@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 33F9516A418 for ; Mon, 26 Nov 2007 15:51:14 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id 07E0713C478 for ; Mon, 26 Nov 2007 15:51:13 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id DD14247195; Mon, 26 Nov 2007 10:54:48 -0500 (EST) Date: Mon, 26 Nov 2007 15:51:07 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Cristian KLEIN In-Reply-To: <474AE170.7020309@net.utcluj.ro> Message-ID: <20071126154611.W65286@fledge.watson.org> References: <473AD98A.8050003@gmail.com> <20071114115254.GA55351@eos.sc1.parodius.com> <473AEED9.6070301@gmail.com> <200711161618.40441.jkim@FreeBSD.org> <474AE170.7020309@net.utcluj.ro> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Andrey , freebsd-current@FreeBSD.org, Jung-uk Kim Subject: Re: RELENG_7 and HEAD: bge causes system hang X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Nov 2007 15:51:14 -0000 On Mon, 26 Nov 2007, Cristian KLEIN wrote: > Great to hear this problem was solved. I still have one big fat question. > Why did the system hang and not allow the kernel debugger show up? I > strongly believe that this bug would have been easily spotted suppose KDB > would have responded. Is it perhaps possible to "harden" KDB, so that such > issues are easier to find and fix in future? I don't know the details of this particular situation, but I can speak to at least one known issue in DDB: right now, getting into DDB from a serial console is a very quick and straight forward path, requiring only the delivery of the serial interrupt and execution of its fast handler. The regular video console keypresses take a much more circuitous route, as syscons isn't MPSAFE, so include the scheduling of an ithread and acquisition of Giant. As such, I've found breaking into the debugger much easier from a serial console for several years. As Giant has been pushed off larger and larger parts of the kernel, the syscons break path has gotten a lot more reliable. There will always be certain cases where a console break (serial or video) will not work, and those include cases where interrupts are disabled on all CPUs (such as if spinlocks are held on all CPUs, perhaps due to one being leaked and then a cascading deadline). In that situation, there's nothing like a nice NMI button or IPMI NMI to get into the debugger :-). We have a feature on i386 and amd64 called MP_WATCHDOG, which allows one CPU to be dedicated to being a watchdog for the others--on lower end hardware this isn't so useful, as CPUs aren't plentiful, but as the number of cores increases, it becomes more and more possible to run this without disrupting normal operation of the machine. When it notices the kernel is no longer running callouts, it delivers an NMI to the other CPUs and kicks (hopefully) one of them into DDB. There are a number of issues with the implementation, not least that we do actually run some other code on the watchdog CPU sometimes as our interrupt routing and scheduler need a bit more adaptation, but it can be quite useful nonetheless. Robert N M Watson Computer Laboratory University of Cambridge