From owner-freebsd-current@freebsd.org Fri Aug 21 15:19:52 2015 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id F25659BF562; Fri, 21 Aug 2015 15:19:52 +0000 (UTC) (envelope-from vangyzen@FreeBSD.org) Received: from smtp.vangyzen.net (hotblack.vangyzen.net [199.48.133.146]) by mx1.freebsd.org (Postfix) with ESMTP id D6EF21F7D; Fri, 21 Aug 2015 15:19:52 +0000 (UTC) (envelope-from vangyzen@FreeBSD.org) Received: from marvin.beer.town (unknown [76.164.8.130]) by smtp.vangyzen.net (Postfix) with ESMTPSA id F1B9556486; Fri, 21 Aug 2015 10:19:48 -0500 (CDT) Subject: Re: freebsd-head: suddenly NMI panics lead to ddb being unable to stop CPUs? To: Ryan Stone , Adrian Chadd References: Cc: freebsd-current , "freebsd-arch@freebsd.org" , Scott Long , Konstantin Belousov From: Eric van Gyzen X-Enigmail-Draft-Status: N1110 Message-ID: <55D74193.4020008@FreeBSD.org> Date: Fri, 21 Aug 2015 10:19:47 -0500 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Aug 2015 15:19:53 -0000 I mentioned this to Adrian, but I'll mention here for everyone else's benefit. Ryan is exactly right. There was a thread a while ago, with a proposed patch from Kostik: https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html As I recall, Scott Long also ran into this a few months ago. It happens for any NMI: entering the debugger, a PCI Parity or System Error, a hardware watchdog timeout, and probably other sources I'm not remembering. Eric On 08/21/2015 09:23, Ryan Stone wrote: > I have seen similar behaviour before. The problem is that every CPU > receives an NMI concurrently. As I recall, one of them gets some kind of > pseudo-spinlock and tries to stop the other CPUs with an NMI. However, > because they are already in an NMI handler, they don't get the second NMI > and don't stop properly. > > The case that I saw actually had to do with a panic triggered by an NMI, > not entering the debugger, but I believe that both cases use > stop_cpus_hard() under the hood and have a similar issue. > > (I also recall seeing the exact situation that you describe while > originally developing SR-IOV on an alpha version of the Fortville hardware > and firmware with a very buggy SR-IOV implementation. I've never seen it > on ixgbe before, although I haven't used SR-IOV there very much at all) > > > On Thu, Aug 20, 2015 at 6:15 PM, Adrian Chadd wrote: > >> Hi! >> >> This has started happening on -HEAD recently. No, I don't have any >> more details yet than "recently." >> >> Whenever I get an NMI panic (and getting an NMI is a separate issue, >> sigh) I get a slew of "failed to stop cpu" messages, and all CPUs >> enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone >> have any ideas? >> >> >> -adrian