From owner-freebsd-amd64@FreeBSD.ORG Tue Sep 27 19:43:47 2005 Return-Path: X-Original-To: freebsd-amd64@FreeBSD.org Delivered-To: freebsd-amd64@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5650C16A41F; Tue, 27 Sep 2005 19:43:47 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [204.156.12.53]) by mx1.FreeBSD.org (Postfix) with ESMTP id C93DA43D48; Tue, 27 Sep 2005 19:43:46 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by cyrus.watson.org (Postfix) with ESMTP id C782746B13; Tue, 27 Sep 2005 15:43:45 -0400 (EDT) Date: Tue, 27 Sep 2005 20:43:45 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Rob Watt In-Reply-To: <20050927140535.G50334@daemon.mistermishap.net> Message-ID: <20050927203128.S61419@fledge.watson.org> References: <20050925115912.H11229@fledge.watson.org> <20050927140535.G50334@daemon.mistermishap.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-hackers@FreeBSD.org, mikep@hudson-trading.com, freebsd-amd64@FreeBSD.org, Jason Carroll Subject: Re: freebsd-5.4-stable panics X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Sep 2005 19:43:47 -0000 On Tue, 27 Sep 2005, Rob Watt wrote: > Thanks for your quick response and suggestions. We have now experienced > an additional type of crash. Type 3 is from 6.0-BETA5, it did not enter > the debugger at all and we could not generate a core. Is this an SMP box? If so, could you try compiling options KDB_STOP_NMI into your kernel -- you'll also need to set debug.kdb.stop_cpus_with_nmi=1 in either loader.conf or at runtime with sysctls. This will probably become the default at some point -- in the mean time, the default when entering the debugger on one CPU is to generate an IPI to the other CPUs telling them "go into the debugger". This works fine unless the CPU has interrupts disabled, such as if it's holding a spin lock in the scheduler, in which case the system will deadlock because that CPU won't acknowledge the IPI. With the above option, a non-maskable interrupt is used to signal the other CPUs into the debugger, which gets into the debugger much more reliably. The trap information you've provided indicates that it is likely a data NULL pointer dereference in the kernel (faulting address is a small increment above NULL). The instruction pointer looks valid -- if you have a debugging copy of the kernel, could you load it into gdb and show me what line number / piece of code it's in? you can use "l *ffffffff803b88ca" to generate that, even without a live debugger session or core. If you can get into DDB with the above, generally good starting point debugging information (ideally gathered with a serial console) is: trace # current thread trace show pcpu # current CPU data show pcpu 0 # CPU 0 data show pcpu 1 # CPU 1 data ... # Any other CPUs ps # process listing show lockedvnods # VFS locking information If you have WITNESS compiled in, also: show alllocks > Unfortunately the 6-BETA crash was completely different from everything > we've seen so far. The panic was related to a page fault and 'top' was > the active process. We are trying again to run our tests on 6.0, but if > we keep encountering other bugs, then those other bugs may prevent us > from determining if multicast is the problem. Let's see if we can get whatever this first bug you're hitting is fixed and see if we can get to the next original problems. > We also ran our applications in 5-STABLE without reading from or writing > to disk (ie we ran the multicast data streams on a remote machine, and > we told our listener/rebroadcaster apps not to write to disk). In this > configuration we were able to run for 4 days without crashing. A few > hours before the crash we had introduced disk activity (bonnie in a > constant loop with 1G test file size). This crash was a type 1, and we > were not able to save a core. The longest we had gone before without a > crash was 6 hours, so it is possible that either load, or disk activity > help trigger the bugs we have seen. I'm heading off on a vacation for two days, and will be offline for that period, but if we can't easily get through solving 6.x problems on the host, I can backport a subset of the multicast fixes to 5.x and we can see if that fixes things up. It may make sense to do this anyway, but I may not have an opportunity to go through the development and testing on that until after 6.0 is out the door. > files attached: > kernel-conf.txt (6.0 kernel) > type3-core.txt (copy of panic output to console) > > We will update you with more info from our 6.0 tests when we have it. > > We are in a bind right now. All modern hardware (ie emt64/amd64) only > seems to work with versions of freebsd that aren't stable when running > our applications. Many vendors do not even sell server hardware that is > purely i386. We never encountered these types of problems on freebsd > 4.x, and many of our 120+ i386 class machines that are running 4.x are > showing their age and need to be replaced. Assuming that the problems we > are experiencing are purely related to ths OS, we now don't have an OS > to run on the newer hardware we've been buying. We really need to find a > way to patch these problems or find a version of freebsd that supports > our platform and is stable. Obviously we appreciate the hard work that > all of you on the freebsd team do, and we are happy to do whatever we > can to help squash these bugs. Hopefully we can get this fixed up as soon as possible. Do you have a testbed or set of test hosts set up so you can non-disruptively test change sets, btw? Thanks, Robert N M Watson