From owner-freebsd-amd64@FreeBSD.ORG  Tue Sep 27 19:43:47 2005
Return-Path: <owner-freebsd-amd64@FreeBSD.ORG>
X-Original-To: freebsd-amd64@FreeBSD.org
Delivered-To: freebsd-amd64@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 5650C16A41F;
	Tue, 27 Sep 2005 19:43:47 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [204.156.12.53])
	by mx1.FreeBSD.org (Postfix) with ESMTP id C93DA43D48;
	Tue, 27 Sep 2005 19:43:46 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by cyrus.watson.org (Postfix) with ESMTP id C782746B13;
	Tue, 27 Sep 2005 15:43:45 -0400 (EDT)
Date: Tue, 27 Sep 2005 20:43:45 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Rob Watt <rob@hudson-trading.com>
In-Reply-To: <20050927140535.G50334@daemon.mistermishap.net>
Message-ID: <20050927203128.S61419@fledge.watson.org>
References: <da4a53d805092310237d732554@mail.gmail.com>
	<20050925115912.H11229@fledge.watson.org>
	<20050927140535.G50334@daemon.mistermishap.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-hackers@FreeBSD.org, mikep@hudson-trading.com,
	freebsd-amd64@FreeBSD.org, Jason Carroll <jason@hudson-trading.com>
Subject: Re: freebsd-5.4-stable panics
X-BeenThere: freebsd-amd64@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Porting FreeBSD to the AMD64 platform <freebsd-amd64.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-amd64>,
	<mailto:freebsd-amd64-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-amd64>
List-Post: <mailto:freebsd-amd64@freebsd.org>
List-Help: <mailto:freebsd-amd64-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-amd64>,
	<mailto:freebsd-amd64-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Sep 2005 19:43:47 -0000


On Tue, 27 Sep 2005, Rob Watt wrote:

> Thanks for your quick response and suggestions. We have now experienced 
> an additional type of crash. Type 3 is from 6.0-BETA5, it did not enter 
> the debugger at all and we could not generate a core.

Is this an SMP box?  If so, could you try compiling options KDB_STOP_NMI 
into your kernel -- you'll also need to set debug.kdb.stop_cpus_with_nmi=1 
in either loader.conf or at runtime with sysctls.  This will probably 
become the default at some point -- in the mean time, the default when 
entering the debugger on one CPU is to generate an IPI to the other CPUs 
telling them "go into the debugger".  This works fine unless the CPU has 
interrupts disabled, such as if it's holding a spin lock in the scheduler, 
in which case the system will deadlock because that CPU won't acknowledge 
the IPI.  With the above option, a non-maskable interrupt is used to 
signal the other CPUs into the debugger, which gets into the debugger much 
more reliably.

The trap information you've provided indicates that it is likely a data 
NULL pointer dereference in the kernel (faulting address is a small 
increment above NULL).  The instruction pointer looks valid -- if you have 
a debugging copy of the kernel, could you load it into gdb and show me 
what line number / piece of code it's in?  you can use "l 
*ffffffff803b88ca" to generate that, even without a live debugger session 
or core.  If you can get into DDB with the above, generally good starting 
point debugging information (ideally gathered with a serial console) is:

   trace					# current thread trace
   show pcpu				# current CPU data
   show pcpu 0				# CPU 0 data
   show pcpu 1				# CPU 1 data
   ...					# Any other CPUs
   ps					# process listing
   show lockedvnods			# VFS locking information

If you have WITNESS compiled in, also:

   show alllocks

> Unfortunately the 6-BETA crash was completely different from everything 
> we've seen so far. The panic was related to a page fault and 'top' was 
> the active process. We are trying again to run our tests on 6.0, but if 
> we keep encountering other bugs, then those other bugs may prevent us 
> from determining if multicast is the problem.

Let's see if we can get whatever this first bug you're hitting is fixed 
and see if we can get to the next original problems.

> We also ran our applications in 5-STABLE without reading from or writing 
> to disk (ie we ran the multicast data streams on a remote machine, and 
> we told our listener/rebroadcaster apps not to write to disk). In this 
> configuration we were able to run for 4 days without crashing. A few 
> hours before the crash we had introduced disk activity (bonnie in a 
> constant loop with 1G test file size). This crash was a type 1, and we 
> were not able to save a core. The longest we had gone before without a 
> crash was 6 hours, so it is possible that either load, or disk activity 
> help trigger the bugs we have seen.

I'm heading off on a vacation for two days, and will be offline for that 
period, but if we can't easily get through solving 6.x problems on the 
host, I can backport a subset of the multicast fixes to 5.x and we can see 
if that fixes things up.  It may make sense to do this anyway, but I may 
not have an opportunity to go through the development and testing on that 
until after 6.0 is out the door.

> files attached:
> kernel-conf.txt (6.0 kernel)
> type3-core.txt (copy of panic output to console)
>
> We will update you with more info from our 6.0 tests when we have it.
>
> We are in a bind right now. All modern hardware (ie emt64/amd64) only 
> seems to work with versions of freebsd that aren't stable when running 
> our applications. Many vendors do not even sell server hardware that is 
> purely i386. We never encountered these types of problems on freebsd 
> 4.x, and many of our 120+ i386 class machines that are running 4.x are 
> showing their age and need to be replaced. Assuming that the problems we 
> are experiencing are purely related to ths OS, we now don't have an OS 
> to run on the newer hardware we've been buying. We really need to find a 
> way to patch these problems or find a version of freebsd that supports 
> our platform and is stable. Obviously we appreciate the hard work that 
> all of you on the freebsd team do, and we are happy to do whatever we 
> can to help squash these bugs.

Hopefully we can get this fixed up as soon as possible.

Do you have a testbed or set of test hosts set up so you can 
non-disruptively test change sets, btw?

Thanks,

Robert N M Watson