From owner-freebsd-stable@FreeBSD.ORG  Wed Sep 29 08:41:42 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: stable@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6D6F8106564A
	for <stable@FreeBSD.org>; Wed, 29 Sep 2010 08:41:42 +0000 (UTC)
	(envelope-from truckman@FreeBSD.org)
Received: from gw.catspoiler.org (adsl-75-1-14-242.dsl.scrm01.sbcglobal.net
	[75.1.14.242]) by mx1.freebsd.org (Postfix) with ESMTP id 1B7E98FC19
	for <stable@FreeBSD.org>; Wed, 29 Sep 2010 08:41:41 +0000 (UTC)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id o8T8fVul061470;
	Wed, 29 Sep 2010 01:41:35 -0700 (PDT)
	(envelope-from truckman@FreeBSD.org)
Message-Id: <201009290841.o8T8fVul061470@gw.catspoiler.org>
Date: Wed, 29 Sep 2010 01:41:31 -0700 (PDT)
From: Don Lewis <truckman@FreeBSD.org>
To: freebsd@jdc.parodius.com
In-Reply-To: <20100929074748.GB83194@icarus.home.lan>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
Cc: stable@FreeBSD.org, sterling@camdensoftware.com
Subject: Re: CPU time accounting broken on 8-STABLE machine after a few
 hours of uptime
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Sep 2010 08:41:42 -0000

On 29 Sep, Jeremy Chadwick wrote:
> On Wed, Sep 29, 2010 at 12:39:49AM -0700, Don Lewis wrote:
>> On 29 Sep, Jeremy Chadwick wrote:
>> 
>> > Given all the information here, in addition to the other portion of the
>> > thread (indicating ntpd reports extreme offset between the system clock
>> > and its stratum 1 source), I would say the motherboard is faulty or
>> > there is a system device which is behaving badly (possibly something
>> > pertaining to interrupts, but I don't know how to debug this on a low
>> > level).
>> 
>> Possible, but I haven't run into any problems running -CURRENT on this
>> box with an SMP kernel.
>> 
>> > Can you boot verbosely and provide all of the output here or somewhere
>> > on the web?
>> 
>> <http://people.freebsd.org/~truckman/AN-M2_HD-8.1-STABLE-verbose.txt>
>> 
>> > If possible, I would start by replacing the mainboard.  The board looks
>> > to be a consumer-level board (I see an nfe(4) controller, for example).
>> 
>> It's an Abit AN-M2 HD.  The RAM is ECC.  I haven't seen any machine
>> check errors in the logs.  I'll run prime95 as soon as I have a chance.
> 
> Thanks for the verbose boot.  Since it works on -CURRENT, can you
> provide a verbose boot from that as well?  Possibly someone made some
> changes between RELENG_8 and HEAD which fixed an issue, which could be
> MFC'd.

Even when I saw the wierd ntp stepping problem and the calcru messages,
the system was still stable enough to build hundreds of ports.  In the
most recent case, I built 800+ ports over several days without any other
hiccups.

It could also be a difference between SMP and !SMP.  I just found a bug
that causes an immediate panic if lock profiling is enabled on a !SMP
kernel.  This bug also exists in -CURRENT.  Here's the patch:

Index: sys/sys/mutex.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/mutex.h,v
retrieving revision 1.105.2.1
diff -u -r1.105.2.1 mutex.h
--- sys/sys/mutex.h	3 Aug 2009 08:13:06 -0000	1.105.2.1
+++ sys/sys/mutex.h	29 Sep 2010 06:58:52 -0000
@@ -251,8 +251,11 @@
 #define _rel_spin_lock(mp) do {						\
 	if (mtx_recursed((mp)))						\
 		(mp)->mtx_recurse--;					\
-	else								\
+	else {								\
 		(mp)->mtx_lock = MTX_UNOWNED;				\
+		LOCKSTAT_PROFILE_RELEASE_LOCK(LS_MTX_SPIN_UNLOCK_RELEASE, \
+			mp);						\
+	}                                                               \
 	spinlock_exit();						\
 } while (0)
 #endif /* SMP */


After applying the above patch, I enabled lock profiling and got the
following results when I ran "make index":
<http://people.freebsd.org/~truckman/AN-M2_HD-8.1-STABLE_lock_profile.txt>

I didn't see anything strange happening this time.  I don't know if I
got lucky, or the change in kernel options "fixed" the bug.