From owner-freebsd-stable@FreeBSD.ORG Wed Sep 29 08:41:42 2010 Return-Path: Delivered-To: stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6D6F8106564A for ; Wed, 29 Sep 2010 08:41:42 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (adsl-75-1-14-242.dsl.scrm01.sbcglobal.net [75.1.14.242]) by mx1.freebsd.org (Postfix) with ESMTP id 1B7E98FC19 for ; Wed, 29 Sep 2010 08:41:41 +0000 (UTC) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id o8T8fVul061470; Wed, 29 Sep 2010 01:41:35 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <201009290841.o8T8fVul061470@gw.catspoiler.org> Date: Wed, 29 Sep 2010 01:41:31 -0700 (PDT) From: Don Lewis To: freebsd@jdc.parodius.com In-Reply-To: <20100929074748.GB83194@icarus.home.lan> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii Cc: stable@FreeBSD.org, sterling@camdensoftware.com Subject: Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Sep 2010 08:41:42 -0000 On 29 Sep, Jeremy Chadwick wrote: > On Wed, Sep 29, 2010 at 12:39:49AM -0700, Don Lewis wrote: >> On 29 Sep, Jeremy Chadwick wrote: >> >> > Given all the information here, in addition to the other portion of the >> > thread (indicating ntpd reports extreme offset between the system clock >> > and its stratum 1 source), I would say the motherboard is faulty or >> > there is a system device which is behaving badly (possibly something >> > pertaining to interrupts, but I don't know how to debug this on a low >> > level). >> >> Possible, but I haven't run into any problems running -CURRENT on this >> box with an SMP kernel. >> >> > Can you boot verbosely and provide all of the output here or somewhere >> > on the web? >> >> >> >> > If possible, I would start by replacing the mainboard. The board looks >> > to be a consumer-level board (I see an nfe(4) controller, for example). >> >> It's an Abit AN-M2 HD. The RAM is ECC. I haven't seen any machine >> check errors in the logs. I'll run prime95 as soon as I have a chance. > > Thanks for the verbose boot. Since it works on -CURRENT, can you > provide a verbose boot from that as well? Possibly someone made some > changes between RELENG_8 and HEAD which fixed an issue, which could be > MFC'd. Even when I saw the wierd ntp stepping problem and the calcru messages, the system was still stable enough to build hundreds of ports. In the most recent case, I built 800+ ports over several days without any other hiccups. It could also be a difference between SMP and !SMP. I just found a bug that causes an immediate panic if lock profiling is enabled on a !SMP kernel. This bug also exists in -CURRENT. Here's the patch: Index: sys/sys/mutex.h =================================================================== RCS file: /home/ncvs/src/sys/sys/mutex.h,v retrieving revision 1.105.2.1 diff -u -r1.105.2.1 mutex.h --- sys/sys/mutex.h 3 Aug 2009 08:13:06 -0000 1.105.2.1 +++ sys/sys/mutex.h 29 Sep 2010 06:58:52 -0000 @@ -251,8 +251,11 @@ #define _rel_spin_lock(mp) do { \ if (mtx_recursed((mp))) \ (mp)->mtx_recurse--; \ - else \ + else { \ (mp)->mtx_lock = MTX_UNOWNED; \ + LOCKSTAT_PROFILE_RELEASE_LOCK(LS_MTX_SPIN_UNLOCK_RELEASE, \ + mp); \ + } \ spinlock_exit(); \ } while (0) #endif /* SMP */ After applying the above patch, I enabled lock profiling and got the following results when I ran "make index": I didn't see anything strange happening this time. I don't know if I got lucky, or the change in kernel options "fixed" the bug.