Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 11 Sep 2009 11:02:14 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        freebsd-hackers@freebsd.org
Cc:        Linda Messerschmidt <linda.messerschmidt@gmail.com>
Subject:   Re: Intermittent system hangs on 7.2-RELEASE-p1
Message-ID:  <200909111102.14503.jhb@freebsd.org>
In-Reply-To: <237c27100909101834g49438707l96fa58df5f717945@mail.gmail.com>
References:  <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <237c27100909101207q73f0c513r60dd5ab83fdfd083@mail.gmail.com> <237c27100909101834g49438707l96fa58df5f717945@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday 10 September 2009 9:34:30 pm Linda Messerschmidt wrote:
> Just to follow up, I've been doing some testing with masking for
> KTR_LOCK rather than KTR_SCHED.
> 
> I'm having trouble with this because I have the KTR buffer size set to
> 1048576 entries, and with only KTR_LOCK enabled, this isn't enough for
> even a full second of tracing; the sample I'm working with now is just
> under 0.9s.  It's an average of one entry every 2001 TSC ticks.  That
> *seems* like a lot of locking activity, but some of the lock points
> are only a couple of lines apart, so maybe it's just incredibly
> verbose.
> 
> Since it's so much data and I'm still working on a way to correlate it
> (lockgraph.py?), all I've got so far is a list of what trace points
> are coming up the most:
> 
> 51927 src/sys/kern/kern_lock.c:215  (_lockmgr UNLOCK mtx_unlock() when
> flags & LK_INTERLOCK)
> 48033 src/sys/kern/vfs_subr.c:2284  (vdropl UNLOCK)
> 41548 src/sys/kern/vfs_subr.c:2187  (vput VI_LOCK)
> 29359 src/sys/kern/vfs_subr.c:2067  (vget VI_LOCK)
> 29358 src/sys/kern/vfs_subr.c:2079  (vget VI_UNLOCK)
> 23799 src/sys/nfsclient/nfs_subs.c:755  (nfs_getattrcache mtx_lock)
> 23460 src/sys/nfsclient/nfs_vnops.c:645  (nfs_getattr mtx_unlock)
> 23460 src/sys/nfsclient/nfs_vnops.c:642  (nfs_getattr mtx_lock)
> 23460 src/sys/nfsclient/nfs_subs.c:815  (nfs_getattrcache mtx_unlock)
> 23138 src/sys/kern/vfs_cache.c:345  (cache_lookup CACHE_LOCK)
> 
> Unfortunately, it kind of sounds like I'm on my way to answering "why
> is this system slow?" even though it really isn't slow.  (And I rush
> to point out that the Apache process in question doesn't at any point
> in its life touch NFS, though some of the other ones on the machine
> do.)
> 
> In order to be the cause of my Apache problem, all this goobering
> around with NFS would have to be relatively infrequent but so intense
> that it shoves everything else out of the way.  I'm skeptical, but I'm
> sure one of you guys can offer a more informed opinion.
> 
> The only other thing I can think of is maybe all this is running me
> out of something I need (vnodes?) so everybody else blocks until it
> finishes and lets go of whatever finite resource it's using up?  But
> that doesn't make a ton of sense either, because why would a lack of
> vnodes cause stalls in accept() or select() in unrelated processes?
> 
> Not sure if I'm going in the right direction here or not.

Try turning off KTR_LOCK for spin mutexes (just force LO_QUIET on in 
mtx_init() if MTX_SPIN is set) and use a schedgraph.py from the latest 
RELENG_7.  It knows how to parse KTR_LOCK events and drop event "bars" for 
locks showing when they are held.  A more recently schedgraph.py might also 
fix the bugs you were seeing with the idle threads looking too long (esp. at 
the start and end of graphs).

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200909111102.14503.jhb>