Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 10 Sep 2009 21:34:30 -0400
From:      Linda Messerschmidt <linda.messerschmidt@gmail.com>
To:        freebsd-hackers@freebsd.org
Subject:   Re: Intermittent system hangs on 7.2-RELEASE-p1
Message-ID:  <237c27100909101834g49438707l96fa58df5f717945@mail.gmail.com>
In-Reply-To: <237c27100909101207q73f0c513r60dd5ab83fdfd083@mail.gmail.com>
References:  <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200908261642.59419.jhb@freebsd.org> <237c27100908271237y66219ef4o4b1b8a6e13ab2f6c@mail.gmail.com> <200908271729.55213.jhb@freebsd.org> <237c27100909100946q3d186af3h66757e0efff307a5@mail.gmail.com> <bc2d970909100957y6d7fd707g9f3184165f8cb766@mail.gmail.com> <237c27100909101129y28771061o86db3c6a50a640eb@mail.gmail.com> <4AA94995.6030700@elischer.org> <237c27100909101207q73f0c513r60dd5ab83fdfd083@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Just to follow up, I've been doing some testing with masking for
KTR_LOCK rather than KTR_SCHED.

I'm having trouble with this because I have the KTR buffer size set to
1048576 entries, and with only KTR_LOCK enabled, this isn't enough for
even a full second of tracing; the sample I'm working with now is just
under 0.9s.  It's an average of one entry every 2001 TSC ticks.  That
*seems* like a lot of locking activity, but some of the lock points
are only a couple of lines apart, so maybe it's just incredibly
verbose.

Since it's so much data and I'm still working on a way to correlate it
(lockgraph.py?), all I've got so far is a list of what trace points
are coming up the most:

51927 src/sys/kern/kern_lock.c:215  (_lockmgr UNLOCK mtx_unlock() when
flags & LK_INTERLOCK)
48033 src/sys/kern/vfs_subr.c:2284  (vdropl UNLOCK)
41548 src/sys/kern/vfs_subr.c:2187  (vput VI_LOCK)
29359 src/sys/kern/vfs_subr.c:2067  (vget VI_LOCK)
29358 src/sys/kern/vfs_subr.c:2079  (vget VI_UNLOCK)
23799 src/sys/nfsclient/nfs_subs.c:755  (nfs_getattrcache mtx_lock)
23460 src/sys/nfsclient/nfs_vnops.c:645  (nfs_getattr mtx_unlock)
23460 src/sys/nfsclient/nfs_vnops.c:642  (nfs_getattr mtx_lock)
23460 src/sys/nfsclient/nfs_subs.c:815  (nfs_getattrcache mtx_unlock)
23138 src/sys/kern/vfs_cache.c:345  (cache_lookup CACHE_LOCK)

Unfortunately, it kind of sounds like I'm on my way to answering "why
is this system slow?" even though it really isn't slow.  (And I rush
to point out that the Apache process in question doesn't at any point
in its life touch NFS, though some of the other ones on the machine
do.)

In order to be the cause of my Apache problem, all this goobering
around with NFS would have to be relatively infrequent but so intense
that it shoves everything else out of the way.  I'm skeptical, but I'm
sure one of you guys can offer a more informed opinion.

The only other thing I can think of is maybe all this is running me
out of something I need (vnodes?) so everybody else blocks until it
finishes and lets go of whatever finite resource it's using up?  But
that doesn't make a ton of sense either, because why would a lack of
vnodes cause stalls in accept() or select() in unrelated processes?

Not sure if I'm going in the right direction here or not.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?237c27100909101834g49438707l96fa58df5f717945>