Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 11 Sep 2009 21:06:12 -0700
From:      Julian Elischer <julian@elischer.org>
To:        Linda Messerschmidt <linda.messerschmidt@gmail.com>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: Intermittent system hangs on 7.2-RELEASE-p1
Message-ID:  <4AAB1E34.2060908@elischer.org>
In-Reply-To: <237c27100909112055i35612b4btbfbecb8b5dd1568c@mail.gmail.com>
References:  <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com>	<200909111102.14503.jhb@freebsd.org>	<237c27100909111035y544e8c91hc7726fd6ef16e351@mail.gmail.com>	<200909111506.47309.jhb@freebsd.org>	<237c27100909111905y244924c1n93b4e4d9ceda44be@mail.gmail.com> <237c27100909112055i35612b4btbfbecb8b5dd1568c@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Linda Messerschmidt wrote:
> OK, I have learned that ktrdump looks up the name of the process
> associated with a particular KSE at the the time of the dump, so if
> it's changed since tracing stopped, it will blissfully blame the wrong
> process.  I understand why that's the case, but it still sucks for
> troubleshooting. :(
> 
> This time, "pf task mtx" and "vnode_free_list" are the locks getting
> the blame.  The processes fingered are an httpd ( (the root "parent"
> of the one doing the work, which does nothing but select() for 1s and
> wait to see if its children died), and vnlru.  No correlation at all
> to the previous results, and this machine is now utterly quiescent
> except for the httpd process and the PHP exerciser.  Hard to imagine
> vnlru has 1s worth of running to do on a machine with 949 total vnodes
> in use.
> 
> A third run produced a 997ms "lock acquire" for "buffer daemon lock,"
> a 497ms one for ip6qlock (no, there's no IPv6 in use on this machine),
> and an 8s (!!!) one on unp_mtx. bufdaemon had a 997s "running" bar,
> but according to the raw TSC values, that happened on the same CPU
> 1.999s *after* the 997ms buffer daemon lock acquire.
> 
> I really don't know where to go from here.  There's so little
> consistency that I'm just not sure if the data is bad, the tool is
> bad, the operator is bad, or there's some problem so fundamentally
> horrible that all I'm seeing is random side effects.
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"

does the system have a serial console? how about a normal console 
/keyboard?

how often deos it hang? and for  how long?
is there a chance that you could notice when it is hung and hit 
<CTL><LAT><ESC> and drop it into the debugger IN teh hung state?

It is possible if you have a serial port to make a program that sends 
a char back and forth and when the machine hangs, sends teh magic 
sequence. (I think it's CR<tilde><CTL-D> for serial debugger break,
but I'm sure you can look up the kernel options and the chars in google.)

if you can drop the machine into DDB (teh kernel debugger) in teh
hung state, then there are lots of comands you can do to find out
what is wrong. jhb actually gave a short talk that I videod and put
on youtube on the topic.

ps will show you what is actually running on which CPU and you an see 
what locks all the other processes are waiting on.
then you can examine those locks and see who owns them.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4AAB1E34.2060908>