Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 12 Sep 2009 02:52:51 -0400
From:      Linda Messerschmidt <linda.messerschmidt@gmail.com>
To:        Julian Elischer <julian@elischer.org>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: Intermittent system hangs on 7.2-RELEASE-p1
Message-ID:  <237c27100909112352k5504357dge725c8f905ee650a@mail.gmail.com>
In-Reply-To: <4AAB35E0.3000908@elischer.org>
References:  <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200909111102.14503.jhb@freebsd.org> <237c27100909111035y544e8c91hc7726fd6ef16e351@mail.gmail.com> <200909111506.47309.jhb@freebsd.org> <237c27100909111905y244924c1n93b4e4d9ceda44be@mail.gmail.com> <237c27100909112055i35612b4btbfbecb8b5dd1568c@mail.gmail.com> <4AAB1E34.2060908@elischer.org> <237c27100909112147h64f71585p2a97f2b48a510985@mail.gmail.com> <4AAB35E0.3000908@elischer.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Sep 12, 2009 at 1:47 AM, Julian Elischer <julian@elischer.org> wrot=
e:
> ok now we need to describe the hang.. =A0if you can predictably get a han=
g
> every 7 seconds does this mean that it doesn't respond to keyboard for a
> moment every 7 seconds?

It's possible.

> or that it doesn't accept packets every 7 seconds?

It appears that it accepts & responds to at least pings; I was able to
do an every-0.1-seconds ping through a bevy of 300-1900ms stalls with:

2323 packets transmitted, 2323 packets received, 0% packet loss
round-trip min/avg/max/stddev =3D 0.120/1.019/5.979/0.288 ms

As best as I could tell, schedgraph also showed that the clock
interrupt and the em0 interrupt always got serviced on time.  Pretty
much seems like its userspace that's getting put on hold.

> Or is it just the apache process that hangs?

This is where I started from.  In the original post (way long ago
now), I described how pretty much every process on the system went
into the kernel for something and stalled there, and then when the
stall ends, they all unblock at once.  I posted some examples via
ktrace that I sadly no longer have the source data for.

> Does the watching process that you refer to below also hang?

I don't think I can say for sure.  I observe visual stalls from time
to time in the output if I have it show every request where there is
no stall shown, which could either indicate that a stall occurred
outside the request or that my shoddy Internet connection has 100ms
latency and consistent 1% packet loss, which it does.

I did write a short C program that just select()s on stdin for 100ms
over and over and aborts if it takes more than 125ms to go through the
loop; it never aborts, even through 1s+ stalls and the loop times it
reports are consistently 110ms regardless of what else is going on,
which I don't think is unexpected.  However, I'm not sure why that
differs from the behavior of the "master" Apache processes, which
select() for 1 second all day long, but do appear to be affected.
Maybe because they are selecting a network socket instead of a tty?  I
don't know.

Also, if I disable NTP, the system does not appear to lose time during
the stalls, which fits with the consistent clock interrupts I saw.

> would it hang if it tried to access the disk?

By using the md device, I believe I have removed the disk from the
equation; neither process is accessing it.

Even without doing that, if I leave iostat -w 1 running alongside the
test, there's no correlation between the tiny amount of disk activity
there is and observed stalls.

> if the watching process is on the same machine, does it only trigger AFTE=
R
> teh request has taken a ling time or could it time out with a select DURI=
NG
> the delayed response? (another way of asking "how hung
> is 'hung'?"

It's just a PHP script using libcurl to request the file.  I only
moved it to the same machine in order to have it be able to write the
sysctl to stop the KTR traces I was doing.

If you're asking could the check script be modified to time out after,
say, 1 second, and if so, would it return during the hang or after it?
 I don't know.  My guess based on the earlier ktrace output is that it
would time out, but not return until the hang ended.  I'll see if I
the curl lib exposes a configurable timeout and try it.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?237c27100909112352k5504357dge725c8f905ee650a>