Date: Sat, 12 Sep 2009 02:52:51 -0400 From: Linda Messerschmidt <linda.messerschmidt@gmail.com> To: Julian Elischer <julian@elischer.org> Cc: freebsd-hackers@freebsd.org Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 Message-ID: <237c27100909112352k5504357dge725c8f905ee650a@mail.gmail.com> In-Reply-To: <4AAB35E0.3000908@elischer.org> References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200909111102.14503.jhb@freebsd.org> <237c27100909111035y544e8c91hc7726fd6ef16e351@mail.gmail.com> <200909111506.47309.jhb@freebsd.org> <237c27100909111905y244924c1n93b4e4d9ceda44be@mail.gmail.com> <237c27100909112055i35612b4btbfbecb8b5dd1568c@mail.gmail.com> <4AAB1E34.2060908@elischer.org> <237c27100909112147h64f71585p2a97f2b48a510985@mail.gmail.com> <4AAB35E0.3000908@elischer.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Sep 12, 2009 at 1:47 AM, Julian Elischer <julian@elischer.org> wrot= e: > ok now we need to describe the hang.. =A0if you can predictably get a han= g > every 7 seconds does this mean that it doesn't respond to keyboard for a > moment every 7 seconds? It's possible. > or that it doesn't accept packets every 7 seconds? It appears that it accepts & responds to at least pings; I was able to do an every-0.1-seconds ping through a bevy of 300-1900ms stalls with: 2323 packets transmitted, 2323 packets received, 0% packet loss round-trip min/avg/max/stddev =3D 0.120/1.019/5.979/0.288 ms As best as I could tell, schedgraph also showed that the clock interrupt and the em0 interrupt always got serviced on time. Pretty much seems like its userspace that's getting put on hold. > Or is it just the apache process that hangs? This is where I started from. In the original post (way long ago now), I described how pretty much every process on the system went into the kernel for something and stalled there, and then when the stall ends, they all unblock at once. I posted some examples via ktrace that I sadly no longer have the source data for. > Does the watching process that you refer to below also hang? I don't think I can say for sure. I observe visual stalls from time to time in the output if I have it show every request where there is no stall shown, which could either indicate that a stall occurred outside the request or that my shoddy Internet connection has 100ms latency and consistent 1% packet loss, which it does. I did write a short C program that just select()s on stdin for 100ms over and over and aborts if it takes more than 125ms to go through the loop; it never aborts, even through 1s+ stalls and the loop times it reports are consistently 110ms regardless of what else is going on, which I don't think is unexpected. However, I'm not sure why that differs from the behavior of the "master" Apache processes, which select() for 1 second all day long, but do appear to be affected. Maybe because they are selecting a network socket instead of a tty? I don't know. Also, if I disable NTP, the system does not appear to lose time during the stalls, which fits with the consistent clock interrupts I saw. > would it hang if it tried to access the disk? By using the md device, I believe I have removed the disk from the equation; neither process is accessing it. Even without doing that, if I leave iostat -w 1 running alongside the test, there's no correlation between the tiny amount of disk activity there is and observed stalls. > if the watching process is on the same machine, does it only trigger AFTE= R > teh request has taken a ling time or could it time out with a select DURI= NG > the delayed response? (another way of asking "how hung > is 'hung'?" It's just a PHP script using libcurl to request the file. I only moved it to the same machine in order to have it be able to write the sysctl to stop the KTR traces I was doing. If you're asking could the check script be modified to time out after, say, 1 second, and if so, would it return during the hang or after it? I don't know. My guess based on the earlier ktrace output is that it would time out, but not return until the hang ended. I'll see if I the curl lib exposes a configurable timeout and try it.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?237c27100909112352k5504357dge725c8f905ee650a>