Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 7 Sep 2006 13:34:08 -0400
From:      Tom Ierna <tom@shockergroup.com>
To:        freebsd-questions@freebsd.org
Subject:   rpc.lockd stalls
Message-ID:  <56C924ED-9AF8-4575-8A2F-9BD523AF117F@shockergroup.com>

next in thread | raw e-mail | index | archive | help
Hello, list.

For the purposes of ease of software and hardware management, I'm  
attempting to run a set of PXE-booted Client machines as web/db or  
mail servers.

The NFS/DHCP/YP servers are running on a 5.4-STABLE Server. I mostly  
followed the PXE guide when building these systems.

All of the disk (except for swap) sits on the master Server (which  
has a bunch of external drive sleds), and all of the Client machines  
boot via Gig-E.

Client machines are running 5.4-STABLE as well, but it is not  
compiled with the same kernel configuration as the master Server, as  
the hardware is slightly different. Client machines share userland  
with the Server.

At the moment I have one Client machine running about 40 domains of  
web and db, with reasonably low traffic (less than 3Mbit/sec total)  
and one Client machine booted from the master Server, but not doing  
anything.

Resource utilization on the master Server seems pretty low.

Sporadically, there appear to be stalls on some locks with rpc.lockd.  
These lock stalls exhibit "interesting" behavior on the Client  
machines: Slots will fill up on Apache in the "W" state. SSH login  
attempts to the client machine (passwd files get some user data via  
YP) will hang and timeout. when I find a file (via Apache's extended  
status) which appears to be one of the stalled locks, and I attempt  
to do anything with the file via a shell on the client machine, such  
as "cat" it, that shell will become unresponsive. Any process which  
is stalled on one of these files cannot be killled.

On the server, the only symptom I've witnessed is that rpc.lockd  
starts using a bit more proc than it usually does. Normal utilization  
is 0.0, and when the problem is happening, proc might go up to 3.0 or  
so. "cat"ing a file on the Server which appears stalled on the  
Client, works fine.

A stop and start of nfslocking on the server seems to clear things  
up. Apache on the client will recover on its own, I'm guessing after  
each stalled lock reaches a timeout. I usually gracefully restart  
Apache, which forces the recovery to happen faster.

As far as timing, it doesn't appear to be consistently periodic. It  
doesn't appear to be load related - I suffered through a Digg of one  
of the sites, and while the client machine served more bandwidth that  
couple of days than it had in a month, this particular problem did  
not occur.

Over the past three months or so, this issue has probably cropped up  
three or four times.

What can I do to troubleshoot this? I would like to add more client  
machines, but I can't until this problem is resolved.

Changing OS builds at this point, unless absolutely necessary, is not  
something I want to do.

Thanks for any insight!

--
Tom Ierna
President
Shockergroup, Inc.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?56C924ED-9AF8-4575-8A2F-9BD523AF117F>