Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 12 May 2003 22:55:09 -0400 (EDT)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        Don Lewis <truckman@FreeBSD.org>
Cc:        current@FreeBSD.org
Subject:   Re: rpc.lockd spinning; much breakage
Message-ID:  <Pine.NEB.3.96L.1030512223339.4858G-100000@fledge.watson.org>
In-Reply-To: <200305122154.h4CLsXM7048693@gw.catspoiler.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 12 May 2003, Don Lewis wrote:

> On 12 May, Robert Watson wrote:
> 
> >> Why not attach the process in gdb and step through the code to find the
> >> loop? 
> > 
> > Well, I guess the problem is I'm not familiar with the NFS lock manager
> > protocol, and what I'm looking for more is debugging advice: is the best
> > approach to attach to the client or server rpc.lockd?  I had a lot of
> > trouble getting ethereal to work well for debugging NLM stuff as it tended
> > to crash. :-)  Things are somewhat complicated by the fact that once you
> > lose the rpc.lockd on a client, lots of programs begin to hang and stack
> > up... 
> 
> Since the server spins in the absence of traffic, I'd look there first. 
> It should reject any corrupt requests, and if it loops forever when it
> gets a request, whether corrupt or legal, it must have a bug (think
> DoS). If it loops on a legal request, then the clients will all get
> hosed, whether or not they are working correctly.  If it's an illegal
> request that is causing the server to loop, you may be able to gather
> some more information about what's wrong with the request. 

Ok, well, I seem to be running into several classes of problems with
rpc.lockd:

(1) When O_NONBLOCK is passed to open() along with O_EXLOCK, EAGAIN is
    normally returned if the lock is contended.  However, when using
    rpc.lockd, I get back EACCES instead.  You can reproduce this by
    trying to acquire the same lock twice using lockf(1) on the same
    client, or different clients, or the server and the client (...).  The
    local test I use is:

       server% lockf -k test sleep 20

       client% lockf -k test echo hi
       lockf: cannot open test: Permission denied

    Normally, lockf(1) gets back EAGAIN and polls for the lock to be
    released.  I'm not sure which case in the client rpc.lockd(8) it is.
    Since Solaris doesn't support O_EXLOCK and lockf(1) it wasn't easy for
    me to test, but since the server returns the same result from FreeBSD
    and Solaris, I'm guessing it's a client-side mapping problem.
    Presumably some instance of nlm_denied should return EAGAIN instead.

(2) When O_NONBLOCK is not passed to open() along with O_EXLOCK, and a
    lock is contended, wakeups are not properly delivered from the
    client-side rpc.lockd to the process.  I'm not sure of the exact
    sequence of events, but in this test I ran a locktest client on two
    FreeBSD 5.x nodes with rpc.lockd, and a Solaris server.  There was
    almost a 15-second lag from the release of the lock on one client and
    notification of the other client before the process was woken up. 
    Here's the rpc.lockd debugging log on the client showing the lag: 

<server acknowledges lock request to cboss client>
  May 12 22:43:59 cboss rpc.lockd: nlm4_lock_res from raven.rv.nailabs.com

<server notifies client lock has been released by other client and
 granted>
  May 12 22:44:07 cboss rpc.lockd: nlm4_granted_msg from raven.rv.nailabs.com

<server sends another message to the client>
  May 12 22:44:19 cboss rpc.lockd: nlm4_lock_res from raven.rv.nailabs.com

   The waiting process is woken up at 22:44:19 instead of at 22:44:07, so
   there's a 12 second lag before the process is woken up, even though a
   first glance suggests that the client-side rpc.lockd got a notification
   substantially early.

(3) Sometimes rpc.lockd on 5.x acting as a server gets really confused
    when you mix local and remote locks.  I haven't quite figured out the
    circumstances, but occasionally I run into a situation where a client
    contends against an existing lock on the server, and the client never
    receives a notification from the server that the lock has been
    released.  It looks like the server stores state that the lock is
    contended, but perhaps never properly re-polls the kernel to see if
    the lock has been locally re-released:

May 12 22:53:35 cboss rpc.lockd: nlm_lock_msg from crash1.cboss.tislabs.com
May 12 22:53:35 cboss rpc.lockd: Entering getlock... 
May 12 22:53:35 cboss rpc.lockd: recieved fhandle size 32, local size 28
May 12 22:53:35 cboss rpc.lockd: Pointer to new lock is 0x8090800 
May 12 22:53:35 cboss rpc.lockd: Entering do_lock... 
May 12 22:53:35 cboss rpc.lockd: Entering lock_partialfilelock 
May 12 22:53:35 cboss rpc.lockd: Entering lock_nfslock... 
May 12 22:53:35 cboss rpc.lockd: Entering test_nfslock 
May 12 22:53:35 cboss rpc.lockd: Entering lock search loop 
May 12 22:53:35 cboss rpc.lockd: *********************************** 
May 12 22:53:35 cboss rpc.lockd: Dumping match filelock 
May 12 22:53:35 cboss rpc.lockd: *********************************** 
May 12 22:53:35 cboss rpc.lockd: *********************************** 
May 12 22:53:35 cboss rpc.lockd: Top of lock loop 
May 12 22:53:35 cboss rpc.lockd: Pointer to file lock: 0x8072800 
May 12 22:53:35 cboss rpc.lockd: *********************************** 
May 12 22:53:35 cboss rpc.lockd: Dumping test filelock 
May 12 22:53:35 cboss rpc.lockd: *********************************** 
May 12 22:53:35 cboss rpc.lockd: *********************************** 
May 12 22:53:35 cboss rpc.lockd: test_nfslock: filehandle match found 
May 12 22:53:35 cboss rpc.lockd: Entering region overlap with vals:
0:0--0:0 
May 12 22:53:35 cboss rpc.lockd: Exiting region overlap with val: 4 
May 12 22:53:35 cboss rpc.lockd: test_nfslock: Region overlap found 0 : 0
-- 0 : 0 
May 12 22:53:35 cboss rpc.lockd: test_nfslock: Exclusivity failure: 1 1 
May 12 22:53:35 cboss rpc.lockd: Checking filelock identity 
May 12 22:53:35 cboss rpc.lockd: Exiting checking filelock identity:
retval: 0 
May 12 22:53:35 cboss rpc.lockd: test_nfslock: Lock attempt failed 
May 12 22:53:35 cboss rpc.lockd: Desired lock 
May 12 22:53:35 cboss rpc.lockd: Conflicting lock 
May 12 22:53:35 cboss rpc.lockd: Dumping file locks 
May 12 22:53:35 cboss rpc.lockd: Exiting test_nfslock 
May 12 22:53:35 cboss rpc.lockd: Exiting lock_nfslock... 
May 12 22:53:35 cboss rpc.lockd: BLOCKING LOCK RECEIVED 
May 12 22:53:35 cboss rpc.lockd: Entering add_blockingfilelock 
May 12 22:53:35 cboss rpc.lockd: Exiting add_blockingfilelock 
May 12 22:53:35 cboss rpc.lockd: Exiting lock_partialfilelock 
May 12 22:53:35 cboss rpc.lockd: PFL_NFS blocking lock denied.  Queued. 
May 12 22:53:35 cboss rpc.lockd: Exiting do_lock... 
May 12 22:53:35 cboss rpc.lockd: Pointer to new lock is 0x8090800 
May 12 22:53:35 cboss rpc.lockd: Exiting getlock... 

   As a result, the client is never granted the lock by the server, and
   waits indefinitely.  Once in a while, once in this state, rpc.lockd
   starts to spin on the server, but I'm still trying to figure out how to
   reproduce that.

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert@fledge.watson.org      Network Associates Laboratories




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.NEB.3.96L.1030512223339.4858G-100000>