From owner-freebsd-current@FreeBSD.ORG Mon May 12 19:55:17 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8780437B401; Mon, 12 May 2003 19:55:17 -0700 (PDT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8FA3B43FA3; Mon, 12 May 2003 19:55:16 -0700 (PDT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.12.9/8.12.9) with ESMTP id h4D2t9On029529; Mon, 12 May 2003 22:55:09 -0400 (EDT) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)h4D2t95H029526; Mon, 12 May 2003 22:55:09 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Mon, 12 May 2003 22:55:09 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: Don Lewis In-Reply-To: <200305122154.h4CLsXM7048693@gw.catspoiler.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: alfred@FreeBSD.org cc: current@FreeBSD.org Subject: Re: rpc.lockd spinning; much breakage X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 May 2003 02:55:17 -0000 On Mon, 12 May 2003, Don Lewis wrote: > On 12 May, Robert Watson wrote: > > >> Why not attach the process in gdb and step through the code to find the > >> loop? > > > > Well, I guess the problem is I'm not familiar with the NFS lock manager > > protocol, and what I'm looking for more is debugging advice: is the best > > approach to attach to the client or server rpc.lockd? I had a lot of > > trouble getting ethereal to work well for debugging NLM stuff as it tended > > to crash. :-) Things are somewhat complicated by the fact that once you > > lose the rpc.lockd on a client, lots of programs begin to hang and stack > > up... > > Since the server spins in the absence of traffic, I'd look there first. > It should reject any corrupt requests, and if it loops forever when it > gets a request, whether corrupt or legal, it must have a bug (think > DoS). If it loops on a legal request, then the clients will all get > hosed, whether or not they are working correctly. If it's an illegal > request that is causing the server to loop, you may be able to gather > some more information about what's wrong with the request. Ok, well, I seem to be running into several classes of problems with rpc.lockd: (1) When O_NONBLOCK is passed to open() along with O_EXLOCK, EAGAIN is normally returned if the lock is contended. However, when using rpc.lockd, I get back EACCES instead. You can reproduce this by trying to acquire the same lock twice using lockf(1) on the same client, or different clients, or the server and the client (...). The local test I use is: server% lockf -k test sleep 20 client% lockf -k test echo hi lockf: cannot open test: Permission denied Normally, lockf(1) gets back EAGAIN and polls for the lock to be released. I'm not sure which case in the client rpc.lockd(8) it is. Since Solaris doesn't support O_EXLOCK and lockf(1) it wasn't easy for me to test, but since the server returns the same result from FreeBSD and Solaris, I'm guessing it's a client-side mapping problem. Presumably some instance of nlm_denied should return EAGAIN instead. (2) When O_NONBLOCK is not passed to open() along with O_EXLOCK, and a lock is contended, wakeups are not properly delivered from the client-side rpc.lockd to the process. I'm not sure of the exact sequence of events, but in this test I ran a locktest client on two FreeBSD 5.x nodes with rpc.lockd, and a Solaris server. There was almost a 15-second lag from the release of the lock on one client and notification of the other client before the process was woken up. Here's the rpc.lockd debugging log on the client showing the lag: May 12 22:43:59 cboss rpc.lockd: nlm4_lock_res from raven.rv.nailabs.com May 12 22:44:07 cboss rpc.lockd: nlm4_granted_msg from raven.rv.nailabs.com May 12 22:44:19 cboss rpc.lockd: nlm4_lock_res from raven.rv.nailabs.com The waiting process is woken up at 22:44:19 instead of at 22:44:07, so there's a 12 second lag before the process is woken up, even though a first glance suggests that the client-side rpc.lockd got a notification substantially early. (3) Sometimes rpc.lockd on 5.x acting as a server gets really confused when you mix local and remote locks. I haven't quite figured out the circumstances, but occasionally I run into a situation where a client contends against an existing lock on the server, and the client never receives a notification from the server that the lock has been released. It looks like the server stores state that the lock is contended, but perhaps never properly re-polls the kernel to see if the lock has been locally re-released: May 12 22:53:35 cboss rpc.lockd: nlm_lock_msg from crash1.cboss.tislabs.com May 12 22:53:35 cboss rpc.lockd: Entering getlock... May 12 22:53:35 cboss rpc.lockd: recieved fhandle size 32, local size 28 May 12 22:53:35 cboss rpc.lockd: Pointer to new lock is 0x8090800 May 12 22:53:35 cboss rpc.lockd: Entering do_lock... May 12 22:53:35 cboss rpc.lockd: Entering lock_partialfilelock May 12 22:53:35 cboss rpc.lockd: Entering lock_nfslock... May 12 22:53:35 cboss rpc.lockd: Entering test_nfslock May 12 22:53:35 cboss rpc.lockd: Entering lock search loop May 12 22:53:35 cboss rpc.lockd: *********************************** May 12 22:53:35 cboss rpc.lockd: Dumping match filelock May 12 22:53:35 cboss rpc.lockd: *********************************** May 12 22:53:35 cboss rpc.lockd: *********************************** May 12 22:53:35 cboss rpc.lockd: Top of lock loop May 12 22:53:35 cboss rpc.lockd: Pointer to file lock: 0x8072800 May 12 22:53:35 cboss rpc.lockd: *********************************** May 12 22:53:35 cboss rpc.lockd: Dumping test filelock May 12 22:53:35 cboss rpc.lockd: *********************************** May 12 22:53:35 cboss rpc.lockd: *********************************** May 12 22:53:35 cboss rpc.lockd: test_nfslock: filehandle match found May 12 22:53:35 cboss rpc.lockd: Entering region overlap with vals: 0:0--0:0 May 12 22:53:35 cboss rpc.lockd: Exiting region overlap with val: 4 May 12 22:53:35 cboss rpc.lockd: test_nfslock: Region overlap found 0 : 0 -- 0 : 0 May 12 22:53:35 cboss rpc.lockd: test_nfslock: Exclusivity failure: 1 1 May 12 22:53:35 cboss rpc.lockd: Checking filelock identity May 12 22:53:35 cboss rpc.lockd: Exiting checking filelock identity: retval: 0 May 12 22:53:35 cboss rpc.lockd: test_nfslock: Lock attempt failed May 12 22:53:35 cboss rpc.lockd: Desired lock May 12 22:53:35 cboss rpc.lockd: Conflicting lock May 12 22:53:35 cboss rpc.lockd: Dumping file locks May 12 22:53:35 cboss rpc.lockd: Exiting test_nfslock May 12 22:53:35 cboss rpc.lockd: Exiting lock_nfslock... May 12 22:53:35 cboss rpc.lockd: BLOCKING LOCK RECEIVED May 12 22:53:35 cboss rpc.lockd: Entering add_blockingfilelock May 12 22:53:35 cboss rpc.lockd: Exiting add_blockingfilelock May 12 22:53:35 cboss rpc.lockd: Exiting lock_partialfilelock May 12 22:53:35 cboss rpc.lockd: PFL_NFS blocking lock denied. Queued. May 12 22:53:35 cboss rpc.lockd: Exiting do_lock... May 12 22:53:35 cboss rpc.lockd: Pointer to new lock is 0x8090800 May 12 22:53:35 cboss rpc.lockd: Exiting getlock... As a result, the client is never granted the lock by the server, and waits indefinitely. Once in a while, once in this state, rpc.lockd starts to spin on the server, but I'm still trying to figure out how to reproduce that. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Network Associates Laboratories