From owner-freebsd-bugs@FreeBSD.ORG Thu Feb 8 01:20:25 2007 Return-Path: X-Original-To: freebsd-bugs@hub.freebsd.org Delivered-To: freebsd-bugs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7CBE916A408 for ; Thu, 8 Feb 2007 01:20:25 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [69.147.83.40]) by mx1.freebsd.org (Postfix) with ESMTP id 3842413C481 for ; Thu, 8 Feb 2007 01:20:25 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (gnats@localhost [127.0.0.1]) by freefall.freebsd.org (8.13.4/8.13.4) with ESMTP id l181KP2g025154 for ; Thu, 8 Feb 2007 01:20:25 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.13.4/8.13.4/Submit) id l181KPhp025149; Thu, 8 Feb 2007 01:20:25 GMT (envelope-from gnats) Date: Thu, 8 Feb 2007 01:20:25 GMT Message-Id: <200702080120.l181KPhp025149@freefall.freebsd.org> To: freebsd-bugs@FreeBSD.org From: Doug Rudoff Cc: Subject: Re: kern/107555: [rpc] 30 second delay in NFS lock of file after waiting for lock X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Doug Rudoff List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 08 Feb 2007 01:20:25 -0000 The following reply was made to PR kern/107555; it has been noted by GNATS. From: Doug Rudoff To: bug-followup@FreeBSD.org Cc: Subject: Re: kern/107555: [rpc] 30 second delay in NFS lock of file after waiting for lock Date: Wed, 7 Feb 2007 16:48:01 -0800 (PST) I've discovered what's happening. On this particular Linux client, a "rpcinfo -p" showed no registered nfs rpc services, including the important "nlockmgr". This was despite nfs and lockd running on the Linux client. On the FreeBSD side, when the original lock is released by the first client app, lockd then attempts to send a NLM_GRANTED to the waiting second client app. But with nlockmgr not a registered rpc service, lockd is not able to create an rpc client handle and thus is unable to send the message. However, lockd does no error checking after attempting to send the granted message and assumes the message was sent successfully. At this point lockd has the file locked by a client that is unaware it has a lock The waiting Linux client app gives up waiting for the NLM_GRANTED from FreeBSD's lockd after a set period and sends a new lock request. Since lockd is already holding the lock on the file for the client the lock is granted. When I restarted nfs on the Linux client, nlockmgr was listed as an rpc service, and the 30 second delay in getting a lock did not occur. You may wonder how any other messages are returned to the client if the client's rpc services aren't registered. Because when lockd receives a message, it knows the client handle that sent the message and can immediately reply to the same handle. But for the NLM_GRANTED message, the client handle isn't stored with the list of and it has to ask the client host for the handle through the rpc services that are registered. To sum things up: 1) The problem was due to the missing nlockmgr rpc service on the Linux client. 2) FreeBSD's lockd assumes it sent an NLM_GRANTED to a client waiting for a lock, even if it's unable to send the message. 3) Since lockd assumes it sent the message, lockd holds the lock for the client, with the client being unaware it has the lock. 4) Since the client never received the NLM_GRANTED while it was waiting for a lock, after 30 seconds it asks for the lock again, and is receives it because lockd is already holding the lock for the client. In send_granted(), if the client handle kept be obtained, there's this comment: "We fail to notify remote that the lock has been granted. The client will timeout and retry, the lock will be granted at this time." So, it was clearly intentional to not care if the client received the NLM_GRANTED message. This is further shown to be the case by the fact the lockd does not look for the reply from the client that it has accepted the granted lock. I'm going to suggest that if it is absolutely known that the client didn't receive the granted message, then the lock should not be granted. Now this won't affect the problematic behavior. It will still take 30 seconds for the client to timeout and request the lock again. But during those 30 seconds another client could succesfully grab a lock. Otherwise, if the waiting client dies, lockd will still be holding the lock unaware that the client is gone and no other client will be able to get the lock. My suggestion on how to fix this: In lockd_lock.c, send_granted() is defined with a void return. Change it to an int return, with a -1 returned if the client handle was not obtained, and 0 if the message was sent. In retry_blockingfilelocklist(), if send_granted() returns a -1, then the initial lock request is denied and the client will have to ask for the lock again. I was thinking an alternative fix would be to add the client handle to struct file_lock. But reading the comments before get_client() in lock_proc.c gives good reasons why you don't want to do that (e.g. the client host reboots and the client handle is no longer valid). I have create a patch, but until I can get a Linux client into the state where nfs and lockd are running on it but not listed in the rpc registry I won't be able to test it exactly (although I could do a test by altering the code so that send_granted() always failed). ____________________________________________________________________________________ Don't get soaked. Take a quick peak at the forecast with the Yahoo! Search weather shortcut. http://tools.search.yahoo.com/shortcuts/#loc_weather