From owner-freebsd-current@FreeBSD.ORG  Wed May 14 19:22:05 2003
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 4B82F37B401; Wed, 14 May 2003 19:22:05 -0700 (PDT)
Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 7220F43F75; Wed, 14 May 2003 19:22:04 -0700 (PDT)
	(envelope-from truckman@FreeBSD.org)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.12.9/8.12.9) with ESMTP id h4F2LgM7054256;
	Wed, 14 May 2003 19:21:46 -0700 (PDT)
	(envelope-from truckman@FreeBSD.org)
Message-Id: <200305150221.h4F2LgM7054256@gw.catspoiler.org>
Date: Wed, 14 May 2003 19:21:42 -0700 (PDT)
From: Don Lewis <truckman@FreeBSD.org>
To: robert@fledge.watson.org
In-Reply-To: <Pine.NEB.3.96L.1030514095118.8018B-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
cc: bsder@allcaps.org
cc: alfred@FreeBSD.org
cc: current@FreeBSD.org
Subject: Re: rpc.lockd spinning; much breakage
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 May 2003 02:22:05 -0000

On 14 May, Robert Watson wrote:
> 
> On Tue, 13 May 2003, Don Lewis wrote:

>> I don't know if the the client will retry in the blocking case or if the
>> server side will have to grow the code to poll any local locks that it >
> might encounter.  >
> 
> Based on earlier experience with the wakeups getting "lost", it sounds
> like the re-polling takes place once every ten seconds on the client for
> blocking locks.

That seems makes sense.  It looks like the client side more or less just
tosses the "blocked" response and waits for the grant message to arrive.
I guess it periodically polls while it waits.

> Speaking of re-polling, here's another bug:  Open two pty's on the NFS
> client.  On pty1, grab and hold an exclusive lock on a file; sleep.  On
> pty2, do a blocking lock attempt on open, but Ctrl-C the process before
> the pty1 process wakes up, meaning that the lock attempt is effectively
> aborted.  Now kill the first process, releasing the lock, and attempt to
> grab the lock on the file: you'll hang forever.  The client rpc.lockd has
> left a blocking lock request registered with the server, but never
> released that lock for the now missing process.

> It looks like rpc.statd on the client needs to remember that it requested
> the lock, and when it discovers that the process requesting the lock has
> evaporated, it should immediately release the lock on its behalf.  It's
> not clear to me how that should be accomplished: perhaps when it tries to
> wake up the process and discovers it is missing, it should do it, or if
> the lock attempt is aborted early due to a signal, a further message
> should be sent from the kernel to the userland rpc.lockd to notify it that
> the lock instance is no longer of interest.  Note that if we're only using
> the pid to identify a process, not a pid and some sort of generation
> number, there's the potential for pid reuse and a resulting race. 

I saw something in the code about a cancel message (nlm4_cancel,
nlm4_cancel_msg). I think what is supposed to happen is that when
process #2 is killed the descriptor waiting for the lock will closed
which should get rid of its lock request.  rpc.lockd on the client
should notice this and send a cancel message to the server. When process
#1 releases the lock, the second lock will no longer be queued on the
the server and process #3 should be able to grab the lock.

This bug could be in the client rpc.lockd, the client kernel, or the
server rpc.lockd.