From owner-freebsd-current Mon Mar 17 19: 4:13 2003 Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 03FE737B401 for ; Mon, 17 Mar 2003 19:04:11 -0800 (PST) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0439343FA3 for ; Mon, 17 Mar 2003 19:04:10 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0217.cvx22-bradley.dialup.earthlink.net ([209.179.198.217] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18v7OK-0004nF-00; Mon, 17 Mar 2003 19:03:53 -0800 Message-ID: <3E768C47.229C1DF0@mindspring.com> Date: Mon, 17 Mar 2003 19:02:31 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: "Andrew P. Lentvorski, Jr." Cc: Steve Sizemore , Dan Nelson , current@FreeBSD.ORG Subject: Re: NFS file unlocking problem References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4920a202babece4805683f0f86c97d45fa2d4e88014a4647c350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG "Andrew P. Lentvorski, Jr." wrote: > The dump doesn't seem to be attached. However, I note that the request > being sent is SETLKW which is a blocking wait until lock is granted. If > the server thinks the file is already locked, it will hang *and* that is > the proper behavior. It is, to ensure FIFO ordering of request grants. You could also implement this as a retry. If you do it the first way, you end up potentially deadlocking the server shen a single client has badly behaved code that locks against itself. If you do it the second way, you end up with timing dependent starvation deadlocks for individual client processes. Note that the first deadlock is normal -- it would happen if the file were local, as well... no help for badly written code -- but I mention it as important because we are talking about blocking multiple clients. I don't know what the process is, but a threaded process can cause a deadlock when it should be a grant/upgrade/downgrade of an existing lock overlap. This is because there is no such thing as a thread ID in the NFS protocol, and if process IDs are different for different threads, and the requests come from the same system ID, then you can get a deadlock when none should be present. To avoid this, either manage all locks in an "apartment" or "rental" model (queue all requests to a single thread, and have it do the locking by proxy) OR make sure that all requests from any thread in a given process in fact are given the same proxy process ID on the wire. [ ... This last is not likely your problem, but I mention it, in case you are using rfork() or Linux threads ... ] > What is the result of running this locally on the NFS server and > attempting to lock the underlying file? If rpc.lockd is hanging onto a > lock, running that perl script locally on the actual file (not an NFS > mounted image of it) should also hang. That was my next question, as well: does it happen on a local FS as well as an NFS FS? Personally, I would *NOT* recommend running it on the server, but mount a local FS on the client instead; the less variables, the better. On the other hand, this is clearly a deadlock that requires an existing, conflicting lock -- IFF the you are correct about the delayed locking behaviour. > As a side note, you probably want to create a C executable to do this kind > of fcntl fiddling when attempting to test NFS. That way you can use a > locally mounted binary and you won't wind up with all of the Perl access > calls on the NFS wire. Or, at least, use a local copy of Perl. I recommend a pared down test case. I suspect that the problem is that something that is expected to have the same ID is locking against itself. Does the failure occur with the same values in all cases in the F_RSETLKW? If so, I suggest you capture *all* locking packets on your wire, and then find who is conflicting. This may be a simple lock order reversal (deadly embrace deadlock) due to poor application performance. You may also find that you have multiple process IDs, when it should be a single process ID, for the proxy PID for the conflicting request. At worst, it would be nice to know the system that caused it. Actually, for a lock you know is threre, you *can* diagnose the problem (somewhat) by writing a program on the server, and using F_GETLK on the range for the hanging lock on the server -- this will return a struct flock, which will give you range and PID information. Do it on the Solaris box, though. The reason you want to do this on the Solaris box is that the struct flock on FreeBSD fails to include the l_rsysid -- the remote system ID. Actually, given this, I don't understand how FreeBSD server side proxy locking can actually work at all; it would incorrectly coelesce locks with local locks when the l_pid matched, which would be *all* locks in the lockd, and then incorrectly release them when a local process exited, or any process on any remote system unlocked an overlapping range (possibly in error). You are using FreeBSD as the NFS client in this case, right? If so, that's probably not an issue for you... -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message