Date: Thu, 25 Sep 2003 13:10:07 -0400 (EDT) From: Robert Watson <rwatson@FreeBSD.org> To: fs@FreeBSD.org Subject: Looking for someone to work on NFS advisory locking... Message-ID: <Pine.NEB.3.96L.1030925130030.50146O-100000@fledge.watson.org>
next in thread | raw e-mail | index | archive | help
FreeBSD 5.x includes an rpc.lockd that is substantially more functional than previous versions at doing distributed locking. Between 5.0 and 5.1, a couple of people worked to resolve some of the outstanding issues in rpc.lockd -- we got a lot of them fixed. However, I'm still running into some substantial problems using rpc.lockd on a daily basis, especially with my crash machines (they like to exercise the locking protocol recovery cases :-). Currently, I'm aware of at least the following problems: (1) rpc.lockd does not handle "aborts" on lock requests. Normally, FreeBSD permits signals to interrupt advisory lock acquisition for local files -- when this happens, the process/thread is simply removed from the list of contenders for the lock when it is released. This is current disabled for NFS locking (no PCATCH in the tsleep) because the kernel code and rpc.lockd have no protocol expression of the notion that a request has been aborted. If PCATCH is added back, the lock request "gets lost" -- i.e., rpc.lockd does acquire the lock on behalf of the process, and then it gets dropped on the floor somewhere where it can't get unlocked. We need to add the ability to express lock request aborts to the kernel<->rpc.lockd protocol. We then need to figure out what to do in rpc.lockd to have the right thing happen with the wire locking protocol. I don't have the NFS locking spec here so I'm not sure what that means. (2) Recovery in the event of a failure is currently problematic. The kernel maintains no state about what locks are held; rpc.lockd does maintain state but reacts poorly in the event of an unexpected wire message. This is especially visible in the context of (1) above: if rpc.lockd gets a lock grant it doesn't expect, it appears to drop it and never release the lock. I haven't comfirmed the details here. If you kill rpc.lockd and restart it, you're hosed. (3) There may be a problem in involving rpc.statd: on a reboot of the client while holding a lock, the lock never appears to be released on the server, so when the client boots up and tries to grab it (i.e., a mail queue lock), the client will wait forever. I don't currently have time to chase down these issues, although I can reproduce a number pretty easily. I was wondering if anyone out there wants to try and grab rpc.lockd by its horns and attempt to address some of these. In the past, Alfred has suggested that the right answer for the state management issue may be to move the NFS locking state machines into the kernel, which would at least eliminate the issue of synchronizing kernel and rpc.lockd state. On the other hand, it might make integration with rpc.statd harder, as well as perhaps make it harder to debug or implement. In any case, we do need to work on this more, so qualified volunteers would be welcome :-). I also suspect we need to do more in the way of interop testing with other locking implementations -- specifically, the Linux implementation. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Network Associates Laboratories
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.NEB.3.96L.1030925130030.50146O-100000>