From owner-freebsd-hackers Sat Dec 12 16:11:40 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id QAA15934 for freebsd-hackers-outgoing; Sat, 12 Dec 1998 16:11:40 -0800 (PST) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id QAA15923 for ; Sat, 12 Dec 1998 16:11:37 -0800 (PST) (envelope-from tlambert@usr01.primenet.com) Received: (from daemon@localhost) by smtp02.primenet.com (8.8.8/8.8.8) id RAA29953; Sat, 12 Dec 1998 17:11:35 -0700 (MST) Received: from usr01.primenet.com(206.165.6.201) via SMTP by smtp02.primenet.com, id smtpd029884; Sat Dec 12 17:11:26 1998 Received: (from tlambert@localhost) by usr01.primenet.com (8.8.5/8.8.5) id RAA12494; Sat, 12 Dec 1998 17:11:20 -0700 (MST) From: Terry Lambert Message-Id: <199812130011.RAA12494@usr01.primenet.com> Subject: Re: Is it possible? To: jkh@zippy.cdrom.com (Jordan K. Hubbard) Date: Sun, 13 Dec 1998 00:11:19 +0000 (GMT) Cc: vmg@novator.com, hackers@FreeBSD.ORG In-Reply-To: <85152.913104877@zippy.cdrom.com> from "Jordan K. Hubbard" at Dec 8, 98 00:14:37 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > > I have run into the proverbial brick wall. I am the administrator of > > a fairly busy electronic commerce Web site, www.ftd.com. Because of > > the demand placed on a single server, I implemented a load balancing > > solution that utilizes NFS in the back end. The versions of FreeBSD > > Hmmm. Well, as you've already noted, NFS is not really sufficient to > this task and never has been. There has never been any locking with > our NFS and, as evidence would tend to suggest, never a degree of > interest on anyone's part sufficient to actually motivate them to > implement the functionality. This isn't true. Actually, Jordan was going to do this as a project in a class he was taking taught by Kirk McKusick... > Even with working NFS locks, it's also probably an inferior solution > to what many folks are doing and that's load balancing at the IP > level. Something like the Coyote Point Systems Equalizer package > (which is also based on FreeBSD, BTW) which takes n boxes and switches > the traffic for them from one FreeBSD box using load metrics and other > heuristics to determine the best match for a request would be a fine > solution, as would any of the several other similar products on the > market. This is potentially true. > Unless you're up for doing an NFS lock implementation, that is. > Terry's patches only address some purported bugs in the general NFS > code, they don't actually implement the lock daemon and other > functionality you'd need to have truly working NFS locks. Evidently, > this isn't something which has actually interested Terry enough to do > either. :-) Actually, my patches addressed all of the kernel locking issues not related to implementation of the NFS client RPC code, and not related to the requisite rpc.lockd code. I didn't do the rpc.lockd code because you were going to. I didn't do the NFS client RPC code because I didn't have working rpc.lockd on which to base an implemetnation. The patches were *not* gratuitous reorganization, as I believe I can prove; they addressed architectural issues only in as much as it was required to address them for (1) binary compatability with previous fcntl(2) based non-proxy locking, (2) support of the concept of proxy locking at all, and (3) dealing with the issue of a stacking VFS consuming an NFS client VFS layer, and the necessity of splitting lock assertions across one or more inferior VFS's, and the corresponding need to be able to abort a lock coelesce on a first VFS if the operation could not be completed on the second. Here is my architecture document, which should describe the patches I've done (basically, all the generic kernel work), and the small amount of work necessary to be done in user space, and in the NFS client code. Hopefully, someone with commit priviledges will approach these ideas, since I've personally approached them three times without success in getting them committed. PS: I'm pretty sure BSDI examined my code before engaging in their own implementation, given the emails I exchanged with them over it. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. ========================================================================== NFS LOCKING 1.0.0.0 Introduction NFS locking is generally desirable. BSDI has implemented NFS locking, purportedly using some of my FreeBSD patchese as a starting point, to achieve the first implementation of NFS locking not derived from Sun source code. What's unfortunate about this is that they neglected to release the code as open source (so far). 2.0.0.0 Server side locking Server side locking is, by far, the easiest NFS locking problem to solve, Server side locking is support for allowing clients to asser locks against files on an NFS server. 2.1.0.0 Theory of operation Server side locking is implemented by allowing the client to make RPC requests which are proxied to the server file space via one or more processes (generally, two: rpc.lockd and rpc.statd). Operations are proxied into the local collision domain, and enforced both against and by local locks, depending on order of operation. 2.2.0.0 rpc.statd The purpose of rpc.statd is to provide host status data to machines that it is monitoring. This is generally used to allow client machines to reassert locks (since the NFS protocol is nominally stateless) following a server restart. This means we can generally ignore rpc.statd for the purposes of this discussion. 2.3.0.0 rpc.lockd The purpose of rpc.lockd is to provide responses for file and record locking requests to client machines. Because NFS is nominally stateless, but locks themselves are nominally stateful, there must be a container for the lock state. In a UNIX system, containers for lock state are called "processes". They provide an ownership context for the locks, such that the locks can be discarded when th NFS services are discontinued. As such, the rpc.lockd is an essential part of the resource and state tracking mechanism for NFS locks. The current FreeBSD rpc.lockd unconditionally grants lock requests; this is sufficient for Solaris interoperability, since Solaris will complain bitterly if there is not a lockd for a Solaris client to talk to, but is of rather limited utility otherwise, since locks are not enforced, even in the NFS collision domain, let alone between that domain and other processes on the FreeBSD machine. Note that it is possible to enforce the NFS locks within the NFS collision domain solely in the rpc.lockd itself, but this is generally not a sufficient answer, both because of architectural issues having to do with with the current rpc.lockd impelemtnation's handling of blocked requests (it has none) and the 2.3.1.0 Interface problems in FreeBSD FreeBSD has a number of interface problems that prevent implementation of a functional rpc.lockd that enforces locks within both collision domains. 2.3.1.1 FreebSD problem #1: Conversion of NFS handles to FD's Historically, NFS locks have been asserted by converting an NFS file handle into an open file descriptor, and then asserting the proxy lock against the descriptor. SOLOUTION FreeBSD must implement an F_CNVT interface, to allow the rpc.lockd to convert an NFS handle into an open file descriptor. This is the first step in asserting a lock: get a file descriptor for use as a handle to the local locking mechanisms to perform operations on behalf of client machines. 2.3.1.2 FreeBSD problem #2: POSIX lock-release-on-close semantics The second problem FreeBSD faces is that a lock release by a process is implicit in POSIX locking semantics. This will not work in FreeBSD, since the same process proxies locks for multiple remote processes, and the semantic enforcement needs to occur on a per remote process basis, not on a per rpc.lockd basis. SOLOUTION FreeBSD must implement the fcntl option F_NONPOSIX for flagging descriptors on which POSIX unlock semantics must not be enforced. This resovles the proxy dissoloution problem, whereby a lock release by one remote client's process will not destroy the locks held by all other remote client's processes, as would happen if POSIX semantics were enforced on that descriptor. It also resolves the case where multiple locks are being proxied using one descriptor ("descriptor caching"). The rpc.lockd engages in descriptor caching by creating a hash based on the device/inode pair for each fd that results from a converted NFS file handle. The purpose of this is twofold: First, it allows a single descriptor to be resource counted for multiple clients such that descriptors are conserved. Second, since the file handle presented by one client may not match the file handle presented by another, either because of intentional NFS server drift to prevent session hijacking, or because of local FS semantics, such as loopback mounts, union mounts, etc., it provides a common rendesvous point for the rpc.lockd. 2.3.1.3 FreeBSD problem #3: lack of support for proxy operations The FreeBSD fcntl(2) interface lacks the ability to note the use of a descriptor as proxy, as well as the identity of the proxied host id and process id. In general, what this means is that there is no support for proxying locks into the kernel. SunOS 4.1.3 solved this problem once; since that is the reference implemetnation for NFS locking, even today, inside Sun Microsystems, there is no need to reinvent the wheel (if someone feels the needs, at least this time, make it round). SOLOUTION FreeBSD must implement F_RGETLK, F_RSETLK, and F_RSETLKW. In addition, the flock structure must be extended, as follows: /* old flock structure -- required for binary compatability*/ struct oflock { off_t l_start; /* starting offset */ off_t l_len; /* len = 0 means until end of file */ pid_t l_pid; /* lock owner */ short l_type; /* lock type: read/write, etc. */ short l_whence; /* type of l_start */ }; /* new flock structure -- required for NFS/SAMBA*/ struct flock { off_t l_start; /* starting offset */ off_t l_len; /* len = 0 means until end of file */ pid_t l_pid; /* lock owner */ short l_type; /* lock type: read/write, etc. */ short l_whence; /* type of l_start */ short l_version; /* avoid future compat. problems*/ long l_rsys; /* remote system id*/ pid_t l_rpid; /* remote lock owner*/ }; The use of an overlay structure solves the pending binary compatability easily an elegantly: the l_version, l_rpid, and l_rsys fields are defaulted for the F_GETLK, F_SETLK, and F_SETLKW commands. This means that they are copied in using the same size as they previously used, and binary compatability is maintained. For the F_RGETLK, F_RSETLK, and F_RSETLKW commands, since they did not previously exist, binary compatability is unnecessary, and they can copy in the non-default l_version, l_rpid, l_rsys identifiers. By fiat, the oflock l_version is 0, and the flock version is 1. Also by fiat, the value of l_rsys is -1 for local locks. In particular, l_rsys is the IPv4 address of the requester, and -1 is illegal, and therefore useful as a cookie for "localhost". This provides the framework whereby proxy operations can be supported by FreeBSD. 2.3.1.4 FreeBSD problem #4: No support for l_rsys and l_rpid. Having an interface is only part of the battle. FreeBSD also fails to support l_rsys and l_rpid internally. These values must be used as uniquifiers; that is, the value of l_pid alone is not sufficient. When l_rsys is not -1 (localhost), the values of l_rsys and l_rpid must also be considered in determining whether or not locks may be coelesced. SOLOUTION Add Support to the FreeBSD locking subsystem to allow for support of these values to use in preventing coelescence and in determining lock equality. This work is rather trivial, but important. As we shall see in section 3, "Client side locking", we will want to defer our modifications until we have a complete picture of the issues for *both* client and server requiriments. 2.3.1.5 FreeBSD problem #5: Not all local FS's support locking We can say that any local FS that we may wish to mount really wants to be NFS exportable. Without getting into the issues of the FreeBSD VFS mount code, mount handilng, and mappinf of mounted FS's into the user visible hierarchy, it is very easy to see that one requirement for supporting locking is that the underlying FS's must also support locking. SOLOUTION Make all underlying FS's support locking by taking it out of the FS, and placing it at a higher layer. Specifically, hang the lock list off the generic vnode, not off the FS specific inode. This is an obvious simplification that reaps many benefits. However, a we will discover in section 3, "Client side locking", we wil want to defer our modifications until we have a complete picture of the issues for *both* client and server requiriments. Specifically, for VFS stacking to function correctly where an inferior VFS happens to be the NFS client VFS, we must preserve the VOP_ADVLOCK interface as a veto-based mechanism, where local media FS's never veto the operation (deferring to the upper level code that manages the lock off the vnode), whereas the NFS client code may, in fact, veto the operation (as could any other VFS that proxies operations, e.g., an SMBFS). 2.3.2.0 Requirements for rpc.lockd Once the FreeBSD interface issues have been addressed, it is necessary to address the rpc.lockd itself. These issues are primarily algorithmic in nature. 2.3.2.1 When any request is made When a client makes a request, the first thing that the rpc.lockd must do is check the client handle hash list to determine if the rpc.lockd already has a descriptor open on that file *for that handle*. If a descriptor is not open for the handle, the rpc.lockd must convert the NFS file handle into a file descriptor. The rpc.lockd then fstats the descriptor to obtain the dev_t and ino_t fields. This uniquely identifies the file to the FreeBSD system in a way that, for security reasons, the handle alone can not. Note: If the FreeBSD system chose to avoid some of the anti-hijack precations it takes, this step could be avoided, and the handle itself used as a unique identifier. The POSIX lock-release-on-close semantics are disabled via an fcntl using th F_NONPOSIX command. Given the unique identifier, a hash is computed to determine if some other client somewhere has the file open. If so, the structure referencing the already open FD's reference count is incremented, and the FD is closed. The client handle hash is updated so that subsequent operations in the same handle do not So there are two hash tables involved: the client handle hash, and the open file hash. Use of these hashes guarantees the minimum descriptor footprint possible for the rpc.lockd. Since this is the most scarce resource on the server, this is what we must optimize. We note at this point what we noted earlier: we must have at least one descriptor per file in which locks are being asserted, since we are the process container for the locks. 2.3.2.2 F_RGETLK This is a straight-forward request. The request is not a blocking request, so it is made, and the result is returned. The rpc.lockd fills out the l_rpid and l_rsys as necessary to make the request. 2.3.2.3 F_RSETLK This is likewise non-blocking, and therefore likewise relatively trivial. 2.3.2.4 F_RSETLKW This operation is the tough one. Because the operation would block, we have an implementation decision. To reduce overhead, we first try F_RSETLK; if it succeeds, we return success. This is by far the most common outcome, given most lock contention mechanisms in most well written FS client software (note: FS, not NFS: programs are clients of FS services, even for local FS's). If this returns EAGAIN, then we must decide how to perform the operation. We can either fork, and have the forked process close all its copies of the descriptors, except the one of interest, and then implement F_RSETLKW as a blocking operation, or we can implement F_RSETLKW as a queued operation. Finally, we could set up a time, and use F_RSETLK exclisively, until it succeeeds. This last is unacceptable, since it does not guarantee order of grant equals order of enqueueing, and thus may break program expectations on semantics, resulting in deadly embrace deadlocks between processes. Given that FreeBSD supports the concepts of sharing a descriptor table between processes (via vfork(2)), the fork option is by far the most attractive, with the caveat that we use the vfork to get a copy of the descriptor table shared so as to not double the fd footprint, even for a short period of time. We can likewise enqueue state, and process SIGCLD to ensure that the parent rpc.lockd knows about all pending and successful requests (necessary for proper operation of the rpc.statd daemon). 2.3.2.5 Back to the general Now we can go back to discussing the general implementaiton. The rpc.lockd must decrement the reference count when locks held by a given process are removed. It can either do this by maintaining a shadow, or, preferentially, by, after a lock is released, performing an F_RGETLK. This is part of the resource tracking for opn descriptors in the rpc.lockd. If the request indicates that there are no more locks held by that l_rsys/l_rpid pair, then the fd reference count is decremented, and the per handle hash is removed from the list. If the reference count goes to zero, then the descriptor is closed. DISCUSSION It is useful to implement late-bingding closes. Specifically, it is useful to not actually delete the reference immediately. SOLOUTION The handle refernces, instead of being deleted, are thrown ont a clock list. If the handles are rereferenced within a tunable time frame, then they are removed from the list and placed back into use; otherwise, after sufficient time has elapsed, they are inactivated as above. This resolves the case of a single client generating a lot of unnecessary rpc.lockd activity by issuing lock-unlock pairs that would cause the references to bounce up and down, requiring a lot of system calls. It preserves the NFS handle hash for a time after the operation nominally completes, in the expectation of future operations by that client. 3.0.0.0 Client side locking Client side locking is much harder than server side locking. Client side locking allows clients to request locks from remote NFS servers on behalf of local processes running on the client machine. 3.1.0.0 Theory of operation Client side locking is implemented by the client NFS code in the kernel making RPC requests against the server, much in the same way that NFS clients operate when making FS operation requests against NFS servers. It is simultaneously more difficult because of the code being located in the kernel, and less difficult, since there is a process context (the reqiesting process) to act as a conatiner ofr the operation until it is completed by the server. Server side locking is implemented by allowing the client to make RPC requests which are proxied to the server file space via one or more processes (generally, two: rpc.lockd and rpc.statd). Operations are proxied into the local collision domain, and enforced both against and by local locks, depending on order of operation. 3.1.1.0 Interface problems in FreeBSD FreeBSD has a number of interface problems that prevent implementation of a functional NFS client locking. 3.1.1.1 FreeBSD problem #1: VFS stacking and coelescence Locks, when asserted, are coelesced by l_pid. If they are asserted by a hosted OS API, e.g., an NFS, AppleTalk, or SAMBA server, they are coelesced by l_rsys and l_rpid, as well; we can ignore all by l_pid in the general case, since exporting an exported FS is foolish and dangerous. When locks are asserted, then, the locks are coelesced if the lock is successful. Thus, If a process had a file [FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF] Protected by the locks: [111111111] [2222222222] [FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF] And asserted a third lock: [333333333333333333] [111111111] [2222222222] [FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF] That lock would be coelesced: [111111111111111111111111111111111] [FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF] For a local media FS, this is not a problem, since the operation occurs locally, and is serialized by virtue of that fact. But for an NFS client, the lock behaviour is less serialized. Consider the case of a VFS stacking layer that stacks two filesystems, and makes the files within them appear to be two extents of a single file. We can imagine that this would be useful for combined log files for a cluster of machines, and for other reasons (there are many other examples; this is merely the simplest). So we have: [ffffffffffffffff][FFFFFFFFFFFFFFFFFFFFFFFFFFF] Lets perform the same locks: [111111111] [2222222222] [ffffffffffffffff][FFFFFFFFFFFFFFFFFFFFFFFFFFF] So far, so good. Now the third lock: [333333333333333333] [111111111] [2222222222] [ffffffffffffffff][FFFFFFFFFFFFFFFFFFFFFFFFFFF] Colesce, phase one: [33333333] [1111111111111111] [2222222222] [ffffffffffffffff][FFFFFFFFFFFFFFFFFFFFFFFFFFF] Oops! The second phase fails because Some other client has the lock: [XX] Now we need to back out the operation on the first FS: [33333333] [111111111] [ffffffffffffffff] Leaving: [1111111] [ffffffffffffffff] Ut-oh: looks like we're screwed. SOLOUTION Delayed coelescing. The locks are asserted, but they are not committed (coelesced) until all the operations have been deemed successful. By dividing the phases of asserting vs. committing, we can delay the coelesceing until we know that all locks are successfully asserted. How do we do this? Very simply, we convert the VOP_ADVLOCK to be a veto mechanism, instead of the mechainsm by which the lock code is acutally called, and we move the locking operations to upper level (common) code. At the same time, we make the OS more robust, since there is only one place, instead of many, where the code is called. For stacking layers that stack on more than one VFS, and for proxy layers, such as NFS, SMB, or AppleTalk client layers, the operation is a veto, where the operation is proxied, and if the proxy fails, then the operation is vetoed. So in general, VOP_ADVLOCK becomes a "return(1);" for most of the VFS layers, with specific exceptions for particular layer types, which *may* veto the operation by the upper level code. If the operation is not vetoed by the upper level code, then the upper level code commits the operation, and the lock ranges are coelesced. 3.1.1.2 FreeBSD problem #2: What if the NFS layer is first? If the NFS layer is first, and the operation is subsequently vetoed, how is the NFS coelesce backed out? SOLOUTION The shadow graph. The NFS client, for each given vnode (nfsnode), must seperately maintain the locks agains the node on a per process basis. What this means is that when a process asserts a lock on an NFS accessed file, the NFS client lockign code must maintain an uncoelesced lock graph. This is because the lock graph *will* be coelesced on the server. In order to back out the operation: [33333333] [111111111] [ffffffffffffffff] | v [1111111111111111] [ffffffffffffffff] The client must keep knowledge of the fact that these locks are seperate. This implies that locks that result in type demotions are not type demoted to the server (i.e., locks against the server are only asserted in promote-only mode so that if they are backed out, there will not have been a demotion, for example, from write to read, on the server). There is currently code in SAMBA which models this, since SAMBA's consumtiopn of the host FS is similar to an NFS clients consumption of an NFS server's FS. 3.2.0.0 The client NFS VFS layer's RPC calls So far no one has implemented this. In general, it is more important to be a server than it is to be a client, at this time. The amount of effort to implement this, if one has the ISO documents, or, more obliquely and therefore more difficult, the rpc.lockd code in the FreeBSD source tree, is pretty small. This would make a good one quarter project for a Batcholer of Science in Computer Science independent study credit. 3.3.0.0 Discussion In general, all of the issues for an NFS client in FreeBSD apply equally to the idea of an AppleTalk or SMB client in FreeBSD. It is likely that FreeBSD will want to support the ability to operate as a desktop (and therefore client) OS, even if this is not the primary niche into which it is currently being driven by the developers. 4.0.0.0 End Of Document ========================================================================== To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message