From owner-freebsd-hackers  Sat Dec 12 16:11:40 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id QAA15934
          for freebsd-hackers-outgoing; Sat, 12 Dec 1998 16:11:40 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id QAA15923
          for <hackers@FreeBSD.ORG>; Sat, 12 Dec 1998 16:11:37 -0800 (PST)
          (envelope-from tlambert@usr01.primenet.com)
Received: (from daemon@localhost)
	by smtp02.primenet.com (8.8.8/8.8.8) id RAA29953;
	Sat, 12 Dec 1998 17:11:35 -0700 (MST)
Received: from usr01.primenet.com(206.165.6.201)
 via SMTP by smtp02.primenet.com, id smtpd029884; Sat Dec 12 17:11:26 1998
Received: (from tlambert@localhost)
	by usr01.primenet.com (8.8.5/8.8.5) id RAA12494;
	Sat, 12 Dec 1998 17:11:20 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199812130011.RAA12494@usr01.primenet.com>
Subject: Re: Is it possible?
To: jkh@zippy.cdrom.com (Jordan K. Hubbard)
Date: Sun, 13 Dec 1998 00:11:19 +0000 (GMT)
Cc: vmg@novator.com, hackers@FreeBSD.ORG
In-Reply-To: <85152.913104877@zippy.cdrom.com> from "Jordan K. Hubbard" at Dec 8, 98 00:14:37 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> > I have run into the proverbial brick wall.  I am the administrator of
> > a fairly busy electronic commerce Web site, www.ftd.com.  Because of
> > the demand placed on a single server, I implemented a load balancing
> > solution that utilizes NFS in the back end.  The versions of FreeBSD
> 
> Hmmm.  Well, as you've already noted, NFS is not really sufficient to
> this task and never has been.  There has never been any locking with
> our NFS and, as evidence would tend to suggest, never a degree of
> interest on anyone's part sufficient to actually motivate them to
> implement the functionality.

This isn't true.

Actually, Jordan was going to do this as a project in a class he
was taking taught by Kirk McKusick...

> Even with working NFS locks, it's also probably an inferior solution
> to what many folks are doing and that's load balancing at the IP
> level.  Something like the Coyote Point Systems Equalizer package
> (which is also based on FreeBSD, BTW) which takes n boxes and switches
> the traffic for them from one FreeBSD box using load metrics and other
> heuristics to determine the best match for a request would be a fine
> solution, as would any of the several other similar products on the
> market.

This is potentially true.

> Unless you're up for doing an NFS lock implementation, that is.
> Terry's patches only address some purported bugs in the general NFS
> code, they don't actually implement the lock daemon and other
> functionality you'd need to have truly working NFS locks. Evidently,
> this isn't something which has actually interested Terry enough to do
> either. :-)


Actually, my patches addressed all of the kernel locking issues not
related to implementation of the NFS client RPC code, and not related
to the requisite rpc.lockd code.

I didn't do the rpc.lockd code because you were going to.  I didn't
do the NFS client RPC code because I didn't have working rpc.lockd
on which to base an implemetnation.

The patches were *not* gratuitous reorganization, as I believe I can
prove; they addressed architectural issues only in as much as it was
required to address them for (1) binary compatability with previous
fcntl(2) based non-proxy locking, (2) support of the concept of
proxy locking at all, and (3) dealing with the issue of a stacking
VFS consuming an NFS client VFS layer, and the necessity of splitting
lock assertions across one or more inferior VFS's, and the
corresponding need to be able to abort a lock coelesce on a first
VFS if the operation could not be completed on the second.


Here is my architecture document, which should describe the patches
I've done (basically, all the generic kernel work), and the small
amount of work necessary to be done in user space, and in the NFS
client code.  Hopefully, someone with commit priviledges will
approach these ideas, since I've personally approached them three
times without success in getting them committed.


PS: I'm pretty sure BSDI examined my code before engaging in their
own implementation, given the emails I exchanged with them over it.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.
==========================================================================
NFS LOCKING


1.0.0.0	Introduction

	NFS locking is generally desirable.  BSDI has implemented
	NFS locking, purportedly using some of my FreeBSD patchese
	as a starting point, to achieve the first implementation
	of NFS locking not derived from Sun source code.  What's
	unfortunate about this is that they neglected to release
	the code as open source (so far).


2.0.0.0	Server side locking

	Server side locking is, by far, the easiest NFS locking
	problem to solve,

	Server side locking is support for allowing clients to
	asser locks against files on an NFS server.

2.1.0.0	Theory of operation

	Server side locking is implemented by allowing the client
	to make RPC requests which are proxied to the server file
	space via one or more processes (generally, two: rpc.lockd
	and rpc.statd).

	Operations are proxied into the local collision domain,
	and enforced both against and by local locks, depending
	on order of operation.

2.2.0.0	rpc.statd

	The purpose of rpc.statd is to provide host status data
	to machines that it is monitoring.  This is generally used
	to allow client machines to reassert locks (since the NFS
	protocol is nominally stateless) following a server
	restart.  This means we can generally ignore rpc.statd
	for the purposes of this discussion.

2.3.0.0	rpc.lockd

	The purpose of rpc.lockd is to provide responses for file
	and record locking requests to client machines.

	Because NFS is nominally stateless, but locks themselves
	are nominally stateful, there must be a container for the
	lock state.  In a UNIX system, containers for lock state
	are called "processes".  They provide an ownership context
	for the locks, such that the locks can be discarded when
	th NFS services are discontinued.  As such, the rpc.lockd
	is an essential part of the resource and state tracking
	mechanism for NFS locks.

	The current FreeBSD rpc.lockd unconditionally grants lock
	requests; this is sufficient for Solaris interoperability,
	since Solaris will complain bitterly if there is not a lockd
	for a Solaris client to talk to, but is of rather limited
	utility otherwise, since locks are not enforced, even in
	the NFS collision domain, let alone between that domain
	and other processes on the FreeBSD machine.

	Note that it is possible to enforce the NFS locks within
	the NFS collision domain solely in the rpc.lockd itself,
	but this is generally not a sufficient answer, both because
	of architectural issues having to do with with the current
	rpc.lockd impelemtnation's handling of blocked requests
	(it has none) and the 

2.3.1.0	Interface problems in FreeBSD

	FreeBSD has a number of interface problems that prevent
	implementation of a functional rpc.lockd that enforces
	locks within both collision domains.

2.3.1.1	FreebSD problem #1: Conversion of NFS handles to FD's

	Historically, NFS locks have been asserted by converting
	an NFS file handle into an open file descriptor, and
	then asserting the proxy lock against the descriptor.

	SOLOUTION

		FreeBSD must implement an F_CNVT interface,
		to allow the rpc.lockd to convert an NFS
		handle into an open file descriptor.

	This is the first step in asserting a lock: get a file
	descriptor for use as a handle to the local locking
	mechanisms to perform operations on behalf of client
	machines.

2.3.1.2	FreeBSD problem #2: POSIX lock-release-on-close semantics

	The second problem FreeBSD faces is that a lock release
	by a process is implicit in POSIX locking semantics.

	This will not work in FreeBSD, since the same process
	proxies locks for multiple remote processes, and the
	semantic enforcement needs to occur on a per remote
	process basis, not on a per rpc.lockd basis.

	SOLOUTION

		FreeBSD must implement the fcntl option
		F_NONPOSIX for flagging descriptors on which
		POSIX unlock semantics must not be enforced.

	This resovles the proxy dissoloution problem, whereby a
	lock release by one remote client's process will not
	destroy the locks held by all other remote client's
	processes, as would happen if POSIX semantics were
	enforced on that descriptor.

	It also resolves the case where multiple locks are being
	proxied using one descriptor ("descriptor caching").  The
	rpc.lockd engages in descriptor caching by creating a hash
	based on the device/inode pair for each fd that results
	from a converted NFS file handle.

	The purpose of this is twofold: First, it allows a single
	descriptor to be resource counted for multiple clients
	such that descriptors are conserved.  Second, since the
	file handle presented by one client may not match the file
	handle presented by another, either because of intentional
	NFS server drift to prevent session hijacking, or because
	of local FS semantics, such as loopback mounts, union
	mounts, etc., it provides a common rendesvous point for
	the rpc.lockd.

2.3.1.3	FreeBSD problem #3: lack of support for proxy operations

	The FreeBSD fcntl(2) interface lacks the ability to note
	the use of a descriptor as proxy, as well as the identity
	of the proxied host id and process id.

	In general, what this means is that there is no support
	for proxying locks into the kernel.

	SunOS 4.1.3 solved this problem once; since that is the
	reference implemetnation for NFS locking, even today,
	inside Sun Microsystems, there is no need to reinvent
	the wheel (if someone feels the needs, at least this
	time, make it round).

	SOLOUTION

		FreeBSD must implement F_RGETLK, F_RSETLK, and
		F_RSETLKW.  In addition, the flock structure
		must be extended, as follows:

		/* old flock structure -- required for binary compatability*/
		struct oflock {
		    off_t   l_start;	/* starting offset */
		    off_t   l_len;	/* len = 0 means until end of file */
		    pid_t   l_pid;	/* lock owner */
		    short   l_type;	/* lock type: read/write, etc. */
		    short   l_whence;	/* type of l_start */
		};

		/* new flock structure -- required for NFS/SAMBA*/
		struct flock {
		    off_t   l_start;	/* starting offset */
		    off_t   l_len;	/* len = 0 means until end of file */
		    pid_t   l_pid;	/* lock owner */
		    short   l_type;	/* lock type: read/write, etc. */
		    short   l_whence;	/* type of l_start */
		    short   l_version;	/* avoid future compat. problems*/
		    long    l_rsys;	/* remote system id*/
		    pid_t   l_rpid;	/* remote lock owner*/
		};

	The use of an overlay structure solves the pending binary
	compatability easily an elegantly: the l_version, l_rpid, and
	l_rsys fields are defaulted for the F_GETLK, F_SETLK, and
	F_SETLKW commands.  This means that they are copied in using
	the same size as they previously used, and binary compatability
	is maintained.

	For the F_RGETLK, F_RSETLK, and F_RSETLKW commands, since they
	did not previously exist, binary compatability is unnecessary,
	and they can copy in the non-default l_version, l_rpid, l_rsys
	identifiers.

	By fiat, the oflock l_version is 0, and the flock version is
	1.  Also by fiat, the value of l_rsys is -1 for local locks.
	In particular, l_rsys is the IPv4 address of the requester,
	and -1 is illegal, and therefore useful as a cookie for
	"localhost".

	This provides the framework whereby proxy operations can be
	supported by FreeBSD.

2.3.1.4	FreeBSD problem #4: No support for l_rsys and l_rpid.

	Having an interface is only part of the battle.  FreeBSD
	also fails to support l_rsys and l_rpid internally.

	These values must be used as uniquifiers; that is, the
	value of l_pid alone is not sufficient.  When l_rsys is not
	-1 (localhost), the values of l_rsys and l_rpid must also
	be considered in determining whether or not locks may be
	coelesced.

	SOLOUTION

		Add Support to the FreeBSD locking subsystem
		to allow for support of these values to use in
		preventing coelescence and in determining lock
		equality.

	This work is rather trivial, but important.

	As we shall see in section 3, "Client side locking", we will
	want to defer our modifications until we have a complete
	picture of the issues for *both* client and server requiriments.

2.3.1.5	FreeBSD problem #5: Not all local FS's support locking

	We can say that any local FS that we may wish to mount
	really wants to be NFS exportable.

	Without getting into the issues of the FreeBSD VFS mount
	code, mount handilng, and mappinf of mounted FS's into the
	user visible hierarchy, it is very easy to see that one
	requirement for supporting locking is that the underlying
	FS's must also support locking.

	SOLOUTION

		Make all underlying FS's support locking by
		taking it out of the FS, and placing it at a
		higher layer.  Specifically, hang the lock
		list off the generic vnode, not off the FS
		specific inode.

	This is an obvious simplification that reaps many benefits.
	However, a we will discover in section 3, "Client side
	locking", we wil want to defer our modifications until we
	have a complete picture of the issues for *both* client
	and server requiriments.  Specifically, for VFS stacking
	to function correctly where an inferior VFS happens to
	be the NFS client VFS, we must preserve the VOP_ADVLOCK
	interface as a veto-based mechanism, where local media
	FS's never veto the operation (deferring to the upper level
	code that manages the lock off the vnode), whereas the
	NFS client code may, in fact, veto the operation (as could
	any other VFS that proxies operations, e.g., an SMBFS).

2.3.2.0	Requirements for rpc.lockd

	Once the FreeBSD interface issues have been addressed, it
	is necessary to address the rpc.lockd itself.  These
	issues are primarily algorithmic in nature.

2.3.2.1	When any request is made

	When a client makes a request, the first thing that the
	rpc.lockd must do is check the client handle hash list
	to determine if the rpc.lockd already has a descriptor
	open on that file *for that handle*.

	If a descriptor is not open for the handle, the rpc.lockd
	must convert the NFS file handle into a file descriptor.

	The rpc.lockd then fstats the descriptor to obtain the
	dev_t and ino_t fields.  This uniquely identifies the file
	to the FreeBSD system in a way that, for security reasons,
	the handle alone can not.

	Note: If the FreeBSD system chose to avoid some of the
	anti-hijack precations it takes, this step could be avoided,
	and the handle itself used as a unique identifier.

	The POSIX lock-release-on-close semantics are disabled via
	an fcntl using th F_NONPOSIX command.

	Given the unique identifier, a hash is computed to determine
	if some other client somewhere has the file open.  If so,
	the structure referencing the already open FD's reference
	count is incremented, and the FD is closed.  The client
	handle hash is updated so that subsequent operations in the
	same handle do not

	So there are two hash tables involved: the client handle
	hash, and the open file hash.

	Use of these hashes guarantees the minimum descriptor
	footprint possible for the rpc.lockd.  Since this is the
	most scarce resource on the server, this is what we must
	optimize.

	We note at this point what we noted earlier: we must have
	at least one descriptor per file in which locks are being
	asserted, since we are the process container for the locks.

2.3.2.2	F_RGETLK

	This is a straight-forward request.  The request is not
	a blocking request, so it is made, and the result is
	returned.  The rpc.lockd fills out the l_rpid and l_rsys
	as necessary to make the request.

2.3.2.3	F_RSETLK

	This is likewise non-blocking, and therefore likewise
	relatively trivial.

2.3.2.4 F_RSETLKW

	This operation is the tough one.  Because the operation
	would block, we have an implementation decision.

	To reduce overhead, we first try F_RSETLK; if it succeeds,
	we return success.  This is by far the most common outcome,
	given most lock contention mechanisms in most well written
	FS client software (note: FS, not NFS: programs are clients
	of FS services, even for local FS's).

	If this returns EAGAIN, then we must decide how to perform
	the operation.

	We can either fork, and have the forked process close all
	its copies of the descriptors, except the one of interest,
	and then implement F_RSETLKW as a blocking operation, or
	we can implement F_RSETLKW as a queued operation.  Finally,
	we could set up a time, and use F_RSETLK exclisively, until
	it succeeeds.  This last is unacceptable, since it does not
	guarantee order of grant equals order of enqueueing, and
	thus may break program expectations on semantics, resulting
	in deadly embrace deadlocks between processes.

	Given that FreeBSD supports the concepts of sharing a
	descriptor table between processes (via vfork(2)), the
	fork option is by far the most attractive, with the
	caveat that we use the vfork to get a copy of the
	descriptor table shared so as to not double the fd
	footprint, even for a short period of time.

	We can likewise enqueue state, and process SIGCLD to ensure
	that the parent rpc.lockd knows about all pending and
	successful requests (necessary for proper operation of the
	rpc.statd daemon).

2.3.2.5	Back to the general

	Now we can go back to discussing the general implementaiton.

	The rpc.lockd must decrement the reference count when locks
	held by a given process are removed.  It can either do this
	by maintaining a shadow, or, preferentially, by, after a
	lock is released, performing an F_RGETLK.

	This is part of the resource tracking for opn descriptors in
	the rpc.lockd.  If the request indicates that there are no
	more locks held by that l_rsys/l_rpid pair, then the fd
	reference count is decremented, and the per handle hash is
	removed from the list.  If the reference count goes to zero,
	then the descriptor is closed.

	DISCUSSION

		It is useful to implement late-bingding closes.
		Specifically, it is useful to not actually delete
		the reference immediately.

	SOLOUTION

		The handle refernces, instead of being deleted, are
		thrown ont a clock list.  If the handles are
		rereferenced within a tunable time frame, then they
		are removed from the list and placed back into use;
		otherwise, after sufficient time has elapsed, they
		are inactivated as above.

	This resolves the case of a single client generating a lot
	of unnecessary rpc.lockd activity by issuing lock-unlock
	pairs that would cause the references to bounce up and
	down, requiring a lot of system calls.  It preserves the
	NFS handle hash for a time after the operation nominally
	completes, in the expectation of future operations by that
	client.

3.0.0.0	Client side locking

	Client side locking is much harder than server side locking.

	Client side locking allows clients to request locks from
	remote NFS servers on behalf of local processes running on
	the client machine.

3.1.0.0	Theory of operation

	Client side locking is implemented by the client NFS code in
	the kernel making RPC requests against the server, much in
	the same way that NFS clients operate when making FS
	operation requests against NFS servers.

	It is simultaneously more difficult because of the code
	being located in the kernel, and less difficult, since
	there is a process context (the reqiesting process) to act
	as a conatiner ofr the operation until it is completed by
	the server.

	Server side locking is implemented by allowing the client
	to make RPC requests which are proxied to the server file
	space via one or more processes (generally, two: rpc.lockd
	and rpc.statd).

	Operations are proxied into the local collision domain,
	and enforced both against and by local locks, depending
	on order of operation.

3.1.1.0	Interface problems in FreeBSD

	FreeBSD has a number of interface problems that prevent
	implementation of a functional NFS client locking.

3.1.1.1	FreeBSD problem #1: VFS stacking and coelescence

	Locks, when asserted, are coelesced by l_pid.  If they
	are asserted by a hosted OS API, e.g., an NFS, AppleTalk,
	or SAMBA server, they are coelesced by l_rsys and l_rpid,
	as well; we can ignore all by l_pid in the general case,
	since exporting an exported FS is foolish and dangerous.

	When locks are asserted, then, the locks are coelesced if
	the lock is successful.  Thus, If a process had a file

		[FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]

	Protected by the locks:

		[111111111]            [2222222222]
		[FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]

	And asserted a third lock:

			[333333333333333333]
		[111111111]            [2222222222]
		[FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]

	That lock would be coelesced:

		[111111111111111111111111111111111]
		[FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]

	For a local media FS, this is not a problem, since the
	operation occurs locally, and is serialized by virtue
	of that fact.  But for an NFS client, the lock behaviour
	is less serialized.

	Consider the case of a VFS stacking layer that stacks
	two filesystems, and makes the files within them appear
	to be two extents of a single file.  We can imagine that
	this would be useful for combined log files for a cluster
	of machines, and for other reasons (there are many other
	examples; this is merely the simplest).  So we have:


		[ffffffffffffffff][FFFFFFFFFFFFFFFFFFFFFFFFFFF]

	Lets perform the same locks:

		[111111111]            [2222222222]
		[ffffffffffffffff][FFFFFFFFFFFFFFFFFFFFFFFFFFF]

	So far, so good.  Now the third lock:

			[333333333333333333]
		[111111111]            [2222222222]
		[ffffffffffffffff][FFFFFFFFFFFFFFFFFFFFFFFFFFF]

	Colesce, phase one:

			          [33333333]
		[1111111111111111]     [2222222222]
		[ffffffffffffffff][FFFFFFFFFFFFFFFFFFFFFFFFFFF]

	Oops!  The second phase fails because Some other client has
	the lock:

				  [XX]

	Now we need to back out the operation on the first FS:

			[33333333]          
		[111111111]       
		[ffffffffffffffff]

	Leaving:

		[1111111]         
		[ffffffffffffffff]

	Ut-oh: looks like we're screwed.

	SOLOUTION

		Delayed coelescing.  The locks are asserted, but
		they are not committed (coelesced) until all the
		operations have been deemed successful.

	By dividing the phases of asserting vs. committing, we can
	delay the coelesceing until we know that all locks are
	successfully asserted.

	How do we do this?  Very simply, we convert the VOP_ADVLOCK
	to be a veto mechanism, instead of the mechainsm by which
	the lock code is acutally called, and we move the locking
	operations to upper level (common) code.  At the same time,
	we make the OS more robust, since there is only one place,
	instead of many, where the code is called.

	For stacking layers that stack on more than one VFS, and for
	proxy layers, such as NFS, SMB, or AppleTalk client layers,
	the operation is a veto, where the operation is proxied, and
	if the proxy fails, then the operation is vetoed.

	So in general, VOP_ADVLOCK becomes a "return(1);" for most
	of the VFS layers, with specific exceptions for particular
	layer types, which *may* veto the operation by the upper
	level code.

	If the operation is not vetoed by the upper level code, then
	the upper level code commits the operation, and the lock
	ranges are coelesced.

3.1.1.2	FreeBSD problem #2: What if the NFS layer is first?

	If the NFS layer is first, and the operation is subsequently
	vetoed, how is the NFS coelesce backed out?

	SOLOUTION

		The shadow graph.  The NFS client, for each
		given vnode (nfsnode), must seperately maintain
		the locks agains the node on a per process basis.

	What this means is that when a process asserts a lock on an
	NFS accessed file, the NFS client lockign code must maintain
	an uncoelesced lock graph.

	This is because the lock graph *will* be coelesced on the
	server.

	In order to back out the operation:

			[33333333]
		[111111111]       
		[ffffffffffffffff]

			|
			v
			          
		[1111111111111111]
		[ffffffffffffffff]

	The client must keep knowledge of the fact that these locks
	are seperate.

	This implies that locks that result in type demotions are
	not type demoted to the server (i.e., locks against the
	server are only asserted in promote-only mode so that if
	they are backed out, there will not have been a demotion,
	for example, from write to read, on the server).

	There is currently code in SAMBA which models this, since
	SAMBA's consumtiopn of the host FS is similar to an NFS
	clients consumption of an NFS server's FS.

3.2.0.0	The client NFS VFS layer's RPC calls

	So far no one has implemented this.  In general, it is more
	important to be a server than it is to be a client, at this
	time.

	The amount of effort to implement this, if one has the ISO
	documents, or, more obliquely and therefore more difficult,
	the rpc.lockd code in the FreeBSD source tree, is pretty
	small.  This would make a good one quarter project for a
	Batcholer of Science in Computer Science independent study
	credit.

3.3.0.0	Discussion

	In general, all of the issues for an NFS client in FreeBSD
	apply equally to the idea of an AppleTalk or SMB client in
	FreeBSD.  It is likely that FreeBSD will want to support
	the ability to operate as a desktop (and therefore client)
	OS, even if this is not the primary niche into which it is
	currently being driven by the developers.


4.0.0.0 End Of Document
==========================================================================

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message