Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 19 Apr 1996 18:17:35 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        current@freebsd.org
Cc:        terry@phaeton.artisoft.com (Terry Lambert)
Subject:   KERNEL SUPPORT FOR NFS LOCKING (server side rpc.lockd support)
Message-ID:  <199604200117.SAA11465@phaeton.artisoft.com>

next in thread | raw e-mail | index | archive | help
Jordan, et. al:


Here are the patches for the kernel to provide the support code for
Andrew's rpc.lockd in the -current source tree.

This is kernel support for server-side locking.  You will be able
to NFS lock files on a FreeBSD NFS server, but a FreeBSD NFS client
will not be able to lock files on a remote NFS server.


With the exception of the vfs_syscalls.c, which is a context diff
because my FS changes have not been integrated and CVS mirroring
is inadequatei for people without commit priveledges, these are a
set of CVS diffs to the files:

kern/kern_descrip.c
kern/kern_lockf.c
sys/fcntl.h
sys/lockf.h

(If you don't like where I put my parenthesis, learn to run "indent").


These changes maintain binary backward comapatability because
of the way local flock calls copy in and out only the local (old)
flock data.  The large reserve area is intentional, and is there
so that future extensions still do not require a system call number
change to be integrated (the same reason SunOS provides a reserve
area).  Expected usage is for, among other applications, mandatory
and non-coelesive record/range locking.

The call interface is identical to the one used by SunOS to support
NFS server side locking.  This is intentional because NFS is not
the only program that consumes the proxy locking interface.

---
::NOTES::


HOW IT WORKS

This is a proxy locking facility.  It assumes a pid_t sized (or smaller)
remote process id, and a long (or smaller) remote system id.  This is
a signed long, per SunOS (programs like the NetWare for UNIX server
depend on the interface not changing).


CAVEAT: l_rsys type

This probably wants to be a defined sysid_t in the types file, but
I didn't want to screw with sys/types.h and the existing rpc.lockd
code to make it work.


CAVEAT: GRACELESS SHUTDOWN

Resource tracking cleanup (ie: on graceless exit of the rpc.lockd) is
not supported.  As noted in the code, the advisory locking facility
that resulted from the 4.4BSD integration of the Heideman code has
broken advisory lock layering.  This is also the reason that NFS
client locking code was not provided at this time.

When the rpc.lockd goes down, only locally held locks by rpc.lockd
itself (ie: there should be none anyway) would be cleaned.

To support this correctly, it would be necesssary to add a "local vs.
remote" flag to the VOP_ADVLOCK() call, which would require modification
of every existing file system implementation.

To support NFS client locking with overlay, loopback, and other FS
stacking constructs in the Heidemann framework, it would be necessary
convert the VOP_ADVLOCK() architecture from "call-through" to "veto".
This would further necessitate changes to loopback and union FS types
to implement call-through from superior to inferior stack veto, only
in those cases (the corrected top level code can only operate on
top level FS's exposed at the VFS layer).  This has been discussed in
detail on various lists.  It is mentioned here because it bears on
allowable user space (rpc.lockd, NetWare daemon file locking)
assumptions; a crashed rpc.lockd *will* leave proxy state hanging
from the node.

The conversion process is described in greater detail in the WARNING
comment in the kern/kern_lockf.c file in the F_UNLKSYS case.


CAVEAT: INTERFACE AMBIGUITY

Because it was unclear from the rpc.lockd code itself (which is not
complete in this regard), the F_CNVT fcntl() call is not fully
implemented; instead it is stubbed in kern/kern_descrip.c.

This fcntl() call takes as an argument *any* file fd to get it into
the VFS (this can be *any* fd, if the "struct fileops" garbage
used by the pipe and socket code, is reworked, like it should
be), the command "F_CNVT", and a pointer to an NFS handle.

The fcntl() is to return an error, or to return an fd, allocated in
the open file table of the calling process.


It is expected that the rpc.lockd will maintain a reference count and
a hash, so that if the same file is locked by multiple servers,
only one rpc.lockd fd (and therefore one system open file table entry)
is consumed.

The suggested rpc.lockd implementation is to stat the resulting fd,
and hash based on the stat results to locate an existing fd for the
same file, if any.  If one exists, the hash reference structure
reference count is to be incremented, and the second (duplicate) fd
closed.  Because there is no kernel support for asynchronous calls
to the fcntl() system call, it is expected that F_RSETLKW will not
typically be used; instead, F_RSETLK will be used, and if EWOULDBLOCK
is returned, a retry is to be queued to a timer-kicked retry list in
the rpc.lockd.  It is expected that there will eventually be a thread
spawned for a blocking request to allow it to run to completion, OR
an async call trap gate will be defined so that the rpc.lockd will
not hang pending completion of blocked F_RSETLKW calls.  Unless
the an async call or *kernel* threading mechanism is used to
implement the blocking lock requests, a race condition is introduced
by the queue service latency that can cause starvation deadlock on
NFS clients using the lock services.


The implementation was left as incomplete because:

1)	It requires knowledge of the handle format used by the
	rpc.lockd (which is hopefully the same as that used by
	NFS).
2)	It requires registration of the handle conversion lookup
	function from the NFS code (nfsrv_fhtovp() in nfs/nfs_subs.c)
	in the same way that the LEASE functions are currently
	registered, or NFS will have to be statically linked at
	all times).
3)	The LEASE function registration code is bogus; like the
	search continuation ("cookie") code, it is fundamentally
	broken, and should be moved to a "server as file system
	consumer presentation layer" to get rid of the common
	overhead case.

Given #1 & #2, it is possible to kludge a workable soloution.

#3 would require integration of previously submitted FS layering
patches (specifically, the namei/nameifree and the "EXCLUDE" flag
changes to kern/vfs_lookup.c).


CAVEAT: DOCUMENTATION

I have not modified the fcntl() man page to include the proxy
extensions to the fcntl() locking interface.  In addition, the
behaviour of the interface is not well documented, except by
the code, since it is derived from user-space experimentation
on binary systems (lack of this kind of documentation is why there
has not been a public rpc.lockd implementation prior to this time).


Patches follow my signature.

					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.
== 8< cut here 8< ==========================================================
*** /sys/kern/vfs_syscalls.c.nonfs	Fri Apr 19 14:38:32 1996
--- /sys/kern/vfs_syscalls.c	Fri Apr 19 15:34:18 1996
***************
*** 6,11 ****
--- 6,12 ----
   * to the University of California by American Telephone and Telegraph
   * Co. or Unix System Laboratories, Inc. and are reproduced herein with
   * the permission of UNIX System Laboratories, Inc.
+  * Copyright (c) 1996 Terrence R. Lambert.  All rights reserved.
   *
   * Redistribution and use in source and binary forms, with or without
   * modification, are permitted provided that the following conditions
***************
*** 19,24 ****
--- 20,26 ----
   *    must display the following acknowledgement:
   *	This product includes software developed by the University of
   *	California, Berkeley and its contributors.
+  *	This product includes software developed by Terrence R. Lambert.
   * 4. Neither the name of the University nor the names of its contributors
   *    may be used to endorse or promote products derived from this software
   *    without specific prior written permission.
***************
*** 741,746 ****
--- 743,750 ----
  				lf.l_type = F_WRLCK;
  			else
  				lf.l_type = F_RDLCK;
+ 			lf.l_rsys = FLOCK_LOCAL_LOCK;
+ 			lf.l_rpid = FLOCK_LOCAL_LOCK;
  			type = F_FLOCK;
  			if ((flags & FNONBLOCK) == 0)
  				type |= F_WAIT;
== 8< cut here 8< ==========================================================
Index: kern_descrip.c
===================================================================
RCS file: /b/cvstree/ncvs/src/sys/kern/kern_descrip.c,v
retrieving revision 1.28
diff -r1.28 kern_descrip.c
8a9
>  * Copyright (c) 1996 Terrence R. Lambert.  All rights reserved.
21a23
>  *	This product includes software developed by Terrence R. Lambert.
210c212
< 	struct flock fl;
---
> 	struct flock flk;
282a285
> 	case F_RSETLKW:
286a290
> 	case F_RSETLK:
291c295,318
< 		error = copyin((caddr_t)uap->arg, (caddr_t)&fl, sizeof (fl));
---
> 		if( ( uap->cmd == F_SETLKW) || ( uap->cmd == F_SETLK)) {
> 			/*
> 			 * Local lock.  Only copy in old lock structure
> 			 * to ensure backward binary compatability (the
> 			 * lock structure may butt up against memory
> 			 * for which a copyin would be invalid).
> 			 */
> 			error = copyin( (caddr_t)uap->arg,
> 					(caddr_t)&flk,
> 					sizeof(struct oflock));
> 			flk.l_rsys	= FLOCK_LOCAL_LOCK;
> 			flk.l_rpid	= FLOCK_LOCAL_LOCK;
> 		} else {
> 			/*
> 			 * Remote lock proxy request.  Only root is
> 			 * allowed to make proxy requests.  Copy
> 			 * in full flock structure.
> 			 */
> 			if( !suser(p->p_ucred, &p->p_acflag))
> 				return( EPERM);
> 			error = copyin( (caddr_t)uap->arg,
> 					(caddr_t)&flk,
> 					sizeof(struct flock));
> 		}
294,296c321,323
< 		if (fl.l_whence == SEEK_CUR)
< 			fl.l_start += fp->f_offset;
< 		switch (fl.l_type) {
---
> 		if (flk.l_whence == SEEK_CUR)
> 			flk.l_start += fp->f_offset;
> 		switch (flk.l_type) {
302c329
< 			return (VOP_ADVLOCK(vp, (caddr_t)p, F_SETLK, &fl, flg));
---
> 			return (VOP_ADVLOCK(vp, (caddr_t)p, F_SETLK, &flk, flg));
308c335
< 			return (VOP_ADVLOCK(vp, (caddr_t)p, F_SETLK, &fl, flg));
---
> 			return (VOP_ADVLOCK(vp, (caddr_t)p, F_SETLK, &flk, flg));
311c338,342
< 			return (VOP_ADVLOCK(vp, (caddr_t)p, F_UNLCK, &fl,
---
> 			return (VOP_ADVLOCK(vp, (caddr_t)p, F_UNLCK, &flk,
> 				F_POSIX));
> 
> 		case F_UNLKSYS:
> 			return (VOP_ADVLOCK(vp, (caddr_t)p, F_UNLKSYS, &flk,
318a350
> 	case F_RGETLK:
323c355,378
< 		error = copyin((caddr_t)uap->arg, (caddr_t)&fl, sizeof (fl));
---
> 		if(  uap->cmd == F_GETLK) {
> 			/*
> 			 * Local lock.  Only copy in old lock structure
> 			 * to ensure backward binary compatability (the
> 			 * lock structure may butt up against memory
> 			 * for which a copyin would be invalid).
> 			 */
> 			error = copyin( (caddr_t)uap->arg,
> 					(caddr_t)&flk,
> 					sizeof(struct oflock));
> 			flk.l_rsys	= FLOCK_LOCAL_LOCK;
> 			flk.l_rpid	= FLOCK_LOCAL_LOCK;
> 		} else {
> 			/*
> 			 * Remote lock proxy request.  Only root is
> 			 * allowed to make proxy requests.  Copy
> 			 * in full flock structure.
> 			 */
> 			if( !suser(p->p_ucred, &p->p_acflag))
> 				return( EPERM);
> 			error = copyin( (caddr_t)uap->arg,
> 					(caddr_t)&flk,
> 					sizeof(struct flock));
> 		}
326,328c381,383
< 		if (fl.l_whence == SEEK_CUR)
< 			fl.l_start += fp->f_offset;
< 		if ((error = VOP_ADVLOCK(vp,(caddr_t)p,F_GETLK,&fl,F_POSIX)))
---
> 		if (flk.l_whence == SEEK_CUR)
> 			flk.l_start += fp->f_offset;
> 		if ((error = VOP_ADVLOCK(vp,(caddr_t)p,F_GETLK,&flk,F_POSIX)))
330c385,421
< 		return (copyout((caddr_t)&fl, (caddr_t)uap->arg, sizeof (fl)));
---
> 		/*
> 		 * Only copy out full structure if not F_GETLK to ensure
> 		 * we do not (incorrectly) fault the caller.
> 		 */
> 		if(  uap->cmd == F_GETLK) {
> 			/* old lock structure is enough...*/
> 			return( copyout((caddr_t)&flk,
> 					(caddr_t)uap->arg,
> 					sizeof(struct oflock)));
> 		} else {	/* F_RGETLK*/
> 			/* new lock structure is required...*/
> 			return( copyout((caddr_t)&flk,
> 					(caddr_t)uap->arg,
> 					sizeof(struct flock)));
> 		}
> 		/* NOTREACHED*/
> 
> 	case F_CNVT:
> 		/*
> 		 * Convert an NFS file handle into a descriptor open
> 		 * in the calling process context (used by the lockd
> 		 * to establish open file state as a holder for a
> 		 * lock).
> 		 */
> 		if( !suser(p->p_ucred, &p->p_acflag))
> 			return( EPERM);
> 
> 		/*
> 		 * XXX requires knowledge of the handle format to
> 		 *     be presented via this interface.
> 		 * XXX requires registration of the conversion
> 		 *     function (which already exists in the
> 		 *     NFS code!) via a mechanism similar to
> 		 *     that used by the existing LEASES code
> 		 *     to allow continued use of NFS as an LKM.
> 		 */
> 		return( EBADF);
857c948
< 	struct flock lf;
---
> 	struct flock flk;
871,874c962,967
< 		lf.l_whence = SEEK_SET;
< 		lf.l_start = 0;
< 		lf.l_len = 0;
< 		lf.l_type = F_UNLCK;
---
> 		flk.l_whence = SEEK_SET;
> 		flk.l_start = 0;
> 		flk.l_len = 0;
> 		flk.l_type = F_UNLCK;
> 		flk.l_rsys = FLOCK_LOCAL_LOCK;
> 		flk.l_rpid = FLOCK_LOCAL_LOCK;
876c969
< 		(void) VOP_ADVLOCK(vp, (caddr_t)p, F_UNLCK, &lf, F_POSIX);
---
> 		(void) VOP_ADVLOCK(vp, (caddr_t)p, F_UNLCK, &flk, F_POSIX);
883,886c976,981
< 		lf.l_whence = SEEK_SET;
< 		lf.l_start = 0;
< 		lf.l_len = 0;
< 		lf.l_type = F_UNLCK;
---
> 		flk.l_whence = SEEK_SET;
> 		flk.l_start = 0;
> 		flk.l_len = 0;
> 		flk.l_type = F_UNLCK;
> 		flk.l_rsys = FLOCK_LOCAL_LOCK;
> 		flk.l_rpid = FLOCK_LOCAL_LOCK;
888c983
< 		(void) VOP_ADVLOCK(vp, (caddr_t)fp, F_UNLCK, &lf, F_FLOCK);
---
> 		(void) VOP_ADVLOCK(vp, (caddr_t)fp, F_UNLCK, &flk, F_FLOCK);
920c1015
< 	struct flock lf;
---
> 	struct flock flk;
928,930c1023,1027
< 	lf.l_whence = SEEK_SET;
< 	lf.l_start = 0;
< 	lf.l_len = 0;
---
> 	flk.l_whence = SEEK_SET;
> 	flk.l_start = 0;
> 	flk.l_len = 0;
> 	flk.l_rsys = FLOCK_LOCAL_LOCK;
> 	flk.l_rpid = FLOCK_LOCAL_LOCK;
932c1029
< 		lf.l_type = F_UNLCK;
---
> 		flk.l_type = F_UNLCK;
934c1031
< 		return (VOP_ADVLOCK(vp, (caddr_t)fp, F_UNLCK, &lf, F_FLOCK));
---
> 		return (VOP_ADVLOCK(vp, (caddr_t)fp, F_UNLCK, &flk, F_FLOCK));
937c1034
< 		lf.l_type = F_WRLCK;
---
> 		flk.l_type = F_WRLCK;
939c1036
< 		lf.l_type = F_RDLCK;
---
> 		flk.l_type = F_RDLCK;
944,945c1041,1042
< 		return (VOP_ADVLOCK(vp, (caddr_t)fp, F_SETLK, &lf, F_FLOCK));
< 	return (VOP_ADVLOCK(vp, (caddr_t)fp, F_SETLK, &lf, F_FLOCK|F_WAIT));
---
> 		return (VOP_ADVLOCK(vp, (caddr_t)fp, F_SETLK, &flk, F_FLOCK));
> 	return (VOP_ADVLOCK(vp, (caddr_t)fp, F_SETLK, &flk, F_FLOCK|F_WAIT));
Index: kern_lockf.c
===================================================================
RCS file: /b/cvstree/ncvs/src/sys/kern/kern_lockf.c,v
retrieving revision 1.5
diff -r1.5 kern_lockf.c
3a4
>  * Copyright (c) 1996 Terrence R. Lambert.  All rights reserved.
19a21
>  *	This product includes software developed by Terrence R. Lambert.
142a145,146
> 	lock->lf_rsys = fl->l_rsys;
> 	lock->lf_rpid = fl->l_rpid;
149a154,178
> 	case F_UNLKSYS:
> 		/*
> 		 * WARNING: lf_overlap() uses LOCK_COMPARE, which will
> 		 * not match true a remote system id/pid given a lock
> 		 * removal on close.  If the rpc.lockd implementation
> 		 * crashes (close on resource cleanup), locks held
> 		 * by a remote system will *NOT* be cleaned!  To fix
> 		 * this, we would need a "proxy" flag to this routine
> 		 * which is propagated down via lf_clearlock() to
> 		 * lf_findoverlap() and flags LOCK_COMPARE() to
> 		 * ignore remote system id/pid in comparing.  Since
> 		 * we believe that VOP_ADVLOCK() is improperly
> 		 * implemented, rather than changing every file
> 		 * system now, we will wait until we do a global
> 		 * clean up of the advisory locking code.  Note
> 		 * that this means the rpc.lockd can not "optimize"
> 		 * by closing the fd to free locks for the final
> 		 * reference instance for the file;  it *must*
> 		 * call with F_UNLKSYS (see closef() in kern_descrip.c).
> 		 * Resource track cleanup on crash will only free locks
> 		 * held by rpc.lockd itself (ie: none), not those held
> 		 * by proxy.
> 		 *
> 		 * -- Terry Lambert
> 		 */
504a534,535
> 		fl->l_rsys = block->lf_rsys;
> 		fl->l_rpid = block->lf_rpid;
565,566c596,597
< 		if (((type & SELF) && lf->lf_id != lock->lf_id) ||
< 		    ((type & OTHERS) && lf->lf_id == lock->lf_id)) {
---
> 		if (((type & SELF) && !LOCK_COMPARE(lf, lock)) ||
> 		    ((type & OTHERS) && LOCK_COMPARE(lf, lock))) {
772a804
> 		lock->lf_type == F_UNLKSYS ? "unlock_sys" :
774c806
< 	if (lock->lf_block)
---
> 	if (lock->lf_block) {
776,777c808,812
< 	else
< 		printf("\n");
---
> 		if( lock->lf_rsys == FLOCK_LOCAL_LOCK)
> 			printf( " (LCL)\n");
> 		else	printf( " (NFS=0x%08x:0x%08x)\n",
> 					lock->lf_rsys, lock->lf_rpid);
> 	} else	printf("\n");
800a836
> 			lf->lf_type == F_UNLKSYS ? "unlock_sys" :
802c838
< 		if (lf->lf_block)
---
> 		if (lf->lf_block) {
804,805c840,844
< 		else
< 			printf("\n");
---
> 			if( lock->lf_rsys == FLOCK_LOCAL_LOCK)
> 				printf( " (LCL)\n");
> 			else	printf( " (NFS=0x%08x:0x%08x)\n",
> 						lock->lf_rsys, lock->lf_rpid);
> 		} else	printf("\n");
Index: fcntl.h
===================================================================
RCS file: /b/cvstree/ncvs/src/sys/sys/fcntl.h,v
retrieving revision 1.3
diff -r1.3 fcntl.h
8a9
>  * Copyright (c) 1996 Terrence R. Lambert.  All rights reserved.
21a23
>  *	This product includes software developed by Terrence R. Lambert.
141a144,149
> #ifndef _POSIX_SOURCE
> #define	F_RGETLK	10		/* remote F_GETLK*/
> #define	F_RSETLK	11		/* remote F_SETLK*/
> #define	F_CNVT		12		/* open a file by NFS file handle*/
> #define	F_RSETLKW	13		/* remote F_SETLKW*/
> #endif	/* _POSIX_SOURCE*/
149a158,160
> #ifndef _POSIX_SOURCE
> #define	F_UNLKSYS	4		/* remove all locks by system*/
> #endif	/* _POSIX_SOURCE*/
165a177,179
> 	long	l_rsys;		/* remote system id*/
> 	pid_t	l_rpid;		/* pid from remote system*/
> 	long	l_reserved[ 4];	/* future use -- avoid syscall id change*/
166a181,198
> 
> #ifdef KERNEL
> /*
>  * Backward compatability advisory file segment locking data type; used
>  * only by kernel to ensure binary compatability.  We only copy in this
>  * size of structure for F_GETLK, F_SETLK, and F_SETLKW.
>  */
> struct oflock {
> 	off_t	l_start;	/* starting offset */
> 	off_t	l_len;		/* len = 0 means until end of file */
> 	pid_t	l_pid;		/* lock owner */
> 	short	l_type;		/* lock type: read/write, etc. */
> 	short	l_whence;	/* type of l_start */
> };
> 
> /* token for l_rsys, l_rpid*/
> #define	FLOCK_LOCAL_LOCK	0	/* a non-NFS lock*/
> #endif	/* KERNEL*/
Index: lockf.h
===================================================================
RCS file: /b/cvstree/ncvs/src/sys/sys/lockf.h,v
retrieving revision 1.3
diff -r1.3 lockf.h
3a4
>  * Copyright (c) 1996 Terrence R. Lambert.  All rights reserved.
19a21
>  *	This product includes software developed by Terrence R. Lambert.
54a57,58
> 	long	lf_rsys;	 /* remote system id*/
> 	pid_t	lf_rpid;	 /* pid from remote system*/
58a63,71
> 
> /*
>  * Compare credentials on two locks.  Returns TRUE if credentials are
>  * equivalent, FALSE otherwise.
>  */
> #define	LOCK_COMPARE(l1, l2)	(( l1->lf_id == l2->lf_id) &&		\
> 				 ( l1->lf_rsys == l2->lf_rsys) &&	\
> 				 ( l1->lf_rpid == l2->lf_rpid)		\
> 				)
== 8< cut here 8< ==========================================================



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199604200117.SAA11465>