From owner-freebsd-current Fri Apr 19 18:20:04 1996 Return-Path: owner-current Received: (from root@localhost) by freefall.freebsd.org (8.7.3/8.7.3) id SAA15584 for current-outgoing; Fri, 19 Apr 1996 18:20:04 -0700 (PDT) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id SAA15523 for ; Fri, 19 Apr 1996 18:19:57 -0700 (PDT) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id SAA11465; Fri, 19 Apr 1996 18:17:35 -0700 From: Terry Lambert Message-Id: <199604200117.SAA11465@phaeton.artisoft.com> Subject: KERNEL SUPPORT FOR NFS LOCKING (server side rpc.lockd support) To: current@freebsd.org Date: Fri, 19 Apr 1996 18:17:35 -0700 (MST) Cc: terry@phaeton.artisoft.com (Terry Lambert) X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-current@freebsd.org X-Loop: FreeBSD.org Precedence: bulk Jordan, et. al: Here are the patches for the kernel to provide the support code for Andrew's rpc.lockd in the -current source tree. This is kernel support for server-side locking. You will be able to NFS lock files on a FreeBSD NFS server, but a FreeBSD NFS client will not be able to lock files on a remote NFS server. With the exception of the vfs_syscalls.c, which is a context diff because my FS changes have not been integrated and CVS mirroring is inadequatei for people without commit priveledges, these are a set of CVS diffs to the files: kern/kern_descrip.c kern/kern_lockf.c sys/fcntl.h sys/lockf.h (If you don't like where I put my parenthesis, learn to run "indent"). These changes maintain binary backward comapatability because of the way local flock calls copy in and out only the local (old) flock data. The large reserve area is intentional, and is there so that future extensions still do not require a system call number change to be integrated (the same reason SunOS provides a reserve area). Expected usage is for, among other applications, mandatory and non-coelesive record/range locking. The call interface is identical to the one used by SunOS to support NFS server side locking. This is intentional because NFS is not the only program that consumes the proxy locking interface. --- ::NOTES:: HOW IT WORKS This is a proxy locking facility. It assumes a pid_t sized (or smaller) remote process id, and a long (or smaller) remote system id. This is a signed long, per SunOS (programs like the NetWare for UNIX server depend on the interface not changing). CAVEAT: l_rsys type This probably wants to be a defined sysid_t in the types file, but I didn't want to screw with sys/types.h and the existing rpc.lockd code to make it work. CAVEAT: GRACELESS SHUTDOWN Resource tracking cleanup (ie: on graceless exit of the rpc.lockd) is not supported. As noted in the code, the advisory locking facility that resulted from the 4.4BSD integration of the Heideman code has broken advisory lock layering. This is also the reason that NFS client locking code was not provided at this time. When the rpc.lockd goes down, only locally held locks by rpc.lockd itself (ie: there should be none anyway) would be cleaned. To support this correctly, it would be necesssary to add a "local vs. remote" flag to the VOP_ADVLOCK() call, which would require modification of every existing file system implementation. To support NFS client locking with overlay, loopback, and other FS stacking constructs in the Heidemann framework, it would be necessary convert the VOP_ADVLOCK() architecture from "call-through" to "veto". This would further necessitate changes to loopback and union FS types to implement call-through from superior to inferior stack veto, only in those cases (the corrected top level code can only operate on top level FS's exposed at the VFS layer). This has been discussed in detail on various lists. It is mentioned here because it bears on allowable user space (rpc.lockd, NetWare daemon file locking) assumptions; a crashed rpc.lockd *will* leave proxy state hanging from the node. The conversion process is described in greater detail in the WARNING comment in the kern/kern_lockf.c file in the F_UNLKSYS case. CAVEAT: INTERFACE AMBIGUITY Because it was unclear from the rpc.lockd code itself (which is not complete in this regard), the F_CNVT fcntl() call is not fully implemented; instead it is stubbed in kern/kern_descrip.c. This fcntl() call takes as an argument *any* file fd to get it into the VFS (this can be *any* fd, if the "struct fileops" garbage used by the pipe and socket code, is reworked, like it should be), the command "F_CNVT", and a pointer to an NFS handle. The fcntl() is to return an error, or to return an fd, allocated in the open file table of the calling process. It is expected that the rpc.lockd will maintain a reference count and a hash, so that if the same file is locked by multiple servers, only one rpc.lockd fd (and therefore one system open file table entry) is consumed. The suggested rpc.lockd implementation is to stat the resulting fd, and hash based on the stat results to locate an existing fd for the same file, if any. If one exists, the hash reference structure reference count is to be incremented, and the second (duplicate) fd closed. Because there is no kernel support for asynchronous calls to the fcntl() system call, it is expected that F_RSETLKW will not typically be used; instead, F_RSETLK will be used, and if EWOULDBLOCK is returned, a retry is to be queued to a timer-kicked retry list in the rpc.lockd. It is expected that there will eventually be a thread spawned for a blocking request to allow it to run to completion, OR an async call trap gate will be defined so that the rpc.lockd will not hang pending completion of blocked F_RSETLKW calls. Unless the an async call or *kernel* threading mechanism is used to implement the blocking lock requests, a race condition is introduced by the queue service latency that can cause starvation deadlock on NFS clients using the lock services. The implementation was left as incomplete because: 1) It requires knowledge of the handle format used by the rpc.lockd (which is hopefully the same as that used by NFS). 2) It requires registration of the handle conversion lookup function from the NFS code (nfsrv_fhtovp() in nfs/nfs_subs.c) in the same way that the LEASE functions are currently registered, or NFS will have to be statically linked at all times). 3) The LEASE function registration code is bogus; like the search continuation ("cookie") code, it is fundamentally broken, and should be moved to a "server as file system consumer presentation layer" to get rid of the common overhead case. Given #1 & #2, it is possible to kludge a workable soloution. #3 would require integration of previously submitted FS layering patches (specifically, the namei/nameifree and the "EXCLUDE" flag changes to kern/vfs_lookup.c). CAVEAT: DOCUMENTATION I have not modified the fcntl() man page to include the proxy extensions to the fcntl() locking interface. In addition, the behaviour of the interface is not well documented, except by the code, since it is derived from user-space experimentation on binary systems (lack of this kind of documentation is why there has not been a public rpc.lockd implementation prior to this time). Patches follow my signature. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. == 8< cut here 8< ========================================================== *** /sys/kern/vfs_syscalls.c.nonfs Fri Apr 19 14:38:32 1996 --- /sys/kern/vfs_syscalls.c Fri Apr 19 15:34:18 1996 *************** *** 6,11 **** --- 6,12 ---- * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. + * Copyright (c) 1996 Terrence R. Lambert. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions *************** *** 19,24 **** --- 20,26 ---- * must display the following acknowledgement: * This product includes software developed by the University of * California, Berkeley and its contributors. + * This product includes software developed by Terrence R. Lambert. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. *************** *** 741,746 **** --- 743,750 ---- lf.l_type = F_WRLCK; else lf.l_type = F_RDLCK; + lf.l_rsys = FLOCK_LOCAL_LOCK; + lf.l_rpid = FLOCK_LOCAL_LOCK; type = F_FLOCK; if ((flags & FNONBLOCK) == 0) type |= F_WAIT; == 8< cut here 8< ========================================================== Index: kern_descrip.c =================================================================== RCS file: /b/cvstree/ncvs/src/sys/kern/kern_descrip.c,v retrieving revision 1.28 diff -r1.28 kern_descrip.c 8a9 > * Copyright (c) 1996 Terrence R. Lambert. All rights reserved. 21a23 > * This product includes software developed by Terrence R. Lambert. 210c212 < struct flock fl; --- > struct flock flk; 282a285 > case F_RSETLKW: 286a290 > case F_RSETLK: 291c295,318 < error = copyin((caddr_t)uap->arg, (caddr_t)&fl, sizeof (fl)); --- > if( ( uap->cmd == F_SETLKW) || ( uap->cmd == F_SETLK)) { > /* > * Local lock. Only copy in old lock structure > * to ensure backward binary compatability (the > * lock structure may butt up against memory > * for which a copyin would be invalid). > */ > error = copyin( (caddr_t)uap->arg, > (caddr_t)&flk, > sizeof(struct oflock)); > flk.l_rsys = FLOCK_LOCAL_LOCK; > flk.l_rpid = FLOCK_LOCAL_LOCK; > } else { > /* > * Remote lock proxy request. Only root is > * allowed to make proxy requests. Copy > * in full flock structure. > */ > if( !suser(p->p_ucred, &p->p_acflag)) > return( EPERM); > error = copyin( (caddr_t)uap->arg, > (caddr_t)&flk, > sizeof(struct flock)); > } 294,296c321,323 < if (fl.l_whence == SEEK_CUR) < fl.l_start += fp->f_offset; < switch (fl.l_type) { --- > if (flk.l_whence == SEEK_CUR) > flk.l_start += fp->f_offset; > switch (flk.l_type) { 302c329 < return (VOP_ADVLOCK(vp, (caddr_t)p, F_SETLK, &fl, flg)); --- > return (VOP_ADVLOCK(vp, (caddr_t)p, F_SETLK, &flk, flg)); 308c335 < return (VOP_ADVLOCK(vp, (caddr_t)p, F_SETLK, &fl, flg)); --- > return (VOP_ADVLOCK(vp, (caddr_t)p, F_SETLK, &flk, flg)); 311c338,342 < return (VOP_ADVLOCK(vp, (caddr_t)p, F_UNLCK, &fl, --- > return (VOP_ADVLOCK(vp, (caddr_t)p, F_UNLCK, &flk, > F_POSIX)); > > case F_UNLKSYS: > return (VOP_ADVLOCK(vp, (caddr_t)p, F_UNLKSYS, &flk, 318a350 > case F_RGETLK: 323c355,378 < error = copyin((caddr_t)uap->arg, (caddr_t)&fl, sizeof (fl)); --- > if( uap->cmd == F_GETLK) { > /* > * Local lock. Only copy in old lock structure > * to ensure backward binary compatability (the > * lock structure may butt up against memory > * for which a copyin would be invalid). > */ > error = copyin( (caddr_t)uap->arg, > (caddr_t)&flk, > sizeof(struct oflock)); > flk.l_rsys = FLOCK_LOCAL_LOCK; > flk.l_rpid = FLOCK_LOCAL_LOCK; > } else { > /* > * Remote lock proxy request. Only root is > * allowed to make proxy requests. Copy > * in full flock structure. > */ > if( !suser(p->p_ucred, &p->p_acflag)) > return( EPERM); > error = copyin( (caddr_t)uap->arg, > (caddr_t)&flk, > sizeof(struct flock)); > } 326,328c381,383 < if (fl.l_whence == SEEK_CUR) < fl.l_start += fp->f_offset; < if ((error = VOP_ADVLOCK(vp,(caddr_t)p,F_GETLK,&fl,F_POSIX))) --- > if (flk.l_whence == SEEK_CUR) > flk.l_start += fp->f_offset; > if ((error = VOP_ADVLOCK(vp,(caddr_t)p,F_GETLK,&flk,F_POSIX))) 330c385,421 < return (copyout((caddr_t)&fl, (caddr_t)uap->arg, sizeof (fl))); --- > /* > * Only copy out full structure if not F_GETLK to ensure > * we do not (incorrectly) fault the caller. > */ > if( uap->cmd == F_GETLK) { > /* old lock structure is enough...*/ > return( copyout((caddr_t)&flk, > (caddr_t)uap->arg, > sizeof(struct oflock))); > } else { /* F_RGETLK*/ > /* new lock structure is required...*/ > return( copyout((caddr_t)&flk, > (caddr_t)uap->arg, > sizeof(struct flock))); > } > /* NOTREACHED*/ > > case F_CNVT: > /* > * Convert an NFS file handle into a descriptor open > * in the calling process context (used by the lockd > * to establish open file state as a holder for a > * lock). > */ > if( !suser(p->p_ucred, &p->p_acflag)) > return( EPERM); > > /* > * XXX requires knowledge of the handle format to > * be presented via this interface. > * XXX requires registration of the conversion > * function (which already exists in the > * NFS code!) via a mechanism similar to > * that used by the existing LEASES code > * to allow continued use of NFS as an LKM. > */ > return( EBADF); 857c948 < struct flock lf; --- > struct flock flk; 871,874c962,967 < lf.l_whence = SEEK_SET; < lf.l_start = 0; < lf.l_len = 0; < lf.l_type = F_UNLCK; --- > flk.l_whence = SEEK_SET; > flk.l_start = 0; > flk.l_len = 0; > flk.l_type = F_UNLCK; > flk.l_rsys = FLOCK_LOCAL_LOCK; > flk.l_rpid = FLOCK_LOCAL_LOCK; 876c969 < (void) VOP_ADVLOCK(vp, (caddr_t)p, F_UNLCK, &lf, F_POSIX); --- > (void) VOP_ADVLOCK(vp, (caddr_t)p, F_UNLCK, &flk, F_POSIX); 883,886c976,981 < lf.l_whence = SEEK_SET; < lf.l_start = 0; < lf.l_len = 0; < lf.l_type = F_UNLCK; --- > flk.l_whence = SEEK_SET; > flk.l_start = 0; > flk.l_len = 0; > flk.l_type = F_UNLCK; > flk.l_rsys = FLOCK_LOCAL_LOCK; > flk.l_rpid = FLOCK_LOCAL_LOCK; 888c983 < (void) VOP_ADVLOCK(vp, (caddr_t)fp, F_UNLCK, &lf, F_FLOCK); --- > (void) VOP_ADVLOCK(vp, (caddr_t)fp, F_UNLCK, &flk, F_FLOCK); 920c1015 < struct flock lf; --- > struct flock flk; 928,930c1023,1027 < lf.l_whence = SEEK_SET; < lf.l_start = 0; < lf.l_len = 0; --- > flk.l_whence = SEEK_SET; > flk.l_start = 0; > flk.l_len = 0; > flk.l_rsys = FLOCK_LOCAL_LOCK; > flk.l_rpid = FLOCK_LOCAL_LOCK; 932c1029 < lf.l_type = F_UNLCK; --- > flk.l_type = F_UNLCK; 934c1031 < return (VOP_ADVLOCK(vp, (caddr_t)fp, F_UNLCK, &lf, F_FLOCK)); --- > return (VOP_ADVLOCK(vp, (caddr_t)fp, F_UNLCK, &flk, F_FLOCK)); 937c1034 < lf.l_type = F_WRLCK; --- > flk.l_type = F_WRLCK; 939c1036 < lf.l_type = F_RDLCK; --- > flk.l_type = F_RDLCK; 944,945c1041,1042 < return (VOP_ADVLOCK(vp, (caddr_t)fp, F_SETLK, &lf, F_FLOCK)); < return (VOP_ADVLOCK(vp, (caddr_t)fp, F_SETLK, &lf, F_FLOCK|F_WAIT)); --- > return (VOP_ADVLOCK(vp, (caddr_t)fp, F_SETLK, &flk, F_FLOCK)); > return (VOP_ADVLOCK(vp, (caddr_t)fp, F_SETLK, &flk, F_FLOCK|F_WAIT)); Index: kern_lockf.c =================================================================== RCS file: /b/cvstree/ncvs/src/sys/kern/kern_lockf.c,v retrieving revision 1.5 diff -r1.5 kern_lockf.c 3a4 > * Copyright (c) 1996 Terrence R. Lambert. All rights reserved. 19a21 > * This product includes software developed by Terrence R. Lambert. 142a145,146 > lock->lf_rsys = fl->l_rsys; > lock->lf_rpid = fl->l_rpid; 149a154,178 > case F_UNLKSYS: > /* > * WARNING: lf_overlap() uses LOCK_COMPARE, which will > * not match true a remote system id/pid given a lock > * removal on close. If the rpc.lockd implementation > * crashes (close on resource cleanup), locks held > * by a remote system will *NOT* be cleaned! To fix > * this, we would need a "proxy" flag to this routine > * which is propagated down via lf_clearlock() to > * lf_findoverlap() and flags LOCK_COMPARE() to > * ignore remote system id/pid in comparing. Since > * we believe that VOP_ADVLOCK() is improperly > * implemented, rather than changing every file > * system now, we will wait until we do a global > * clean up of the advisory locking code. Note > * that this means the rpc.lockd can not "optimize" > * by closing the fd to free locks for the final > * reference instance for the file; it *must* > * call with F_UNLKSYS (see closef() in kern_descrip.c). > * Resource track cleanup on crash will only free locks > * held by rpc.lockd itself (ie: none), not those held > * by proxy. > * > * -- Terry Lambert > */ 504a534,535 > fl->l_rsys = block->lf_rsys; > fl->l_rpid = block->lf_rpid; 565,566c596,597 < if (((type & SELF) && lf->lf_id != lock->lf_id) || < ((type & OTHERS) && lf->lf_id == lock->lf_id)) { --- > if (((type & SELF) && !LOCK_COMPARE(lf, lock)) || > ((type & OTHERS) && LOCK_COMPARE(lf, lock))) { 772a804 > lock->lf_type == F_UNLKSYS ? "unlock_sys" : 774c806 < if (lock->lf_block) --- > if (lock->lf_block) { 776,777c808,812 < else < printf("\n"); --- > if( lock->lf_rsys == FLOCK_LOCAL_LOCK) > printf( " (LCL)\n"); > else printf( " (NFS=0x%08x:0x%08x)\n", > lock->lf_rsys, lock->lf_rpid); > } else printf("\n"); 800a836 > lf->lf_type == F_UNLKSYS ? "unlock_sys" : 802c838 < if (lf->lf_block) --- > if (lf->lf_block) { 804,805c840,844 < else < printf("\n"); --- > if( lock->lf_rsys == FLOCK_LOCAL_LOCK) > printf( " (LCL)\n"); > else printf( " (NFS=0x%08x:0x%08x)\n", > lock->lf_rsys, lock->lf_rpid); > } else printf("\n"); Index: fcntl.h =================================================================== RCS file: /b/cvstree/ncvs/src/sys/sys/fcntl.h,v retrieving revision 1.3 diff -r1.3 fcntl.h 8a9 > * Copyright (c) 1996 Terrence R. Lambert. All rights reserved. 21a23 > * This product includes software developed by Terrence R. Lambert. 141a144,149 > #ifndef _POSIX_SOURCE > #define F_RGETLK 10 /* remote F_GETLK*/ > #define F_RSETLK 11 /* remote F_SETLK*/ > #define F_CNVT 12 /* open a file by NFS file handle*/ > #define F_RSETLKW 13 /* remote F_SETLKW*/ > #endif /* _POSIX_SOURCE*/ 149a158,160 > #ifndef _POSIX_SOURCE > #define F_UNLKSYS 4 /* remove all locks by system*/ > #endif /* _POSIX_SOURCE*/ 165a177,179 > long l_rsys; /* remote system id*/ > pid_t l_rpid; /* pid from remote system*/ > long l_reserved[ 4]; /* future use -- avoid syscall id change*/ 166a181,198 > > #ifdef KERNEL > /* > * Backward compatability advisory file segment locking data type; used > * only by kernel to ensure binary compatability. We only copy in this > * size of structure for F_GETLK, F_SETLK, and F_SETLKW. > */ > struct oflock { > off_t l_start; /* starting offset */ > off_t l_len; /* len = 0 means until end of file */ > pid_t l_pid; /* lock owner */ > short l_type; /* lock type: read/write, etc. */ > short l_whence; /* type of l_start */ > }; > > /* token for l_rsys, l_rpid*/ > #define FLOCK_LOCAL_LOCK 0 /* a non-NFS lock*/ > #endif /* KERNEL*/ Index: lockf.h =================================================================== RCS file: /b/cvstree/ncvs/src/sys/sys/lockf.h,v retrieving revision 1.3 diff -r1.3 lockf.h 3a4 > * Copyright (c) 1996 Terrence R. Lambert. All rights reserved. 19a21 > * This product includes software developed by Terrence R. Lambert. 54a57,58 > long lf_rsys; /* remote system id*/ > pid_t lf_rpid; /* pid from remote system*/ 58a63,71 > > /* > * Compare credentials on two locks. Returns TRUE if credentials are > * equivalent, FALSE otherwise. > */ > #define LOCK_COMPARE(l1, l2) (( l1->lf_id == l2->lf_id) && \ > ( l1->lf_rsys == l2->lf_rsys) && \ > ( l1->lf_rpid == l2->lf_rpid) \ > ) == 8< cut here 8< ==========================================================