From owner-freebsd-fs Mon Aug 5 19:00:17 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id TAA08217 for fs-outgoing; Mon, 5 Aug 1996 19:00:17 -0700 (PDT) Received: from parkplace.cet.co.jp (parkplace.cet.co.jp [202.32.64.1]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id TAA08209 for ; Mon, 5 Aug 1996 19:00:11 -0700 (PDT) Received: from localhost (michaelh@localhost) by parkplace.cet.co.jp (8.7.5/CET-v2.1) with SMTP id BAA24468; Tue, 6 Aug 1996 01:59:09 GMT Date: Tue, 6 Aug 1996 10:59:09 +0900 (JST) From: Michael Hancock Reply-To: Michael Hancock To: Terry Lambert cc: dfr@render.com, jkh@time.cdrom.com, tony@fit.qut.edu.au, freebsd-fs@freebsd.org Subject: Per fs vnode pools (was Re: NFS Diskless Dispare...) In-Reply-To: <199608051859.LAA11723@phaeton.artisoft.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk [Moved to fs from current] Thanks. Previously, I didn't see how an inactive nfs vnode would be reclaimed and moved to a ffs vnode pool, cleanly. The generic interfaces takes care of all this cleanly. It looks like a win in terms of performance and new fs development ease at the expense of a little space. Regards, Mike Hancock On Mon, 5 Aug 1996, Terry Lambert wrote: > > I think what he's is saying is that when the vnodes are in the global pool > > the chances of reusing a vnode that was used previously by a particular fs > > is less than having a per fs vnode pool. > > No, it's not. > > > The problem with the per fs vnode pool is the management overhead. When > > you need to start reusing vnodes you need to search through all the > > different fs pools to find a vnode. > > > > I don't know which is a better trade-off. > > This isn't how per FS vnode pools should work. > > When you want a vnode, you call the generic "getnewvnode()" from the > XXX_vget routine via VFS_VGET (sys/mount.h). > > This function returns a vnode with an FS specific inode. > > In reality, you never care to have a vnode without an FS specific inode, > since there is no way to access or write buffers hung off the critter > because of the way vclean works. > > > What I'm suggesting is that there needs to be both a VFS_VGET and > a VFS_VPUT (or VFS_VRELE). With the additional per fs release > mechanism, each FS instance can allocate an inode pool at its > instantiation (or do it on a per instance basis, the current > method which makes inode allocation so slow...). > > Consider UFS: the in core inode struct consists of a bunch of in core > data elements (which should probably be in their own structure) and > a "struct dinode i_din" for the on disk inode. > > You could modify this as: > > struct inode { > struct icinode i_ic; /* in core inode*/ > struct vnode i_iv; /* vnode for inode*/ > struct dinode i_din; /* on disk inode*/ > }; > > > Essentially, allocation of an inode would allocate a vnode. There > would never be an inode without a vnode. > > > The VFS_VPUT would put the vnode into a pool maintained by the > FS per fs instance (the in core fs structure would need an > additional structure element to point to the maintenance data). > > The FS itself would use generic maintenance routines shared by > all FS's... and capable of taking a structure size for i_ic and > i_din element size variations between FS types. This would > maintain all common code in the common interface. > > > The use of the vget to associate naked vnodes with the FS's would > go away; in no case is a naked vnode ever useful, since using vnode > buffer elements requires an FS context. > > > In effect, the ihash would become a vnhash and LRU for use in > reclaiming vnode/inode pairs. This would be much more efficient > than the current dual allocation sequence. > > > This would allow the discard of the vclean interface, and of the > lock used to ensure it operates (a lock which has to be reimplemented > and reimplemented correctly on a per FS basis in the XXX_LOCK and > XXX_UNLOCK FS specific routines). > > > The vnode locking could then be done in common code: > > > vn_lock( vp, flags, p) > struct vnode *vp; > int flags; > struct proc *p; > { > /* actual lock*/ > if( ( st = ...) == SUCCESS) { > if( ( st = VOP_LOCK( vp, flags, p)) != SUCCESS) { > /* lock was vetoed, undo actual lock*/ > ... > } > } > return( st); > } > > > The point here is that the lock contention (if any) can be resolved > without ever hitting the FS itsef in the failure case. > > > > The generic case of the per FS lock is now: > > > int > XXX_lock(ap) > struct vop_lock_args /* { > struct vnode *a_vp; > int a_flags; > struct proc *a_p; > } */ *ap; > { > return( SUCCESS); > } > > > This is much harder to screw up when writing a new FS, and makes for much > smaller intermediate layers. > > > For NFS and unions, there isn't an i_din... but they also require data > hung off the vnode, so the same allocation rules apply. It's a win > either way, and has the side benefit of unmunging the vn. > > > I believe that John Heidemann's thesis had this in mind when it refers > to using an RPC layer to use remote file system layers as intermediates > in a local VFS stack. > > > Terry Lambert > terry@lambert.org > --- > Any opinions in this posting are my own and not those of my present > or previous employers. > -- michaelh@cet.co.jp http://www.cet.co.jp CET Inc., Daiichi Kasuya BLDG 8F 2-5-12, Higashi Shinbashi, Minato-ku, Tokyo 105 Japan Tel: +81-3-3437-1761 Fax: +81-3-3437-1766 From owner-freebsd-fs Tue Aug 6 08:50:45 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id IAA25900 for fs-outgoing; Tue, 6 Aug 1996 08:50:45 -0700 (PDT) Received: from minnow.render.com (render.demon.co.uk [158.152.30.118]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id IAA25887 for ; Tue, 6 Aug 1996 08:50:39 -0700 (PDT) Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id QAA15586; Tue, 6 Aug 1996 16:50:34 +0100 Date: Tue, 6 Aug 1996 16:50:33 +0100 (BST) From: Doug Rabson To: Terry Lambert cc: Michael Hancock , jkh@time.cdrom.com, tony@fit.qut.edu.au, freebsd-fs@freebsd.org Subject: Re: NFS Diskless Dispare... In-Reply-To: <199608051859.LAA11723@phaeton.artisoft.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk [moved to freebsd-fs] On Mon, 5 Aug 1996, Terry Lambert wrote: > What I'm suggesting is that there needs to be both a VFS_VGET and > a VFS_VPUT (or VFS_VRELE). With the additional per fs release > mechanism, each FS instance can allocate an inode pool at its > instantiation (or do it on a per instance basis, the current > method which makes inode allocation so slow...). Not really sure how this would work for filesystems without a flat namespace? VFS_VGET is not supported for msdosfs, cd9660, nfs and probably others. > > Consider UFS: the in core inode struct consists of a bunch of in core > data elements (which should probably be in their own structure) and > a "struct dinode i_din" for the on disk inode. > > You could modify this as: > > struct inode { > struct icinode i_ic; /* in core inode*/ > struct vnode i_iv; /* vnode for inode*/ > struct dinode i_din; /* on disk inode*/ > }; > > > Essentially, allocation of an inode would allocate a vnode. There > would never be an inode without a vnode. > > > The VFS_VPUT would put the vnode into a pool maintained by the > FS per fs instance (the in core fs structure would need an > additional structure element to point to the maintenance data). > > The FS itself would use generic maintenance routines shared by > all FS's... and capable of taking a structure size for i_ic and > i_din element size variations between FS types. This would > maintain all common code in the common interface. > > > The use of the vget to associate naked vnodes with the FS's would > go away; in no case is a naked vnode ever useful, since using vnode > buffer elements requires an FS context. > > > In effect, the ihash would become a vnhash and LRU for use in > reclaiming vnode/inode pairs. This would be much more efficient > than the current dual allocation sequence. > > > This would allow the discard of the vclean interface, and of the > lock used to ensure it operates (a lock which has to be reimplemented > and reimplemented correctly on a per FS basis in the XXX_LOCK and > XXX_UNLOCK FS specific routines). Wait a minute. The VOP_LOCK is not there just for vclean to work. If you took it out, a lot of the VOPs in ufs would break due to unexpected reentry. The VOP_LOCK is there to ensure that operations which modify the vnode are properly sequenced even if the process has to sleep during the operation. > > > The vnode locking could then be done in common code: > > > vn_lock( vp, flags, p) > struct vnode *vp; > int flags; > struct proc *p; > { > /* actual lock*/ > if( ( st = ...) == SUCCESS) { > if( ( st = VOP_LOCK( vp, flags, p)) != SUCCESS) { > /* lock was vetoed, undo actual lock*/ > ... > } > } > return( st); > } > > > The point here is that the lock contention (if any) can be resolved > without ever hitting the FS itsef in the failure case. > You can't do this for NFS. If you use exclusive locks in NFS and a server dies, you easily can end up holding onto a lock for the root vnode until the server reboots. To make it work for NFS, you would have to make the lock interruptable which forces you to fix code which does not check the error return from VOP_LOCK all over the place. I hope we are not talking at cross purposes. We are talking about the vnode lock, not the advisory record locking aren't we? -- Doug Rabson, Microsoft RenderMorphics Ltd. Mail: dfr@render.com Phone: +44 171 251 4411 FAX: +44 171 251 0939 From owner-freebsd-fs Tue Aug 6 10:32:24 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id KAA02217 for fs-outgoing; Tue, 6 Aug 1996 10:32:24 -0700 (PDT) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id KAA02212 for ; Tue, 6 Aug 1996 10:32:22 -0700 (PDT) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id KAA13564; Tue, 6 Aug 1996 10:28:47 -0700 From: Terry Lambert Message-Id: <199608061728.KAA13564@phaeton.artisoft.com> Subject: Re: NFS Diskless Dispare... To: dfr@render.com (Doug Rabson) Date: Tue, 6 Aug 1996 10:28:47 -0700 (MST) Cc: terry@lambert.org, michaelh@cet.co.jp, jkh@time.cdrom.com, tony@fit.qut.edu.au, freebsd-fs@freebsd.org In-Reply-To: from "Doug Rabson" at Aug 6, 96 04:50:33 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > [moved to freebsd-fs] > > On Mon, 5 Aug 1996, Terry Lambert wrote: > > > What I'm suggesting is that there needs to be both a VFS_VGET and > > a VFS_VPUT (or VFS_VRELE). With the additional per fs release > > mechanism, each FS instance can allocate an inode pool at its > > instantiation (or do it on a per instance basis, the current > > method which makes inode allocation so slow...). > > Not really sure how this would work for filesystems without a flat > namespace? VFS_VGET is not supported for msdosfs, cd9660, nfs and > probably others. Conceptually, it's pretty tribial to support; it's not supported because the stacking is not correctly implemented for these FS's. Look at the /sys/miscfs/nullfs use of VOP_VGET. > Wait a minute. The VOP_LOCK is not there just for vclean to work. If you > took it out, a lot of the VOPs in ufs would break due to unexpected > reentry. The VOP_LOCK is there to ensure that operations which modify the > vnode are properly sequenced even if the process has to sleep during the > operation. That's why the vn_lock would be called. The VOP_LOCK is a transparent veto/allow interface in that case, but that doesn't mean a counting reference isn't held by PID (like it had to be). The actual Lite2 routine for "actual lock" is called lockmgr() and lives in kern_lock.c in the Lite2 sources. Lite2 already moves in this direction -- it just hasn't gone far enough. > > The vnode locking could then be done in common code: > > > > > > vn_lock( vp, flags, p) > > struct vnode *vp; > > int flags; > > struct proc *p; > > { > > /* actual lock*/ > > if( ( st = ...) == SUCCESS) { > > if( ( st = VOP_LOCK( vp, flags, p)) != SUCCESS) { > > /* lock was vetoed, undo actual lock*/ > > ... > > } > > } > > return( st); > > } > > > > > > The point here is that the lock contention (if any) can be resolved > > without ever hitting the FS itsef in the failure case. > > > > You can't do this for NFS. If you use exclusive locks in NFS and a > server dies, you easily can end up holding onto a lock for the root vnode > until the server reboots. To make it work for NFS, you would have to make > the lock interruptable which forces you to fix code which does not check > the error return from VOP_LOCK all over the place. This is one of the "flags" fields, and it only applies to the NFS client code. Actually, since the NFSnode is not transiently destroyed as a result of server reboot (statelessness *is* a win, no matter what the RFS advocates would have you believe), there isn't a problem with holding the reference. One of the things Sun recommends is not making the mounts on mount points in the root directory; to avoid exactly this scenario (it really doesn't matter in the diskless/dataless case, since you will hang on swap or page-in from image-file-as-swap-store anyway). The root does not need to be locked for the node lookup for the root for a covering node in any case; this is an error in the "node x covers node y" case in the lookup case. You can see that the lookup code documents a race where it frees and relocks the parent node to avoid exactly this scenario, actually. A lock does not need to be held in the lookup for the parent in the NFS lookup case for the mount point traversal. I believe this is an error in the current code. The issue is more interesting in the client case; a reference is not a lock, per se, it's an increment of the reference count. The server holds the lock mid path traversal. This is resolved by setting the "interruptable" flag on the vn_lock into the underlying FS on the server. The easiest way to think of this is in terms of provider interfaces and consumer interfaces. There are many FS provider interfaces. The FS consumer interfaces are the syscall layer (the vfs_subr.c) and the NFS client. This goes hand in hand with the discussion we had about the VOP_READDIR interface needing to be split into "get buffer/reference buffer element" (remember the conversation about killing off the cookie interface about a year ago?!?!). > I hope we are not talking at cross purposes. We are talking about the > vnode lock, not the advisory record locking aren't we? Yes. The VOP_ADVLOCK is also (ideally) a veto interface. This allows lock contention from several processes on the same client to be resolved locally without hitting the wire, and gives a one client pseudo-flock that works without fully implementing the NFS locking code. This is really irrelevant to the VOP_LOCK code, which deals with asserting the lock only in the exception cases. In the NFS client case, the VOP_LOCK and VOP_ADVLOCK are non-null. I didn't show the sleep interface in the vn_lock in the case of the failure. The sleep puts a loop around the "actual lock" code so a sleep occurs above, at the higher code level. Intermediate locks on per layer vnodes (if any are truly needed; see below) are automatically wound and unwound for retry in the blocking case. In the NFS case, the lock is asserted to the underlying FS, and the sleep target is returned to the top of the loop by the FS layer where the contention occurred (basically, a vnodep is returned in the != SUCCESS case (SUCCESS == 0); this is used as the sleep target. If a lock in the NFS server code fails, and it fails for the UFS lock case for the underlying FS, then it should sleep on the UFS vnode being unlocked. The veto interface actually implies a couple of semantic changes; the real implementation would probably be as a NULL lock entry to allow the routine to not be called at all, saving the vnode_if parameter list deconstruction/reconstruction. This allows the substitution of a chaining interface for a file system stacking layer. Now you are probably asking "but how can this work when an intermediate non-NULL layer fans out or in from multiple vnodes?". The union FS case is one of the most interesting cases for this, since what you want to do is conditionally assert a lock on two or more underlying FS's, either of which could have NULL or non-NULL veto code. The reason it is interesting is stack operand collapse in a stacking instance. I could have the following simple case: (syscalls or NFS or AFP or SMB or NetWare kernel server) consumer vn_lock | ^ | ^ v | v | quota layer quota VOP_LOCK (NULL) | ^ | ^ v | v | uid mapping layer uid VOP_LOCK (NULL) | ^ | ^ v | v | FFS FFS VOP_LOCK (NULL) Really, you want to collapse NULL layer entries. But since the stack could be reentered from the top, how can you do this without endangering the locking of terminal nodes based on intermediate nodes? It turns out that the function collapse for the VOP_LOCK's in this case is NULL; but say we replace FFS with the NFS client, where the last layer is non-NULL? We would want to collapse to the NFS VOP_LOCK call, since the intermediate chainings are NULL, but the terminal chaining is not. Similar collapse could remove the uid mapping layer's VOP_LOOKUP, leaving the quota VOP_LOOKUP (which has to be there to hide the quota file and protect it) followed by the FFS VOP_LOOKUP. The call-down chain is abbreviated. This is a general win in the veto interface cases. The only place you are required to propagate is the non-NULL cases, and the non-NULL case will only occur when a fan-out or fan-in of vnodes occurs between layers. Currently collapse is not implemented. Part of the support for collapse without full kernel recompilation on VOP addition was the 0->1 FS instance count changes to the vfs_init.c code and the addition of the structure sizing field in the vnode_if.c generation in my big patch set (where the vnode_if.c generated had the structure vfs_op_descs size computed in the vnode_if.c file. The change did not simply allow the transition from 0->N loadable FS's (part of the necessary work for discardable fallback drivers for the FS, assuming kernel paging at some point in the future), and it did not just allow you to add VFS OPS to the vnode_if without having to recompile all FS modules and LKM's (it's stated intent). The change also allows (with the inclusion of a structure sort, since the init causes a structure copy anyway to get it into a stack instantiation) the simplification of the vnode_if call to eliminate the intermediate functioncall stub: a necessary step towards call graph collapse. You want this so that if you have 10 FS layers in a stack, you only have to call one or two veto functions out of the 10... and if they are all NULL, the one is synthetic anyway. This is a big win in reducing the current code duplication, which you want to do not only to reduce code size, but to make FS's more robust. The common behaviours of FS's *should* be implemented in common code. The Lite2 code recognizes this at the VOP_LOCK level in a primitive fashion by introducing the lockmgr() call, but since the model is not uniformly applied, and deadly-embrace or two caller starvation deadlocks can still occur in the Lite2 model. Going to the next step, a veto model, both increases the code robustness considerably, as well as resolving the state wind/unwind problems inherent in fan out. The fan out problem is *the* problem with the unionfs, at this point. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. From owner-freebsd-fs Wed Aug 7 22:20:02 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id WAA08289 for fs-outgoing; Wed, 7 Aug 1996 22:20:02 -0700 (PDT) Received: from parkplace.cet.co.jp (parkplace.cet.co.jp [202.32.64.1]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id WAA08249 for ; Wed, 7 Aug 1996 22:19:58 -0700 (PDT) Received: from localhost (michaelh@localhost) by parkplace.cet.co.jp (8.7.5/CET-v2.1) with SMTP id FAA11948; Thu, 8 Aug 1996 05:19:30 GMT Date: Thu, 8 Aug 1996 14:19:30 +0900 (JST) From: Michael Hancock Reply-To: Michael Hancock To: Terry Lambert cc: dfr@render.com, jkh@time.cdrom.com, tony@fit.qut.edu.au, freebsd-fs@freebsd.org Subject: Re: Per fs vnode pools (was Re: NFS Diskless Dispare...) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk On Tue, 6 Aug 1996, I wrote: > > In effect, the ihash would become a vnhash and LRU for use in > > reclaiming vnode/inode pairs. This would be much more efficient > > than the current dual allocation sequence. Would you want this to be LRU vnodes with no buffer pages first? The buffer cache is being reclaimed, with some kind of algorithm, independent of the vnodes. You want to keep the vnodes with data still hanging off of them in the fs pool longer. BTW, is the incore inode table fixed or dynamic? Mike Hancock From owner-freebsd-fs Thu Aug 8 10:41:07 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id KAA01046 for fs-outgoing; Thu, 8 Aug 1996 10:41:07 -0700 (PDT) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id KAA01031 for ; Thu, 8 Aug 1996 10:41:04 -0700 (PDT) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id KAA17340; Thu, 8 Aug 1996 10:33:44 -0700 From: Terry Lambert Message-Id: <199608081733.KAA17340@phaeton.artisoft.com> Subject: Re: Per fs vnode pools (was Re: NFS Diskless Dispare...) To: michaelh@cet.co.jp Date: Thu, 8 Aug 1996 10:33:44 -0700 (MST) Cc: terry@lambert.org, dfr@render.com, jkh@time.cdrom.com, tony@fit.qut.edu.au, freebsd-fs@freebsd.org In-Reply-To: from "Michael Hancock" at Aug 8, 96 02:19:30 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > On Tue, 6 Aug 1996, Michael wrote: > > > > In effect, the ihash would become a vnhash and LRU for use in > > > reclaiming vnode/inode pairs. This would be much more efficient > > > than the current dual allocation sequence. > > Would you want this to be LRU vnodes with no buffer pages first? Yes. Minimally, you'd want a dual insertion point for LRU pages: head | vnodes without buffer pages | vnodes with buffer pages | tail insertion points ---^ ---^ > The > buffer cache is being reclaimed, with some kind of algorithm, independent > of the vnodes. You want to keep the vnodes with data still hanging off of > them in the fs pool longer. Actually, you want to be able to impose a working set quota on a per vnode basis using the cache reclaim algorithm. This avoids large mmap's from thrashing the cache. You could have supervisor, or even user, overrides for the behaviour. head | buffer reclaimation list | tail ^ ^--- insert here if vnode buffer count | is below working set quota insert here if vnode buffer count equals working set quota So truly, it does not want to be independent of the vnodes. A vnode quota is better than a process quota, since a process can use vnodes in common with other processes; you don't want to have a process with a low working set quota able to interfere with locality for another otherwise unrelated process. > BTW, is the incore inode table fixed or dynamic? Currently dynamic in FFS, and FS implementation dependent in principle. Potentially you will want to be able to install soft usage limits via mount options, independent of FS, assuming a common subsystem is being used to implement the allocation and LRU maintenance for each FS. This would imply a need to be able to force a reclaim, or allocation balancing at a minimum, in low memory situations. This is actually a consequence of the buffer cache information not being indexed by device/offset for data which is not referred by vnode: inode information, etc.. If I had my preferences, the cache would be indexible by dev/offset as well (I would *not* eliminate the vnode/offset indexing currently present, since it avoids a bmap on every call that deals with file data). One major win here is that getting one on disk inode vs. another on disk inode in the same directory has a high probability of locality (the FFS paper makes this clear when looking at the directory/inode/cylinder group allocation policy). Instead of copying to an in core inode buffer, the on disk inode could be a page ref to the page containing the inode data, and a pointer. This would save all of the internal copies required for stat and other operations. Since multiple inodes could be in a device mapped page (as opposed to a strict vnode mapping), this could save a significant amount of I/O (16 disk inodes @ 128 bytes each per page). I'd like to keep the table dynamic in a modified slab basis: using a power of two allocation-ahead; this is open to discussion. John Dyson, in particular, has some interesting VM plans that would bear directly on how you'd want to do this. Clearly, if you had page mapping for the device for the on disk inode data, the allocated in core object would be the vnode, in core inode data (local FS state for an inode that is referenced), and a pointer to the page containing the disk inode data (with an implied page ref, and an implied limit of one page on the in core data -- you could overcome this by adding more page references in a table to the in core inode data and handling the inode dereference in the FS: you have to do that anyway, since the reference is implict, not explicit). A direct implication of this is that buffer reclaim for non-vnode unreferenced pages would need to be handled seperately; this is only a minor complication... you could do this by tracking number of items on a FS independent per device managed global LRU list vs. the number of items in the FS LRU's and establishing a high water mark for free pages so that the reclaim will occur on deallocation that pushes the LRU above the high water mark. Then you reclaim pages down to the low water mark (the page just freed, being below the low water mark, is left on the list to ensure locality). FS mechanics are one of the funnest things you can discuss. 8-). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. From owner-freebsd-fs Thu Aug 8 10:48:08 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id KAA02098 for fs-outgoing; Thu, 8 Aug 1996 10:48:08 -0700 (PDT) Received: from minnow.render.com (render.demon.co.uk [158.152.30.118]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id KAA02079 for ; Thu, 8 Aug 1996 10:48:01 -0700 (PDT) Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id SAA21945; Thu, 8 Aug 1996 18:47:45 +0100 Date: Thu, 8 Aug 1996 18:47:44 +0100 (BST) From: Doug Rabson To: Terry Lambert cc: michaelh@cet.co.jp, jkh@time.cdrom.com, tony@fit.qut.edu.au, freebsd-fs@freebsd.org Subject: Re: NFS Diskless Dispare... In-Reply-To: <199608061728.KAA13564@phaeton.artisoft.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk On Tue, 6 Aug 1996, Terry Lambert wrote: > > [moved to freebsd-fs] > > > > On Mon, 5 Aug 1996, Terry Lambert wrote: > > > > > What I'm suggesting is that there needs to be both a VFS_VGET and > > > a VFS_VPUT (or VFS_VRELE). With the additional per fs release > > > mechanism, each FS instance can allocate an inode pool at its > > > instantiation (or do it on a per instance basis, the current > > > method which makes inode allocation so slow...). > > > > Not really sure how this would work for filesystems without a flat > > namespace? VFS_VGET is not supported for msdosfs, cd9660, nfs and > > probably others. > > > Conceptually, it's pretty tribial to support; it's not supported > because the stacking is not correctly implemented for these FS's. > Look at the /sys/miscfs/nullfs use of VOP_VGET. VFS_VGET is not implemented in NFS because the concept just doesn't apply. VFS_VGET is only relavent for local filesystems. NFS does have a flat namespace in terms of filehandles but not one which you could squeeze into the VFS_VGET interface. > > > Wait a minute. The VOP_LOCK is not there just for vclean to work. If you > > took it out, a lot of the VOPs in ufs would break due to unexpected > > reentry. The VOP_LOCK is there to ensure that operations which modify the > > vnode are properly sequenced even if the process has to sleep during the > > operation. > > That's why the vn_lock would be called. The VOP_LOCK is a transparent > veto/allow interface in that case, but that doesn't mean a counting > reference isn't held by PID (like it had to be). The actual Lite2 > routine for "actual lock" is called lockmgr() and lives in kern_lock.c > in the Lite2 sources. Lite2 already moves in this direction -- it just > hasn't gone far enough. > > > > > The vnode locking could then be done in common code: > > > > > > > > > vn_lock( vp, flags, p) > > > struct vnode *vp; > > > int flags; > > > struct proc *p; > > > { > > > /* actual lock*/ > > > if( ( st = ...) == SUCCESS) { > > > if( ( st = VOP_LOCK( vp, flags, p)) != SUCCESS) { > > > /* lock was vetoed, undo actual lock*/ > > > ... > > > } > > > } > > > return( st); > > > } > > > > > > > > > The point here is that the lock contention (if any) can be resolved > > > without ever hitting the FS itsef in the failure case. > > > > > > > You can't do this for NFS. If you use exclusive locks in NFS and a > > server dies, you easily can end up holding onto a lock for the root vnode > > until the server reboots. To make it work for NFS, you would have to make > > the lock interruptable which forces you to fix code which does not check > > the error return from VOP_LOCK all over the place. > > This is one of the "flags" fields, and it only applies to the NFS client > code. Actually, since the NFSnode is not transiently destroyed as a > result of server reboot (statelessness *is* a win, no matter what the > RFS advocates would have you believe), there isn't a problem with holding > the reference. So the NFS code would degrade the exclusive lock back to a shared lock? Hmm. I don't think that would work since you can't get the exclusive lock until all the shared lockers release their locks. > > One of the things Sun recommends is not making the mounts on mount > points in the root directory; to avoid exactly this scenario (it really > doesn't matter in the diskless/dataless case, since you will hang on > swap or page-in from image-file-as-swap-store anyway). It doesn't matter if they are on mount points in root. If a lock is stuck in a sub-filesystem, then the 'sticking' can propagate across the mount point. > > The root does not need to be locked for the node lookup for the root > for a covering node in any case; this is an error in the "node x covers > node y" case in the lookup case. You can see that the lookup code > documents a race where it frees and relocks the parent node to avoid > exactly this scenario, actually. A lock does not need to be held > in the lookup for the parent in the NFS lookup case for the mount > point traversal. I believe this is an error in the current code. Have to think about this some more. Are you saying that when lookup is crossing a mountpoint, it does not need any locks in the parent filesystem? > > > The issue is more interesting in the client case; a reference is not > a lock, per se, it's an increment of the reference count. The server > holds the lock mid path traversal. > > This is resolved by setting the "interruptable" flag on the vn_lock > into the underlying FS on the server. > > > The easiest way to think of this is in terms of provider interfaces > and consumer interfaces. There are many FS provider interfaces. The > FS consumer interfaces are the syscall layer (the vfs_subr.c) and the > NFS client. This goes hand in hand with the discussion we had about ^^^^^^ Do you mean NFS server here? > the VOP_READDIR interface needing to be split into "get buffer/reference > buffer element" (remember the conversation about killing off the cookie > interface about a year ago?!?!). I remember that. I think I ended up agreeing with you about it. The details are a bit vague.. > [advlock digression ...] > > In the NFS client case, the VOP_LOCK and VOP_ADVLOCK are non-null. I > didn't show the sleep interface in the vn_lock in the case of the > failure. The sleep puts a loop around the "actual lock" code so a > sleep occurs above, at the higher code level. Intermediate locks > on per layer vnodes (if any are truly needed; see below) are > automatically wound and unwound for retry in the blocking case. > > > In the NFS case, the lock is asserted to the underlying FS, and the sleep > target is returned to the top of the loop by the FS layer where the > contention occurred (basically, a vnodep is returned in the != SUCCESS > case (SUCCESS == 0); this is used as the sleep target. > > If a lock in the NFS server code fails, and it fails for the UFS lock > case for the underlying FS, then it should sleep on the UFS vnode > being unlocked. > > The veto interface actually implies a couple of semantic changes; the > real implementation would probably be as a NULL lock entry to allow > the routine to not be called at all, saving the vnode_if parameter > list deconstruction/reconstruction. > > This allows the substitution of a chaining interface for a file system > stacking layer. > > Now you are probably asking "but how can this work when an intermediate > non-NULL layer fans out or in from multiple vnodes?". > > > The union FS case is one of the most interesting cases for this, since > what you want to do is conditionally assert a lock on two or more > underlying FS's, either of which could have NULL or non-NULL veto code. > The reason it is interesting is stack operand collapse in a stacking > instance. > > I could have the following simple case: > > > (syscalls or NFS or AFP or SMB or NetWare kernel server) > > consumer vn_lock > | ^ | ^ > v | v | > quota layer quota VOP_LOCK (NULL) > | ^ | ^ > v | v | > uid mapping layer uid VOP_LOCK (NULL) > | ^ | ^ > v | v | > FFS FFS VOP_LOCK (NULL) > > Really, you want to collapse NULL layer entries. But since the stack > could be reentered from the top, how can you do this without endangering > the locking of terminal nodes based on intermediate nodes? > > It turns out that the function collapse for the VOP_LOCK's in this > case is NULL; but say we replace FFS with the NFS client, where the > last layer is non-NULL? > > We would want to collapse to the NFS VOP_LOCK call, since the > intermediate chainings are NULL, but the terminal chaining is not. > Similar collapse could remove the uid mapping layer's VOP_LOOKUP, > leaving the quota VOP_LOOKUP (which has to be there to hide the > quota file and protect it) followed by the FFS VOP_LOOKUP. The > call-down chain is abbreviated. This is a general win in the veto > interface cases. The only place you are required to propagate is > the non-NULL cases, and the non-NULL case will only occur when a > fan-out or fan-in of vnodes occurs between layers. > > Currently collapse is not implemented. Part of the support for > collapse without full kernel recompilation on VOP addition was the > 0->1 FS instance count changes to the vfs_init.c code and the > addition of the structure sizing field in the vnode_if.c generation > in my big patch set (where the vnode_if.c generated had the structure > vfs_op_descs size computed in the vnode_if.c file. The change did > not simply allow the transition from 0->N loadable FS's (part of > the necessary work for discardable fallback drivers for the FS, > assuming kernel paging at some point in the future), and it did not > just allow you to add VFS OPS to the vnode_if without having to > recompile all FS modules and LKM's (it's stated intent). The change > also allows (with the inclusion of a structure sort, since the init > causes a structure copy anyway to get it into a stack instantiation) > the simplification of the vnode_if call to eliminate the intermediate > functioncall stub: a necessary step towards call graph collapse. You > want this so that if you have 10 FS layers in a stack, you only have > to call one or two veto functions out of the 10... and if they are > all NULL, the one is synthetic anyway. This is interesting. It is similar to the internal driver architecture we use in our 3D graphics system (was Reality Lab, now Microsoft's Direct3D). The driver is split up into different modules depending on functionality. The consumer (Direct3D) has a stack which it pushes driver modules onto for all the required functionality. This used to be useful for reconfiguring the stack at runtime to select different rendering algorithms etc. Direct3D broke that unfortunately but that is another story. It communicates with the drivers by sending service calls to the top driver in the stack. Each service call has a well defined number. If that module understands the service, it implements it and returns a result. Otherwise, it passes the service call down to the next driver in the stack. Some modules override service calls in lower layers and they typically do their own work and then pass the service onto the next layer in the stack. To optimise the system, we added a service call table in the stack head. When a module is pushed onto the stack, it is called to 'bid' some of its services into the service call table. Each module in turn going up the stack puts a function pointer into the table for each of the services it wants to implement. If it is overriding a lower module, it just overwrites the pointer. If you add service calls, nothing needs to recompile (as long as the service call table is large enough) because the new services just go after the existing ones. > > > This is a big win in reducing the current code duplication, which you > want to do not only to reduce code size, but to make FS's more robust. > The common behaviours of FS's *should* be implemented in common code. Agreed. The worst candidates at the moment are VOP_LOOKUP, VOP_RENAME and the general locking/releasing protocol, IMHO. > > The Lite2 code recognizes this at the VOP_LOCK level in a primitive > fashion by introducing the lockmgr() call, but since the model is not > uniformly applied, and deadly-embrace or two caller starvation deadlocks > can still occur in the Lite2 model. Going to the next step, a veto > model, both increases the code robustness considerably, as well as > resolving the state wind/unwind problems inherent in fan out. The > fan out problem is *the* problem with the unionfs, at this point. Well at the moment, I think we have to just grit our teeth and merge in the lite2 code as it stands. We have to at least try to converge with the other strains of 4.4, if only to try and share the load of maintaining the filesystem code. I strongly believe that there should be a consensus between the different 4.4 groups over FS development or we just end up with chaos. -- Doug Rabson, Microsoft RenderMorphics Ltd. Mail: dfr@render.com Phone: +44 171 734 3761 FAX: +44 171 734 6426 From owner-freebsd-fs Thu Aug 8 14:54:40 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id OAA16913 for fs-outgoing; Thu, 8 Aug 1996 14:54:40 -0700 (PDT) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id OAA16885 for ; Thu, 8 Aug 1996 14:54:29 -0700 (PDT) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id OAA17616; Thu, 8 Aug 1996 14:48:29 -0700 From: Terry Lambert Message-Id: <199608082148.OAA17616@phaeton.artisoft.com> Subject: Re: NFS Diskless Dispare... To: dfr@render.com (Doug Rabson) Date: Thu, 8 Aug 1996 14:48:28 -0700 (MST) Cc: terry@lambert.org, michaelh@cet.co.jp, jkh@time.cdrom.com, tony@fit.qut.edu.au, freebsd-fs@freebsd.org In-Reply-To: from "Doug Rabson" at Aug 8, 96 06:47:44 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > > Conceptually, it's pretty tribial to support; it's not supported > > because the stacking is not correctly implemented for these FS's. > > Look at the /sys/miscfs/nullfs use of VOP_VGET. > > VFS_VGET is not implemented in NFS because the concept just doesn't apply. > VFS_VGET is only relavent for local filesystems. NFS does have a flat > namespace in terms of filehandles but not one which you could squeeze into > the VFS_VGET interface. The flat name space is the nfsnodes, not the file handles. In the NFS case, you would simply *not* implement recovery without reallocation. The allocation time is small compared to the wire time, and the actions could be interleaved by assuming a success response, with an additional dealloc overhead for the failure case. > > > You can't do this for NFS. If you use exclusive locks in NFS and a > > > server dies, you easily can end up holding onto a lock for the root vnode > > > until the server reboots. To make it work for NFS, you would have to make > > > the lock interruptable which forces you to fix code which does not check > > > the error return from VOP_LOCK all over the place. > > > > This is one of the "flags" fields, and it only applies to the NFS client > > code. Actually, since the NFSnode is not transiently destroyed as a > > result of server reboot (statelessness *is* a win, no matter what the > > RFS advocates would have you believe), there isn't a problem with holding > > the reference. > > So the NFS code would degrade the exclusive lock back to a shared lock? > Hmm. I don't think that would work since you can't get the exclusive lock > until all the shared lockers release their locks. You would unhold the lock and set reassert pending availability from the rpc.mount negotiation suceeding. Do this by setting up a fake sleep address. The trade off is between blocking a process (which you will have to do anyway) and hanging the kernel. The locks are local. The only possible race condition is local stacking on top of the NFS on the client side. You can either not allow it, or you can accept the fact that someone might win the thundering herd race (in which case you just get delayed a bit), or you can FIFO the request list with an array and a request entrancy limit when the array is full, where you degrade to thundering herd to get into the FIFO list. It's unlikely that someone will be running hundreds of process from an NFS server that crashes, and care who gets their page requests satisfied first. The delay from misordering is going to be *nothing* compared with the delay for a network resource which is unavailable long enough to have the request list fill up. I think it's a non-problem to unwind the state, and the collision avoidance is well worth the worst case being slightly degraded. Currently in BSD and SunOS, if the server can't satisfy a page request from one local process, it blocks and the whole system goes to hell. This way, only the processes which are relying on the unreliable resourvce go to hell. Even so, I still vote for flagging the NFS mount to force a copy to swap of any file being used as swap store from an unreliable server. It's a better long term soloution anyway. > > One of the things Sun recommends is not making the mounts on mount > > points in the root directory; to avoid exactly this scenario (it really > > doesn't matter in the diskless/dataless case, since you will hang on > > swap or page-in from image-file-as-swap-store anyway). > > It doesn't matter if they are on mount points in root. If a lock is stuck > in a sub-filesystem, then the 'sticking' can propagate across the mount > point. Well, yes, I suppose. There are better ways to fix that; specfically, lock the node that is covered before you lock the covering node in a mount point traversal. The issue is resolved locally after the second process waiting for the node, without propagating up past the mount point. I'm more concerned with interaction between multiple mounts of CDROM's on a changer device. It's more likely, if you ask me. Nevertheless, if you are running something off an NFS server, and it can't run, then it can't run. FreeBSD is no less graceful about that than any commercial OS. > > The root does not need to be locked for the node lookup for the root > > for a covering node in any case; this is an error in the "node x covers > > node y" case in the lookup case. You can see that the lookup code > > documents a race where it frees and relocks the parent node to avoid > > exactly this scenario, actually. A lock does not need to be held > > in the lookup for the parent in the NFS lookup case for the mount > > point traversal. I believe this is an error in the current code. > > Have to think about this some more. Are you saying that when lookup is > crossing a mountpoint, it does not need any locks in the parent > filesystem? It needs locks on the covered node, but it does not need to propagate the collision to root. The only case this fails is when / is NFS mounted and the server goes down. You have worse problems at that point, and hanging for the server to come back up is most likely the right thing to do in that case anyway. > > The easiest way to think of this is in terms of provider interfaces > > and consumer interfaces. There are many FS provider interfaces. The > > FS consumer interfaces are the syscall layer (the vfs_subr.c) and the > > NFS client. This goes hand in hand with the discussion we had about > ^^^^^^ > Do you mean NFS server here? Yes, thanks; sorry about that. > > the VOP_READDIR interface needing to be split into "get buffer/reference > > buffer element" (remember the conversation about killing off the cookie > > interface about a year ago?!?!). > > I remember that. I think I ended up agreeing with you about it. The > details are a bit vague.. I saved them; I can forward them if need be. The details were vague because I wanted an interface that let me tell it what I wanted back, but a struct direct only return would be acceptable for an interim implementation. That's one that could be broken up without too much trouble. [ ... ] > > Currently collapse is not implemented. Part of the support for > > collapse without full kernel recompilation on VOP addition was the > > 0->1 FS instance count changes to the vfs_init.c code and the > > addition of the structure sizing field in the vnode_if.c generation > > in my big patch set (where the vnode_if.c generated had the structure > > vfs_op_descs size computed in the vnode_if.c file. The change did > > not simply allow the transition from 0->N loadable FS's (part of > > the necessary work for discardable fallback drivers for the FS, > > assuming kernel paging at some point in the future), and it did not > > just allow you to add VFS OPS to the vnode_if without having to > > recompile all FS modules and LKM's (it's stated intent). The change > > also allows (with the inclusion of a structure sort, since the init > > causes a structure copy anyway to get it into a stack instantiation) > > the simplification of the vnode_if call to eliminate the intermediate > > functioncall stub: a necessary step towards call graph collapse. You > > want this so that if you have 10 FS layers in a stack, you only have > > to call one or two veto functions out of the 10... and if they are > > all NULL, the one is synthetic anyway. > > This is interesting. It is similar to the internal driver architecture we > use in our 3D graphics system (was Reality Lab, now Microsoft's Direct3D). > The driver is split up into different modules depending on functionality. > The consumer (Direct3D) has a stack which it pushes driver modules onto > for all the required functionality. This used to be useful for > reconfiguring the stack at runtime to select different rendering > algorithms etc. Direct3D broke that unfortunately but that is another > story. I cheated for this; the two "competing" vnode stacking architectures are Heidemann's (the one we are using) and Rosenthal's (which lost out). Rosenthal alludes to stack collapes in his Usenix paper on "A file system stacking architecture". The general problem with Rosenthal's support is the same problem Novell was having in their "Advanced File System Design": personal views. A personal view allows the FS to have a cannonical form, and each user can choose his view on the FS. The problem with this is the same problem Windows95 has now with desktop themes: support is impossible. Imagine the user who is told to "drag that icon to the wastebasket to fix the problem"... he may have a beaker of acid, or a black hole or a trash compactor, or whatever... there are not Schnelling points in common that the user and the technical support person agree on so that they can communicate effectively. You can steal the personal view idea of a cannonical form for a directory structure by specifying a cannonization name space for files regardless of their name in the real name space. This was the basis of some of my internationalization work about two years ago (the numeric name space suggestion that allowed you to rename system critical files like /etc/passwd to Japanese equivalents and have NIS and login keep working). Rosenthal needed stack collapse to reduce the memory requirements per view instance so he could have views at all. > It communicates with the drivers by sending service calls to the top > driver in the stack. Each service call has a well defined number. If > that module understands the service, it implements it and returns a > result. Otherwise, it passes the service call down to the next driver in > the stack. Some modules override service calls in lower layers and they > typically do their own work and then pass the service onto the next layer > in the stack. Yes. This is exactly how the Heidemann thesis wants the VFS stacking to work. It fails because of the way the integration into the Lite code occurred in a rush as a result of the USL lawsuit and settlement. Specifically, there's no concept of adding a new VFS OP without rebuilding FFS (which is used to get the max number of VFS OPs allowed, in the current FreeBSD/NetBSD code). > To optimise the system, we added a service call table in the stack head. > When a module is pushed onto the stack, it is called to 'bid' some of its > services into the service call table. Each module in turn going up the > stack puts a function pointer into the table for each of the services it > wants to implement. If it is overriding a lower module, it just > overwrites the pointer. This is not quite the same inheritance model. Basically, you still need to be able to call each inferior layer. Consider the unionfs that unions two NFS mounts. Any fan in/fan out layer must be non-null. > If you add service calls, nothing needs to recompile (as long as the > service call table is large enough) because the new services just go after > the existing ones. Yes. The vnode_if.c structure sizing is how I ensured the table was large enough: that's where the table is defined, so it should be where the size is defined. The use of the FFS table to do the sizing in the init was what broke the ability to add service calls dynamically in the Heidemann code a integrated into 4.4Lite. > > This is a big win in reducing the current code duplication, which you > > want to do not only to reduce code size, but to make FS's more robust. > > The common behaviours of FS's *should* be implemented in common code. > > Agreed. The worst candidates at the moment are VOP_LOOKUP, VOP_RENAME and > the general locking/releasing protocol, IMHO. The lookup is more difficult because of the way directory management is dependent on file management. It's not possible to remove the FFS directory management code and replace it with an override with the current code arrangement. Specifically, I can't replace the FFS directory structure code with a btree (for instance). The lookup path buffer deallocation patches pushing the deallocation up into the consumer interface where the allocation took place, was a move toward severability of the directory interface. It had a side effect of moving toward the anility to support multiple name spaces nf FS's that require it (VFAT/NTFS/UMSDOS/NETWARE/HFS), and of abstracting the component representation type (for Unicode support and more internationalization). This doesn't resolve the seperability problem of the directory code, but is goes a long way toward freeing up the dependencies to allow incremental changes. I seriously dislike the relookup for the rename code, and think that it needs to be rethought. Bute seperability was a necessary first step. > > The Lite2 code recognizes this at the VOP_LOCK level in a primitive > > fashion by introducing the lockmgr() call, but since the model is not > > uniformly applied, and deadly-embrace or two caller starvation deadlocks > > can still occur in the Lite2 model. Going to the next step, a veto > > model, both increases the code robustness considerably, as well as > > resolving the state wind/unwind problems inherent in fan out. The > > fan out problem is *the* problem with the unionfs, at this point. > > Well at the moment, I think we have to just grit our teeth and merge in > the lite2 code as it stands. We have to at least try to converge with the > other strains of 4.4, if only to try and share the load of maintaining the > filesystem code. I strongly believe that there should be a consensus > between the different 4.4 groups over FS development or we just end up > with chaos. The Lite 2 code merge *needs* to take place. I need to spend more time on it now that I'm good for work+ 1 hour or so a day of sitting. I'll susbscribe to that list pretty soon now. As to maintenance and design... well, I think we have a problem no matter what we do. The Heidemann thesis, and the other FICUS documents are *the* design document, IMO. The problem is that the current code in the 4.4 camps does not conform to the design documents. I think that no matter what, that needs to be corrected. Then there are issues of kludges for the interface design, or for missing technology pieces that simply have not been considered in the 4.4 code. The biggest kludge is that there is no documented bottom-end interface. We already have an unresolvable discrepancy because of VM difference. The second biggest kludge is the workaround for the directory structure size differences... the origin of the "cookie" crap in the VOP_READDIR interface. NetBSD and FreeBSD solved this problem in a bad way, and are in fact not interoperable at this point because of that. Finally, there's the fact that 4.4 as shipped didn't support kernel module loading of any kind, and so there was no effort to limit the recompilation necessary for adding VOP's in the default vfs_init, the vfs_fs_init, or in the vnode_if method generation code. Short of starting up CSRG again, I don't see a common source for solving the Lite/Lite2 VFS problems. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.