From owner-freebsd-fs  Mon Aug  5 19:00:17 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id TAA08217
          for fs-outgoing; Mon, 5 Aug 1996 19:00:17 -0700 (PDT)
Received: from parkplace.cet.co.jp (parkplace.cet.co.jp [202.32.64.1])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id TAA08209
          for <freebsd-fs@freebsd.org>; Mon, 5 Aug 1996 19:00:11 -0700 (PDT)
Received: from localhost (michaelh@localhost) by parkplace.cet.co.jp (8.7.5/CET-v2.1) with SMTP id BAA24468; Tue, 6 Aug 1996 01:59:09 GMT
Date: Tue, 6 Aug 1996 10:59:09 +0900 (JST)
From: Michael Hancock <michaelh@cet.co.jp>
Reply-To: Michael Hancock <michaelh@cet.co.jp>
To: Terry Lambert <terry@lambert.org>
cc: dfr@render.com, jkh@time.cdrom.com, tony@fit.qut.edu.au,
        freebsd-fs@freebsd.org
Subject: Per fs vnode pools (was Re: NFS Diskless Dispare...)
In-Reply-To: <199608051859.LAA11723@phaeton.artisoft.com>
Message-ID: <Pine.SV4.3.93.960806102756.24224A-100000@parkplace.cet.co.jp>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

[Moved to fs from current]

Thanks.  Previously, I didn't see how an inactive nfs vnode would be
reclaimed and moved to a ffs vnode pool, cleanly.  The generic interfaces
takes care of all this cleanly.

It looks like a win in terms of performance and new fs development ease at
the expense of a little space.

Regards,


Mike Hancock

On Mon, 5 Aug 1996, Terry Lambert wrote:

> > I think what he's is saying is that when the vnodes are in the global pool
> > the chances of reusing a vnode that was used previously by a particular fs
> > is less than having a per fs vnode pool. 
> 
> No, it's not.
> 
> > The problem with the per fs vnode pool is the management overhead.  When
> > you need to start reusing vnodes you need to search through all the
> > different fs pools to find a vnode. 
> > 
> > I don't know which is a better trade-off.
> 
> This isn't how per FS vnode pools should work.
> 
> When you want a vnode, you call the generic "getnewvnode()" from the
> XXX_vget routine via VFS_VGET (sys/mount.h).
> 
> This function returns a vnode with an FS specific inode.
> 
> In reality, you never care to have a vnode without an FS specific inode,
> since there is no way to access or write buffers hung off the critter
> because of the way vclean works.
> 
> 
> What I'm suggesting is that there needs to be both a VFS_VGET and
> a VFS_VPUT (or VFS_VRELE).  With the additional per fs release
> mechanism, each FS instance can allocate an inode pool at its
> instantiation (or do it on a per instance basis, the current
> method which makes inode allocation so slow...).
> 
> Consider UFS: the in core inode struct consists of a bunch of in core
> data elements (which should probably be in their own structure) and
> a "struct  dinode i_din" for the on disk inode.
> 
> You could modify this as:
> 
> struct inode {
> 	struct icinode	i_ic;		/* in core inode*/
> 	struct vnode	i_iv;		/* vnode for inode*/
> 	struct dinode	i_din;		/* on disk inode*/
> };
> 
> 
> Essentially, allocation of an inode would allocate a vnode.  There
> would never be an inode without a vnode.
> 
> 
> The VFS_VPUT would put the vnode into a pool maintained by the
> FS per fs instance (the in core fs structure would need an
> additional structure element to point to the maintenance data).
> 
> The FS itself would use generic maintenance routines shared by
> all FS's... and capable of taking a structure size for i_ic and
> i_din element size variations between FS types.  This would
> maintain all common code in the common interface.
> 
> 
> The use of the vget to associate naked vnodes with the FS's would
> go away; in no case is a naked vnode ever useful, since using vnode
> buffer elements requires an FS context.
> 
> 
> In effect, the ihash would become a vnhash and LRU for use in
> reclaiming vnode/inode pairs.  This would be much more efficient
> than the current dual allocation sequence.
> 
> 
> This would allow the discard of the vclean interface, and of the
> lock used to ensure it operates (a lock which has to be reimplemented
> and reimplemented correctly on a per FS basis in the XXX_LOCK and
> XXX_UNLOCK FS specific routines).
> 
> 
> The vnode locking could then be done in common code:
> 
> 
> vn_lock( vp, flags, p)
> struct vnode *vp;
> int flags;
> struct proc *p;
> {
> 	/* actual lock*/
> 	if( ( st = ...) == SUCCESS) {
> 		if( ( st = VOP_LOCK( vp, flags, p)) != SUCCESS) {
> 			/* lock was vetoed, undo actual lock*/
> 			...
> 		}
> 	}
> 	return( st);
> }
> 
> 
> The point here is that the lock contention (if any) can be resolved
> without ever hitting the FS itsef in the failure case.
> 
> 
> 
> The generic case of the per FS lock is now:
> 
> 
> int
> XXX_lock(ap)
> 	struct vop_lock_args /* {
> 		struct vnode *a_vp;
> 		int a_flags; 
> 		struct proc *a_p;
> 	} */ *ap; 
> {
> 	return( SUCCESS);
> }
> 
> 
> This is much harder to screw up when writing a new FS, and makes for much
> smaller intermediate layers.
> 
> 
> For NFS and unions, there isn't an i_din... but they also require data
> hung off the vnode, so the same allocation rules apply.  It's a win
> either way, and has the side benefit of unmunging the vn.
> 
> 
> I believe that John Heidemann's thesis had this in mind when it refers
> to using an RPC layer to use remote file system layers as intermediates
> in a local VFS stack.
> 
> 
> 					Terry Lambert
> 					terry@lambert.org
> ---
> Any opinions in this posting are my own and not those of my present
> or previous employers.
> 

--
michaelh@cet.co.jp                                http://www.cet.co.jp
CET Inc., Daiichi Kasuya BLDG 8F 2-5-12, Higashi Shinbashi, Minato-ku,
Tokyo 105 Japan              Tel: +81-3-3437-1761 Fax: +81-3-3437-1766


From owner-freebsd-fs  Tue Aug  6 08:50:45 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id IAA25900
          for fs-outgoing; Tue, 6 Aug 1996 08:50:45 -0700 (PDT)
Received: from minnow.render.com (render.demon.co.uk [158.152.30.118])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id IAA25887
          for <freebsd-fs@freebsd.org>; Tue, 6 Aug 1996 08:50:39 -0700 (PDT)
Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id QAA15586; Tue, 6 Aug 1996 16:50:34 +0100
Date: Tue, 6 Aug 1996 16:50:33 +0100 (BST)
From: Doug Rabson <dfr@render.com>
To: Terry Lambert <terry@lambert.org>
cc: Michael Hancock <michaelh@cet.co.jp>, jkh@time.cdrom.com,
        tony@fit.qut.edu.au, freebsd-fs@freebsd.org
Subject: Re: NFS Diskless Dispare...
In-Reply-To: <199608051859.LAA11723@phaeton.artisoft.com>
Message-ID: <Pine.BSI.3.95.960806163307.10082P-100000@minnow.render.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

[moved to freebsd-fs]

On Mon, 5 Aug 1996, Terry Lambert wrote:

> What I'm suggesting is that there needs to be both a VFS_VGET and
> a VFS_VPUT (or VFS_VRELE).  With the additional per fs release
> mechanism, each FS instance can allocate an inode pool at its
> instantiation (or do it on a per instance basis, the current
> method which makes inode allocation so slow...).

Not really sure how this would work for filesystems without a flat
namespace?  VFS_VGET is not supported for msdosfs, cd9660, nfs and
probably others.

> 
> Consider UFS: the in core inode struct consists of a bunch of in core
> data elements (which should probably be in their own structure) and
> a "struct  dinode i_din" for the on disk inode.
> 
> You could modify this as:
> 
> struct inode {
> 	struct icinode	i_ic;		/* in core inode*/
> 	struct vnode	i_iv;		/* vnode for inode*/
> 	struct dinode	i_din;		/* on disk inode*/
> };
> 
> 
> Essentially, allocation of an inode would allocate a vnode.  There
> would never be an inode without a vnode.
> 
> 
> The VFS_VPUT would put the vnode into a pool maintained by the
> FS per fs instance (the in core fs structure would need an
> additional structure element to point to the maintenance data).
> 
> The FS itself would use generic maintenance routines shared by
> all FS's... and capable of taking a structure size for i_ic and
> i_din element size variations between FS types.  This would
> maintain all common code in the common interface.
> 
> 
> The use of the vget to associate naked vnodes with the FS's would
> go away; in no case is a naked vnode ever useful, since using vnode
> buffer elements requires an FS context.
> 
> 
> In effect, the ihash would become a vnhash and LRU for use in
> reclaiming vnode/inode pairs.  This would be much more efficient
> than the current dual allocation sequence.
> 
> 
> This would allow the discard of the vclean interface, and of the
> lock used to ensure it operates (a lock which has to be reimplemented
> and reimplemented correctly on a per FS basis in the XXX_LOCK and
> XXX_UNLOCK FS specific routines).

Wait a minute.  The VOP_LOCK is not there just for vclean to work.  If you
took it out, a lot of the VOPs in ufs would break due to unexpected
reentry.  The VOP_LOCK is there to ensure that operations which modify the
vnode are properly sequenced even if the process has to sleep during the
operation.

> 
> 
> The vnode locking could then be done in common code:
> 
> 
> vn_lock( vp, flags, p)
> struct vnode *vp;
> int flags;
> struct proc *p;
> {
> 	/* actual lock*/
> 	if( ( st = ...) == SUCCESS) {
> 		if( ( st = VOP_LOCK( vp, flags, p)) != SUCCESS) {
> 			/* lock was vetoed, undo actual lock*/
> 			...
> 		}
> 	}
> 	return( st);
> }
> 
> 
> The point here is that the lock contention (if any) can be resolved
> without ever hitting the FS itsef in the failure case.
> 

You can't do this for NFS.  If you use exclusive locks in NFS and a
server dies, you easily can end up holding onto a lock for the root vnode
until the server reboots.  To make it work for NFS, you would have to make
the lock interruptable which forces you to fix code which does not check
the error return from VOP_LOCK all over the place.

I hope we are not talking at cross purposes.  We are talking about the
vnode lock, not the advisory record locking aren't we?

--
Doug Rabson, Microsoft RenderMorphics Ltd.	Mail:  dfr@render.com
						Phone: +44 171 251 4411
						FAX:   +44 171 251 0939


From owner-freebsd-fs  Tue Aug  6 10:32:24 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id KAA02217
          for fs-outgoing; Tue, 6 Aug 1996 10:32:24 -0700 (PDT)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id KAA02212
          for <freebsd-fs@freebsd.org>; Tue, 6 Aug 1996 10:32:22 -0700 (PDT)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id KAA13564; Tue, 6 Aug 1996 10:28:47 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199608061728.KAA13564@phaeton.artisoft.com>
Subject: Re: NFS Diskless Dispare...
To: dfr@render.com (Doug Rabson)
Date: Tue, 6 Aug 1996 10:28:47 -0700 (MST)
Cc: terry@lambert.org, michaelh@cet.co.jp, jkh@time.cdrom.com,
        tony@fit.qut.edu.au, freebsd-fs@freebsd.org
In-Reply-To: <Pine.BSI.3.95.960806163307.10082P-100000@minnow.render.com> from "Doug Rabson" at Aug 6, 96 04:50:33 pm
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> [moved to freebsd-fs]
> 
> On Mon, 5 Aug 1996, Terry Lambert wrote:
> 
> > What I'm suggesting is that there needs to be both a VFS_VGET and
> > a VFS_VPUT (or VFS_VRELE).  With the additional per fs release
> > mechanism, each FS instance can allocate an inode pool at its
> > instantiation (or do it on a per instance basis, the current
> > method which makes inode allocation so slow...).
> 
> Not really sure how this would work for filesystems without a flat
> namespace?  VFS_VGET is not supported for msdosfs, cd9660, nfs and
> probably others.


Conceptually, it's pretty tribial to support; it's not supported
because the stacking is not correctly implemented for these FS's.
Look at the /sys/miscfs/nullfs use of VOP_VGET.

> Wait a minute.  The VOP_LOCK is not there just for vclean to work.  If you
> took it out, a lot of the VOPs in ufs would break due to unexpected
> reentry.  The VOP_LOCK is there to ensure that operations which modify the
> vnode are properly sequenced even if the process has to sleep during the
> operation.

That's why the vn_lock would be called.  The VOP_LOCK is a transparent
veto/allow interface in that case, but that doesn't mean a counting
reference isn't held by PID (like it had to be).  The actual Lite2
routine for "actual lock" is called lockmgr() and lives in kern_lock.c
in the Lite2 sources.  Lite2 already moves in this direction -- it just
hasn't gone far enough.


> > The vnode locking could then be done in common code:
> > 
> > 
> > vn_lock( vp, flags, p)
> > struct vnode *vp;
> > int flags;
> > struct proc *p;
> > {
> > 	/* actual lock*/
> > 	if( ( st = ...) == SUCCESS) {
> > 		if( ( st = VOP_LOCK( vp, flags, p)) != SUCCESS) {
> > 			/* lock was vetoed, undo actual lock*/
> > 			...
> > 		}
> > 	}
> > 	return( st);
> > }
> > 
> > 
> > The point here is that the lock contention (if any) can be resolved
> > without ever hitting the FS itsef in the failure case.
> > 
> 
> You can't do this for NFS.  If you use exclusive locks in NFS and a
> server dies, you easily can end up holding onto a lock for the root vnode
> until the server reboots.  To make it work for NFS, you would have to make
> the lock interruptable which forces you to fix code which does not check
> the error return from VOP_LOCK all over the place.

This is one of the "flags" fields, and it only applies to the NFS client
code.  Actually, since the NFSnode is not transiently destroyed as a
result of server reboot (statelessness *is* a win, no matter what the
RFS advocates would have you believe), there isn't a problem with holding
the reference.

One of the things Sun recommends is not making the mounts on mount
points in the root directory; to avoid exactly this scenario (it really
doesn't matter in the diskless/dataless case, since you will hang on
swap or page-in from image-file-as-swap-store anyway).

The root does not need to be locked for the node lookup for the root
for a covering node in any case; this is an error in the "node x covers
node y" case in the lookup case.  You can see that the lookup code
documents a race where it frees and relocks the parent node to avoid
exactly this scenario, actually.  A lock does not need to be held
in the lookup for the parent in the NFS lookup case for the mount
point traversal.  I believe this is an error in the current code.


The issue is more interesting in the client case; a reference is not
a lock, per se, it's an increment of the reference count.  The server
holds the lock mid path traversal.

This is resolved by setting the "interruptable" flag on the vn_lock
into the underlying FS on the server.


The easiest way to think of this is in terms of provider interfaces
and consumer interfaces.  There are many FS provider interfaces.  The
FS consumer interfaces are the syscall layer (the vfs_subr.c) and the
NFS client.  This goes hand in hand with the discussion we had about
the VOP_READDIR interface needing to be split into "get buffer/reference
buffer element" (remember the conversation about killing off the cookie
interface about a year ago?!?!).


> I hope we are not talking at cross purposes.  We are talking about the
> vnode lock, not the advisory record locking aren't we?

Yes.  The VOP_ADVLOCK is also (ideally) a veto interface.  This allows
lock contention from several processes on the same client to be resolved
locally without hitting the wire, and gives a one client pseudo-flock
that works without fully implementing the NFS locking code.

This is really irrelevant to the VOP_LOCK code, which deals with
asserting the lock only in the exception cases.

In the NFS client case, the VOP_LOCK and VOP_ADVLOCK are non-null.  I
didn't show the sleep interface in the vn_lock in the case of the
failure.  The sleep puts a loop around the "actual lock" code so a
sleep occurs above, at the higher code level.  Intermediate locks
on per layer vnodes (if any are truly needed; see below) are
automatically wound and unwound for retry in the blocking case.


In the NFS case, the lock is asserted to the underlying FS, and the sleep
target is returned to the top of the loop by the FS layer where the
contention occurred (basically, a vnodep is returned in the != SUCCESS
case (SUCCESS == 0); this is used as the sleep target.

If a lock in the NFS server code fails, and it fails for the UFS lock
case for the underlying FS, then it should sleep on the UFS vnode
being unlocked.

The veto interface actually implies a couple of semantic changes; the
real implementation would probably be as a NULL lock entry to allow
the routine to not be called at all, saving the vnode_if parameter
list deconstruction/reconstruction.

This allows the substitution of a chaining interface for a file system
stacking layer.

Now you are probably asking "but how can this work when an intermediate
non-NULL layer fans out or in from multiple vnodes?".


The union FS case is one of the most interesting cases for this, since
what you want to do is conditionally assert a lock on two or more
underlying FS's, either of which could have NULL or non-NULL veto code.
The reason it is interesting is stack operand collapse in a stacking
instance.

I could have the following simple case:


	(syscalls or NFS or AFP or SMB or NetWare kernel server)

        consumer          vn_lock
          | ^              | ^
          v |              v |
      quota layer        quota VOP_LOCK (NULL)
          | ^              | ^
          v |              v |
    uid mapping layer    uid VOP_LOCK (NULL)
          | ^              | ^
          v |              v |
          FFS            FFS VOP_LOCK (NULL)

Really, you want to collapse NULL layer entries.  But since the stack
could be reentered from the top, how can you do this without endangering
the locking of terminal nodes based on intermediate nodes?

It turns out that the function collapse for the VOP_LOCK's in this
case is NULL; but say we replace FFS with the NFS client, where the
last layer is non-NULL?

We would want to collapse to the NFS VOP_LOCK call, since the
intermediate chainings are NULL, but the terminal chaining is not.
Similar collapse could remove the uid mapping layer's VOP_LOOKUP,
leaving the quota VOP_LOOKUP (which has to be there to hide the
quota file and protect it) followed by the FFS VOP_LOOKUP.  The
call-down chain is abbreviated.  This is a general win in the veto
interface cases.  The only place you are required to propagate is
the non-NULL cases, and the non-NULL case will only occur when a
fan-out or fan-in of vnodes occurs between layers.

Currently collapse is not implemented.  Part of the support for
collapse without full kernel recompilation on VOP addition was the
0->1 FS instance count changes to the vfs_init.c code and the
addition of the structure sizing field in the vnode_if.c generation
in my big patch set (where the vnode_if.c generated had the structure
vfs_op_descs size computed in the vnode_if.c file.  The change did
not simply allow the transition from 0->N loadable FS's (part of
the necessary work for discardable fallback drivers for the FS,
assuming kernel paging at some point in the future), and it did not
just allow you to add VFS OPS to the vnode_if without having to
recompile all FS modules and LKM's (it's stated intent).  The change
also allows (with the inclusion of a structure sort, since the init
causes a structure copy anyway to get it into a stack instantiation)
the simplification of the vnode_if call to eliminate the intermediate
functioncall stub: a necessary step towards call graph collapse.  You
want this so that if you have 10 FS layers in a stack, you only have
to call one or two veto functions out of the 10... and if they are
all NULL, the one is synthetic anyway.


This is a big win in reducing the current code duplication, which you
want to do not only to reduce code size, but to make FS's more robust.
The common behaviours of FS's *should* be implemented in common code.

The Lite2 code recognizes this at the VOP_LOCK level in a primitive
fashion by introducing the lockmgr() call, but since the model is not
uniformly applied, and deadly-embrace or two caller starvation deadlocks
can still occur in the Lite2 model.  Going to the next step, a veto
model, both increases the code robustness considerably, as well as
resolving the state wind/unwind problems inherent in fan out.  The
fan out problem is *the* problem with the unionfs, at this point.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

From owner-freebsd-fs  Wed Aug  7 22:20:02 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id WAA08289
          for fs-outgoing; Wed, 7 Aug 1996 22:20:02 -0700 (PDT)
Received: from parkplace.cet.co.jp (parkplace.cet.co.jp [202.32.64.1])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id WAA08249
          for <freebsd-fs@freebsd.org>; Wed, 7 Aug 1996 22:19:58 -0700 (PDT)
Received: from localhost (michaelh@localhost) by parkplace.cet.co.jp (8.7.5/CET-v2.1) with SMTP id FAA11948; Thu, 8 Aug 1996 05:19:30 GMT
Date: Thu, 8 Aug 1996 14:19:30 +0900 (JST)
From: Michael Hancock <michaelh@cet.co.jp>
Reply-To: Michael Hancock <michaelh@cet.co.jp>
To: Terry Lambert <terry@lambert.org>
cc: dfr@render.com, jkh@time.cdrom.com, tony@fit.qut.edu.au,
        freebsd-fs@freebsd.org
Subject: Re: Per fs vnode pools (was Re: NFS Diskless Dispare...)
In-Reply-To: <Pine.SV4.3.93.960806102756.24224A-100000@parkplace.cet.co.jp>
Message-ID: <Pine.SV4.3.93.960808140146.11801B-100000@parkplace.cet.co.jp>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

On Tue, 6 Aug 1996, I wrote:

> > In effect, the ihash would become a vnhash and LRU for use in
> > reclaiming vnode/inode pairs.  This would be much more efficient
> > than the current dual allocation sequence.

Would you want this to be LRU vnodes with no buffer pages first?  The
buffer cache is being reclaimed, with some kind of algorithm, independent
of the vnodes.  You want to keep the vnodes with data still hanging off of
them in the fs pool longer.

BTW, is the incore inode table fixed or dynamic?

Mike Hancock


From owner-freebsd-fs  Thu Aug  8 10:41:07 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id KAA01046
          for fs-outgoing; Thu, 8 Aug 1996 10:41:07 -0700 (PDT)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id KAA01031
          for <freebsd-fs@freebsd.org>; Thu, 8 Aug 1996 10:41:04 -0700 (PDT)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id KAA17340; Thu, 8 Aug 1996 10:33:44 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199608081733.KAA17340@phaeton.artisoft.com>
Subject: Re: Per fs vnode pools (was Re: NFS Diskless Dispare...)
To: michaelh@cet.co.jp
Date: Thu, 8 Aug 1996 10:33:44 -0700 (MST)
Cc: terry@lambert.org, dfr@render.com, jkh@time.cdrom.com, tony@fit.qut.edu.au,
        freebsd-fs@freebsd.org
In-Reply-To: <Pine.SV4.3.93.960808140146.11801B-100000@parkplace.cet.co.jp> from "Michael Hancock" at Aug 8, 96 02:19:30 pm
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> On Tue, 6 Aug 1996, Michael wrote:
> 
> > > In effect, the ihash would become a vnhash and LRU for use in
> > > reclaiming vnode/inode pairs.  This would be much more efficient
> > > than the current dual allocation sequence.
> 
> Would you want this to be LRU vnodes with no buffer pages first?

Yes.  Minimally, you'd want a dual insertion point for LRU pages:

head | vnodes without buffer pages | vnodes with buffer pages | tail
insertion points               ---^                       ---^

> The
> buffer cache is being reclaimed, with some kind of algorithm, independent
> of the vnodes.  You want to keep the vnodes with data still hanging off of
> them in the fs pool longer.

Actually, you want to be able to impose a working set quota on a per
vnode basis using the cache reclaim algorithm.  This avoids large mmap's
from thrashing the cache.  You could have supervisor, or even user,
overrides for the behaviour.

head | buffer reclaimation list | tail
      ^                        ^--- insert here if vnode buffer count
      |                             is below working set quota
      insert here if vnode buffer count equals
      working set quota

So truly, it does not want to be independent of the vnodes.

A vnode quota is better than a process quota, since a process can use
vnodes in common with other processes; you don't want to have a process
with a low working set quota able to interfere with locality for another
otherwise unrelated process.


> BTW, is the incore inode table fixed or dynamic?

Currently dynamic in FFS, and FS implementation dependent in principle.
Potentially you will want to be able to install soft usage limits via
mount options, independent of FS, assuming a common subsystem is being
used to implement the allocation and LRU maintenance for each FS.  This
would imply a need to be able to force a reclaim, or allocation balancing
at a minimum, in low memory situations.

This is actually a consequence of the buffer cache information not
being indexed by device/offset for data which is not referred by
vnode: inode information, etc..  If I had my preferences, the cache
would be indexible by dev/offset as well (I would *not* eliminate the
vnode/offset indexing currently present, since it avoids a bmap on
every call that deals with file data).  One major win here is that
getting one on disk inode vs. another on disk inode in the same
directory has a high probability of locality (the FFS paper makes
this clear when looking at the directory/inode/cylinder group allocation
policy).  Instead of copying to an in core inode buffer, the on disk
inode could be a page ref to the page containing the inode data, and
a pointer.  This would save all of the internal copies required for
stat and other operations.  Since multiple inodes could be in a device
mapped page (as opposed to a strict vnode mapping), this could save
a significant amount of I/O (16 disk inodes @ 128 bytes each per page).

I'd like to keep the table dynamic in a modified slab basis: using a
power of two allocation-ahead; this is open to discussion.  John Dyson,
in particular, has some interesting VM plans that would bear directly
on how you'd want to do this.

Clearly, if you had page mapping for the device for the on disk inode
data, the allocated in core object would be the vnode, in core inode
data (local FS state for an inode that is referenced), and a pointer
to the page containing the disk inode data (with an implied page ref, and
an implied limit of one page on the in core data -- you could overcome
this by adding more page references in a table to the in core inode
data and handling the inode dereference in the FS: you have to do that
anyway, since the reference is implict, not explicit).  A direct
implication of this is that buffer reclaim for non-vnode unreferenced
pages would need to be handled seperately; this is only a minor
complication... you could do this by tracking number of items on a
FS independent per device managed global LRU list vs. the number of
items in the FS LRU's and establishing a high water mark for free
pages so that the reclaim will occur on deallocation that pushes the
LRU above the high water mark.  Then you reclaim pages down to the
low water mark (the page just freed, being below the low water mark,
is left on the list to ensure locality).


FS mechanics are one of the funnest things you can discuss.  8-).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

From owner-freebsd-fs  Thu Aug  8 10:48:08 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id KAA02098
          for fs-outgoing; Thu, 8 Aug 1996 10:48:08 -0700 (PDT)
Received: from minnow.render.com (render.demon.co.uk [158.152.30.118])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id KAA02079
          for <freebsd-fs@freebsd.org>; Thu, 8 Aug 1996 10:48:01 -0700 (PDT)
Received: from minnow.render.com (minnow.render.com [193.195.178.1]) by minnow.render.com (8.6.12/8.6.9) with SMTP id SAA21945; Thu, 8 Aug 1996 18:47:45 +0100
Date: Thu, 8 Aug 1996 18:47:44 +0100 (BST)
From: Doug Rabson <dfr@render.com>
To: Terry Lambert <terry@lambert.org>
cc: michaelh@cet.co.jp, jkh@time.cdrom.com, tony@fit.qut.edu.au,
        freebsd-fs@freebsd.org
Subject: Re: NFS Diskless Dispare...
In-Reply-To: <199608061728.KAA13564@phaeton.artisoft.com>
Message-ID: <Pine.BSI.3.95.960808173810.10082U-100000@minnow.render.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

On Tue, 6 Aug 1996, Terry Lambert wrote:

> > [moved to freebsd-fs]
> > 
> > On Mon, 5 Aug 1996, Terry Lambert wrote:
> > 
> > > What I'm suggesting is that there needs to be both a VFS_VGET and
> > > a VFS_VPUT (or VFS_VRELE).  With the additional per fs release
> > > mechanism, each FS instance can allocate an inode pool at its
> > > instantiation (or do it on a per instance basis, the current
> > > method which makes inode allocation so slow...).
> > 
> > Not really sure how this would work for filesystems without a flat
> > namespace?  VFS_VGET is not supported for msdosfs, cd9660, nfs and
> > probably others.
> 
> 
> Conceptually, it's pretty tribial to support; it's not supported
> because the stacking is not correctly implemented for these FS's.
> Look at the /sys/miscfs/nullfs use of VOP_VGET.

VFS_VGET is not implemented in NFS because the concept just doesn't apply.
VFS_VGET is only relavent for local filesystems.  NFS does have a flat
namespace in terms of filehandles but not one which you could squeeze into
the VFS_VGET interface.

> 
> > Wait a minute.  The VOP_LOCK is not there just for vclean to work.  If you
> > took it out, a lot of the VOPs in ufs would break due to unexpected
> > reentry.  The VOP_LOCK is there to ensure that operations which modify the
> > vnode are properly sequenced even if the process has to sleep during the
> > operation.
> 
> That's why the vn_lock would be called.  The VOP_LOCK is a transparent
> veto/allow interface in that case, but that doesn't mean a counting
> reference isn't held by PID (like it had to be).  The actual Lite2
> routine for "actual lock" is called lockmgr() and lives in kern_lock.c
> in the Lite2 sources.  Lite2 already moves in this direction -- it just
> hasn't gone far enough.
> 
> 
> > > The vnode locking could then be done in common code:
> > > 
> > > 
> > > vn_lock( vp, flags, p)
> > > struct vnode *vp;
> > > int flags;
> > > struct proc *p;
> > > {
> > > 	/* actual lock*/
> > > 	if( ( st = ...) == SUCCESS) {
> > > 		if( ( st = VOP_LOCK( vp, flags, p)) != SUCCESS) {
> > > 			/* lock was vetoed, undo actual lock*/
> > > 			...
> > > 		}
> > > 	}
> > > 	return( st);
> > > }
> > > 
> > > 
> > > The point here is that the lock contention (if any) can be resolved
> > > without ever hitting the FS itsef in the failure case.
> > > 
> > 
> > You can't do this for NFS.  If you use exclusive locks in NFS and a
> > server dies, you easily can end up holding onto a lock for the root vnode
> > until the server reboots.  To make it work for NFS, you would have to make
> > the lock interruptable which forces you to fix code which does not check
> > the error return from VOP_LOCK all over the place.
> 
> This is one of the "flags" fields, and it only applies to the NFS client
> code.  Actually, since the NFSnode is not transiently destroyed as a
> result of server reboot (statelessness *is* a win, no matter what the
> RFS advocates would have you believe), there isn't a problem with holding
> the reference.

So the NFS code would degrade the exclusive lock back to a shared lock?
Hmm.  I don't think that would work since you can't get the exclusive lock
until all the shared lockers release their locks.

> 
> One of the things Sun recommends is not making the mounts on mount
> points in the root directory; to avoid exactly this scenario (it really
> doesn't matter in the diskless/dataless case, since you will hang on
> swap or page-in from image-file-as-swap-store anyway).

It doesn't matter if they are on mount points in root.  If a lock is stuck
in a sub-filesystem, then the 'sticking' can propagate across the mount
point.

> 
> The root does not need to be locked for the node lookup for the root
> for a covering node in any case; this is an error in the "node x covers
> node y" case in the lookup case.  You can see that the lookup code
> documents a race where it frees and relocks the parent node to avoid
> exactly this scenario, actually.  A lock does not need to be held
> in the lookup for the parent in the NFS lookup case for the mount
> point traversal.  I believe this is an error in the current code.

Have to think about this some more.  Are you saying that when lookup is
crossing a mountpoint, it does not need any locks in the parent
filesystem?

> 
> 
> The issue is more interesting in the client case; a reference is not
> a lock, per se, it's an increment of the reference count.  The server
> holds the lock mid path traversal.
> 
> This is resolved by setting the "interruptable" flag on the vn_lock
> into the underlying FS on the server.
> 
> 
> The easiest way to think of this is in terms of provider interfaces
> and consumer interfaces.  There are many FS provider interfaces.  The
> FS consumer interfaces are the syscall layer (the vfs_subr.c) and the
> NFS client.  This goes hand in hand with the discussion we had about
      ^^^^^^
Do you mean NFS server here?

> the VOP_READDIR interface needing to be split into "get buffer/reference
> buffer element" (remember the conversation about killing off the cookie
> interface about a year ago?!?!).

I remember that.  I think I ended up agreeing with you about it.  The
details are a bit vague..

> [advlock digression ...]
> 
> In the NFS client case, the VOP_LOCK and VOP_ADVLOCK are non-null.  I
> didn't show the sleep interface in the vn_lock in the case of the
> failure.  The sleep puts a loop around the "actual lock" code so a
> sleep occurs above, at the higher code level.  Intermediate locks
> on per layer vnodes (if any are truly needed; see below) are
> automatically wound and unwound for retry in the blocking case.
> 
> 
> In the NFS case, the lock is asserted to the underlying FS, and the sleep
> target is returned to the top of the loop by the FS layer where the
> contention occurred (basically, a vnodep is returned in the != SUCCESS
> case (SUCCESS == 0); this is used as the sleep target.
> 
> If a lock in the NFS server code fails, and it fails for the UFS lock
> case for the underlying FS, then it should sleep on the UFS vnode
> being unlocked.
> 
> The veto interface actually implies a couple of semantic changes; the
> real implementation would probably be as a NULL lock entry to allow
> the routine to not be called at all, saving the vnode_if parameter
> list deconstruction/reconstruction.
> 
> This allows the substitution of a chaining interface for a file system
> stacking layer.
> 
> Now you are probably asking "but how can this work when an intermediate
> non-NULL layer fans out or in from multiple vnodes?".
> 
> 
> The union FS case is one of the most interesting cases for this, since
> what you want to do is conditionally assert a lock on two or more
> underlying FS's, either of which could have NULL or non-NULL veto code.
> The reason it is interesting is stack operand collapse in a stacking
> instance.
> 
> I could have the following simple case:
> 
> 
> 	(syscalls or NFS or AFP or SMB or NetWare kernel server)
> 
>         consumer          vn_lock
>           | ^              | ^
>           v |              v |
>       quota layer        quota VOP_LOCK (NULL)
>           | ^              | ^
>           v |              v |
>     uid mapping layer    uid VOP_LOCK (NULL)
>           | ^              | ^
>           v |              v |
>           FFS            FFS VOP_LOCK (NULL)
> 
> Really, you want to collapse NULL layer entries.  But since the stack
> could be reentered from the top, how can you do this without endangering
> the locking of terminal nodes based on intermediate nodes?
> 
> It turns out that the function collapse for the VOP_LOCK's in this
> case is NULL; but say we replace FFS with the NFS client, where the
> last layer is non-NULL?
> 
> We would want to collapse to the NFS VOP_LOCK call, since the
> intermediate chainings are NULL, but the terminal chaining is not.
> Similar collapse could remove the uid mapping layer's VOP_LOOKUP,
> leaving the quota VOP_LOOKUP (which has to be there to hide the
> quota file and protect it) followed by the FFS VOP_LOOKUP.  The
> call-down chain is abbreviated.  This is a general win in the veto
> interface cases.  The only place you are required to propagate is
> the non-NULL cases, and the non-NULL case will only occur when a
> fan-out or fan-in of vnodes occurs between layers.
> 
> Currently collapse is not implemented.  Part of the support for
> collapse without full kernel recompilation on VOP addition was the
> 0->1 FS instance count changes to the vfs_init.c code and the
> addition of the structure sizing field in the vnode_if.c generation
> in my big patch set (where the vnode_if.c generated had the structure
> vfs_op_descs size computed in the vnode_if.c file.  The change did
> not simply allow the transition from 0->N loadable FS's (part of
> the necessary work for discardable fallback drivers for the FS,
> assuming kernel paging at some point in the future), and it did not
> just allow you to add VFS OPS to the vnode_if without having to
> recompile all FS modules and LKM's (it's stated intent).  The change
> also allows (with the inclusion of a structure sort, since the init
> causes a structure copy anyway to get it into a stack instantiation)
> the simplification of the vnode_if call to eliminate the intermediate
> functioncall stub: a necessary step towards call graph collapse.  You
> want this so that if you have 10 FS layers in a stack, you only have
> to call one or two veto functions out of the 10... and if they are
> all NULL, the one is synthetic anyway.

This is interesting.  It is similar to the internal driver architecture we
use in our 3D graphics system (was Reality Lab, now Microsoft's Direct3D).
The driver is split up into different modules depending on functionality.
The consumer (Direct3D) has a stack which it pushes driver modules onto
for all the required functionality.  This used to be useful for
reconfiguring the stack at runtime to select different rendering
algorithms etc.  Direct3D broke that unfortunately but that is another
story.

It communicates with the drivers by sending service calls to the top
driver in the stack.  Each service call has a well defined number.  If
that module understands the service, it implements it and returns a
result.  Otherwise, it passes the service call down to the next driver in
the stack.  Some modules override service calls in lower layers and they
typically do their own work and then pass the service onto the next layer
in the stack.

To optimise the system, we added a service call table in the stack head. 
When a module is pushed onto the stack, it is called to 'bid' some of its
services into the service call table.  Each module in turn going up the
stack puts a function pointer into the table for each of the services it
wants to implement.  If it is overriding a lower module, it just
overwrites the pointer.

If you add service calls, nothing needs to recompile (as long as the
service call table is large enough) because the new services just go after
the existing ones.

> 
> 
> This is a big win in reducing the current code duplication, which you
> want to do not only to reduce code size, but to make FS's more robust.
> The common behaviours of FS's *should* be implemented in common code.

Agreed.  The worst candidates at the moment are VOP_LOOKUP, VOP_RENAME and
the general locking/releasing protocol, IMHO.

> 
> The Lite2 code recognizes this at the VOP_LOCK level in a primitive
> fashion by introducing the lockmgr() call, but since the model is not
> uniformly applied, and deadly-embrace or two caller starvation deadlocks
> can still occur in the Lite2 model.  Going to the next step, a veto
> model, both increases the code robustness considerably, as well as
> resolving the state wind/unwind problems inherent in fan out.  The
> fan out problem is *the* problem with the unionfs, at this point.

Well at the moment, I think we have to just grit our teeth and merge in
the lite2 code as it stands.  We have to at least try to converge with the
other strains of 4.4, if only to try and share the load of maintaining the
filesystem code.  I strongly believe that there should be a consensus
between the different 4.4 groups over FS development or we just end up
with chaos.

--
Doug Rabson, Microsoft RenderMorphics Ltd.	Mail:  dfr@render.com
						Phone: +44 171 734 3761
						FAX:   +44 171 734 6426


From owner-freebsd-fs  Thu Aug  8 14:54:40 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id OAA16913
          for fs-outgoing; Thu, 8 Aug 1996 14:54:40 -0700 (PDT)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id OAA16885
          for <freebsd-fs@freebsd.org>; Thu, 8 Aug 1996 14:54:29 -0700 (PDT)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id OAA17616; Thu, 8 Aug 1996 14:48:29 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199608082148.OAA17616@phaeton.artisoft.com>
Subject: Re: NFS Diskless Dispare...
To: dfr@render.com (Doug Rabson)
Date: Thu, 8 Aug 1996 14:48:28 -0700 (MST)
Cc: terry@lambert.org, michaelh@cet.co.jp, jkh@time.cdrom.com,
        tony@fit.qut.edu.au, freebsd-fs@freebsd.org
In-Reply-To: <Pine.BSI.3.95.960808173810.10082U-100000@minnow.render.com> from "Doug Rabson" at Aug 8, 96 06:47:44 pm
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> > Conceptually, it's pretty tribial to support; it's not supported
> > because the stacking is not correctly implemented for these FS's.
> > Look at the /sys/miscfs/nullfs use of VOP_VGET.
> 
> VFS_VGET is not implemented in NFS because the concept just doesn't apply.
> VFS_VGET is only relavent for local filesystems.  NFS does have a flat
> namespace in terms of filehandles but not one which you could squeeze into
> the VFS_VGET interface.

The flat name space is the nfsnodes, not the file handles.  In the NFS
case, you would simply *not* implement recovery without reallocation.
The allocation time is small compared to the wire time, and the actions
could be interleaved by assuming a success response, with an additional
dealloc overhead for the failure case.

> > > You can't do this for NFS.  If you use exclusive locks in NFS and a
> > > server dies, you easily can end up holding onto a lock for the root vnode
> > > until the server reboots.  To make it work for NFS, you would have to make
> > > the lock interruptable which forces you to fix code which does not check
> > > the error return from VOP_LOCK all over the place.
> > 
> > This is one of the "flags" fields, and it only applies to the NFS client
> > code.  Actually, since the NFSnode is not transiently destroyed as a
> > result of server reboot (statelessness *is* a win, no matter what the
> > RFS advocates would have you believe), there isn't a problem with holding
> > the reference.
> 
> So the NFS code would degrade the exclusive lock back to a shared lock?
> Hmm.  I don't think that would work since you can't get the exclusive lock
> until all the shared lockers release their locks.

You would unhold the lock and set reassert pending availability from
the rpc.mount negotiation suceeding.  Do this by setting up a fake
sleep address.

The trade off is between blocking a process (which you will have to do
anyway) and hanging the kernel.

The locks are local.  The only possible race condition is local stacking
on top of the NFS on the client side.  You can either not allow it, or
you can accept the fact that someone might win the thundering herd race
(in which case you just get delayed a bit), or you can FIFO the request
list with an array and a request entrancy limit when the array is full,
where you degrade to thundering herd to get into the FIFO list.  It's
unlikely that someone will be running hundreds of process from an NFS
server that crashes, and care who gets their page requests satisfied
first.  The delay from misordering is going to be *nothing* compared
with the delay for a network resource which is unavailable long enough
to have the request list fill up.  I think it's a non-problem to unwind
the state, and the collision avoidance is well worth the worst case
being slightly degraded.  Currently in BSD and SunOS, if the server
can't satisfy a page request from one local process, it blocks and the
whole system goes to hell.  This way, only the processes which are
relying on the unreliable resourvce go to hell.  Even so, I still
vote for flagging the NFS mount to force a copy to swap of any file
being used as swap store from an unreliable server.  It's a better
long term soloution anyway.


> > One of the things Sun recommends is not making the mounts on mount
> > points in the root directory; to avoid exactly this scenario (it really
> > doesn't matter in the diskless/dataless case, since you will hang on
> > swap or page-in from image-file-as-swap-store anyway).
> 
> It doesn't matter if they are on mount points in root.  If a lock is stuck
> in a sub-filesystem, then the 'sticking' can propagate across the mount
> point.

Well, yes, I suppose.  There are better ways to fix that; specfically,
lock the node that is covered before you lock the covering node in a
mount point traversal.  The issue is resolved locally after the second
process waiting for the node, without propagating up past the mount point.

I'm more concerned with interaction between multiple mounts of CDROM's on
a changer device.  It's more likely, if you ask me.

Nevertheless, if you are running something off an NFS server, and it can't
run, then it can't run.  FreeBSD is no less graceful about that than any
commercial OS.

> > The root does not need to be locked for the node lookup for the root
> > for a covering node in any case; this is an error in the "node x covers
> > node y" case in the lookup case.  You can see that the lookup code
> > documents a race where it frees and relocks the parent node to avoid
> > exactly this scenario, actually.  A lock does not need to be held
> > in the lookup for the parent in the NFS lookup case for the mount
> > point traversal.  I believe this is an error in the current code.
> 
> Have to think about this some more.  Are you saying that when lookup is
> crossing a mountpoint, it does not need any locks in the parent
> filesystem?

It needs locks on the covered node, but it does not need to propagate
the collision to root.  The only case this fails is when / is NFS mounted
and the server goes down.  You have worse problems at that point, and
hanging for the server to come back up is most likely the right thing to
do in that case anyway.


> > The easiest way to think of this is in terms of provider interfaces
> > and consumer interfaces.  There are many FS provider interfaces.  The
> > FS consumer interfaces are the syscall layer (the vfs_subr.c) and the
> > NFS client.  This goes hand in hand with the discussion we had about
>       ^^^^^^
> Do you mean NFS server here?

Yes, thanks; sorry about that.


> > the VOP_READDIR interface needing to be split into "get buffer/reference
> > buffer element" (remember the conversation about killing off the cookie
> > interface about a year ago?!?!).
> 
> I remember that.  I think I ended up agreeing with you about it.  The
> details are a bit vague..

I saved them; I can forward them if need be.  The details were vague
because I wanted an interface that let me tell it what I wanted back,
but a struct direct only return would be acceptable for an interim
implementation.  That's one that could be broken up without too much
trouble.


[ ... ]

> > Currently collapse is not implemented.  Part of the support for
> > collapse without full kernel recompilation on VOP addition was the
> > 0->1 FS instance count changes to the vfs_init.c code and the
> > addition of the structure sizing field in the vnode_if.c generation
> > in my big patch set (where the vnode_if.c generated had the structure
> > vfs_op_descs size computed in the vnode_if.c file.  The change did
> > not simply allow the transition from 0->N loadable FS's (part of
> > the necessary work for discardable fallback drivers for the FS,
> > assuming kernel paging at some point in the future), and it did not
> > just allow you to add VFS OPS to the vnode_if without having to
> > recompile all FS modules and LKM's (it's stated intent).  The change
> > also allows (with the inclusion of a structure sort, since the init
> > causes a structure copy anyway to get it into a stack instantiation)
> > the simplification of the vnode_if call to eliminate the intermediate
> > functioncall stub: a necessary step towards call graph collapse.  You
> > want this so that if you have 10 FS layers in a stack, you only have
> > to call one or two veto functions out of the 10... and if they are
> > all NULL, the one is synthetic anyway.
> 
> This is interesting.  It is similar to the internal driver architecture we
> use in our 3D graphics system (was Reality Lab, now Microsoft's Direct3D).
> The driver is split up into different modules depending on functionality.
> The consumer (Direct3D) has a stack which it pushes driver modules onto
> for all the required functionality.  This used to be useful for
> reconfiguring the stack at runtime to select different rendering
> algorithms etc.  Direct3D broke that unfortunately but that is another
> story.

I cheated for this; the two "competing" vnode stacking architectures
are Heidemann's (the one we are using) and Rosenthal's (which lost out).
Rosenthal alludes to stack collapes in his Usenix paper on "A file
system stacking architecture".

The general problem with Rosenthal's support is the same problem
Novell was having in their "Advanced File System Design": personal
views.

A personal view allows the FS to have a cannonical form, and each user
can choose his view on the FS.  The problem with this is the same
problem Windows95 has now with desktop themes: support is impossible.
Imagine the user who is told to "drag that icon to the wastebasket to
fix the problem"... he may have a beaker of acid, or a black hole or a
trash compactor, or whatever... there are not Schnelling points in
common that the user and the technical support person agree on so
that they can communicate effectively.

You can steal the personal view idea of a cannonical form for a directory
structure by specifying a cannonization name space for files regardless
of their name in the real name space.  This was the basis of some of
my internationalization work about two years ago (the numeric name
space suggestion that allowed you to rename system critical files like
/etc/passwd to Japanese equivalents and have NIS and login keep working).

Rosenthal needed stack collapse to reduce the memory requirements per
view instance so he could have views at all.


> It communicates with the drivers by sending service calls to the top
> driver in the stack.  Each service call has a well defined number.  If
> that module understands the service, it implements it and returns a
> result.  Otherwise, it passes the service call down to the next driver in
> the stack.  Some modules override service calls in lower layers and they
> typically do their own work and then pass the service onto the next layer
> in the stack.

Yes.  This is exactly how the Heidemann thesis wants the VFS stacking
to work.  It fails because of the way the integration into the Lite
code occurred in a rush as a result of the USL lawsuit and settlement.
Specifically, there's no concept of adding a new VFS OP without rebuilding
FFS (which is used to get the max number of VFS OPs allowed, in the
current FreeBSD/NetBSD code).


> To optimise the system, we added a service call table in the stack head. 
> When a module is pushed onto the stack, it is called to 'bid' some of its
> services into the service call table.  Each module in turn going up the
> stack puts a function pointer into the table for each of the services it
> wants to implement.  If it is overriding a lower module, it just
> overwrites the pointer.

This is not quite the same inheritance model.  Basically, you still need
to be able to call each inferior layer.  Consider the unionfs that unions
two NFS mounts.  Any fan in/fan out layer must be non-null.


> If you add service calls, nothing needs to recompile (as long as the
> service call table is large enough) because the new services just go after
> the existing ones.

Yes.  The vnode_if.c structure sizing is how I ensured the table was large
enough: that's where the table is defined, so it should be where the
size is defined.  The use of the FFS table to do the sizing in the init
was what broke the ability to add service calls dynamically in the
Heidemann code a integrated into 4.4Lite.


> > This is a big win in reducing the current code duplication, which you
> > want to do not only to reduce code size, but to make FS's more robust.
> > The common behaviours of FS's *should* be implemented in common code.
> 
> Agreed.  The worst candidates at the moment are VOP_LOOKUP, VOP_RENAME and
> the general locking/releasing protocol, IMHO.

The lookup is more difficult because of the way directory management
is dependent on file management.  It's not possible to remove the FFS
directory management code and replace it with an override with the
current code arrangement.  Specifically, I can't replace the FFS
directory structure code with a btree (for instance).

The lookup path buffer deallocation patches pushing the deallocation
up into the consumer interface where the allocation took place, was
a move toward severability of the directory interface.  It had a side
effect of moving toward the anility to support multiple name spaces
nf FS's that require it (VFAT/NTFS/UMSDOS/NETWARE/HFS), and of
abstracting the component representation type (for Unicode support
and more internationalization).

This doesn't resolve the seperability problem of the directory code,
but is goes a long way toward freeing up the dependencies to allow
incremental changes.  I seriously dislike the relookup for the rename
code, and think that it needs to be rethought.  Bute seperability
was a necessary first step.


> > The Lite2 code recognizes this at the VOP_LOCK level in a primitive
> > fashion by introducing the lockmgr() call, but since the model is not
> > uniformly applied, and deadly-embrace or two caller starvation deadlocks
> > can still occur in the Lite2 model.  Going to the next step, a veto
> > model, both increases the code robustness considerably, as well as
> > resolving the state wind/unwind problems inherent in fan out.  The
> > fan out problem is *the* problem with the unionfs, at this point.
> 
> Well at the moment, I think we have to just grit our teeth and merge in
> the lite2 code as it stands.  We have to at least try to converge with the
> other strains of 4.4, if only to try and share the load of maintaining the
> filesystem code.  I strongly believe that there should be a consensus
> between the different 4.4 groups over FS development or we just end up
> with chaos.

The Lite 2 code merge *needs* to take place.  I need to spend more time
on it now that I'm good for work+ 1 hour or so a day of sitting.  I'll
susbscribe to that list pretty soon now.

As to maintenance and design... well, I think we have a problem no
matter what we do.  The Heidemann thesis, and the other FICUS documents
are *the* design document, IMO.  The problem is that the current code
in the 4.4 camps does not conform to the design documents.  I think
that no matter what, that needs to be corrected.  Then there are issues
of kludges for the interface design, or for missing technology pieces
that simply have not been considered in the 4.4 code.  The biggest
kludge is that there is no documented bottom-end interface.  We already
have an unresolvable discrepancy because of VM difference.  The second
biggest kludge is the workaround for the directory structure size
differences... the origin of the "cookie" crap in the VOP_READDIR
interface.  NetBSD and FreeBSD solved this problem in a bad way, and
are in fact not interoperable at this point because of that.  Finally,
there's the fact that 4.4 as shipped didn't support kernel module loading
of any kind, and so there was no effort to limit the recompilation
necessary for adding VOP's in the default vfs_init, the vfs_fs_init,
or in the vnode_if method generation code.

Short of starting up CSRG again, I don't see a common source for
solving the Lite/Lite2 VFS problems.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.