From owner-freebsd-hackers  Wed Aug 18 13:44:47 1999
Delivered-To: freebsd-hackers@freebsd.org
Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135])
	by hub.freebsd.org (Postfix) with ESMTP
	id EA4F215D9F; Wed, 18 Aug 1999 13:43:47 -0700 (PDT)
	(envelope-from tlambert@usr06.primenet.com)
Received: (from daemon@localhost)
	by smtp05.primenet.com (8.9.1/8.9.1) id NAA113206;
	Wed, 18 Aug 1999 13:43:27 -0700
Received: from usr06.primenet.com(206.165.6.206)
 via SMTP by smtp05.primenet.com, id smtpdDReHUa; Wed Aug 18 13:43:17 1999
Received: (from tlambert@localhost)
	by usr06.primenet.com (8.8.5/8.8.5) id NAA28863;
	Wed, 18 Aug 1999 13:43:14 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199908182043.NAA28863@usr06.primenet.com>
Subject: Re: BSD XFS Port & BSD VFS Rewrite
To: wrstuden@nas.nasa.gov
Date: Wed, 18 Aug 1999 20:43:14 +0000 (GMT)
Cc: tlambert@primenet.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
In-Reply-To: <Pine.SOL.3.96.990818112953.14430G-100000@marcy.nas.nasa.gov> from "Bill Studenmund" at Aug 18, 99 11:59:01 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> > > Right. That exported struct lock * makes locking down to the lowest-level
> > > file easy - you just feed it to the lock manager, and you're locking the
> > > same lock the lowest level fs uses. You then lock all vnodes stacked over
> > > this one at the same time. Otherwise, you just call VOP_LOCK below and
> > > then lock yourself.
> > 
> > I think this defeats the purpose of the stacking architecture; I
> > think that if you look at an unadulterated NULLFS, you'll see what I
> > mean.
> 
> Please be more precise. I have looked at an unadulterated NULLFS, and
> found it lacking. I don't see how this change breaks stacking.


OK, there's the concept of "collapse" of stacking layer.  This was
first introduced in the Rosenthal stacking vnode architecture, out
of Sun Microsystems.

Rosenthal was concerned that, when you stack 500 putatively "null"
NULLFS's, that the amount of function call overhead not increase
proportionally.

To resolve this, he introduced the concept of a "collapsed" VFS
stack.  That is, the actual array of function vectors is actually
a one dimensional projection of a two dimensional stack, and that
the visible portion is actually where the first layer on the way
down the stack that implements a VOP occurs.

We can visualize this like so:

			    VOPs
Layer |	VOP1	VOP2	VOP3	VOP4	VOP5	VOP6	...
-----------------------------------------------------------
L1	-	-	-	imp	-	-	...
L2	imp	-	-	imp	-	imp	...
L3	imp	-	-	imp	imp	-	...
L4	-	-	imp	-	-	-	...
L5	imp	imp	imp	imp	imp	imp	...

The resulting "collapsed" array of entry vectors looks like so:

	L2VOP1	L5VOP2	L4VOP3	L1VOP4	L3VOP5	L2VOP6	...

There is an implicit assumption here that most stacks will not be
randomly staggered like this example.  The idea behind this
assumption is that additional layers will most frequently add
functionality, rather than replacing it.

Heidemann carried this idea over into his architecture, to be
employed at the point that a VFS stack is first instanced.

The BSD4.4 implementation of this is partially flawed.  There is
an implicit implementation of this for the UFS/FFS "stack" of
layers, in the VOP's descriptor array exported by the combination
of the two being hard coded as being a precollapsed stack.  This
is actually antithetical to the design.

The second place this flaw is apparent is in the inability to
add VOP's into an existing kernel, since the entry point vector
is a fixed size, and is not expanded implicitly by the act of
adding a VFS layer containing a new VOP.

For the use of non-error vfs_defaults, this is also flawed for
proxies, but not for the consumer of the VFS stack, only for the
producer end on the other side of the proxy, which although it
does not implement a particular VOP, needs to _NOT_ use the
local vfs_default for the VOP, but instead needs to proxy the
VOP over to the other side for remote processing.

The act of getting a vfs_default VOP after a collapse, instead
of having a NULL entry point that the descriptor call mechanism
treats as a call failure, damages the ability to proxy unknown
VOP's.


> > Intermediate FS's should not trap VOP's that are not applicable
> > to them.
> 
> True. But VOP_LOCK is applicable to layered fs's. :-)

Only for translation layers that require local backing store.  I'm
prepared to make an exception for them, and require that they
explicitly call the VOP in the underlying vnode over which they are
stacked.  This is the same compromise that both Rosenthal and
Heidemann consciously chose.


> > One of the purposes of doing a VOP_LOCK on intermediate vnodes
> > that aren't backing objects is to deal with the global vnode
> > pool management.  I'd really like FS's to own their vnode pools,
> > but even without that, you don't need the locking, since you
> > only need to flush data on vnodes that are backing objects.
> > 
> > If we look at a stack of FS's with intermediate exposure into the
> > namespace, then it's clear that the issue is really only applicable
> > to objects that act as a backing store:
> > 
> > 
> > ----------------------	----------------------	--------------------
> > FS			Exposed in hierarchy	Backing object
> > ----------------------	----------------------	--------------------
> > top			yes			no
> > intermediate_1		no			no
> > intermediate_2		no			yes
> > intermediate_3		yes			no
> > bottom			no			yes
> > ----------------------	----------------------	--------------------
> > 
> > So when we lock "top", we only lock in intermediate_2 and in bottom.
> 
> No. One of the things Heidemann notes in his dissertation is that to
> prevent deadlock, you have to lock the whole stack of vnodes at once, not
> bit by bit.
> 
> i.e. there is one lock for the whole thing.

This is not true for a unified VM and buffer cache environment,
and a significant reduction in overhead can be achieved thereby.

Heidemann did his work on SVR4, which does not have a unified VM
and buffer cache.  The deadlock discussion in his dissertation is
only applicable to systems where the coherency model is such that
each and every vnode has buffers associated with it.  That is, it
applies to vnodes which act as backing store (buffer cache object
references).

If you seperate the concept, such that you don't have to deal with
vnodes that do not have coherency issues, then you can drastically
reduce the number of coherency operations required (locking is a
coherency operation).

In addition to this, you can effectively obtain what neither the
Rosenthal or the SVR4 version of the Heidemann stacking framework
can otherwise obtain: intermediate VFS layer NULL VOP call collapse.
The way you obtain this is by caching the vnode of the backing object
in the intermediate layer, and dereferencing it to get at it's VOP
vector directly.

This means that a functional layer that shodows an underlying VOP,
seperated by 1,000 NULLFS layers, does not result in a 1,000 function
call overhead.


> > > Actually isn't the only problem when you have vnode fan-in (union FS)? 
> > > i.e.  a plain compressing layer should not introduce vnode locking
> > > problems. 
> > 
> > If it's a block compression layer, it will.  Also a translation layer;
> > consider a pure Unicode system that wants to remotely mount an FS
> > from a legacy system.  To do this, it needs to expand the pages from
> > the legacy system [only it can, since the legacy system doesn't know
> > about Unicode] in a 2:1 ratio.  Now consider doing a byte-range lock
> > on a file on such a system.  To propogate the lock, you have to do
> > an arithmetic conversion at the translation layer.  This gets worse
> > if the lower end FS is exposed in the namespace as well.
> 
> Wait. byte-range locking is different from vnode locking. I've been
> talking about vnode locking, which is different from the byte-range
> locking you're discussing above.

Conceptually, they're not really different at all.  You want to
apply an operation against a stack of vnodes, and only involve
the relevent vnodes when you do it.


> > > Nope. The problem is that while stacking (null, umap, and overlay fs's)
> > > work, we don't have the coherency issues worked out so that upper layers
> > > can cache data. i.e. so that the lower fs knows it has to ask the uper
> > > layers to give pages back. :-) But multiple ls -lR's work fine. :-)
> > 
> > With UVM in NetBSD, this is (supposedly) not an issue.
> 
> UBC. UVM is a new memory manager. UBC unifies the buffer cache with the VM
> system.

I was under the impression that th "U" in "UVM" was for "Unified".

Does NetBSD not have a unified VM and buffer cache?  is th "U" in
"UVM" referring not to buffer cache unification, but to platform
unification?

It was my understanding from John Dyson, who had to work on NetBSD
for NCI, that the new NetBSD stuff actually unified the VM and the
buffer cache.

If this isn't the case, then, yes, you will need to lock all the way
up and down, and eat the copy overhead for the concurrency for the
intermediate vnodes.  8-(.


> > You could actually think of it this way, as well: only FS's that
> > contain vnodes that provide backing should implement VOP_GETPAGES
> > and VOP_PUTPAGES, and all I/O should be done through paging.
> 
> Right. That's part of UBC. :-)

Yep.  Again, if NetBSD doesn't have this, it's really important
that it obtain it.  8-(.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message