Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 18 Aug 1999 18:19:42 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        wrstuden@nas.nasa.gov
Cc:        tlambert@primenet.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject:   Re: BSD XFS Port & BSD VFS Rewrite
Message-ID:  <199908181819.LAA14096@usr02.primenet.com>
In-Reply-To: <Pine.SOL.3.96.990817092121.6014C-100000@marcy.nas.nasa.gov> from "Bill Studenmund" at Aug 17, 99 01:44:34 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> > > > > > 2.	Advisory locks are hung off private backing objects.
> > > I'm not sure. The struct lock * is only used by layered filesystems, so
> > > they can keep track both of the underlying vnode lock, and if needed their
> > > own vnode lock. For advisory locks, would we want to keep track both of
> > > locks on our layer and the layer below? Don't we want either one or the
> > > other? i.e. layers bypass to the one below, or deal with it all
> > > themselves.
> > 
> > I think you want the lock on the intermediate layer: basically, on
> > every vnode that has data associated with it that is unique to a
> > layer.  Let's not forget, also, that you can expose a layer into
> > the namespace in one place, and expose it covered under another
> > layer, at another.  If you locked down to the backing object, then
> > the only issue you would be left with is one or more intermediate
> > backing objects.
> 
> Right. That exported struct lock * makes locking down to the lowest-level
> file easy - you just feed it to the lock manager, and you're locking the
> same lock the lowest level fs uses. You then lock all vnodes stacked over
> this one at the same time. Otherwise, you just call VOP_LOCK below and
> then lock yourself.

I think this defeats the purpose of the stacking architecture; I
think that if you look at an unadulterated NULLFS, you'll see what I
mean.

Intermediate FS's should not trap VOP's that are not applicable
to them.

One of the purposes of doing a VOP_LOCK on intermediate vnodes
that aren't backing objects is to deal with the global vnode
pool management.  I'd really like FS's to own their vnode pools,
but even without that, you don't need the locking, since you
only need to flush data on vnodes that are backing objects.

If we look at a stack of FS's with intermediate exposure into the
namespace, then it's clear that the issue is really only applicable
to objects that act as a backing store:


----------------------	----------------------	--------------------
FS			Exposed in hierarchy	Backing object
----------------------	----------------------	--------------------
top			yes			no
intermediate_1		no			no
intermediate_2		no			yes
intermediate_3		yes			no
bottom			no			yes
----------------------	----------------------	--------------------

So when we lock "top", we only lock in intermediate_2 and in bottom.

Then we attempt to lock in intermediate_3, but it fails: not because
there is a lock on the vnode in intermediate_3, but because there is
a lock in bottom.

It's unnecessary to lock the vnodes in the intermediate path, or
even at the exposure level, unless they are vnodes that have an
associated backing store.

The need to lock in intermediate_2 exists because it is a translation
layer or a namespace escape.  It deals with compression, or it deals
with file-as-a-directory folding, or it deals with file-hiding
(perhaps for a quoata file), etc..  If it didn't, it wouldn't need
backing store (and therefore wouldn't need to be locked).


> > For a layer with an intermediate backing object, I'm prepared to
> > declare it "special", and proxy the operation down to any inferior
> > backing object (e.g. a union FS that adds files from two FS's
> > together, rather than just directoriy entry lists).  I think such
> > layers are the exception, not the rule.
> 
> Actually isn't the only problem when you have vnode fan-in (union FS)? 
> i.e.  a plain compressing layer should not introduce vnode locking
> problems. 

If it's a block compression layer, it will.  Also a translation layer;
consider a pure Unicode system that wants to remotely mount an FS
from a legacy system.  To do this, it needs to expand the pages from
the legacy system [only it can, since the legacy system doesn't know
about Unicode] in a 2:1 ratio.  Now consider doing a byte-range lock
on a file on such a system.  To propogate the lock, you have to do
an arithmetic conversion at the translation layer.  This gets worse
if the lower end FS is exposed in the namespace as well.

You could make the same arguments for other types of translation or
namespace escapes.


> > I think that export policies are the realm of /etc/exports.
> > 
> > The problem with each FS implementing its own policy, is that this
> > is another place that copyinstr() gets called, when it shouldn't.
> 
> Well, my thought was that, like with current code, most every fs would
> just call vfs_export() when it's presented an export operation. But by
> retaining the option of having the fs do its own thing, we can support
> different export semantics if desired.

I think this bears down on whether the NFS server VFS consumer is
allowed access to the VFS stack at the particular intermediate
layer.  I think this is really an administrative policy decision,
and not an option for the VFS.

I think it would be bad if a given VFS could refuse to participate
in a stacking operation because it didn't like who was stacking.

If we insist on the ability for a VFS to refused stacking, then
we should generalize the idea, such that an intermediate VFS could
refuse exposure into the filesystem namespace accessible to users.

Consider the case of a VFS without quota support, stacked under a
VFS layer that provided quota support by hiding a file in the top
level directory ("quota") and then folding the directory closed by
rerooting in a subdirectory of the top level directory ("root/").

It's reasonable to assume that most admins that want to enforce
quotas would *not* want the possibility of exposing the VFS without
quota support in the user accessible namespace.  Should the VFS
without quotas refuse such exposure?

I think the answer is "no", and that it is an administrative
control issue, not a VFS's preference issue.  Administrators enforce
this by protecting the path to exposure points, or by mounting
stacks over top of exposure points, which results in the exposure
being hidden under another mount.  Using the QUOTAFS example, you
mount the FS to be quota-enforced on /home, and then you mount
the QUOTAFS over top of it, and have it cover "/home" itself,
hiding the underlying FS from exposure.


> > I would resolve this by passing a standard option to the mount code
> > in user space.  For root mounts, a vnode is passed down.  For other
> > mounts, the vnode is parsed and passed if the option is specified.
> 
> Or maybe add a field to vfsops. This info says what the mount call will
> expect (I want a block device, a regular file, a directory, etc), so it
> fits. :-)

This is actually an elegant soloution to the problem.  Much of the
time, we don't consider data interfaces when they are appropriate
because of their widespread use in inappropriate ways (e.g. "ps").


> Also, if we leave it to userland, what happens if someone writes a
> program which calls sys_mount with something the fs doesn't expect. :-)

Well, that gets to another grail of mine: when a device containing
a filesystem "arrives", I believe it should trigger a mount into
the list of mounted filesystems.

I don't necessarily mean that it should also be exported into the
filesystem hierarchy at that point (but it's an option, using the
"last mounted on" information).


> > I think that you will only be able to find rare examples of FS's
> > that don't take device names as arguments.  But for those, you
> > don't specify the option, and it gets "NULL", and whatever local
> > options you specify.
> 
> I agree I can't see a leaf fs not taking a device node. But layered
> fs's certainly will want something else. :-)

I think they want a vnode of an already mounted FS.  The trick is
to enforce the "already mounted" part of that.  I'm comforable with
doing this by saying "it's not already mounted until you can look
up a vnode on it".


> > The point is that, for FS's that can be both root and sub-root,
> > the mount code doesn't have to make the decision, it can be punted
> > to higher level code, in one place, where the code can be centrally
> > maintained and kept from getting "stale" when things change out
> > from under it.
> 
> True.
> 
> And with good comments we can catch the times when the centrally located
> code changes & brakes an assumption made by the fs. :-)

8-).


> > > Except for a minor buglet with device nodes, stacking works in NetBSD at
> > > present. :-)
> > 
> > Have you tried Heidemann's student's stacking layers?  There is one
> > encryption, and one per-file compression with namespace hiding, that
> > I think it would be hard pressed to keep up with.  But I'll give it
> > the benefit of the doubt.  8-).
> 
> Nope. The problem is that while stacking (null, umap, and overlay fs's)
> work, we don't have the coherency issues worked out so that upper layers
> can cache data. i.e. so that the lower fs knows it has to ask the uper
> layers to give pages back. :-) But multiple ls -lR's work fine. :-)

With UVM in NetBSD, this is (supposedly) not an issue.

You could actually think of it this way, as well: only FS's that
contain vnodes that provide backing should implement VOP_GETPAGES
and VOP_PUTPAGES, and all I/O should be done through paging.


> > > I agree it's ugly, but it has the advantage that it doesn't grow the
> > > on-disk inode. A lot of flks have designs on the remaining 64 bits free.
> > > :-)
> > 
> > Well, so long as we can resolve the issue for a long, long time;
> > I plan on being around to have to put up with the bugs, if I can
> > wrangle it... 8-).
> 
> :-)
> 
> I bet by then (559447 AD) we won't be using ffs, so the problem will be
> moot. :-)

Unless I'm the curator of a computer museum... 8-).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199908181819.LAA14096>