Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 16 Aug 1999 16:04:11 -0700 (PDT)
From:      Bill Studenmund <wrstuden@nas.nasa.gov>
To:        Terry Lambert <tlambert@primenet.com>
Cc:        Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject:   Re: BSD XFS Port & BSD VFS Rewrite
Message-ID:  <Pine.SOL.3.96.990816143421.27345M-100000@marcy.nas.nasa.gov>
In-Reply-To: <199908162118.OAA04940@usr09.primenet.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 16 Aug 1999, Terry Lambert wrote:

> > > 2.	Advisory locks are hung off private backing objects.
> > I'd vote againsts your implimentation suggestion for VOP_ADVLOCK on an 
> > efficiency concern. If we actually make a VOP call, that should be the
> > end of the story. I.e either add a vnode flag to indicate pas/fail-ness,
> > or add a genfs/std call to handle the problem.
> > 
> > I'd actually vote for the latter. Hang the byte-range locking off of the
> > vnode, and add a genfs_advlock() or vop_stdadvlock() routine (depending on
> > OS flavor) to handle the call. That way all fs's that can share code, and
> > the callers need only call VO_ADVLOCK() - no other logic.
> 
> OK.  Here's the problem with that:  NFS client locks in a stacked
> FS on top the the NFS client FS.

Ahh, but it'd be the fs's decision to map genfs_advlock()/vop_stdadvlock()
to its vop_advlock_desc entry or not. In this case, NFS wouldn't want to
do that.

Though it would mean growing the fs footprint.

> Specifically, you need to seperate the idea of asserting a lock
> against the local vnode, asserting the lock via NFS locking, and
> coelescing the local lock list, after both have succeeded, or
> reverting the local assertion, should the remote assertion fail.

Right. But my thought was that you'd be calling an NFS routine, so it
could do the right thing.

> > NetBSD actually needs this to get unionfs to work. Do you want to talk
> > privately about it?
> 
> If you want.  FreeBSD needs it for unionfs and nullfs, so it's
> something that would be worth airing.
> 
> I think you could say that no locking routine was an approval of
> the uuper level lock.  This lets you bail on all FS's except NFS,
> where you have to deal with the approve/reject from the remote
> host.  The problem with this on FreeBSD is the VFS_default stuff,
> which puts a non-NULL interface on all FS's for all VOP's.

I'm not familiar with the VFS_default stuff. All the vop_default_desc
routines in NetBSD point to error routines.

> Yes, this NULL is the same NULL I suggested for advisory locks,
> above.

I'm not sure. The struct lock * is only used by layered filesystems, so
they can keep track both of the underlying vnode lock, and if needed their
own vnode lock. For advisory locks, would we want to keep track both of
locks on our layer and the layer below? Don't we want either one or the
other? i.e. layers bypass to the one below, or deal with it all
themselves.

> > > 5.	The idea of "root" vs. "non-root" mounts is inherently bad.
> > You forgot:
> > 
> > 	5)	Update export lists
> > 
> > 		If you call the mount routine with no device name
> > 		(args.fspec == 0) and with MNT_UPDATE, you get
> > 		routed to the vfs_export routine
> 
> This must be the job of the upper level code, so that there is
> a single control point for export information, instead of spreading
> it throughout ead FS's mount entry point.

I agree it should be detangled, but think it should remain the fs's job to
choose to call vfs_export. Otherwise an fs can't impliment its own export
policies. :-)

> > I thought it was? Admitedly the only reference code I have is the ntfs
> > code in the NetBSD kernel. But given how full of #ifdef (__FreeBSD__)'s it
> > is, I thought it'd be an ok reference.
> 
> No.

We've lost the context, but what I was trying to say was that I thought
the marking-the-vnode-as-mounted-on bit was done in the mount syscall at
present. At least that's what
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_syscalls.c?rev=1.130
seems to be doing.

> Basically, what you would have is the equivalent of a variable
> length "mounted volume" table, from which mappings (and exports,
> based on the mappings) are externalized into the namespace.

Ahh, sounds like you're talking about a new formalism..

> Right.  It should just have a "mount" entry point, and the rest
> of the stuff moves to higher level code, called by the mount system
> call, and the mountroot stuff during boot, to externalize the root
> volume at the top of the hierarchy.
> 
> An ideal world would mount a / that had a /dev under it, and then
> do transparent mounts over top of that.

That would be quite a different place than we have now. ;-)

> > > 	The conversion of the root device into a vnode pointer, or
> > > 	a path to a device into a vnode pointer, is the job of upper
> > > 	level code -- specifically, the mount system call, and the
> > > 	common code for booting.
> > 
> > My one concern about this is you've assumed that the user is mounting a
> > device onto a filesystem.
> 
> No.  Vnoide, not bdevvp.  The bdevvp stuff is for the boot time stuff
> in the upper level code, and only applies to the root volume.

Maybe I mis-parsed. I thought you were talking about parsing the first
mount option (in mount /dev/disk there, the /dev/disk option) into a
vnode. The concern below is that different fs's have different ideas as to
what that node should be. Some want it a device node which no one else is
using (most leaf fs's), while some others want a directory (nullfs, etc),
some want a file or device (the HSM system I'm working on) while others
don't care (in mount -t kernfs /kern /kern , the first kern doesn't matter
at all). But all is well with different support routines which the
mount_foo() routine can call.

> > Layered filesystems won't do that. nullfs,
> > umaptfs, and unionfs will want a directory. The hierarchical storage
> > system I'm working on will want a file. kernfs, procfs, and an fs which I
> > haven't checked into the NetBSD tree don't really need the extra
> > parameter. Supporting all these different cases would be a hassle for
> > upstream code.
> > 
> > > 	This removes a large amount of complex code from each of
> > > 	the file systems, and centralizes the maintenance task into
> > > 	one set of code that either works for everyone, or no one
> > > 	(removing the duplication of code/introduction of errors
> > > 	issue).
> > 
> > Might I suggest a common library of routines which different mount
> > routines can call? That way we'd get code sharing while letting the fs
> > make decisions about what it expects of the input arguments.
> 
> This is the "footprint" problem, all over again.  Reject/accept (or 
> "accept if no VOP") seems more elegant, and also reduces footprint.

Very true. The problem is that the current VFS system was designed as a
black box. It gets handed all calls, and it gets to decide policy, and do
everything on its own. We're now basically discussing ways of having the
plethora of fs's we now have do things the same way. :-)

> > > 7.	The struct nameidata (namei.h) is broken in conception.
> 
> Can you push a Unicode name down from an appropriate system call?
> 
> I don't see any way to deal with an NT FS for characters outside
> ISO 8859-1, otherwise.  8-(.

Hmmm. I think the real problem is that the kernel(s) is(are) not at all
designed well for different laguages.

> > > 9.	The implementation of namei() is POSIX non-compliant
> > > 
> > > 	The implementation of namei() is by means of coroutine
> > > 	"recursion"; this is similar to the only recursion you can
> > > 	achieve in FORTRAN.
> > > 
> > > 	The upshot of this is that the use of the "//" namespace
> > > 	escape allowed by POSIX can not be usefully implemented.
> > > 	This is because it is not possible to inherit a namespace
> > > 	escape deeper than a single path component for a stack of
> > > 	more than one layer in depth.
> > > 
> > > 	This needs to be fixed, both for "natural" SMBFS support,
> > > 	and for other uses of the namespace escape (HTTP "tunnels",
> > > 	extended attribute and/or resource fork access in an OS/2
> > > 	HPFS or Macintosh HFS implementation, etc.), including
> > > 	forward looking research.
> > > 
> > > 	This is related to item 7.
> > 
> > I'm sorry. This point didn't parse. Could you give an example?
> > 
> > I don't see how the namei recursion method prevents catching // as a
> > namespace escape.
> 
> 
> //apple-resource-fork/intermediate_dir/some_other_dir/file_with_fork
> 
> You can't inherit the fact that you are looking at the resource fork
> in the terminal component, ONLY.

Yep, there's no easy way to do that now.. The one thing which comes to
mind is to have lookup() rip out the first component and save it in the
namei struct.

Though the devil's advocate in me points out that this difficulty is not
inherent in the recursion setup, but in how lookup() is designed. :-)

> > > 	Quotas should be an abstract stacking layer that can be
> > > 	applied to any FS, instead of an FFS specific monstrosity.
> > 
> > It should certainly be possible to add a quota layer on top of any leaf
> > fs. That way you could de-couple quotas. :-)
> 
> Yes, assuming stacking works in the first place...

Except for a minor buglet with device nodes, stacking works in NetBSD at
present. :-)

> > One other suggestion I've heard is to split the 64 bits we have for time
> > into 44 bits for seconds, and 20 bits for microseconds. That's more than
> > enough modification resolution, and also pushes things to past year
> > 500,000 AD. Versioning the indoe would cover this easily.
> 
> Ugh.  But possible...

I agree it's ugly, but it has the advantage that it doesn't grow the
on-disk inode. A lot of flks have designs on the remaining 64 bits free.
:-)

Take care,

Bill



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.SOL.3.96.990816143421.27345M-100000>